Lots of good ops posts this week, from linux systems administration to outage retrospective and creating actionable alerts to managing large Kubernetes clusters and more besides.

Is the art of maintaining linux servers a dying art? This post thinks maybe, and makes a case for reclaiming it.

A detailed retrospective of a 73-hour outage. Lots of technical details, and it’s interesting to see this sort of openness from a consumer brand.

A look at redundancy as an approach to building resilient systems, using the James Webb space telescope as a case study.

An interesting post on using continuous testing to provide feedback to development teams to aim in managing availability.

A post with lots of operational details about running a large Kubernetes installation, with over 4k nodes and 200k pods.

A useful set of tips for creating actionable alerts in your monitoring system.

Indexes speed up database queries right? Well, like most things, it’s more complicated than that. An interesting look at real world database performance.

An argument for using GraphQL for querying the state of cloud resources as the complexity and scale of the APIs provided by the public clouds continues to increase.


A standalone reverse-proxy to enforce Webauthn authentication. It can be inserted in front of sensitive services or even chained with other proxies (e.g. OAuth, MFA) to enable a layered security model

A fuzzer for RESTful APIs, useful for finding security and reliability issues.