Several incident management posts and tools this week, long with discussion of platform engineering teams, a good sociotechnical systems reading list and technical posts on Kubernetes and Elasticsearch operations.

A nice high level introduction to incident management, centred around some real world improvements within one organisation.

Another good incident management post. This one a video and transcript of a talk on the cost of coordination during incidents.

Audio and transcript of an interesting discussion about the responsibilities of a platform team, and emerging patterns and anti-patterns.

In discussions about software supply chain security you’ll hear folks talking about attestations. This post explains what these are, why they are important and demonstrates some tools for working with them.

A look at rightsizing Kubernetes workloads, including details of how pod limits work, the Kubernetes scheduler and the vertical pod autoscaler.

A collected set of reading material on Open Systems Theory and Sociotechnical Systems design. A good set of papers and books if you want to go deep on the subject.

Database tools often store data on disk, and when things go wrong how that works might be important. This post looks at how to debug disk-related Elasticsearch issues.


Autometrics is a set of open source metrics libraries, making use of OpenTelemetry and Prometheus easier. Currently supporting Rust, TypeScript, Go and Python, with more languages coming.

Untitled Goose Tool is a tool designed for threat assessment and incident response for Azure Active Directory (AzureAD), Azure, and Microsoft 365 environments.