Several in-depth posts this week on crunch operations topics, including metrics for incident response, SLOs and SLIs and the idea that you build it, you run it.

A post on using mean time to to improve incident response. Good discussion of choices about what metrics to use.

A comprehensive post on setting SLOs and SLIs for complex distributed systems.

You build it, you run it. A familiar phrase, but the following playbook explores what that means and how to implement it to improve service operations.

There is a lot of interest in software bill of materials (SBOMs) at the moment, but a lot of that has been focused on creating SBOMs. This post looks an another aspect, storage and distribution.

Down the rabbit hole of one team debugging a AWS EC2 networking issue related to sending large packets.

Databases and applications are often still considered quite separately, separate specialists and sometimes teams. This post speculates about what vertical integration in that space might mean.

A detailed look at detecting silent errors in large scale systems, combining opportunistic and ripple testing to detect hard-to-find issues.


kube-opex-analytics is a cost optimization tool for Kubernetes. Collect data, surface to Prometheus/Grafana, focused on capacity planning and cost.