2 minute read

Some good modern operations posts this week, with a few debugging tales, discussion of monitoring, getting started with cloud services and a good introduction to service level objectives.

From our sponsor, VictorOps

Incident commanders and lifecycle coordinators help organize incident response to make on-call suck less. See how DevOps teams benefit from centralizing alerting, incident response and collaboration:
http://try.victorops.com/devopsweekly/incident-lifecycle-coordinators

News

A great tale of debugging an incident in a world of interlinked distributed systems. Some good takeaways for anyone interested in resilient architecture as well.
https://medium.com/@daniel.p.woods/on-infrastructure-at-scale-a-cascading-failure-of-distributed-systems-7cff2a3cd2df

A long and very detailed post on how Stack Overflow do monitoring. Lots of internal dashboards, graphs and discussions of approaches and different facets of the monitoring problem.
https://nickcraver.com/blog/2018/11/29/stack-overflow-how-we-do-monitoring/

A quick introduction to Service Level Objectives, Service Level Indicators and Error Budgets for managing reliability, accompanied by an experience report from one team starting to implement them.
https://medium.com/kudos-engineering/managing-reliability-with-slos-and-error-budgets-37346665abf6

Kubernetes cluster cost overhead can be an issue, often down to the cost of idle resources. This post explores how to measure and reduce that overhead, in particular looking at CPU, memory and disk. It also explores why autoscaling probably shouldn’t be step one.
https://medium.com/kubecost/detect-overspending-by-measuring-idle-kubernetes-resources-d5d97eb205e0

An interesting article on how many metrics an individual application might expose, including examples of a few real world apps. Interesting to think about when considering capacity and cost of monitoring.
https://www.robustperception.io/how-many-metrics-should-an-application-return

When public cloud services launched they had a small number of services, and you could learn the APIs and tools fairly quickly. Not so today. This post acts as a reminder, as well as a good collection of links for anyone starting getting up to speed to with AWS or GCP.
https://peadarcoyle.com/2019/01/12/im-lost-how-do-i-learn-to-use-the-cloud/

A walkthrough of building, and debugging a simple deployment pipeline using GitLab. Takes in some certificate issues and overriding existing Docker images used in each step.
https://joaorosa.io/2019/01/13/using-flyway-and-gitlab-to-deploy-a-mysql-database-to-aws-rds-securely/

A detailed walkthrough for anyone looking to implement structured logging in Java.
https://github.com/tersesystems/terse-logback

A look at a multi-cloud data processing pipeline. Starting in AWS, this walks through a few ways of duplicating data into GCP, utilising existing services wherever possible. No mention of the cost of transfer or storage though.
https://www.vibrato.com.au/blog/towards-a-multi-cloud-serverless-data-warehouse

A quick introduction to Kubernetes networking concepts, and a basic explanation of the role of ingress, with an example using the Traefik proxy.
https://www.cloudops.com/2018/06/why-kubernetes-ingresses-arent-as-difficult-as-you-think/

Tools

GoFish is a cross-platform systems package manager, bringing the ease of use of Homebrew to Linux and Windows. Early days but a few packages starting to be available.
https://gofi.sh/index.html
https://github.com/fishworks/fish-food

Incident commanders and lifecycle coordinators help organize incident response to make on-call suck less. See how DevOps teams benefit from centralizing alerting, incident response and collaboration:
http://try.victorops.com/devopsweekly/incident-lifecycle-coordinators

Updated: