Site Reliability Engineering

For advanceds 11 topics (7 required · 4 optional) · updated 2026-06-08

How to keep software running reliably in production — from SLOs and observability to incident response and safe deployments.

Reading time: ~22 min (+12 min optional)
Level mix: 4 beginner · 7 intermediate

Shipping code is the easy part. Keeping it running — and knowing immediately when it breaks — is the hard part. Site reliability engineering (SRE) is the discipline that turned “operations” from “someone restarting servers” into an engineering practice with measurable goals, principled failure handling, and systematic improvement.

This path covers the core ideas: what to measure and why (SLIs, SLOs), how to understand a running system (observability, logging, monitoring), how to handle failures when they happen (incident response, runbooks), and how to deploy changes without causing them (blue-green, canary).

Edit this path on GitHub

Site Reliability Engineering

Roadmap

Getting to production

Understanding what's happening

The discipline

When things go wrong

Safer releases