Site Reliability Engineering
How to keep software running reliably in production — from SLOs and observability to incident response and safe deployments.
- Reading time
- ~22 min (+12 min optional)
- Level mix
- 4 beginner · 7 intermediate
Shipping code is the easy part. Keeping it running — and knowing immediately when it breaks — is the hard part. Site reliability engineering (SRE) is the discipline that turned “operations” from “someone restarting servers” into an engineering practice with measurable goals, principled failure handling, and systematic improvement.
This path covers the core ideas: what to measure and why (SLIs, SLOs), how to understand a running system (observability, logging, monitoring), how to handle failures when they happen (incident response, runbooks), and how to deploy changes without causing them (blue-green, canary).
Roadmap
Loading progress...
Getting to production
The act of getting new versions of software running in production safely, predictably, and without downtime.
Understanding what's happening
Continuously observing a running system — collecting metrics, alerting on anomalies — so problems are caught before users notice.
Recording discrete events from a running system, so the engineers operating it can reconstruct what happened — and when, and why.
The ability to understand a system's internal state from the data it emits — metrics, logs, and traces — so you can debug problems you didn't anticipate, not just the ones you alerted on.
The discipline
A discipline pioneered at Google that applies software-engineering principles to operations — automation, SLOs, error budgets, blameless culture.
The vocabulary of reliability targets — an SLI measures how well a service is doing, an SLO is the goal for that measure, and an SLA is the contractual promise (with penalties) to a customer.
When things go wrong
The structured way teams handle production incidents — from detection through resolution to a blameless postmortem.
- DowntimeOptional
Any period when a system is unavailable or not serving users correctly — the thing reliability engineering exists to minimize, measured as the inverse of uptime.
- RunbookOptional
A documented, step-by-step procedure for handling a specific operational task or failure — so any on-call engineer can respond correctly under pressure without reinventing the fix.
Safer releases
- Blue-Green DeploymentOptional
A release strategy that runs two identical production environments — one live, one idle — and switches all traffic to the new version at once, so rollback is instant.
- Canary DeploymentOptional
A release strategy that rolls a new version out to a small fraction of users first, watches its metrics, and gradually increases the share only if it's healthy — limiting the blast radius of a bad release.