Operations and Reliability
Running software in production — deployment, observability, SRE, incident response, and reliability.
Operations and reliability are what keep software running once it ships: deployment, monitoring, on-call, incident response, and the engineering discipline of building systems that stay up.
Core
The essentials. Start here.-
Logging
Recording discrete events from a running system, so the engineers operating it can reconstruct what happened — and when, and why.
core beginner concept -
Monitoring
Continuously observing a running system — collecting metrics, alerting on anomalies — so problems are caught before users notice.
core beginner concept -
Deployment
The act of getting new versions of software running in production safely, predictably, and without downtime.
core intermediate concept -
Incident Response
The structured way teams handle production incidents — from detection through resolution to a blameless postmortem.
core intermediate concept -
SRE
A discipline pioneered at Google that applies software-engineering principles to operations — automation, SLOs, error budgets, blameless culture.
core intermediate field
Important
What you'll meet next.-
Downtime
Any period when a system is unavailable or not serving users correctly — the thing reliability engineering exists to minimize, measured as the inverse of uptime.
beginner concept -
Runbook
A documented, step-by-step procedure for handling a specific operational task or failure — so any on-call engineer can respond correctly under pressure without reinventing the fix.
beginner concept -
Blue-Green Deployment
A release strategy that runs two identical production environments — one live, one idle — and switches all traffic to the new version at once, so rollback is instant.
intermediate concept -
Canary Deployment
A release strategy that rolls a new version out to a small fraction of users first, watches its metrics, and gradually increases the share only if it's healthy — limiting the blast radius of a bad release.
intermediate concept -
Observability
The ability to understand a system's internal state from the data it emits — metrics, logs, and traces — so you can debug problems you didn't anticipate, not just the ones you alerted on.
intermediate concept -
SLO, SLI, SLA
The vocabulary of reliability targets — an SLI measures how well a service is doing, an SLO is the goal for that measure, and an SLA is the contractual promise (with penalties) to a customer.
intermediate concept
Supplemental
Niche, historical, or specialized.-
Chaos Engineering
The practice of deliberately injecting failures into production systems to discover weaknesses before they cause unexpected outages — building confidence in a system's resilience by verifying it can withstand turbulent conditions.
supplemental intermediate concept -
Error Budget
The acceptable amount of downtime or errors implied by a Service Level Objective — giving teams a concrete budget to spend on reliability risk versus feature velocity, and an objective trigger for halting deployments when reliability is at risk.
supplemental intermediate concept -
Feature Flag Rollout
A technique for enabling or disabling features at runtime without deployment — decoupling feature release from code deployment and enabling gradual rollouts, A/B tests, and instant rollbacks without redeployment.
supplemental intermediate concept -
Toil
Manual, repetitive, automatable operational work that grows with service scale — Google's SRE model defines toil as the enemy of engineering productivity and mandates that SREs spend no more than 50% of their time on it.
supplemental intermediate concept