Operations and Reliability

Running software in production — deployment, observability, SRE, incident response, and reliability.

Operations and reliability are what keep software running once it ships: deployment, monitoring, on-call, incident response, and the engineering discipline of building systems that stay up.

Core

The essentials. Start here.

Logging

Recording discrete events from a running system, so the engineers operating it can reconstruct what happened — and when, and why.

core beginner concept
Monitoring

Continuously observing a running system — collecting metrics, alerting on anomalies — so problems are caught before users notice.

core beginner concept
Deployment

The act of getting new versions of software running in production safely, predictably, and without downtime.

core intermediate concept
Incident Response

The structured way teams handle production incidents — from detection through resolution to a blameless postmortem.

core intermediate concept
SRE

A discipline pioneered at Google that applies software-engineering principles to operations — automation, SLOs, error budgets, blameless culture.

core intermediate field

Important

What you'll meet next.

Supplemental

Niche, historical, or specialized.