Monitoring
Also known as: systems monitoring
Continuously observing a running system — collecting metrics, alerting on anomalies — so problems are caught before users notice.
- Primary domain
- Systems Software
- Sub-category
- Dependability, Fault Tolerance & Reliability
In simple terms
Monitoring is what you do so you find out a service is broken before your users (or your boss) does. You collect numbers — request rate, error rate, latency, queue depth, CPU, memory — and you set alarms that fire when those numbers go where they shouldn’t.
More detail
Most monitoring stacks combine three signal types:
- Metrics — numeric, low-cardinality, sampled at intervals. Time-series databases (Prometheus, InfluxDB, Datadog, CloudWatch).
- Logs — discrete events with rich context. See logging.
- Traces — the path of a single request through many services. Distributed tracing (Jaeger, Zipkin, Tempo).
Together these are usually called observability — the broader umbrella.
The classic four “Golden Signals” (Google SRE book):
- Latency — how long requests take.
- Traffic — how many requests per second.
- Errors — fraction of requests failing.
- Saturation — how full the system is.
Layered on top:
- Dashboards for at-a-glance status and incident response.
- Alerts that page humans only when needed. Bad alerting is worse than no alerting: alert fatigue makes everyone ignore the next one.
- SLOs (Service Level Objectives) — concrete targets like “99.9% of requests under 300 ms in any rolling 30-day window” — that frame what’s worth alerting on.
Why it matters
Monitoring is the difference between “we’re flying blind” and “we know what’s happening”. It is the foundation everything in operations sits on — without it, you can’t reliably deploy, debug, or commit to availability targets.
Real-world examples
-
A spike in 5xx errors after a deploy pages the on-call engineer, who rolls back.
-
A latency histogram showing p99 climbing for hours while p50 stays flat — usually a tail-latency problem.
-
A dashboard during an incident is where most people watch what’s happening; without it the team is reduced to messages and guesses.
-
The Cloudflare dashboard during the 2017 “Cloudbleed” incident is a public case study in monitoring done well — engineers spotted, isolated, and rolled back the bad config in under an hour.
Common misconceptions
- “More metrics = better monitoring.” Past a point, more is noise. Pick the few that matter and watch them well.
- “Monitoring tells you why.” It tells you what and when. Why comes from logs, traces, and human investigation.
Learn next
The other major signal: logging. What you do when alerts fire: incident response.
Read this in a learning path
All paths →This topic is part of 2 learning paths. Start in context to keep prev/next and progress tracking.
- Read this in Backend Engineer Starter KitThe minimum set of topics that turns a programmer into someone who can ship and operate a backend service in production. Start here View the whole path
- Read this in Site Reliability EngineeringHow to keep software running reliably in production — from SLOs and observability to incident response and safe deployments. Start here View the whole path
Relationships
- Requires
- Related
- Next
- Leads to
- Required by
Neighborhood
A visual companion to the relationships above. Click any node to visit that topic.