Downtime
Also known as: outage, uptime, availability
Any period when a system is unavailable or not serving users correctly — the thing reliability engineering exists to minimize, measured as the inverse of uptime.
- Primary domain
- Systems Software
- Sub-category
- Dependability, Fault Tolerance & Reliability
In simple terms
Downtime is any stretch of time when a system isn’t working for its users — the site won’t load, the app returns errors, the payment won’t go through. It’s the opposite of uptime, and it’s the fundamental thing operations and reliability work exists to prevent. Downtime has direct, visible costs: lost revenue, frustrated users, missed deadlines, and damaged trust. Most of the field’s practices — monitoring, redundancy, careful deployments — are ultimately about avoiding or shortening it.
More detail
Availability is usually quoted in “nines” — the percentage of time a system is up — and the gap between each level is dramatic:
| Availability | Downtime per year |
|---|---|
| 99% (“two nines”) | ~3.65 days |
| 99.9% (“three nines”) | ~8.8 hours |
| 99.99% (“four nines”) | ~52 minutes |
| 99.999% (“five nines”) | ~5 minutes |
Each extra nine costs disproportionately more to achieve, which is why targets are set deliberately via SLOs rather than reflexively chasing 100%.
Downtime is also categorized by intent:
- Planned downtime — scheduled maintenance, upgrades, migrations. Modern zero-downtime techniques (blue-green, canary releases, rolling updates) aim to eliminate even this.
- Unplanned downtime — outages from bugs, hardware failure, dependency failures, traffic spikes, or human error. This is what incident response handles.
Common causes are surprisingly mundane: bad deployments, expired certificates, exhausted disk or memory, an overloaded database, a failed dependency, or a misconfiguration. Two metrics frame how teams reduce it: MTBF (mean time between failures — make outages rarer) and MTTR (mean time to recovery — make them shorter). For most services, getting good at recovering fast (MTTR) pays off more than chasing perfect prevention.
Why it matters
Downtime is the bottom-line failure mode of any service — the moment software stops delivering value and starts costing it. For large businesses an hour of downtime can mean millions in lost revenue, which is why so much engineering effort goes into redundancy, graceful degradation, fast rollback, and rapid incident response. Framing reliability work in terms of “how much downtime, and how fast we recover” keeps it grounded in real user and business impact.
Real-world examples
- A major cloud region outage taking down dozens of dependent websites and apps for hours — a recurring, high-profile kind of downtime.
- An expired TLS certificate silently causing an outage at midnight — one of the most common and avoidable causes.
- A retailer doing the math that a few minutes of checkout downtime on a peak shopping day costs more than a year of extra reliability investment.
Common misconceptions
- “We should aim for zero downtime / 100% uptime.” The cost of each additional nine grows steeply, and beyond a point users can’t tell the difference; the right target is a deliberate SLO, not perfection.
- “Downtime only means a total crash.” Partial degradation — slow responses, some failing requests, one broken feature — is downtime too, often measured against an error-rate or latency objective rather than a binary up/down.
Learn next
How much downtime is acceptable is defined by SLOs and SLAs; handling it when it happens is incident response.
Read this in a learning path
All paths →This topic is part of a learning path. Start in context to keep prev/next and progress tracking.
Relationships
- Related
Neighborhood
A visual companion to the relationships above. Click any node to visit that topic.