Downtime

In simple terms

Downtime is any stretch of time when a system isn’t working for its users — the site won’t load, the app returns errors, the payment won’t go through. It’s the opposite of uptime, and it’s the fundamental thing operations and reliability work exists to prevent. Downtime has direct, visible costs: lost revenue, frustrated users, missed deadlines, and damaged trust. Most of the field’s practices — monitoring, redundancy, careful deployments — are ultimately about avoiding or shortening it.

More detail

Availability is usually quoted in “nines” — the percentage of time a system is up — and the gap between each level is dramatic:

Availability	Downtime per year
99% (“two nines”)	~3.65 days
99.9% (“three nines”)	~8.8 hours
99.99% (“four nines”)	~52 minutes
99.999% (“five nines”)	~5 minutes

Each extra nine costs disproportionately more to achieve, which is why targets are set deliberately via SLOs rather than reflexively chasing 100%.

Downtime is also categorized by intent:

Planned downtime — scheduled maintenance, upgrades, migrations. Modern zero-downtime techniques (blue-green, canary releases, rolling updates) aim to eliminate even this.
Unplanned downtime — outages from bugs, hardware failure, dependency failures, traffic spikes, or human error. This is what incident response handles.

Common causes are surprisingly mundane: bad deployments, expired certificates, exhausted disk or memory, an overloaded database, a failed dependency, or a misconfiguration. Two metrics frame how teams reduce it: MTBF (mean time between failures — make outages rarer) and MTTR (mean time to recovery — make them shorter). For most services, getting good at recovering fast (MTTR) pays off more than chasing perfect prevention.

Why it matters

Downtime is the bottom-line failure mode of any service — the moment software stops delivering value and starts costing it. For large businesses an hour of downtime can mean millions in lost revenue, which is why so much engineering effort goes into redundancy, graceful degradation, fast rollback, and rapid incident response. Framing reliability work in terms of “how much downtime, and how fast we recover” keeps it grounded in real user and business impact.

Real-world examples

A major cloud region outage taking down dozens of dependent websites and apps for hours — a recurring, high-profile kind of downtime.
An expired TLS certificate silently causing an outage at midnight — one of the most common and avoidable causes.
A retailer doing the math that a few minutes of checkout downtime on a peak shopping day costs more than a year of extra reliability investment.

Common misconceptions

“We should aim for zero downtime / 100% uptime.” The cost of each additional nine grows steeply, and beyond a point users can’t tell the difference; the right target is a deliberate SLO, not perfection.
“Downtime only means a total crash.” Partial degradation — slow responses, some failing requests, one broken feature — is downtime too, often measured against an error-rate or latency objective rather than a binary up/down.

Learn next

How much downtime is acceptable is defined by SLOs and SLAs; handling it when it happens is incident response.

In simple terms

More detail

Why it matters

Real-world examples

Common misconceptions

Learn next

Read this in a learning path

Relationships

Neighborhood