Chaos Engineering

In simple terms

You can’t know if your system is resilient just by looking at it or running unit tests — distributed systems fail in unexpected ways when parts of them break. Chaos Engineering is the practice of deliberately breaking things in production (in a controlled way) to discover how your system actually behaves under failure, before an unplanned outage does it for you. Netflix’s “Chaos Monkey” randomly kills virtual machines in production; if your system can’t survive the Chaos Monkey, it can’t survive a real outage. The goal is to convert unknown-unknowns (surprising failures) into known-knowns (failures you’ve already handled).

More detail

The Chaos Engineering process:

Define steady state: measure normal system behaviour — error rate, latency, throughput. Establish the hypothesis: “the system maintains steady state under this failure condition.”
Hypothesize: “If we kill one EC2 instance in the API tier, latency stays below 200ms.”
Introduce the failure variable: kill the instance, inject network latency, corrupt a percentage of responses, exhaust disk space, disconnect a database replica.
Observe: does the system maintain steady state? What actually happened?
Fix or accept: if the system behaved poorly, fix the weakness. If it behaved well, you have higher confidence and documented evidence.
Gradually increase scope: start in staging → move to off-peak production → run during peak traffic.

Types of experiments:

Instance/pod failure: kill a random EC2 instance, ECS task, or Kubernetes pod.
Network partitions: simulate network loss between services; inject latency (100ms+); drop packets.
Dependency failure: make an upstream service return errors (500s) or time out.
Resource exhaustion: fill disk, exhaust file descriptors, spike CPU.
Region failure: simulate an AWS availability zone or region going down.
Dependency latency: inject slow responses from a database or external API.

Netflix’s Chaos Monkey (2011): the original chaos engineering tool. Randomly terminates EC2 instances in Netflix’s production environment during business hours. Forces engineers to design for instance failure: services must tolerate losing any single instance without impacting customers. Netflix later expanded to Chaos Kong (kills an entire AWS region) and Failure Injection Testing (FIT — injects errors in service responses).

Principles of Chaos Engineering (chaos-based manifesto):

Build a hypothesis around steady-state behaviour.
Vary real-world events (not hypothetical ones).
Run experiments in production.
Automate experiments to run continuously.
Minimise blast radius (scope failures carefully; have a kill switch).

Tools:

Chaos Monkey / Simian Army (Netflix, open-source): instance termination, latency injection.
Gremlin (commercial): managed chaos platform; CPU, memory, disk, network, process attacks.
Chaos Toolkit (open-source): Python-based; integrates with Kubernetes, AWS, GCP.
Litmus (CNCF): Kubernetes-native chaos engineering.
AWS Fault Injection Simulator (FIS): managed service for AWS-specific failures.
Pumba: Docker/Kubernetes container killing and network impairment.

Blast radius control: chaos experiments must have a kill switch (immediately stop the experiment), a fixed scope (only affect 1% of pods, only in one AZ), and a clear abort criterion (if error rate exceeds X%, stop). Chaos Engineering is not randomly breaking things; it’s controlled science.

Gameday: a planned “gameday” is a scheduled chaos exercise where the team deliberately breaks something and practices incident response. Teams become faster at detecting and recovering from failures.

Why it matters

Distributed systems fail in complex, emergent ways that no static analysis can predict. Chaos Engineering is the only way to know how your system behaves under realistic failure conditions — before your customers find out. Netflix’s chaos programme is credited with enabling their migration from a monolith to microservices while maintaining reliability. Amazon runs large-scale gamedays before major traffic events (Prime Day). Understanding chaos engineering is essential for SRE and platform engineering roles, and for teams that operate critical distributed systems.

Real-world examples

Netflix: Chaos Monkey runs continuously in production; every Netflix engineer knows their service must tolerate instance loss.
Amazon: runs gamedays before Prime Day (July) — chaos experiments to ensure the system handles scale under failure conditions.
LinkedIn: chaos experiments on Kafka clusters and ZooKeeper to verify fault tolerance.
Slack: chaos experiments on their message persistence layer to verify data durability.

Common misconceptions

“Chaos Engineering means randomly breaking production.” It means controlled experiments with a clear hypothesis, minimal blast radius, and a kill switch. Ad-hoc breaking is just breaking.
“Chaos Engineering is only for mature, large companies.” Any distributed system that needs to be reliable benefits from chaos experiments. The earlier you start, the cheaper the fixes.

Learn next

Chaos Engineering is motivated by error budgets (measuring reliability) and generates toil if reliability issues are not addressed. Circuit breakers and timeout patterns are often the fixes discovered through chaos experiments. DORA metrics — specifically MTTR — improve when chaos engineering is practised.