Raft
Also known as: raft consensus, raft protocol
A consensus algorithm designed to be understandable — it achieves distributed agreement through leader election and log replication, and has become the go-to replacement for Paxos in modern systems.
- Primary domain
- Concurrency & Parallelism
- Sub-category
- Distributed Computing
In simple terms
Raft is a recipe for getting a cluster of servers to agree on a sequence of values, even when some servers crash or messages are lost. It decomposes the problem into three separable sub-problems — leader election, log replication, and safety — and handles each in a way that is meant to be traceable by a human reader. If you’ve used etcd, CockroachDB, or TiKV, you’ve relied on Raft.
The Visual Map
stateDiagram-v2
[*] --> Follower
Follower --> Candidate: election timeout<br/>(no heartbeat from leader)
Candidate --> Leader: wins majority of votes
Candidate --> Follower: another leader appears<br/>(higher term)
Candidate --> Candidate: split vote -> new term, retry
Leader --> Follower: discovers higher term
note right of Leader: sends heartbeats +<br/>AppendEntries; commits<br/>once a majority replicates
note right of Follower: votes for first valid<br/>candidate per term
More detail
Raft organises time into terms (monotonically increasing integers). Each term starts with an election; if no candidate wins, a new term begins. Exactly one leader exists per term (or none during elections).
Leader election: servers start as followers. If a follower receives no heartbeat from a leader within an election timeout (150–300 ms, randomised), it becomes a candidate, increments its term, and requests votes. A server votes for the first candidate it hears from in a term (provided the candidate’s log is at least as up-to-date). The candidate that collects a majority wins and begins sending heartbeats.
Log replication: clients send commands only to the leader. The leader appends the command to its log, then sends AppendEntries RPCs to all followers. When a majority confirms the entry, the leader commits it — applies it to the state machine and replies to the client. The leader then notifies followers in subsequent AppendEntries messages; they commit too.
Safety invariants:
- A server only votes for a candidate whose log is at least as complete as its own (prevents electing a server that’s missing committed entries).
- A leader never overwrites committed entries.
- Two servers can never have different committed entries at the same log index in the same term.
Log compaction: to prevent the log growing forever, Raft supports snapshotting — periodically write the entire state machine to disk, discard the log prefix before that point.
Raft was explicitly designed for understandability where Paxos is proven but underspecified: implementing a real system requires solving leader election, log management, and reconfiguration not described in the original Paxos paper, and Raft specifies all of these. It tolerates f failures with 2f+1 nodes, and split-brain is impossible because no two majorities can form simultaneously.
Under the Hood
The randomised election timeout — Raft’s neatest trick — is what makes leader election terminate instead of livelocking:
import random
def run_election(n_nodes, max_rounds=10):
"""Each node waits a random timeout; the first to fire becomes candidate.
Randomisation makes simultaneous candidacies (split votes) rare."""
for round_num in range(1, max_rounds + 1):
# each follower picks a timeout in [150, 300] ms
timeouts = [random.randint(150, 300) for _ in range(n_nodes)]
first = min(timeouts)
contenders = [i for i, t in enumerate(timeouts) if t == first]
if len(contenders) == 1:
return round_num, contenders[0] # clean win
# tie: split vote, no majority -> new term, everyone retries
return max_rounds, None
random.seed(2)
wins = [run_election(5) for _ in range(1000)]
clean = sum(1 for rounds, leader in wins if rounds == 1)
print(f'{clean}/1000 elections decided in the FIRST term')
print(f'avg terms to elect a leader: {sum(r for r, _ in wins)/len(wins):.2f}')
Because timeouts are randomised over a 150 ms spread, one node almost always fires first and wins before others wake — split votes are rare and self-correct on the next term. Paxos’s dueling-proposers livelock becomes, in Raft, a quick re-roll of the dice.
Engineering Trade-offs
- Understandability vs theoretical minimalism. Raft fixes a strong leader and a single log to stay teachable — accepting that all writes funnel through one node, where leaderless variants (EPaxos) extract more parallelism at a large cost in complexity most teams can’t afford to verify.
- Strong leader: simple, but a throughput ceiling. One leader ordering every command gives clean linearizability and easy reasoning — and caps write throughput at one node’s capacity. Multi-Raft (one group per shard) is the standard escape, distributing load across many leaders.
- Randomised timeouts: liveness vs failover speed. Short election timeouts detect a dead leader fast but risk spurious elections under load or network jitter; long ones are stable but slow to recover. The 150–300 ms range is a tuned compromise, not a law.
- Commit needs a quorum round-trip. Every committed entry waits for majority replication — durable and split-brain-proof, but cross-region clusters pay inter-region latency per write. Followers can serve only stale reads unless you route reads through the leader or use lease-based reads.
Real-world examples
- etcd (Kubernetes’ key-value store) uses Raft; every cluster operation (pod scheduling, secret storage) goes through etcd’s Raft log.
- CockroachDB and TiKV use Raft per shard to replicate data across regions.
- Consul uses Raft for its service registry.
- InfluxDB, RethinkDB, and Neo4j (enterprise) use Raft variants.
Common misconceptions
- “Raft is slower than Paxos.” They have the same message complexity; Raft is sometimes faster in practice because leadership is stable and commit messages are one round-trip.
- “The leader is a bottleneck.” The leader serialises commands, which is required for linearisability. Multi-Raft (many Raft groups, each for a partition) distributes the load across leaders.
Try it yourself
Show why split votes are rare — and why removing the randomisation breaks it:
python3 -c "
import random
def elect(n, spread):
# spread=0 means all nodes time out simultaneously (no randomisation)
timeouts = [150 + random.randint(0, spread) for _ in range(n)]
first = min(timeouts)
return timeouts.count(first) == 1 # True = clean single winner
for spread in (0, 5, 150):
random.seed(1)
clean = sum(elect(5, spread) for _ in range(10000))
label = 'NO randomisation' if spread == 0 else f'spread {spread}ms'
print(f'{label:18}: {clean/100:.1f}% of elections had a clean winner')
"
With no randomisation every node fires at once and ties constantly; widen the spread and clean single-winner elections become overwhelmingly likely. That one design choice is what turns consensus from a livelock risk into a sub-second operation.
Learn next
- Paxos — the theory Raft repackaged for humans.
- Consensus — the abstract guarantee both provide.
- Distributed system — where “committed” semantics matter in practice.
Relationships
- Requires
- Related
Neighborhood
A visual companion to the relationships above. Click any node to visit that topic.