SystemExpertsSystemExperts
Pricing

Whitepapers

15 items

MapReduce: Simplified Data Processing on Large Clusters

30mintermediate

Kafka: A Distributed Messaging System for Log Processing

30mintermediate

Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

25mintermediate

Bitcoin: A Peer-to-Peer Electronic Cash System

30mintermediate

In Search of an Understandable Consensus Algorithm (Raft)

35madvanced

TAO: Facebook's Distributed Data Store for the Social Graph

35madvanced

The Google File System

35madvanced

The Log-Structured Merge-Tree (LSM-Tree)

35madvanced

The Chubby Lock Service for Loosely-Coupled Distributed Systems

30madvanced

Spanner: Google's Globally Distributed Database

40madvanced

Bigtable: A Distributed Storage System for Structured Data

35madvanced

Scaling Memcache at Facebook

35madvanced

Large-scale cluster management at Google with Borg

35madvanced

The Next 700 Programming Languages

30madvanced

The Part-Time Parliament

40madvanced
distributed-systemsconsensusreplicationfault-toleranceetcdraftadvanced

In Search of an Understandable Consensus Algorithm (Raft)

The consensus algorithm designed for understandability that powers etcd, Consul, and CockroachDB

Diego Ongaro, John Ousterhout|Stanford University|2014|35 min read
View Original Paper

Summary

Raft solves the distributed consensus problem—getting multiple servers to agree on a sequence of values even when some servers fail. Unlike Paxos, Raft was designed from the ground up for understandability by decomposing consensus into three relatively independent subproblems: leader election, log replication, and safety. The result is an algorithm that engineers can actually implement correctly.

Key Takeaways

Strong Leader Model

All writes flow through a single leader, simplifying reasoning about consistency. The leader has complete authority over log replication—it never overwrites its own entries and all entries flow from leader to followers.

Randomized Election Timeouts

Split votes are resolved elegantly using randomized timeouts (150-300ms typically). This simple mechanism avoids the complexity of ranking or priority schemes while ensuring elections complete quickly.

Log Matching Property

If two logs contain an entry with the same index and term, then the logs are identical in all preceding entries. This invariant, enforced by a simple consistency check during AppendEntries, is the foundation of Raft's safety.

Distributed systems need consensus to maintain consistency across replicas. Consider a replicated key-value store: when a client writes `x=5`, all replicas must eventually agree on this value and the order of all writes. Without consensus, replicas diverge and clients see inconsistent data.

The fundamental challenge: servers can fail at any time, network partitions can isolate groups of servers, and messages can be delayed or reordered. Despite these failures, the system must:

  1. Never return incorrect results (safety)
  2. Eventually make progress when a majority of servers are operational (liveness)

Paxos solved this problem in 1989, but its specification is notoriously difficult to understand. Real implementations like Google's Chubby required significant extensions not covered in the original paper. Raft was created specifically to be understandable while providing the same guarantees.

Summary

Raft solves the distributed consensus problem—getting multiple servers to agree on a sequence of values even when some servers fail. Unlike Paxos, Raft was designed from the ground up for understandability by decomposing consensus into three relatively independent subproblems: leader election, log replication, and safety. The result is an algorithm that engineers can actually implement correctly.

Key Takeaways

Strong Leader Model

All writes flow through a single leader, simplifying reasoning about consistency. The leader has complete authority over log replication—it never overwrites its own entries and all entries flow from leader to followers.

Randomized Election Timeouts

Split votes are resolved elegantly using randomized timeouts (150-300ms typically). This simple mechanism avoids the complexity of ranking or priority schemes while ensuring elections complete quickly.

Log Matching Property

If two logs contain an entry with the same index and term, then the logs are identical in all preceding entries. This invariant, enforced by a simple consistency check during AppendEntries, is the foundation of Raft's safety.

Commitment by Majority

An entry is committed once the leader has replicated it to a majority of servers. Since any two majorities overlap, a committed entry can never be lost—any future leader must have it.

Term Numbers as Logical Clocks

Terms act as logical clocks that detect stale leaders. Every RPC includes the sender's term; if a server receives a request with a stale term, it rejects it. If it discovers a higher term, it immediately reverts to follower state.

Joint Consensus for Membership Changes

Cluster membership changes use a two-phase approach (Cold,new) to prevent split-brain scenarios where two independent majorities could form during reconfiguration.

Deep Dive

Distributed systems need consensus to maintain consistency across replicas. Consider a replicated key-value store: when a client writes `x=5`, all replicas must eventually agree on this value and the order of all writes. Without consensus, replicas diverge and clients see inconsistent data.

The fundamental challenge: servers can fail at any time, network partitions can isolate groups of servers, and messages can be delayed or reordered. Despite these failures, the system must:

  1. Never return incorrect results (safety)
  2. Eventually make progress when a majority of servers are operational (liveness)

Paxos solved this problem in 1989, but its specification is notoriously difficult to understand. Real implementations like Google's Chubby required significant extensions not covered in the original paper. Raft was created specifically to be understandable while providing the same guarantees.

Trade-offs

AspectAdvantageDisadvantage
UnderstandabilityDesigned explicitly for clarity—engineers can implement and debug it correctlySome optimizations are harder to add without breaking the simple mental model
Strong LeaderSimplifies reasoning: all decisions flow through one node, making log consistency straightforwardLeader is a bottleneck for writes and a single point of failure during leader transitions
Majority QuorumSimple and robust—any two majorities overlap, ensuring committed entries are never lostRequires 2f+1 servers to tolerate f failures; can't make progress with exactly half alive
Synchronous ReplicationCommitted entries are guaranteed durable across majority—no data loss on leader failureWrite latency includes network RTT to majority; slow followers don't affect latency but reduce redundancy
In-Order CommitsSimple log structure; state machines see commands in same order everywhereA stuck command blocks all subsequent commands; no out-of-order commit optimization

Premium Content

Sign in to access this content or upgrade for full access.