Whitepapers

15 items

MapReduce: Simplified Data Processing on Large Clusters

30mintermediate

Kafka: A Distributed Messaging System for Log Processing

30mintermediate

Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

25mintermediate

distributed-systemsconsensusreplicationfault-toleranceetcdraftadvanced

In Search of an Understandable Consensus Algorithm (Raft)

The consensus algorithm designed for understandability that powers etcd, Consul, and CockroachDB

Diego Ongaro, John Ousterhout|Stanford University|2014|35 min read

View Original Paper

Summary

Raft solves the distributed consensus problem—getting multiple servers to agree on a sequence of values even when some servers fail. Unlike Paxos, Raft was designed from the ground up for understandability by decomposing consensus into three relatively independent subproblems: leader election, log replication, and safety. The result is an algorithm that engineers can actually implement correctly.

Key Takeaways

Strong Leader Model

All writes flow through a single leader, simplifying reasoning about consistency. The leader has complete authority over log replication—it never overwrites its own entries and all entries flow from leader to followers.

Randomized Election Timeouts

Split votes are resolved elegantly using randomized timeouts (150-300ms typically). This simple mechanism avoids the complexity of ranking or priority schemes while ensuring elections complete quickly.

Log Matching Property

If two logs contain an entry with the same index and term, then the logs are identical in all preceding entries. This invariant, enforced by a simple consistency check during AppendEntries, is the foundation of Raft's safety.

Distributed systems need consensus to maintain consistency across replicas. Consider a replicated key-value store: when a client writes `x=5`, all replicas must eventually agree on this value and the order of all writes. Without consensus, replicas diverge and clients see inconsistent data.

The fundamental challenge: servers can fail at any time, network partitions can isolate groups of servers, and messages can be delayed or reordered. Despite these failures, the system must:

Never return incorrect results (safety)
Eventually make progress when a majority of servers are operational (liveness)

Paxos solved this problem in 1989, but its specification is notoriously difficult to understand. Real implementations like Google's Chubby required significant extensions not covered in the original paper. Raft was created specifically to be understandable while providing the same guarantees.

Summary

Key Takeaways

Strong Leader Model

Randomized Election Timeouts

Log Matching Property

Commitment by Majority

An entry is committed once the leader has replicated it to a majority of servers. Since any two majorities overlap, a committed entry can never be lost—any future leader must have it.

Term Numbers as Logical Clocks

Terms act as logical clocks that detect stale leaders. Every RPC includes the sender's term; if a server receives a request with a stale term, it rejects it. If it discovers a higher term, it immediately reverts to follower state.

Joint Consensus for Membership Changes

Cluster membership changes use a two-phase approach (Cold,new) to prevent split-brain scenarios where two independent majorities could form during reconfiguration.

Deep Dive

The fundamental challenge: servers can fail at any time, network partitions can isolate groups of servers, and messages can be delayed or reordered. Despite these failures, the system must:

Never return incorrect results (safety)
Eventually make progress when a majority of servers are operational (liveness)

Trade-offs

Aspect	Advantage	Disadvantage
Understandability	Designed explicitly for clarity—engineers can implement and debug it correctly	Some optimizations are harder to add without breaking the simple mental model
Strong Leader	Simplifies reasoning: all decisions flow through one node, making log consistency straightforward	Leader is a bottleneck for writes and a single point of failure during leader transitions
Majority Quorum	Simple and robust—any two majorities overlap, ensuring committed entries are never lost	Requires 2f+1 servers to tolerate f failures; can't make progress with exactly half alive
Synchronous Replication	Committed entries are guaranteed durable across majority—no data loss on leader failure	Write latency includes network RTT to majority; slow followers don't affect latency but reduce redundancy
In-Order Commits	Simple log structure; state machines see commands in same order everywhere	A stuck command blocks all subsequent commands; no out-of-order commit optimization

Premium Content

distributed-systemsconsensusreplicationfault-toleranceetcdraftadvanced

In Search of an Understandable Consensus Algorithm (Raft)

The consensus algorithm designed for understandability that powers etcd, Consul, and CockroachDB

Diego Ongaro, John Ousterhout|Stanford University|2014|35 min read

View Original Paper

Summary

Key Takeaways

Strong Leader Model

Randomized Election Timeouts

Log Matching Property

The fundamental challenge: servers can fail at any time, network partitions can isolate groups of servers, and messages can be delayed or reordered. Despite these failures, the system must:

Never return incorrect results (safety)
Eventually make progress when a majority of servers are operational (liveness)

Summary

Key Takeaways

Strong Leader Model

Randomized Election Timeouts

Log Matching Property

Commitment by Majority

An entry is committed once the leader has replicated it to a majority of servers. Since any two majorities overlap, a committed entry can never be lost—any future leader must have it.

Term Numbers as Logical Clocks

Joint Consensus for Membership Changes

Cluster membership changes use a two-phase approach (Cold,new) to prevent split-brain scenarios where two independent majorities could form during reconfiguration.

Deep Dive

The fundamental challenge: servers can fail at any time, network partitions can isolate groups of servers, and messages can be delayed or reordered. Despite these failures, the system must:

Never return incorrect results (safety)
Eventually make progress when a majority of servers are operational (liveness)

Trade-offs

Aspect	Advantage	Disadvantage
Understandability	Designed explicitly for clarity—engineers can implement and debug it correctly	Some optimizations are harder to add without breaking the simple mental model
Strong Leader	Simplifies reasoning: all decisions flow through one node, making log consistency straightforward	Leader is a bottleneck for writes and a single point of failure during leader transitions
Majority Quorum	Simple and robust—any two majorities overlap, ensuring committed entries are never lost	Requires 2f+1 servers to tolerate f failures; can't make progress with exactly half alive
Synchronous Replication	Committed entries are guaranteed durable across majority—no data loss on leader failure	Write latency includes network RTT to majority; slow followers don't affect latency but reduce redundancy
In-Order Commits	Simple log structure; state machines see commands in same order everywhere	A stuck command blocks all subsequent commands; no out-of-order commit optimization

Whitepapers

MapReduce: Simplified Data Processing on Large Clusters

Kafka: A Distributed Messaging System for Log Processing

Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

Bitcoin: A Peer-to-Peer Electronic Cash System