Whitepapers
15 items
15 items
Google's coordination service that underpins GFS, Bigtable, and MapReduce—and inspired ZooKeeper
Chubby is a distributed lock service that provides coarse-grained locking and reliable small-file storage for loosely-coupled distributed systems. Rather than exposing raw Paxos consensus to developers, Chubby wraps it in a familiar file-system interface with locks. It serves as the coordination backbone for Google's infrastructure—GFS uses it for master election, Bigtable for tablet server coordination, and MapReduce for task assignment. The design prioritizes availability and reliability over raw performance, using aggressive caching and sessions to handle thousands of clients.
Google chose a centralized lock service over a Paxos library because: (1) developers are more familiar with locks than consensus protocols, (2) a service reduces the number of servers needed for consensus, and (3) a lock service naturally provides a place to store small amounts of metadata.
Chubby is designed for locks held for hours or days, not milliseconds. This enables aggressive client-side caching and session-based semantics. Fine-grained locks would require too much server load and wouldn't benefit from caching.
Chubby presents a simple file system with directories and files (nodes). Each node can act as a lock and store small data. This familiar interface hides Paxos complexity while enabling both locking and configuration storage.
By the mid-2000s, Google's infrastructure consisted of thousands of machines running distributed systems like GFS, Bigtable, and MapReduce. These systems faced common coordination challenges:
Leader Election: GFS needs a single master at any time. What happens when the master crashes? How do we elect a new one without split-brain?
Configuration Management: Bigtable tablet servers need to know the current master's address. How do we distribute this reliably?
Service Discovery: MapReduce workers need to find the job coordinator. How do we maintain an up-to-date registry?
Distributed Locking: Multiple processes need exclusive access to resources. How do we implement locks that survive failures?
The naive solution—each team implements Paxos—had problems:
Key insight: A lock service is a better abstraction than a consensus library for most developers.