Open Source

10 items

zookeeperdistributed-systemscoordinationconsensuskafkahadoopleader-electionadvanced

Apache ZooKeeper: Distributed Coordination Service

The battle-tested coordination kernel that powers Kafka, HBase, and Hadoop with reliable distributed primitives

Java|12,000 stars|Updated January 2024|40 min read

View on GitHub

Summary

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and group services. It exposes a simple API of znodes (like files in a filesystem) that clients use to build higher-level primitives like distributed locks, leader election, and barriers. ZooKeeper achieves high availability through replication and uses the Zab protocol (similar to Raft) for consensus. Its ordered, wait-free nature makes it ideal for coordination where correctness matters more than raw throughput.

Key Takeaways

Hierarchical Namespace (znodes)

ZooKeeper organizes data in a tree structure similar to a filesystem. Each node (znode) can store data (up to 1MB) and have children. This hierarchy naturally maps to configuration trees, service registries, and lock paths.

Ephemeral Nodes for Liveness

Ephemeral znodes are automatically deleted when the client session ends. This enables failure detection - if a service crashes, its ephemeral node disappears, triggering watches on other clients.

Sequential Nodes for Ordering

ZooKeeper can append a monotonically increasing counter to node names. This enables fair queuing and leader election - the client with the lowest sequence number becomes the leader.

ZooKeeper was created at Yahoo! Research to solve a common problem: distributed systems need coordination, but building coordination primitives correctly is extremely difficult. Every distributed application was reimplementing locks, leader election, and configuration management - often with subtle bugs.

ZooKeeper provides a small set of simple primitives that applications combine to build higher-level coordination:

Configuration Management: Store and distribute config across a cluster
Naming/Service Discovery: Register services, discover endpoints
Distributed Locks: Mutual exclusion across processes
Leader Election: Choose one node to be the master
Barriers: Synchronize phases across distributed processes
Group Membership: Track which nodes are alive

The key insight: instead of providing these primitives directly, ZooKeeper provides a filesystem-like API with special properties (ordering, watches, ephemeral nodes) that make implementing them straightforward and correct.

ZooKeeper in a Distributed System

Who uses ZooKeeper?

Apache Kafka: Broker registration, partition leadership, consumer group coordination (pre-KRaft)
Apache HBase: Region server tracking, master election, schema management
Apache Hadoop: NameNode high availability, ResourceManager HA
Apache Solr: Cluster state, collection management, leader election
Apache Flink: Job manager HA, checkpoint coordination

Summary

Key Takeaways

Hierarchical Namespace (znodes)

Ephemeral Nodes for Liveness

Ephemeral znodes are automatically deleted when the client session ends. This enables failure detection - if a service crashes, its ephemeral node disappears, triggering watches on other clients.

Sequential Nodes for Ordering

ZooKeeper can append a monotonically increasing counter to node names. This enables fair queuing and leader election - the client with the lowest sequence number becomes the leader.

Watches for Event Notification

Clients can set watches on znodes to receive one-time notifications when data changes or children are modified. Watches are ordered - a client sees the watch event before seeing the new data.

Zab Protocol for Consensus

ZooKeeper Atomic Broadcast (Zab) ensures all updates are totally ordered and replicated to a quorum before acknowledgment. This provides sequential consistency - all clients see updates in the same order.

Session Semantics

Clients maintain sessions with ZooKeeper through heartbeats. Sessions have timeouts - if the server does not hear from a client within the timeout, ephemeral nodes are deleted. Clients can reconnect to any server and resume their session.

Deep Dive

ZooKeeper provides a small set of simple primitives that applications combine to build higher-level coordination:

Configuration Management: Store and distribute config across a cluster
Naming/Service Discovery: Register services, discover endpoints
Distributed Locks: Mutual exclusion across processes
Leader Election: Choose one node to be the master
Barriers: Synchronize phases across distributed processes
Group Membership: Track which nodes are alive

ZooKeeper in a Distributed System

Who uses ZooKeeper?

Apache Kafka: Broker registration, partition leadership, consumer group coordination (pre-KRaft)
Apache HBase: Region server tracking, master election, schema management
Apache Hadoop: NameNode high availability, ResourceManager HA
Apache Solr: Cluster state, collection management, leader election
Apache Flink: Job manager HA, checkpoint coordination

Trade-offs

Aspect	Advantage	Disadvantage
Consistency Model	Sequential consistency with linearizable writes ensures all clients see updates in same order	Reads from followers may be stale. Use sync() for freshest reads at cost of latency
Hierarchical Namespace	Natural organization for configs, locks, registries. Enables subtree watches and ACLs	No native support for range queries. Must know exact paths or list children
One-time Watches	Simple semantics, guaranteed ordering between watch and subsequent reads	Must re-register after each event. Can miss rapid changes. Herd effect if not careful
Ephemeral Nodes	Automatic cleanup on failure enables leader election, locks, and service discovery	Session timeouts must be tuned carefully. Too short causes false failures, too long delays detection
Java Implementation	Mature, battle-tested, extensive ecosystem of client libraries and tooling	JVM overhead, GC pauses can cause session timeouts, higher memory footprint
Ensemble Size	More nodes tolerate more failures, can spread read load	More nodes slow down writes (need more ACKs). Odd numbers only efficient (3,5,7)

Open Source

Redis: In-Memory Data Structure Store

Apache Kafka: Distributed Event Streaming Platform

Kubernetes: Container Orchestration Platform

Nginx: High-Performance Web Server and Reverse Proxy

PostgreSQL: The World's Most Advanced Open Source Database

Apache Cassandra: Distributed Wide-Column Store

etcd: Distributed Reliable Key-Value Store