SystemExpertsSystemExperts
Pricing

Open Source

10 items

Redis: In-Memory Data Structure Store

45mintermediate

Apache Kafka: Distributed Event Streaming Platform

40mintermediate

Kubernetes: Container Orchestration Platform

50mintermediate

Nginx: High-Performance Web Server and Reverse Proxy

40mintermediate

PostgreSQL: The World's Most Advanced Open Source Database

50madvanced

Apache Cassandra: Distributed Wide-Column Store

40madvanced

etcd: Distributed Reliable Key-Value Store

35madvanced

Apache ZooKeeper: Distributed Coordination Service

40madvanced

Envoy Proxy: Modern L7 Proxy and Communication Bus

40madvanced

Apache Hadoop: Distributed Storage and Processing

50madvanced
zookeeperdistributed-systemscoordinationconsensuskafkahadoopleader-electionadvanced

Apache ZooKeeper: Distributed Coordination Service

The battle-tested coordination kernel that powers Kafka, HBase, and Hadoop with reliable distributed primitives

Java|12,000 stars|Updated January 2024|40 min read
View on GitHub

Summary

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and group services. It exposes a simple API of znodes (like files in a filesystem) that clients use to build higher-level primitives like distributed locks, leader election, and barriers. ZooKeeper achieves high availability through replication and uses the Zab protocol (similar to Raft) for consensus. Its ordered, wait-free nature makes it ideal for coordination where correctness matters more than raw throughput.

Key Takeaways

Hierarchical Namespace (znodes)

ZooKeeper organizes data in a tree structure similar to a filesystem. Each node (znode) can store data (up to 1MB) and have children. This hierarchy naturally maps to configuration trees, service registries, and lock paths.

Ephemeral Nodes for Liveness

Ephemeral znodes are automatically deleted when the client session ends. This enables failure detection - if a service crashes, its ephemeral node disappears, triggering watches on other clients.

Sequential Nodes for Ordering

ZooKeeper can append a monotonically increasing counter to node names. This enables fair queuing and leader election - the client with the lowest sequence number becomes the leader.

ZooKeeper was created at Yahoo! Research to solve a common problem: distributed systems need coordination, but building coordination primitives correctly is extremely difficult. Every distributed application was reimplementing locks, leader election, and configuration management - often with subtle bugs.

ZooKeeper provides a small set of simple primitives that applications combine to build higher-level coordination:

  • Configuration Management: Store and distribute config across a cluster
  • Naming/Service Discovery: Register services, discover endpoints
  • Distributed Locks: Mutual exclusion across processes
  • Leader Election: Choose one node to be the master
  • Barriers: Synchronize phases across distributed processes
  • Group Membership: Track which nodes are alive

The key insight: instead of providing these primitives directly, ZooKeeper provides a filesystem-like API with special properties (ordering, watches, ephemeral nodes) that make implementing them straightforward and correct.

ZooKeeper in a Distributed System

Who uses ZooKeeper?

  • Apache Kafka: Broker registration, partition leadership, consumer group coordination (pre-KRaft)
  • Apache HBase: Region server tracking, master election, schema management
  • Apache Hadoop: NameNode high availability, ResourceManager HA
  • Apache Solr: Cluster state, collection management, leader election
  • Apache Flink: Job manager HA, checkpoint coordination

Summary

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and group services. It exposes a simple API of znodes (like files in a filesystem) that clients use to build higher-level primitives like distributed locks, leader election, and barriers. ZooKeeper achieves high availability through replication and uses the Zab protocol (similar to Raft) for consensus. Its ordered, wait-free nature makes it ideal for coordination where correctness matters more than raw throughput.

Key Takeaways

Hierarchical Namespace (znodes)

ZooKeeper organizes data in a tree structure similar to a filesystem. Each node (znode) can store data (up to 1MB) and have children. This hierarchy naturally maps to configuration trees, service registries, and lock paths.

Ephemeral Nodes for Liveness

Ephemeral znodes are automatically deleted when the client session ends. This enables failure detection - if a service crashes, its ephemeral node disappears, triggering watches on other clients.

Sequential Nodes for Ordering

ZooKeeper can append a monotonically increasing counter to node names. This enables fair queuing and leader election - the client with the lowest sequence number becomes the leader.

Watches for Event Notification

Clients can set watches on znodes to receive one-time notifications when data changes or children are modified. Watches are ordered - a client sees the watch event before seeing the new data.

Zab Protocol for Consensus

ZooKeeper Atomic Broadcast (Zab) ensures all updates are totally ordered and replicated to a quorum before acknowledgment. This provides sequential consistency - all clients see updates in the same order.

Session Semantics

Clients maintain sessions with ZooKeeper through heartbeats. Sessions have timeouts - if the server does not hear from a client within the timeout, ephemeral nodes are deleted. Clients can reconnect to any server and resume their session.

Deep Dive

ZooKeeper was created at Yahoo! Research to solve a common problem: distributed systems need coordination, but building coordination primitives correctly is extremely difficult. Every distributed application was reimplementing locks, leader election, and configuration management - often with subtle bugs.

ZooKeeper provides a small set of simple primitives that applications combine to build higher-level coordination:

  • Configuration Management: Store and distribute config across a cluster
  • Naming/Service Discovery: Register services, discover endpoints
  • Distributed Locks: Mutual exclusion across processes
  • Leader Election: Choose one node to be the master
  • Barriers: Synchronize phases across distributed processes
  • Group Membership: Track which nodes are alive

The key insight: instead of providing these primitives directly, ZooKeeper provides a filesystem-like API with special properties (ordering, watches, ephemeral nodes) that make implementing them straightforward and correct.

ZooKeeper in a Distributed System

Who uses ZooKeeper?

  • Apache Kafka: Broker registration, partition leadership, consumer group coordination (pre-KRaft)
  • Apache HBase: Region server tracking, master election, schema management
  • Apache Hadoop: NameNode high availability, ResourceManager HA
  • Apache Solr: Cluster state, collection management, leader election
  • Apache Flink: Job manager HA, checkpoint coordination

Trade-offs

AspectAdvantageDisadvantage
Consistency ModelSequential consistency with linearizable writes ensures all clients see updates in same orderReads from followers may be stale. Use sync() for freshest reads at cost of latency
Hierarchical NamespaceNatural organization for configs, locks, registries. Enables subtree watches and ACLsNo native support for range queries. Must know exact paths or list children
One-time WatchesSimple semantics, guaranteed ordering between watch and subsequent readsMust re-register after each event. Can miss rapid changes. Herd effect if not careful
Ephemeral NodesAutomatic cleanup on failure enables leader election, locks, and service discoverySession timeouts must be tuned carefully. Too short causes false failures, too long delays detection
Java ImplementationMature, battle-tested, extensive ecosystem of client libraries and toolingJVM overhead, GC pauses can cause session timeouts, higher memory footprint
Ensemble SizeMore nodes tolerate more failures, can spread read loadMore nodes slow down writes (need more ACKs). Odd numbers only efficient (3,5,7)

Premium Content

Sign in to access this content or upgrade for full access.