Whitepapers
15 items
15 items
The pub-sub system that became the backbone of modern data pipelines, processing trillions of messages daily
Kafka is a distributed publish-subscribe messaging system designed for high-throughput log processing. Unlike traditional message brokers that track per-message state, Kafka uses an append-only log abstraction where consumers track their own position. This simple design, combined with sequential disk I/O, OS page cache utilization, and zero-copy transfers, enables Kafka to achieve throughput orders of magnitude higher than traditional systems while providing durability guarantees through replication.
Kafka models each topic partition as an append-only, ordered, immutable sequence of messages. This log abstraction simplifies the broker (just append and serve), enables efficient sequential I/O, and allows consumers to rewind and replay messages—impossible in traditional queues.
Unlike traditional brokers that track delivery state per message, Kafka consumers track their own position (offset) in the log. This eliminates per-message bookkeeping on the broker, enabling massive throughput and allowing consumers to re-read messages at will.
Kafka deliberately uses sequential disk writes and reads. Sequential disk I/O (600 MB/s) is faster than random memory access in some cases. Combined with OS page cache, Kafka achieves in-memory speeds with disk durability.
By 2010, LinkedIn faced a data infrastructure crisis. The company generated massive amounts of log data:
This data needed to flow to multiple destinations: - Hadoop for offline analytics and machine learning - Real-time systems for monitoring and alerting - Search indexes for log exploration - Data warehouses for business intelligence
The existing solutions failed:
Traditional message queues (ActiveMQ, RabbitMQ) couldn't handle the throughput. They were designed for transactional messaging with per-message acknowledgment tracking—great for order processing, terrible for log streams.
Custom log aggregators (Scribe, Flume) provided throughput but no durability, replay, or real-time consumption. Data went one-way into Hadoop.
LinkedIn needed a system that combined: - High throughput of log aggregators - Durability and replay of databases - Pub-sub flexibility of message queues - Horizontal scalability of distributed systems