Open Source
10 items
10 items
The distributed commit log that processes trillions of events daily at LinkedIn, Uber, and Netflix
Kafka is a distributed commit log designed for high-throughput, fault-tolerant event streaming. It stores streams of records in categories called topics, which are split into partitions for parallelism. Producers append records to partitions, and consumers read them in order. The key insight: by treating message storage as a distributed log with sequential I/O and zero-copy transfers, Kafka achieves throughput of millions of messages per second while maintaining ordering guarantees and durability.
Kafka treats the commit log as the fundamental data structure. Records are appended sequentially, never modified. This append-only design enables sequential disk I/O (600MB/s) instead of random I/O (100KB/s), making disk-based storage faster than memory-based alternatives.
Topics are split into partitions, each an ordered, immutable sequence of records. Partitions enable parallel production and consumption - you can have as many consumers as partitions. The partition count is your parallelism ceiling.
Consumers in a group divide partitions among themselves. Adding consumers automatically rebalances the load. This enables horizontal scaling without changing producer code or topic configuration.
Kafka was created at LinkedIn in 2010 to solve a specific problem: how do you collect and process activity data (page views, searches, ad clicks) from hundreds of services in real-time?
Traditional message queues like RabbitMQ work well for task distribution but struggle with: - High throughput: Millions of events per second - Durability: Messages must survive broker failures - Replay: Ability to reprocess historical data - Multiple consumers: Many services reading the same data
Kafka solves this by treating messaging as a distributed commit log rather than a message queue. The key differences:
| Traditional Queue | Kafka | |-------------------|-------| | Messages deleted after consumption | Messages retained by time/size | | Single consumer per message | Multiple consumer groups | | Random access | Sequential, offset-based | | Memory-focused | Disk-focused (sequential I/O) |
Common use cases: