Open Source

10 items

kafkastreamingevent-drivenmessagingdistributed-systemspub-sublogintermediate

Apache Kafka: Distributed Event Streaming Platform

The distributed commit log that processes trillions of events daily at LinkedIn, Uber, and Netflix

Java/Scala|28,000 stars|Updated January 2024|40 min read

View on GitHub

Summary

Kafka is a distributed commit log designed for high-throughput, fault-tolerant event streaming. It stores streams of records in categories called topics, which are split into partitions for parallelism. Producers append records to partitions, and consumers read them in order. The key insight: by treating message storage as a distributed log with sequential I/O and zero-copy transfers, Kafka achieves throughput of millions of messages per second while maintaining ordering guarantees and durability.

Key Takeaways

The Log is the Database

Kafka treats the commit log as the fundamental data structure. Records are appended sequentially, never modified. This append-only design enables sequential disk I/O (600MB/s) instead of random I/O (100KB/s), making disk-based storage faster than memory-based alternatives.

Partitions for Parallelism

Topics are split into partitions, each an ordered, immutable sequence of records. Partitions enable parallel production and consumption - you can have as many consumers as partitions. The partition count is your parallelism ceiling.

Consumer Groups for Scalability

Consumers in a group divide partitions among themselves. Adding consumers automatically rebalances the load. This enables horizontal scaling without changing producer code or topic configuration.

Kafka was created at LinkedIn in 2010 to solve a specific problem: how do you collect and process activity data (page views, searches, ad clicks) from hundreds of services in real-time?

Traditional message queues like RabbitMQ work well for task distribution but struggle with: - High throughput: Millions of events per second - Durability: Messages must survive broker failures - Replay: Ability to reprocess historical data - Multiple consumers: Many services reading the same data

Kafka solves this by treating messaging as a distributed commit log rather than a message queue. The key differences:

| Traditional Queue | Kafka | |-------------------|-------| | Messages deleted after consumption | Messages retained by time/size | | Single consumer per message | Multiple consumer groups | | Random access | Sequential, offset-based | | Memory-focused | Disk-focused (sequential I/O) |

Kafka vs Traditional Message Queue

Common use cases:

Event streaming: Real-time user activity tracking
Log aggregation: Collecting logs from distributed services
Stream processing: ETL pipelines with Kafka Streams or Flink
Event sourcing: Storing state changes as immutable events
Commit log: Database replication and CDC (Change Data Capture)
Metrics pipeline: Collecting and routing metrics to monitoring systems

Summary

Key Takeaways

The Log is the Database

Partitions for Parallelism

Consumer Groups for Scalability

Consumers in a group divide partitions among themselves. Adding consumers automatically rebalances the load. This enables horizontal scaling without changing producer code or topic configuration.

Offset-Based Consumption

Each record in a partition has a unique offset (position). Consumers track their offset, enabling replay from any point, multiple consumer groups reading independently, and exactly-once semantics with idempotent producers.

Replication for Durability

Each partition is replicated across multiple brokers. One replica is the leader (handles all reads/writes), others are followers. If the leader fails, a follower is promoted. ISR (In-Sync Replicas) tracks which replicas are caught up.

Zero-Copy Data Transfer

Kafka uses sendfile() syscall to transfer data directly from disk to network socket, bypassing user-space buffers. Combined with batching and compression, this achieves 2GB/s throughput per broker.

Deep Dive

Kafka was created at LinkedIn in 2010 to solve a specific problem: how do you collect and process activity data (page views, searches, ad clicks) from hundreds of services in real-time?

Kafka solves this by treating messaging as a distributed commit log rather than a message queue. The key differences:

Kafka vs Traditional Message Queue

Common use cases:

Event streaming: Real-time user activity tracking
Log aggregation: Collecting logs from distributed services
Stream processing: ETL pipelines with Kafka Streams or Flink
Event sourcing: Storing state changes as immutable events
Commit log: Database replication and CDC (Change Data Capture)
Metrics pipeline: Collecting and routing metrics to monitoring systems

Trade-offs

Aspect	Advantage	Disadvantage
Append-only log design	Sequential I/O enables disk throughput rivaling memory; simple, debuggable data model	No in-place updates; deletions require compaction or retention expiry; large storage footprint for mutable data
Partition-based parallelism	Linear horizontal scaling; ordering guaranteed within partition; consumer parallelism matches partition count	Partition count is hard to change; uneven key distribution causes hot partitions; cross-partition ordering requires external coordination
Pull-based consumption	Consumers control their pace; easy replay and reprocessing; no need for complex flow control	Consumers must poll continuously; latency higher than push-based systems for light workloads
Consumer group coordination	Automatic load balancing; adding consumers scales consumption; offset tracking built-in	Rebalancing pauses consumption; max parallelism limited by partition count; complex failure semantics
At-least-once default delivery	Simple, high throughput; works for most use cases; easy to reason about	Requires idempotent consumers for correctness; exactly-once adds complexity and latency
Replicated for durability	Survives broker failures; configurable durability vs latency tradeoff (acks)	Replication factor multiplies storage; ISR shrinkage can block writes; leader election causes brief unavailability
Schema-agnostic	Flexible - accepts any bytes; producers/consumers manage their own serialization	No built-in schema enforcement; requires external schema registry for compatibility; debugging raw bytes is hard

Open Source

Redis: In-Memory Data Structure Store