SystemExpertsSystemExperts
Pricing

Open Source

10 items

Redis: In-Memory Data Structure Store

45mintermediate

Apache Kafka: Distributed Event Streaming Platform

40mintermediate

Kubernetes: Container Orchestration Platform

50mintermediate

Nginx: High-Performance Web Server and Reverse Proxy

40mintermediate

PostgreSQL: The World's Most Advanced Open Source Database

50madvanced

Apache Cassandra: Distributed Wide-Column Store

40madvanced

etcd: Distributed Reliable Key-Value Store

35madvanced

Apache ZooKeeper: Distributed Coordination Service

40madvanced

Envoy Proxy: Modern L7 Proxy and Communication Bus

40madvanced

Apache Hadoop: Distributed Storage and Processing

50madvanced
kafkastreamingevent-drivenmessagingdistributed-systemspub-sublogintermediate

Apache Kafka: Distributed Event Streaming Platform

The distributed commit log that processes trillions of events daily at LinkedIn, Uber, and Netflix

Java/Scala|28,000 stars|Updated January 2024|40 min read
View on GitHub

Summary

Kafka is a distributed commit log designed for high-throughput, fault-tolerant event streaming. It stores streams of records in categories called topics, which are split into partitions for parallelism. Producers append records to partitions, and consumers read them in order. The key insight: by treating message storage as a distributed log with sequential I/O and zero-copy transfers, Kafka achieves throughput of millions of messages per second while maintaining ordering guarantees and durability.

Key Takeaways

The Log is the Database

Kafka treats the commit log as the fundamental data structure. Records are appended sequentially, never modified. This append-only design enables sequential disk I/O (600MB/s) instead of random I/O (100KB/s), making disk-based storage faster than memory-based alternatives.

Partitions for Parallelism

Topics are split into partitions, each an ordered, immutable sequence of records. Partitions enable parallel production and consumption - you can have as many consumers as partitions. The partition count is your parallelism ceiling.

Consumer Groups for Scalability

Consumers in a group divide partitions among themselves. Adding consumers automatically rebalances the load. This enables horizontal scaling without changing producer code or topic configuration.

Kafka was created at LinkedIn in 2010 to solve a specific problem: how do you collect and process activity data (page views, searches, ad clicks) from hundreds of services in real-time?

Traditional message queues like RabbitMQ work well for task distribution but struggle with: - High throughput: Millions of events per second - Durability: Messages must survive broker failures - Replay: Ability to reprocess historical data - Multiple consumers: Many services reading the same data

Kafka solves this by treating messaging as a distributed commit log rather than a message queue. The key differences:

| Traditional Queue | Kafka | |-------------------|-------| | Messages deleted after consumption | Messages retained by time/size | | Single consumer per message | Multiple consumer groups | | Random access | Sequential, offset-based | | Memory-focused | Disk-focused (sequential I/O) |

Kafka vs Traditional Message Queue

Common use cases:

  1. Event streaming: Real-time user activity tracking
  2. Log aggregation: Collecting logs from distributed services
  3. Stream processing: ETL pipelines with Kafka Streams or Flink
  4. Event sourcing: Storing state changes as immutable events
  5. Commit log: Database replication and CDC (Change Data Capture)
  6. Metrics pipeline: Collecting and routing metrics to monitoring systems

Summary

Kafka is a distributed commit log designed for high-throughput, fault-tolerant event streaming. It stores streams of records in categories called topics, which are split into partitions for parallelism. Producers append records to partitions, and consumers read them in order. The key insight: by treating message storage as a distributed log with sequential I/O and zero-copy transfers, Kafka achieves throughput of millions of messages per second while maintaining ordering guarantees and durability.

Key Takeaways

The Log is the Database

Kafka treats the commit log as the fundamental data structure. Records are appended sequentially, never modified. This append-only design enables sequential disk I/O (600MB/s) instead of random I/O (100KB/s), making disk-based storage faster than memory-based alternatives.

Partitions for Parallelism

Topics are split into partitions, each an ordered, immutable sequence of records. Partitions enable parallel production and consumption - you can have as many consumers as partitions. The partition count is your parallelism ceiling.

Consumer Groups for Scalability

Consumers in a group divide partitions among themselves. Adding consumers automatically rebalances the load. This enables horizontal scaling without changing producer code or topic configuration.

Offset-Based Consumption

Each record in a partition has a unique offset (position). Consumers track their offset, enabling replay from any point, multiple consumer groups reading independently, and exactly-once semantics with idempotent producers.

Replication for Durability

Each partition is replicated across multiple brokers. One replica is the leader (handles all reads/writes), others are followers. If the leader fails, a follower is promoted. ISR (In-Sync Replicas) tracks which replicas are caught up.

Zero-Copy Data Transfer

Kafka uses sendfile() syscall to transfer data directly from disk to network socket, bypassing user-space buffers. Combined with batching and compression, this achieves 2GB/s throughput per broker.

Deep Dive

Kafka was created at LinkedIn in 2010 to solve a specific problem: how do you collect and process activity data (page views, searches, ad clicks) from hundreds of services in real-time?

Traditional message queues like RabbitMQ work well for task distribution but struggle with: - High throughput: Millions of events per second - Durability: Messages must survive broker failures - Replay: Ability to reprocess historical data - Multiple consumers: Many services reading the same data

Kafka solves this by treating messaging as a distributed commit log rather than a message queue. The key differences:

| Traditional Queue | Kafka | |-------------------|-------| | Messages deleted after consumption | Messages retained by time/size | | Single consumer per message | Multiple consumer groups | | Random access | Sequential, offset-based | | Memory-focused | Disk-focused (sequential I/O) |

Kafka vs Traditional Message Queue

Common use cases:

  1. Event streaming: Real-time user activity tracking
  2. Log aggregation: Collecting logs from distributed services
  3. Stream processing: ETL pipelines with Kafka Streams or Flink
  4. Event sourcing: Storing state changes as immutable events
  5. Commit log: Database replication and CDC (Change Data Capture)
  6. Metrics pipeline: Collecting and routing metrics to monitoring systems

Trade-offs

AspectAdvantageDisadvantage
Append-only log designSequential I/O enables disk throughput rivaling memory; simple, debuggable data modelNo in-place updates; deletions require compaction or retention expiry; large storage footprint for mutable data
Partition-based parallelismLinear horizontal scaling; ordering guaranteed within partition; consumer parallelism matches partition countPartition count is hard to change; uneven key distribution causes hot partitions; cross-partition ordering requires external coordination
Pull-based consumptionConsumers control their pace; easy replay and reprocessing; no need for complex flow controlConsumers must poll continuously; latency higher than push-based systems for light workloads
Consumer group coordinationAutomatic load balancing; adding consumers scales consumption; offset tracking built-inRebalancing pauses consumption; max parallelism limited by partition count; complex failure semantics
At-least-once default deliverySimple, high throughput; works for most use cases; easy to reason aboutRequires idempotent consumers for correctness; exactly-once adds complexity and latency
Replicated for durabilitySurvives broker failures; configurable durability vs latency tradeoff (acks)Replication factor multiplies storage; ISR shrinkage can block writes; leader election causes brief unavailability
Schema-agnosticFlexible - accepts any bytes; producers/consumers manage their own serializationNo built-in schema enforcement; requires external schema registry for compatibility; debugging raw bytes is hard

Premium Content

Sign in to access this content or upgrade for full access.