Whitepapers

15 items

MapReduce: Simplified Data Processing on Large Clusters

30mintermediate

Kafka: A Distributed Messaging System for Log Processing

30mintermediate

Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

25mintermediate

distributed-systemsmessagingkafkastreamingpub-sublinkedindata-pipelineintermediate

Kafka: A Distributed Messaging System for Log Processing

The pub-sub system that became the backbone of modern data pipelines, processing trillions of messages daily

Jay Kreps, Neha Narkhede, Jun Rao|LinkedIn|2011|30 min read

View Original Paper

Summary

Kafka is a distributed publish-subscribe messaging system designed for high-throughput log processing. Unlike traditional message brokers that track per-message state, Kafka uses an append-only log abstraction where consumers track their own position. This simple design, combined with sequential disk I/O, OS page cache utilization, and zero-copy transfers, enables Kafka to achieve throughput orders of magnitude higher than traditional systems while providing durability guarantees through replication.

Key Takeaways

Log as the Core Abstraction

Kafka models each topic partition as an append-only, ordered, immutable sequence of messages. This log abstraction simplifies the broker (just append and serve), enables efficient sequential I/O, and allows consumers to rewind and replay messages—impossible in traditional queues.

Consumer-Managed Offsets

Unlike traditional brokers that track delivery state per message, Kafka consumers track their own position (offset) in the log. This eliminates per-message bookkeeping on the broker, enabling massive throughput and allowing consumers to re-read messages at will.

Sequential I/O Over Random Access

Kafka deliberately uses sequential disk writes and reads. Sequential disk I/O (600 MB/s) is faster than random memory access in some cases. Combined with OS page cache, Kafka achieves in-memory speeds with disk durability.

By 2010, LinkedIn faced a data infrastructure crisis. The company generated massive amounts of log data:

Activity events: Page views, searches, ad impressions (billions/day)
Operational metrics: Server CPU, memory, network stats
Application logs: Errors, warnings, debugging information
Data pipeline feeds: Database changes, derived data streams

This data needed to flow to multiple destinations: - Hadoop for offline analytics and machine learning - Real-time systems for monitoring and alerting - Search indexes for log exploration - Data warehouses for business intelligence

The existing solutions failed:

LinkedIn's Data Integration Problem

Traditional message queues (ActiveMQ, RabbitMQ) couldn't handle the throughput. They were designed for transactional messaging with per-message acknowledgment tracking—great for order processing, terrible for log streams.

Custom log aggregators (Scribe, Flume) provided throughput but no durability, replay, or real-time consumption. Data went one-way into Hadoop.

LinkedIn needed a system that combined: - High throughput of log aggregators - Durability and replay of databases - Pub-sub flexibility of message queues - Horizontal scalability of distributed systems

Summary

Key Takeaways

Log as the Core Abstraction

Consumer-Managed Offsets

Sequential I/O Over Random Access

Zero-Copy Data Transfer

Kafka uses sendfile() to transfer data directly from page cache to network socket, bypassing user-space entirely. This eliminates two memory copies and two context switches per message batch, dramatically improving throughput.

Partitioning for Parallelism

Topics are split into partitions distributed across brokers. Each partition is an independent log with its own leader. This enables linear horizontal scaling—add partitions and brokers to increase throughput.

Consumer Groups for Scalability

Consumers in a group divide partitions among themselves—each partition is consumed by exactly one consumer in the group. This provides both pub-sub (multiple groups) and queue (single group) semantics in one system.

Deep Dive

By 2010, LinkedIn faced a data infrastructure crisis. The company generated massive amounts of log data:

Activity events: Page views, searches, ad impressions (billions/day)
Operational metrics: Server CPU, memory, network stats
Application logs: Errors, warnings, debugging information
Data pipeline feeds: Database changes, derived data streams

The existing solutions failed:

LinkedIn's Data Integration Problem

Custom log aggregators (Scribe, Flume) provided throughput but no durability, replay, or real-time consumption. Data went one-way into Hadoop.

Trade-offs

Aspect	Advantage	Disadvantage
Log-Based vs Queue-Based	Append-only log enables replay, multiple consumers, and efficient sequential I/O	No per-message acknowledgment; consumers must manage their own offsets
Consumer-Side Offset Tracking	Eliminates broker-side bookkeeping; enables consumer replay and rewind	Consumers responsible for offset management; exactly-once requires careful implementation
Partitioning for Parallelism	Linear horizontal scaling; independent I/O per partition	No global ordering across partitions; partition count is hard to change
High Throughput Design	Orders of magnitude faster than traditional brokers; handles millions of messages/sec	Not optimized for low-latency single-message delivery; batching adds latency
Retention-Based Deletion	Messages persist for configured time; multiple consumers read same data; natural audit log	Storage costs can grow large; must configure retention carefully

Premium Content

distributed-systemsmessagingkafkastreamingpub-sublinkedindata-pipelineintermediate

Kafka: A Distributed Messaging System for Log Processing

The pub-sub system that became the backbone of modern data pipelines, processing trillions of messages daily

Jay Kreps, Neha Narkhede, Jun Rao|LinkedIn|2011|30 min read

View Original Paper

Summary

Key Takeaways

Log as the Core Abstraction

Consumer-Managed Offsets

Sequential I/O Over Random Access

By 2010, LinkedIn faced a data infrastructure crisis. The company generated massive amounts of log data:

Activity events: Page views, searches, ad impressions (billions/day)
Operational metrics: Server CPU, memory, network stats
Application logs: Errors, warnings, debugging information
Data pipeline feeds: Database changes, derived data streams

The existing solutions failed:

LinkedIn's Data Integration Problem

Custom log aggregators (Scribe, Flume) provided throughput but no durability, replay, or real-time consumption. Data went one-way into Hadoop.

Summary

Key Takeaways

Log as the Core Abstraction

Consumer-Managed Offsets

Sequential I/O Over Random Access

Zero-Copy Data Transfer

Partitioning for Parallelism

Consumer Groups for Scalability

Deep Dive

By 2010, LinkedIn faced a data infrastructure crisis. The company generated massive amounts of log data:

Activity events: Page views, searches, ad impressions (billions/day)
Operational metrics: Server CPU, memory, network stats
Application logs: Errors, warnings, debugging information
Data pipeline feeds: Database changes, derived data streams

The existing solutions failed:

LinkedIn's Data Integration Problem

Custom log aggregators (Scribe, Flume) provided throughput but no durability, replay, or real-time consumption. Data went one-way into Hadoop.

Trade-offs

Aspect	Advantage	Disadvantage
Log-Based vs Queue-Based	Append-only log enables replay, multiple consumers, and efficient sequential I/O	No per-message acknowledgment; consumers must manage their own offsets
Consumer-Side Offset Tracking	Eliminates broker-side bookkeeping; enables consumer replay and rewind	Consumers responsible for offset management; exactly-once requires careful implementation
Partitioning for Parallelism	Linear horizontal scaling; independent I/O per partition	No global ordering across partitions; partition count is hard to change
High Throughput Design	Orders of magnitude faster than traditional brokers; handles millions of messages/sec	Not optimized for low-latency single-message delivery; batching adds latency
Retention-Based Deletion	Messages persist for configured time; multiple consumers read same data; natural audit log	Storage costs can grow large; must configure retention carefully

Whitepapers

MapReduce: Simplified Data Processing on Large Clusters

Kafka: A Distributed Messaging System for Log Processing

Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

Bitcoin: A Peer-to-Peer Electronic Cash System

In Search of an Understandable Consensus Algorithm (Raft)

TAO: Facebook's Distributed Data Store for the Social Graph

The Google File System

The Log-Structured Merge-Tree (LSM-Tree)

The Chubby Lock Service for Loosely-Coupled Distributed Systems

Spanner: Google's Globally Distributed Database

Bigtable: A Distributed Storage System for Structured Data

Scaling Memcache at Facebook

Large-scale cluster management at Google with Borg

The Next 700 Programming Languages

The Part-Time Parliament

Kafka: A Distributed Messaging System for Log Processing

Summary

Key Takeaways

Log as the Core Abstraction

Consumer-Managed Offsets

Sequential I/O Over Random Access

The Problem: Log Data at Scale

LinkedIn's Data Integration Problem

Premium Content

Kafka: A Distributed Messaging System for Log Processing

Summary

Key Takeaways

Log as the Core Abstraction

Consumer-Managed Offsets

Sequential I/O Over Random Access

The Problem: Log Data at Scale

LinkedIn's Data Integration Problem

Premium Content