System Design Masterclass
Streamingkafkastreamingdistributed-systemsevent-drivenmessage-queueadvanced

Design Distributed Stream Processing System

Design a system like Apache Kafka for real-time data streaming

Millions of messages/sec|Similar to LinkedIn, Uber, Netflix, Confluent, Spotify|45 min read

Summary

A stream processing system is like a super-fast delivery service for data. Imagine millions of small packages (messages) being sent every second from many senders to many receivers. The system must deliver every package in order, never lose any, and let many receivers get the same package. The hard parts are: making sure messages arrive in the right order, not losing messages even when computers crash, and letting thousands of senders and receivers work at the same time. Companies like LinkedIn, Uber, Netflix, and Spotify ask this question in interviews.

Key Takeaways

Core Problem

Think of it like a giant notebook where you can only write at the end (never erase or change old pages). Everyone can read this notebook at their own speed. This simple idea - an append-only log - makes everything else work.

The Hard Part

We want messages about the same thing (like all orders from user Bob) to stay in order. But we also want many computers working at the same time for speed. The trick is splitting data into groups called partitions - order is kept within each group, but different groups work in parallel.

Scaling Axis

We grow by adding more partitions. Each partition is like a separate notebook on a different computer. More partitions means more notebooks being written and read at the same time.

Critical Invariant

Two promises we must never break: (1) Messages in the same partition always stay in order - message 5 always comes before message 6. (2) Once we say a message is saved, it must survive even if a computer crashes.

Performance Requirement

Handle millions of messages every second. A message should go from sender to receiver in less than 100 milliseconds (one-tenth of a second). Even though we save everything to disk, we need to be super fast.

Key Tradeoff

We give up perfect global ordering (all messages everywhere in one order) to get speed. Instead, we promise ordering only within partitions. This lets us use many computers in parallel.

Design Walkthrough

Problem Statement

The Question: Design a system that can move millions of messages per second between different apps, save all messages safely, and let apps process data in real-time.

What the system needs to do (most important first):

  1. 1.Accept messages from senders (producers) - Apps send data like orders placed, buttons clicked, or logs created. The system must accept these super fast.
  2. 2.Store messages safely - Never lose a message once we say it is saved. Even if a computer crashes, the data must survive.
  3. 3.Deliver messages to receivers (consumers) - Apps that want the data can read it. Many different apps can read the same messages.
  4. 4.Keep messages in order - If user Bob places order 1, then order 2, receivers must see them in that order (1 before 2).
  5. 5.Allow replay - If an app crashes and restarts, it can go back and re-read old messages it missed.
  6. 6.Scale to handle huge load - Support millions of messages per second from thousands of senders and receivers.

What to say first

Let me understand the requirements before I start designing. I want to know: How many messages per second? How big are the messages? Can we ever lose messages? Do all messages need to be in one global order, or just messages about the same thing?

What the interviewer really wants to see: - Do you understand why we use a log (like a notebook you can only write at the end of)? - Can you explain the tradeoff between ordering and speed? - Do you know what exactly-once means and why it is hard? - Can you design for both high speed AND safety (durability)?

Clarifying Questions

Before you start designing, ask questions to understand what you are building. Good questions show the interviewer you think before you code.

Question 1: How many messages per second?

What is the expected load? How many apps are sending messages and how many are receiving?

Why ask this: The number decides how many computers we need and how we split the data.

What interviewers usually say: 1 million messages per second, 10,000 senders, 1,000 receiver groups.

How this changes your design: We need many partitions spread across many broker computers. One computer cannot handle 1 million messages alone.

Question 2: How big are the messages?

What is the typical size? What is the biggest message we might see?

Why ask this: Big messages need different handling than small ones.

What interviewers usually say: Average message is 1 KB (about one page of text). Maximum is 1 MB.

How this changes your design: We can bundle many small messages together for speed. For big messages, we might need to split them into chunks.

Question 3: Can we lose messages?

What happens if a message is lost? Is it okay to lose some, or must every message be saved?

Why ask this: Perfect safety is slower. If we can lose some messages, we can be faster.

What interviewers usually say: Once we say a message is saved, we cannot lose it. We are okay with a small delay to be safe.

How this changes your design: We need to copy each message to multiple computers before saying done. If one computer dies, the others still have the data.

Question 4: Do messages need to be in order?

Must ALL messages be in one perfect order, or just messages about the same thing (like the same user or same order ID)?

Why ask this: Perfect global order means one computer does all the work - very slow. Order per user/entity is much faster.

What interviewers usually say: Messages about the same key (like user ID) must be in order. We do not need all messages everywhere to be in one order.

How this changes your design: We send all messages with the same key to the same partition. Order is kept within partitions, but different partitions can work in parallel.

Question 5: How long do we keep messages?

Should messages be deleted after some time? Do receivers need to go back and replay old messages?

Why ask this: Keeping messages forever needs huge storage. Most systems delete old messages after some days.

What interviewers usually say: Keep messages for 7 days. Apps should be able to replay from any point in those 7 days.

How this changes your design: We need big disks. We delete messages older than 7 days. We track where each receiver left off so they can continue from there.

Summarize your assumptions

Let me summarize: 1 million messages per second, 1 KB average size, copy to 3 computers for safety, order within partition keys, keep for 7 days with replay. I will design for this.

The Hard Part

Say this to the interviewer

The hardest part is this: We want messages in order AND we want to be fast. These two goals fight each other. If one computer handles everything, we get perfect order but very slow speed. The trick is to split into partitions - each partition keeps its own order, but many partitions work at the same time.

Why this is really hard (explained simply):

  1. 1.Order vs Speed - If you want ALL messages in one perfect order, only one computer can write them. That computer becomes the bottleneck. You cannot use 100 computers if only 1 can write.
  2. 2.Exactly-once delivery - Making sure a message is delivered exactly one time (not zero, not two) is nearly impossible in computers. Networks fail, computers crash, things get retried. The solution is giving each message a unique ID and checking for repeats.
  3. 3.Multiple receivers working together - If 3 receivers are reading from 6 partitions, who reads which partition? When a receiver crashes, another must take over its work. This coordination is tricky.
  4. 4.Speed vs Safety - Writing to disk is slow. Waiting for other computers to confirm they saved a copy is slow. But we need both for safety. The trick is writing to disk in a special fast way (sequential writes) and letting senders choose how much safety they need.

Common mistake candidates make

Many people suggest a simple message queue where messages disappear after reading. That is wrong for this problem. We need a LOG - like a notebook where old pages are never torn out. This lets multiple receivers read at their own speed, lets receivers go back and re-read, and gives us built-in ordering.

The big idea: The Log

Think of a log as a notebook where: - You can only write on new pages at the end (never change old pages) - Each page has a number (0, 1, 2, 3, ...) - Many people can read the notebook at the same time - Each reader remembers which page they are on - Readers can go back to re-read old pages

This simple idea gives us: - Order: Page 5 always comes before page 6 - Replay: Go back to page 100 and read forward - Multiple readers: Everyone reads at their own pace - Safety: Old pages never change, just add new ones - No duplicates: Each page number is unique

The Log - Like a notebook you can only add to

Scale and Access Patterns

Before designing, let me figure out how big this system needs to be. This helps us choose the right tools and number of computers.

What we are measuringNumberWhat this means for our design
Messages per second1,000,000Very high - we need many computers working together
Average message size1 KB1 million messages x 1 KB = 1 GB of data every second
+ 5 more rows...

What to tell the interviewer

At 1 million messages per second with 1 KB each, we are writing 1 GB every second. With 7-day storage and 3 copies, we need about 2 petabytes of disk space. This is a write-heavy system - we write once and read many times. Writes are always at the end of the log (sequential), which is super fast even on disk.

How people use the system (from most common to least common):

  1. 1.Send messages (produce) - Apps constantly send new data. This must be super fast. Messages always go to the end of the log.
  2. 2.Read messages (consume) - Apps read messages, usually from where they left off. They read forward through the log.
  3. 3.Replay old messages - Sometimes apps need to go back and re-process old data. They jump to an old position and read forward.
  4. 4.Search for specific messages - Rare. The system is not designed for searching - it is designed for ordered reading.
How fast can one computer write?
- A good SSD can write 500 MB per second easily
- We need to write 1 GB per second total
+ 14 more lines...

High-Level Architecture

Now let me draw the big picture of how all the pieces fit together. I will keep it simple and explain what each part does.

What to tell the interviewer

The system has three main parts: (1) Brokers - computers that store the log data, (2) A coordination service - keeps track of which broker has which data, (3) Clients - senders and receivers that connect to brokers. Data is split into partitions, and each partition lives on one broker.

Stream Processing System - The Big Picture

What each part does and WHY:

PartWhat it doesWhy we need it
BrokerStores messages on disk. Accepts writes from senders. Sends messages to receivers.This is where the actual data lives. Each broker is a computer with lots of disk space.
TopicA name for a group of related messages (like orders or user-clicks)Lets us organize different types of data. Senders write to topics, receivers read from topics.
+ 5 more rows...

How real companies do it

Kafka (LinkedIn) originally used ZooKeeper for coordination. Newer versions use their own built-in system called KRaft. Pulsar (Yahoo) separates storage into BookKeeper. The core ideas are the same - brokers store data, something coordinates them, clients connect to brokers.

Important: Why partitions matter so much

Partitions are the secret to speed. Without partitions, one computer does all the work. With 100 partitions across 10 brokers, 10 computers share the work. With 1000 partitions, even more. The rule: messages with the same key (like user ID) always go to the same partition, so they stay in order.

Data Model and Storage

Now let me explain how we actually store the messages on disk. The key idea is that writing to the end of a file is FAST - even faster than some in-memory databases!

What to tell the interviewer

Each partition is stored as a series of files called segments. Each segment is like one chapter of a book. We write to the current chapter until it gets too big, then start a new chapter. Old chapters can be deleted when they expire. We also keep an index - like a book's table of contents - to quickly find specific messages.

How a partition is stored on disk

What is in a message?

Each message has these parts:

PartWhat it isExample
OffsetThe message number in this partition (like page number)42 means this is the 43rd message (we start at 0)
TimestampWhen the message was created2024-03-15 14:30:00
KeyUsed to decide which partition (optional)user_123 or order_456
ValueThe actual dataThe order details, click info, log text, etc.
HeadersExtra labels (optional)source=mobile, version=2

How we find a message quickly

Imagine you want to find message number 1500. We do not want to read through all messages from the start!

The index helps us jump close to where we want:

The Index (saved every 100 messages):
+------------------+------------------+
| Message Number   | Position in File |
+ 23 more lines...

Key insight: We do NOT use a database

We do not use MySQL or PostgreSQL. The log file IS our database. Why? Writing to the end of a file is incredibly fast - as fast as writing to memory. The operating system automatically keeps recently-read data in memory (page cache), so repeated reads are fast too. Simple files beat complex databases for this use case.

Batching: The speed trick

Instead of sending one message at a time, we bundle many messages together:

Without batching:
- Send message 1... network round trip... done
- Send message 2... network round trip... done  
+ 12 more lines...

Replication: Keeping Data Safe

What happens if a computer crashes and its disk dies? We lose all the data on it! To prevent this, we copy each message to multiple computers.

What to tell the interviewer

Each partition has one leader and multiple followers. The leader does all the work (accepts writes, serves reads). Followers just copy everything the leader has. If the leader dies, one of the followers becomes the new leader. This way we never lose data.

How copies (replicas) work

In-Sync Replicas (ISR) - The followers that are caught up

Not all followers are always up to date. Network can be slow, computers can lag. We keep a list of followers that ARE caught up - called the In-Sync Replica set (ISR).

Setup: Partition 0 has 3 copies (1 leader + 2 followers)
Rule: We need at least 2 copies before saying "message saved"
+ 14 more lines...
Safety SettingWhat it meansSpeedWhen to use
acks=0Do not wait for any confirmationSuper fastLogs, metrics where losing some is OK
acks=1Wait for leader onlyFastMost normal use cases
acks=allWait for all in-sync copiesSlowerBank transactions, critical data

What happens when the leader crashes?

Before crash:
- Partition 0: Leader = Broker 1, Followers = Broker 2, Broker 3
- All 3 have messages 0 through 999
+ 17 more lines...

The promise we must keep

Once we tell a sender their message is saved (with acks=all), that message MUST survive even if a computer crashes. This is our durability promise. Breaking this promise means losing bank transactions, orders, or critical data.

Senders (Producers) Deep Dive

Let me explain how the sender app works. The sender has a few important jobs: pick which partition to send to, bundle messages into batches, and handle failures.

Inside a sender app

How do we pick which partition?

The key you provide decides the partition. All messages with the same key go to the same partition (so they stay in order).

If you provide a key (like user ID):
    partition = hash(key) % number_of_partitions
    
+ 16 more lines...

How batching works

We do not send messages one by one. We collect them into batches first.

Settings:
- batch_size = 16 KB (send when batch reaches this size)
- linger_time = 5 ms (or send after this much time, whichever comes first)
+ 26 more lines...

Key configuration

Two settings control the speed vs delay tradeoff: batch_size (how big before sending) and linger_time (how long to wait for more messages). For high throughput, use bigger batches and longer linger. For low delay, use smaller batches and shorter linger.

Receivers (Consumers) Deep Dive

Now let me explain how receiver apps work. Receivers read messages from partitions and remember where they left off (their offset). Multiple receivers can work together in a group to share the work.

How receivers share work in a group

How receiver groups work

A group is a team of receivers that share the work. Each partition is assigned to exactly one receiver in the group. This way, no message is processed twice.

Scenario: Topic has 6 partitions, Group has 3 receivers

Initial assignment:
+ 24 more lines...

Tracking progress with offsets

Each receiver remembers which message it read last. This is called the offset - like a bookmark in a book.

Partition 0 has messages: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9

Receiver starts reading:
+ 23 more lines...

Important detail

Offsets are saved in a special topic called __consumer_offsets. This topic is replicated and safe just like any other topic. Each entry is: (group ID, topic, partition) -> (offset). When a receiver restarts, it reads its last offset from this topic.

Exactly-Once Delivery

The holy grail of messaging is exactly-once: each message is processed exactly one time. Not zero (lost), not two (duplicate). This is really hard to achieve.

Why exactly-once is hard

In the real world, networks fail, computers crash, messages get retried. How do you know if a message was already processed? If you retry after a timeout, maybe it worked the first time. If you do not retry, maybe it was lost. Exactly-once requires careful tracking on both the sender and receiver sides.

Problem 1: Sender retries create duplicates

Imagine: Sender sends message. Broker saves it. Network fails before confirmation reaches sender. Sender retries. Now the message is saved twice!

Each sender gets:
- Producer ID (PID): A unique number like 12345
- Sequence number: Counts up for each message (0, 1, 2, 3, ...)
+ 18 more lines...

Problem 2: Writing to multiple partitions atomically

Sometimes you need to write to several partitions as one unit. Either all succeed or all fail. This is called a transaction.

Transaction: All or nothing

Example: When an order comes in, we need to:
1. Write the order to the "orders" topic
2. Update inventory in the "inventory" topic
+ 21 more lines...

Problem 3: Receivers might see uncommitted messages

By default, receivers see all messages including pending ones. To only see committed messages, use the read_committed setting.

read_uncommitted (default):
- Receiver sees ALL messages immediately
- Including messages from transactions that might be rolled back
+ 20 more lines...

The practical truth about exactly-once

True exactly-once requires: (1) Idempotent senders with sequence numbers, (2) Transactions for multi-partition writes, (3) Read-committed receivers, (4) Idempotent processing in receiver apps. Most systems use at-least-once with idempotent receivers - simpler and almost as good.

What Can Go Wrong and How We Handle It

Tell the interviewer about failures

Good engineers think about what can break. Let me walk through the things that can go wrong and how we protect against them.

What breaksWhat happensHow we fix itHow long to recover
One broker crashesIts partitions become unavailable brieflyFollowers become leaders. Data is safe because we have copies.10-30 seconds to elect new leader
Broker disk diesData on that disk is lost foreverWe have copies on other brokers. Re-copy data to a new broker.Minutes to hours depending on data size
+ 4 more rows...

Example: What happens when a broker crashes

Before the crash:
- Partition 0: Leader = Broker 1, Followers = Broker 2, Broker 3
- All three have messages 0 through 999 (fully synced)
+ 19 more lines...

How receivers handle failures safely

The safe way to process messages:

while True:
+ 30 more lines...

Key insight

The stream system guarantees exactly-once between sender and the log. But your receiver app must handle the possibility of seeing a message twice. Use the message offset or a unique ID to detect and skip duplicates. This is called idempotent processing.

Growing the System Over Time

What to tell the interviewer

This design handles millions of messages per second. Let me explain how we grow it when we need more capacity, and what extra features we might add.

How we grow step by step:

When we need more speed (throughput): - Add more partitions to the topic - More partitions = more parallel work - Add more receivers to the group (up to number of partitions) - Note: We can add partitions but not remove them

When brokers are overloaded: - Add more broker computers to the cluster - Move some partitions to the new brokers (rebalancing) - This can be done while the system is running

When we need more storage: - Add bigger disks to brokers - Or add more brokers - Or use tiered storage (explained below)

Tiered Storage - Hot and cold data

Why tiered storage helps: - Fast SSD is expensive, cheap cloud storage is... cheap - Most receivers read recent messages (hot data) - Old messages are rarely read but must be kept - Keep hot data on fast local disks, move cold data to cheap cloud storage - Receivers reading old data are slower, but that is okay for replays

Extra features we might add:

FeatureWhat it doesWhy it is useful
Log compactionKeep only the latest message for each key, delete older onesGreat for keeping latest state (like user profile). Saves space.
QuotasLimit how fast one sender or receiver can goPrevents one app from hogging all the resources
Schema registryChecks that messages follow a specific formatPrevents bad data from entering the system
Dead letter queueSpecial topic for messages that failed processingFailed messages do not block others, can be fixed later
MonitoringTrack message rates, delays, disk usageKnow when something is going wrong before users complain

Alternative systems to mention

If the interviewer asks about alternatives: (1) For simpler needs with less throughput, RabbitMQ is easier to set up. (2) For even lower latency, Redis Streams keeps everything in memory. (3) For infinite storage with separate compute, Apache Pulsar with BookKeeper is interesting. (4) For cloud-native, AWS Kinesis or Google Pub/Sub are managed options.

Different use cases need different focus:

Logging system (like collecting server logs): - Can lose some messages (acks=1 is fine) - Need lots of storage - Receivers are slow batch processors

Payment events (like order processing): - Cannot lose any messages (acks=all required) - Need exactly-once processing - Receivers must be fast

IoT sensors (like millions of thermometers): - Millions of tiny messages - Can lose some data - Need to handle many connections - Maybe use MQTT first, then stream into our system

Design Trade-offs

Advantages

  • +All messages in one perfect order
  • +Simple to understand
  • +One receiver handles everything

Disadvantages

  • -Cannot scale - one computer does all work
  • -If it crashes, everything stops
  • -One receiver maximum
When to use

Only when you truly need global order AND have low volume. Very rare.