Design Walkthrough
Problem Statement
The Question: Design a system that can move millions of messages per second between different apps, save all messages safely, and let apps process data in real-time.
What the system needs to do (most important first):
- 1.Accept messages from senders (producers) - Apps send data like orders placed, buttons clicked, or logs created. The system must accept these super fast.
- 2.Store messages safely - Never lose a message once we say it is saved. Even if a computer crashes, the data must survive.
- 3.Deliver messages to receivers (consumers) - Apps that want the data can read it. Many different apps can read the same messages.
- 4.Keep messages in order - If user Bob places order 1, then order 2, receivers must see them in that order (1 before 2).
- 5.Allow replay - If an app crashes and restarts, it can go back and re-read old messages it missed.
- 6.Scale to handle huge load - Support millions of messages per second from thousands of senders and receivers.
What to say first
Let me understand the requirements before I start designing. I want to know: How many messages per second? How big are the messages? Can we ever lose messages? Do all messages need to be in one global order, or just messages about the same thing?
What the interviewer really wants to see: - Do you understand why we use a log (like a notebook you can only write at the end of)? - Can you explain the tradeoff between ordering and speed? - Do you know what exactly-once means and why it is hard? - Can you design for both high speed AND safety (durability)?
Clarifying Questions
Before you start designing, ask questions to understand what you are building. Good questions show the interviewer you think before you code.
Question 1: How many messages per second?
What is the expected load? How many apps are sending messages and how many are receiving?
Why ask this: The number decides how many computers we need and how we split the data.
What interviewers usually say: 1 million messages per second, 10,000 senders, 1,000 receiver groups.
How this changes your design: We need many partitions spread across many broker computers. One computer cannot handle 1 million messages alone.
Question 2: How big are the messages?
What is the typical size? What is the biggest message we might see?
Why ask this: Big messages need different handling than small ones.
What interviewers usually say: Average message is 1 KB (about one page of text). Maximum is 1 MB.
How this changes your design: We can bundle many small messages together for speed. For big messages, we might need to split them into chunks.
Question 3: Can we lose messages?
What happens if a message is lost? Is it okay to lose some, or must every message be saved?
Why ask this: Perfect safety is slower. If we can lose some messages, we can be faster.
What interviewers usually say: Once we say a message is saved, we cannot lose it. We are okay with a small delay to be safe.
How this changes your design: We need to copy each message to multiple computers before saying done. If one computer dies, the others still have the data.
Question 4: Do messages need to be in order?
Must ALL messages be in one perfect order, or just messages about the same thing (like the same user or same order ID)?
Why ask this: Perfect global order means one computer does all the work - very slow. Order per user/entity is much faster.
What interviewers usually say: Messages about the same key (like user ID) must be in order. We do not need all messages everywhere to be in one order.
How this changes your design: We send all messages with the same key to the same partition. Order is kept within partitions, but different partitions can work in parallel.
Question 5: How long do we keep messages?
Should messages be deleted after some time? Do receivers need to go back and replay old messages?
Why ask this: Keeping messages forever needs huge storage. Most systems delete old messages after some days.
What interviewers usually say: Keep messages for 7 days. Apps should be able to replay from any point in those 7 days.
How this changes your design: We need big disks. We delete messages older than 7 days. We track where each receiver left off so they can continue from there.
Summarize your assumptions
Let me summarize: 1 million messages per second, 1 KB average size, copy to 3 computers for safety, order within partition keys, keep for 7 days with replay. I will design for this.
The Hard Part
Say this to the interviewer
The hardest part is this: We want messages in order AND we want to be fast. These two goals fight each other. If one computer handles everything, we get perfect order but very slow speed. The trick is to split into partitions - each partition keeps its own order, but many partitions work at the same time.
Why this is really hard (explained simply):
- 1.Order vs Speed - If you want ALL messages in one perfect order, only one computer can write them. That computer becomes the bottleneck. You cannot use 100 computers if only 1 can write.
- 2.Exactly-once delivery - Making sure a message is delivered exactly one time (not zero, not two) is nearly impossible in computers. Networks fail, computers crash, things get retried. The solution is giving each message a unique ID and checking for repeats.
- 3.Multiple receivers working together - If 3 receivers are reading from 6 partitions, who reads which partition? When a receiver crashes, another must take over its work. This coordination is tricky.
- 4.Speed vs Safety - Writing to disk is slow. Waiting for other computers to confirm they saved a copy is slow. But we need both for safety. The trick is writing to disk in a special fast way (sequential writes) and letting senders choose how much safety they need.
Common mistake candidates make
Many people suggest a simple message queue where messages disappear after reading. That is wrong for this problem. We need a LOG - like a notebook where old pages are never torn out. This lets multiple receivers read at their own speed, lets receivers go back and re-read, and gives us built-in ordering.
The big idea: The Log
Think of a log as a notebook where: - You can only write on new pages at the end (never change old pages) - Each page has a number (0, 1, 2, 3, ...) - Many people can read the notebook at the same time - Each reader remembers which page they are on - Readers can go back to re-read old pages
This simple idea gives us: - Order: Page 5 always comes before page 6 - Replay: Go back to page 100 and read forward - Multiple readers: Everyone reads at their own pace - Safety: Old pages never change, just add new ones - No duplicates: Each page number is unique
The Log - Like a notebook you can only add to
Scale and Access Patterns
Before designing, let me figure out how big this system needs to be. This helps us choose the right tools and number of computers.
| What we are measuring | Number | What this means for our design |
|---|---|---|
| Messages per second | 1,000,000 | Very high - we need many computers working together |
| Average message size | 1 KB | 1 million messages x 1 KB = 1 GB of data every second |
What to tell the interviewer
At 1 million messages per second with 1 KB each, we are writing 1 GB every second. With 7-day storage and 3 copies, we need about 2 petabytes of disk space. This is a write-heavy system - we write once and read many times. Writes are always at the end of the log (sequential), which is super fast even on disk.
How people use the system (from most common to least common):
- 1.Send messages (produce) - Apps constantly send new data. This must be super fast. Messages always go to the end of the log.
- 2.Read messages (consume) - Apps read messages, usually from where they left off. They read forward through the log.
- 3.Replay old messages - Sometimes apps need to go back and re-process old data. They jump to an old position and read forward.
- 4.Search for specific messages - Rare. The system is not designed for searching - it is designed for ordered reading.
How fast can one computer write?
- A good SSD can write 500 MB per second easily
- We need to write 1 GB per second totalHigh-Level Architecture
Now let me draw the big picture of how all the pieces fit together. I will keep it simple and explain what each part does.
What to tell the interviewer
The system has three main parts: (1) Brokers - computers that store the log data, (2) A coordination service - keeps track of which broker has which data, (3) Clients - senders and receivers that connect to brokers. Data is split into partitions, and each partition lives on one broker.
Stream Processing System - The Big Picture
What each part does and WHY:
| Part | What it does | Why we need it |
|---|---|---|
| Broker | Stores messages on disk. Accepts writes from senders. Sends messages to receivers. | This is where the actual data lives. Each broker is a computer with lots of disk space. |
| Topic | A name for a group of related messages (like orders or user-clicks) | Lets us organize different types of data. Senders write to topics, receivers read from topics. |
How real companies do it
Kafka (LinkedIn) originally used ZooKeeper for coordination. Newer versions use their own built-in system called KRaft. Pulsar (Yahoo) separates storage into BookKeeper. The core ideas are the same - brokers store data, something coordinates them, clients connect to brokers.
Important: Why partitions matter so much
Partitions are the secret to speed. Without partitions, one computer does all the work. With 100 partitions across 10 brokers, 10 computers share the work. With 1000 partitions, even more. The rule: messages with the same key (like user ID) always go to the same partition, so they stay in order.
Data Model and Storage
Now let me explain how we actually store the messages on disk. The key idea is that writing to the end of a file is FAST - even faster than some in-memory databases!
What to tell the interviewer
Each partition is stored as a series of files called segments. Each segment is like one chapter of a book. We write to the current chapter until it gets too big, then start a new chapter. Old chapters can be deleted when they expire. We also keep an index - like a book's table of contents - to quickly find specific messages.
How a partition is stored on disk
What is in a message?
Each message has these parts:
| Part | What it is | Example |
|---|---|---|
| Offset | The message number in this partition (like page number) | 42 means this is the 43rd message (we start at 0) |
| Timestamp | When the message was created | 2024-03-15 14:30:00 |
| Key | Used to decide which partition (optional) | user_123 or order_456 |
| Value | The actual data | The order details, click info, log text, etc. |
| Headers | Extra labels (optional) | source=mobile, version=2 |
How we find a message quickly
Imagine you want to find message number 1500. We do not want to read through all messages from the start!
The index helps us jump close to where we want:
The Index (saved every 100 messages):
+------------------+------------------+
| Message Number | Position in File |Key insight: We do NOT use a database
We do not use MySQL or PostgreSQL. The log file IS our database. Why? Writing to the end of a file is incredibly fast - as fast as writing to memory. The operating system automatically keeps recently-read data in memory (page cache), so repeated reads are fast too. Simple files beat complex databases for this use case.
Batching: The speed trick
Instead of sending one message at a time, we bundle many messages together:
Without batching:
- Send message 1... network round trip... done
- Send message 2... network round trip... done Replication: Keeping Data Safe
What happens if a computer crashes and its disk dies? We lose all the data on it! To prevent this, we copy each message to multiple computers.
What to tell the interviewer
Each partition has one leader and multiple followers. The leader does all the work (accepts writes, serves reads). Followers just copy everything the leader has. If the leader dies, one of the followers becomes the new leader. This way we never lose data.
How copies (replicas) work
In-Sync Replicas (ISR) - The followers that are caught up
Not all followers are always up to date. Network can be slow, computers can lag. We keep a list of followers that ARE caught up - called the In-Sync Replica set (ISR).
Setup: Partition 0 has 3 copies (1 leader + 2 followers)
Rule: We need at least 2 copies before saying "message saved"
| Safety Setting | What it means | Speed | When to use |
|---|---|---|---|
| acks=0 | Do not wait for any confirmation | Super fast | Logs, metrics where losing some is OK |
| acks=1 | Wait for leader only | Fast | Most normal use cases |
| acks=all | Wait for all in-sync copies | Slower | Bank transactions, critical data |
What happens when the leader crashes?
Before crash:
- Partition 0: Leader = Broker 1, Followers = Broker 2, Broker 3
- All 3 have messages 0 through 999The promise we must keep
Once we tell a sender their message is saved (with acks=all), that message MUST survive even if a computer crashes. This is our durability promise. Breaking this promise means losing bank transactions, orders, or critical data.
Senders (Producers) Deep Dive
Let me explain how the sender app works. The sender has a few important jobs: pick which partition to send to, bundle messages into batches, and handle failures.
Inside a sender app
How do we pick which partition?
The key you provide decides the partition. All messages with the same key go to the same partition (so they stay in order).
If you provide a key (like user ID):
partition = hash(key) % number_of_partitions
How batching works
We do not send messages one by one. We collect them into batches first.
Settings:
- batch_size = 16 KB (send when batch reaches this size)
- linger_time = 5 ms (or send after this much time, whichever comes first)Key configuration
Two settings control the speed vs delay tradeoff: batch_size (how big before sending) and linger_time (how long to wait for more messages). For high throughput, use bigger batches and longer linger. For low delay, use smaller batches and shorter linger.
Receivers (Consumers) Deep Dive
Now let me explain how receiver apps work. Receivers read messages from partitions and remember where they left off (their offset). Multiple receivers can work together in a group to share the work.
How receivers share work in a group
How receiver groups work
A group is a team of receivers that share the work. Each partition is assigned to exactly one receiver in the group. This way, no message is processed twice.
Scenario: Topic has 6 partitions, Group has 3 receivers
Initial assignment:Tracking progress with offsets
Each receiver remembers which message it read last. This is called the offset - like a bookmark in a book.
Partition 0 has messages: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
Receiver starts reading:Important detail
Offsets are saved in a special topic called __consumer_offsets. This topic is replicated and safe just like any other topic. Each entry is: (group ID, topic, partition) -> (offset). When a receiver restarts, it reads its last offset from this topic.
Exactly-Once Delivery
The holy grail of messaging is exactly-once: each message is processed exactly one time. Not zero (lost), not two (duplicate). This is really hard to achieve.
Why exactly-once is hard
In the real world, networks fail, computers crash, messages get retried. How do you know if a message was already processed? If you retry after a timeout, maybe it worked the first time. If you do not retry, maybe it was lost. Exactly-once requires careful tracking on both the sender and receiver sides.
Problem 1: Sender retries create duplicates
Imagine: Sender sends message. Broker saves it. Network fails before confirmation reaches sender. Sender retries. Now the message is saved twice!
Each sender gets:
- Producer ID (PID): A unique number like 12345
- Sequence number: Counts up for each message (0, 1, 2, 3, ...)Problem 2: Writing to multiple partitions atomically
Sometimes you need to write to several partitions as one unit. Either all succeed or all fail. This is called a transaction.
Transaction: All or nothing
Example: When an order comes in, we need to:
1. Write the order to the "orders" topic
2. Update inventory in the "inventory" topicProblem 3: Receivers might see uncommitted messages
By default, receivers see all messages including pending ones. To only see committed messages, use the read_committed setting.
read_uncommitted (default):
- Receiver sees ALL messages immediately
- Including messages from transactions that might be rolled backThe practical truth about exactly-once
True exactly-once requires: (1) Idempotent senders with sequence numbers, (2) Transactions for multi-partition writes, (3) Read-committed receivers, (4) Idempotent processing in receiver apps. Most systems use at-least-once with idempotent receivers - simpler and almost as good.
What Can Go Wrong and How We Handle It
Tell the interviewer about failures
Good engineers think about what can break. Let me walk through the things that can go wrong and how we protect against them.
| What breaks | What happens | How we fix it | How long to recover |
|---|---|---|---|
| One broker crashes | Its partitions become unavailable briefly | Followers become leaders. Data is safe because we have copies. | 10-30 seconds to elect new leader |
| Broker disk dies | Data on that disk is lost forever | We have copies on other brokers. Re-copy data to a new broker. | Minutes to hours depending on data size |
Example: What happens when a broker crashes
Before the crash:
- Partition 0: Leader = Broker 1, Followers = Broker 2, Broker 3
- All three have messages 0 through 999 (fully synced)How receivers handle failures safely
The safe way to process messages:
while True:Key insight
The stream system guarantees exactly-once between sender and the log. But your receiver app must handle the possibility of seeing a message twice. Use the message offset or a unique ID to detect and skip duplicates. This is called idempotent processing.
Growing the System Over Time
What to tell the interviewer
This design handles millions of messages per second. Let me explain how we grow it when we need more capacity, and what extra features we might add.
How we grow step by step:
When we need more speed (throughput): - Add more partitions to the topic - More partitions = more parallel work - Add more receivers to the group (up to number of partitions) - Note: We can add partitions but not remove them
When brokers are overloaded: - Add more broker computers to the cluster - Move some partitions to the new brokers (rebalancing) - This can be done while the system is running
When we need more storage: - Add bigger disks to brokers - Or add more brokers - Or use tiered storage (explained below)
Tiered Storage - Hot and cold data
Why tiered storage helps: - Fast SSD is expensive, cheap cloud storage is... cheap - Most receivers read recent messages (hot data) - Old messages are rarely read but must be kept - Keep hot data on fast local disks, move cold data to cheap cloud storage - Receivers reading old data are slower, but that is okay for replays
Extra features we might add:
| Feature | What it does | Why it is useful |
|---|---|---|
| Log compaction | Keep only the latest message for each key, delete older ones | Great for keeping latest state (like user profile). Saves space. |
| Quotas | Limit how fast one sender or receiver can go | Prevents one app from hogging all the resources |
| Schema registry | Checks that messages follow a specific format | Prevents bad data from entering the system |
| Dead letter queue | Special topic for messages that failed processing | Failed messages do not block others, can be fixed later |
| Monitoring | Track message rates, delays, disk usage | Know when something is going wrong before users complain |
Alternative systems to mention
If the interviewer asks about alternatives: (1) For simpler needs with less throughput, RabbitMQ is easier to set up. (2) For even lower latency, Redis Streams keeps everything in memory. (3) For infinite storage with separate compute, Apache Pulsar with BookKeeper is interesting. (4) For cloud-native, AWS Kinesis or Google Pub/Sub are managed options.
Different use cases need different focus:
Logging system (like collecting server logs): - Can lose some messages (acks=1 is fine) - Need lots of storage - Receivers are slow batch processors
Payment events (like order processing): - Cannot lose any messages (acks=all required) - Need exactly-once processing - Receivers must be fast
IoT sensors (like millions of thermometers): - Millions of tiny messages - Can lose some data - Need to handle many connections - Maybe use MQTT first, then stream into our system