Design Walkthrough
Problem Statement
The Question: Design a messaging app like WhatsApp where people can chat one-on-one, create group chats, see if friends are online, and share photos and videos.
What the app needs to do (most important first):
- 1.Send messages between two people - Alice sends a message to Bob. If Bob is online, he sees it instantly. If Bob is offline, he sees it when he opens the app later.
- 2.Group chats - Send one message to a group of friends (like a family group or work team). Everyone in the group sees the message.
- 3.Show who is online - See a green dot next to friends who are currently using the app. See "last seen 5 minutes ago" for others.
- 4.Message status - Show one check mark when sent, two check marks when delivered, blue check marks when read.
- 5.Share photos and videos - Send pictures and videos, not just text.
- 6.Work when offline - If your internet cuts out, messages you receive should still appear when internet comes back.
What to say first
This is a big problem with many features. Let me first understand the scale - how many users and messages. Then I will focus on the core feature (sending messages between two people) before adding group chats and media sharing.
What the interviewer really wants to see: - Can you keep millions of phone connections open at the same time? - What happens when someone is offline for days? Do their messages get lost? - If I send messages A, B, C quickly, do they arrive in the right order? - When I send a message to a group of 100 people, how does it reach everyone?
Clarifying Questions
Before you start designing, ask questions to understand what you are building. Good questions show the interviewer you think before you code.
Question 1: How big is this?
How many people use this app? How many are chatting at the same time? How many messages are sent every day?
Why ask this: If millions of people use it at once, we need a very different design than if only thousands use it.
What interviewers usually say: 500 million people use it daily. 100 million are connected at the same time. 100 billion messages are sent every day.
How this changes your design: We need thousands of servers just to handle the connections. One server can handle about 50,000 connections, so we need 2,000+ servers just for that!
Question 2: What if someone is offline for a long time?
If someone turns off their phone for a month, should their messages still be waiting when they come back?
Why ask this: This tells us how long we need to save undelivered messages.
What interviewers usually say: Yes, save messages for at least 30 days. After 30 days, we can delete undelivered messages.
How this changes your design: We need a real database to store messages, not just memory. Messages must survive even if servers restart.
Question 3: Do messages need to arrive in order?
If I send Hi then Hello quickly, must Hi always arrive before Hello?
Why ask this: Keeping perfect order is expensive and complicated.
What interviewers usually say: Messages in the same chat should be in order. Different chats can be slightly out of order.
How this changes your design: We add a number to each message (1, 2, 3...) so the phone can sort them correctly even if they arrive in the wrong order.
Question 4: How big can groups be?
What is the maximum number of people in a group chat? 10? 100? 10,000?
Why ask this: Sending a message to 10 people is easy. Sending to 10,000 people is hard.
What interviewers usually say: Maximum 256 people per group (like WhatsApp). Most groups have fewer than 20 people.
How this changes your design: For small groups, we can send the message to everyone one by one. We do not need fancy tricks for huge groups.
Summarize your assumptions
Let me summarize what I will design for: 500 million daily users, 100 million connected at the same time, 100 billion messages per day, messages saved for 30 days, messages in order within a chat, and groups up to 256 people. I will focus on one-on-one messaging first, then add groups.
The Hard Part
Say this to the interviewer
The hardest part of a messaging app is this: How do we keep 100 million phone connections open at the same time? And how do we make sure messages are never lost, even if someone is offline for days?
Why connections are tricky (explained simply):
- 1.Regular websites vs messaging apps - When you visit a website, your browser connects, gets the page, and disconnects. With messaging, your phone stays connected ALL the time so it can receive messages instantly.
- 2.100 million open connections - Imagine 100 million people on the phone at the same time, all waiting to hear from you. You need to remember who everyone is and which line they are on.
- 3.Phones disconnect randomly - People walk into elevators, go into tunnels, or their battery dies. The connection breaks without warning.
- 4.Server crashes - When a server with 50,000 connections crashes, all those people get disconnected. They need to reconnect to a different server.
Common mistake candidates make
Many people forget about offline users. They say: just send the message through the connection. But what if the connection is broken? What if the person turned off their phone? The message would be lost forever! You MUST save messages in a database first.
The key insight:
Every user has an "inbox" - like a mailbox that saves messages for them.
- When you are online: New messages go to your inbox AND get pushed to your phone instantly. - When you are offline: New messages go to your inbox and wait. When you open the app, you get all waiting messages.
This way, messages are NEVER lost!
How messages are delivered
Scale and Access Patterns
Before designing, let me figure out how big this system needs to be. This helps us choose the right tools.
| What we are measuring | Number | What this means for our design |
|---|---|---|
| Total users | 2 billion | Huge number of accounts to manage |
| Daily active users | 500 million | How many people use it each day |
What to tell the interviewer
At 100 billion messages per day, we are sending over 1 million messages every second. Each message needs to be: saved in the sender's chat history, saved in the recipient's inbox, and pushed to the recipient if they are online. That is 3+ database operations per message!
Connection Servers (how many servers to keep phones connected):
- 100 million connections at once
- One server can handle 50,000 connections (with good engineering)How people use the app (from most common to least common):
- 1.Send a message - Someone types and sends a message. This is the #1 action.
- 2.Receive a message - Getting messages pushed to your phone instantly.
- 3.Open a chat - Looking at old messages in a conversation. People scroll back to see chat history.
- 4.Check who is online - Looking at the green dot to see if friends are available.
Common interview mistake: Underestimating scale
Many people say: just use one database for messages. But 1 million messages per second is way too much for one database! We need to split the data across many databases. The good news: each user's messages are separate, so we can easily split by user.
High-Level Architecture
Now let me draw the big picture of how all the pieces fit together. I will keep it simple and explain what each part does.
What to tell the interviewer
I will break this into separate services - one for handling connections, one for routing messages, one for storing messages, and one for tracking who is online. Each service does one job well. We split users across servers so the load is spread out.
WhatsApp System - The Big Picture
What each part does and WHY it is separate:
| Service | What it does | Why it is separate (what to tell interviewer) |
|---|---|---|
| Gateway Servers | Keep phone connections open. When a message needs to go to Bob, find which Gateway Bob is connected to. | Why separate? Handling 100 million connections is a huge job. These servers do ONLY connection handling - nothing else. If they crash, phones reconnect to another Gateway. |
| Chat Servers | Route messages to the right place. When Alice sends to Bob, figure out where Bob is and deliver the message. | Why separate? Message routing logic is different from connection handling. Chat servers can restart without dropping connections. We can scale them independently. |
Common interview question: Why not one big service?
Interviewers often ask: Why do you need separate services? Your answer: Different parts have different needs. Gateways need to handle millions of connections. The database needs to handle millions of writes. Presence can be slightly delayed. By separating them, each part can be optimized for its job and scaled independently.
Technology Choices - Why we picked these tools:
Gateway: Custom server with WebSockets - WebSocket keeps a permanent connection between phone and server - Phone does not need to keep asking "any new messages?" - server pushes instantly - One server handles 50,000+ connections with good engineering
Message Queue: Kafka (Recommended) - Why we chose it: Messages are NEVER lost - Kafka saves to disk - If something goes wrong, we can replay old messages - Handles millions of messages per second
Message Database: Cassandra (Recommended) - Why we chose it: Handles massive writes (millions per second) - Scales by adding more servers (no limit) - Data is copied to multiple servers (if one dies, data is safe) - Other options: ScyllaDB (faster Cassandra), DynamoDB (managed by AWS)
Session Store: Redis - Why we chose it: Super fast (in-memory) - Simple key-value lookup: "Where is Bob connected?" - If Redis crashes, phones just reconnect (data is rebuilt)
Important interview tip
Pick technologies YOU know! If you have used MySQL at your job, explain how it could work. Interviewers care more about your reasoning than the specific tool. Say: I will use Cassandra because it handles write-heavy workloads well, but I could also use DynamoDB if we are on AWS.
How WhatsApp actually does it
WhatsApp famously handled 900 million users with only 50 engineers! Their secret: They used Erlang, a programming language designed for phone systems. Erlang can handle millions of tiny processes per server. Most companies use more mainstream technologies like Java or Go.
Data Model and Storage
Now let me show how we organize the data in the database. Think of tables like spreadsheets - each one stores a different type of information.
What to tell the interviewer
I will use Cassandra for message storage because it handles lots of writes well. The key decision is how to organize data: I will group messages by conversation so that loading a chat is fast.
Table 1: Messages - Where all the messages live
This table stores every message ever sent. Messages are grouped by conversation (so all messages between Alice and Bob are together).
| Column | What it stores | Example |
|---|---|---|
| conversation_id | Which conversation this message belongs to | conv_alice_bob_123 |
| message_id | Unique ID for this message (includes timestamp) | msg_20240115_143022_abc |
Why group by conversation?
When you open a chat with Bob, we need to load the last 50 messages quickly. By grouping all Bob messages together, we read from ONE place instead of searching the whole database.
Table 2: User Inbox - Messages waiting to be delivered
This is the "mailbox" for each user. When you are offline, messages pile up here. When you come back online, we deliver everything waiting in your inbox.
| Column | What it stores | Example |
|---|---|---|
| user_id | Whose inbox is this | user_bob |
| message_id | Which message | msg_20240115_143022_abc |
Important: The inbox is the key to offline delivery!
When Bob is offline, messages go into his inbox with status pending. When Bob opens the app, we find all pending messages and deliver them. Then we change status to delivered. This is how we guarantee no message is ever lost.
Table 3: Conversations - List of all chats
This table knows which conversations exist and who is in them. When you open the app, we load your recent conversations from here.
| Column | What it stores | Example |
|---|---|---|
| conversation_id | Unique ID for this conversation | conv_abc123 |
| type | Is it one-on-one or a group? | direct or group |
Table 4: User Conversations - Each person's chat list
When you open the app, you see a list of all your chats with the newest message preview. This table makes that screen load fast.
| Column | What it stores | Example |
|---|---|---|
| user_id | Whose chat list is this | user_alice |
| updated_at | When was last activity | 2024-01-15 2:30 PM |
| conversation_id | Which conversation | conv_alice_bob_123 |
| last_message_preview | Preview of recent message | Hey, are you free for... |
| unread_count | How many unread messages | 3 |
Session Store (Redis) - Who is connected where
This is NOT in the database - it is in fast memory (Redis). We use it to find where to send messages.
# Find which Gateway server Bob is connected to
user_session:bob -> {
"gateway_id": "gateway-server-5",What happens when Redis data disappears?
If Redis crashes, we lose the session data. But that is okay! When Bob's app tries to send a message, it will notice the connection is gone and reconnect. The new connection creates new session data. Messages waiting in the inbox (in Cassandra) are safe.
Message Flow Deep Dive
Let me walk through exactly what happens when Alice sends a message to Bob. This is the most important part to explain clearly!
What happens when Alice sends Hey Bob!
The check marks explained:
- 1.One check mark (Sent): Message saved in database. Even if everything crashes, the message is safe.
- 2.Two check marks (Delivered): Message reached Bob's phone. We know Bob has the message now.
- 3.Blue check marks (Read): Bob opened the conversation and saw the message.
WHEN Alice sends a message to Bob:
STEP 1: Save the message (so it is never lost)The important insight
We tell Alice "sent" BEFORE we deliver to Bob. Why? Because delivery might take time (Bob might be offline). Alice should not wait. The message is safe in the database - that is what matters.
What happens when Bob comes back online:
WHEN Bob opens the app:
STEP 1: Connect to a Gateway serverWhat if Bob was offline for a week?
Bob might have hundreds of waiting messages! We deliver them in batches (50 at a time) so his phone is not overwhelmed. We also sort by conversation so recent chats come first.
Group Messages
Group chats are a bit different. When Alice sends a message to a group of 10 people, we need to deliver it to all 10.
What to tell the interviewer
For group messages, I send the message to each member one by one. For small groups (under 50 people), this is fast enough. For very large groups, we would need smarter strategies.
Alice sends to Family Group (4 people)
How group messages work:
- 1.Save the message once - We store the message in the messages table (just one copy)
- 2.Get the member list - Look up who is in the group: Bob, Carol, Dave
- 3.Fan out to everyone - For each member, add the message to their inbox
- 4.Deliver to online members - Bob and Dave are online, so they get the message instantly
- 5.Wait for offline members - Carol is offline, her inbox saves the message
- 6.Delivery receipts - Alice sees two check marks when ALL members receive the message
WHEN Alice sends "Who wants pizza?" to Family Group:
STEP 1: Save the messageWhy groups have a size limit
If a group has 10,000 people, sending one message creates 10,000 inbox entries. That is a lot of database writes for one message! This is why WhatsApp limits groups to 256 people. For bigger groups (like company announcements), you use Channels which work differently.
| Group Size | How We Handle It | Why This Works |
|---|---|---|
| Small (under 20) | Fan out to all members immediately | Fast enough - 20 writes is no problem |
| Medium (20-256) | Fan out but do it in the background | Does not slow down the sender, delivery takes a few seconds |
| Large (256+) | Use channels or broadcast lists | Different design - message is stored once, members pull updates |
Preventing Message Loss
The golden rule
A message must NEVER be lost. If Alice sends a message, Bob must eventually receive it. Getting a message twice is annoying but okay. Losing a message breaks trust forever.
Why we choose "at-least-once" delivery:
There are three choices for message delivery: - At-most-once: Might lose messages, but never send twice. (BAD for chat!) - Exactly-once: Never lose, never duplicate. (Very hard and slow) - At-least-once: Never lose, might duplicate. (Good for chat!)
We choose at-least-once because: 1. Losing a message is terrible - user loses trust 2. Getting a duplicate is annoying but fixable - phone ignores duplicates 3. Exactly-once needs complicated coordination that slows everything down
ON Bob's Phone:
We keep a list of recently seen message IDs (last 10,000 messages)How we make sure messages are never lost:
| What could go wrong | How we protect against it |
|---|---|
| Server crashes before saving message | We save to database FIRST, before doing anything else |
| Message queue (Kafka) loses message | Kafka saves to disk and copies to multiple servers |
The key insight
The inbox table is our safety net. Every message goes to the inbox. It stays there until we are SURE the recipient got it (they sent an acknowledgment). If anything goes wrong, the message is safe in the inbox.
Keeping messages in order:
When Alice sends messages quickly: "Hi" then "How are you?" then "Want to meet?"
They must arrive in that order! Here is how we do it:
Each message gets a special ID called TIMEUUID:
- It contains the exact time the message was created
- It is unique (no two messages have the same ID)What Can Go Wrong and How We Handle It
Tell the interviewer about failures
Good engineers think about what can break. Let me walk through the things that can go wrong and how we protect against them.
Common failures and how we handle them:
| What breaks | What happens to users | How we fix it | Why this works |
|---|---|---|---|
| Gateway server crashes | Users on that server get disconnected | Phone automatically reconnects to another Gateway | Load balancer sends them to a healthy server |
| Chat server crashes | Messages are delayed | Kafka holds messages, another Chat server picks them up | Kafka never loses messages |
When Bob reconnects after being offline:
WHEN Bob opens the app after being offline for 3 days:
STEP 1: Connect to a GatewayWhat is idempotent? (important word to know)
Idempotent means: doing something twice has the same result as doing it once. Our delivery is idempotent - if we accidentally try to deliver the same message twice, it is okay! The phone ignores duplicates, and we do not create extra entries.
Growing the System Over Time
What to tell the interviewer
This design works great for up to 100 million users in one location. Let me explain how we would grow it to support users around the world.
How we grow step by step:
Stage 1: Single Region (up to 100 million users) - All servers in one data center (like US-East) - Simple and fast - messages never leave the building - Add more servers as users grow
Stage 2: Multiple Regions (100-500 million users) - Data centers in US, Europe, and Asia - Users connect to the nearest data center (faster!) - Messages between regions travel through a fast backbone
Stage 3: Global Scale (500 million+ users) - Each region is somewhat independent - Complex coordination for cross-region messages - This is where things get really hard!
Multiple regions around the world
When Alice (US) sends to Bob (Europe):
- 1.Alice's message goes to US data center 2. Saved in US database 3. Routed through backbone to Europe (adds ~100ms) 4. Saved in Europe database 5. Delivered to Bob through Europe Gateway
Total extra delay: about 100-200 milliseconds (still feels instant!)
Cool features we can add later:
1. End-to-End Encryption Messages are encrypted on Alice's phone and decrypted on Bob's phone. Even we (the server) cannot read them!
BEFORE sending:
- Alice's phone encrypts: "Hey Bob!" becomes "X3kJ9mP..."
- Only Bob's phone has the key to decrypt it2. Voice and Video Calls - Real-time audio/video is different from messages - Uses peer-to-peer connections when possible (phone to phone directly) - Falls back to server relay when direct connection fails
3. Status Updates (Stories) - Photos/videos that disappear after 24 hours - Stored differently - expire automatically - Fan out to all contacts who want to see
| Feature | Why it is different | Special handling needed |
|---|---|---|
| Voice/Video Calls | Real-time streaming, not store-and-forward | Use WebRTC for peer-to-peer, TURN servers for relay |
| Status/Stories | Temporary content, broadcast to many | Time-based expiration, lazy fan-out |
| File Sharing | Large files (100MB+) | Upload to CDN first, send link in message |
| Message Search | Find old messages by keyword | Separate search index (Elasticsearch) |
| Multi-device | Same account on phone and laptop | Sync messages to all devices, handle conflicts |
What about multiple devices?
If Bob uses WhatsApp on his phone AND his laptop, both need to see messages. With end-to-end encryption, this is tricky! WhatsApp Web works by using the phone as the source - laptop connects through the phone. Other apps like Signal sync messages to all devices directly.