System Design Masterclass
Messagingmessagingreal-timewebsocketsdistributed-systemspresenceadvanced

Design WhatsApp / Messenger

Design a real-time messaging system at massive scale

2B+ users, 100B+ messages/day|Similar to WhatsApp, Facebook Messenger, Telegram, Signal, Discord, Slack|45 min read

Summary

WhatsApp lets you send messages to friends instantly. When your friend is online, they see your message right away. When they are offline (phone turned off or no internet), the message waits and gets delivered when they come back online. The tricky parts are: keeping millions of phones connected at the same time, making sure no message ever gets lost, showing messages in the right order, and sending one message to everyone in a group chat. Companies like Meta, Telegram, Discord, and Slack ask this question in interviews.

Key Takeaways

Core Problem

The main job is to deliver messages to people instantly when they are online, and save messages safely when they are offline so nothing gets lost.

The Hard Part

Keeping millions of phone connections open at the same time is hard. Also, if someone turns off their phone for a week, their messages must still be waiting when they come back.

Scaling Axis

Each person's messages are stored separately. We split users across many servers - users whose names start with A-M go to one set of servers, N-Z go to another set.

Critical Invariant

Messages must NEVER get lost. If you send a message, it MUST reach your friend eventually. Getting the same message twice is okay (annoying but not terrible). Losing a message is NOT okay.

Performance Requirement

When both people are online, messages should arrive in less than half a second. People expect instant messaging to be instant!

Key Tradeoff

We choose to sometimes send a message twice rather than risk losing it. The phone app remembers which messages it already showed, so duplicates get ignored.

Design Walkthrough

Problem Statement

The Question: Design a messaging app like WhatsApp where people can chat one-on-one, create group chats, see if friends are online, and share photos and videos.

What the app needs to do (most important first):

  1. 1.Send messages between two people - Alice sends a message to Bob. If Bob is online, he sees it instantly. If Bob is offline, he sees it when he opens the app later.
  2. 2.Group chats - Send one message to a group of friends (like a family group or work team). Everyone in the group sees the message.
  3. 3.Show who is online - See a green dot next to friends who are currently using the app. See "last seen 5 minutes ago" for others.
  4. 4.Message status - Show one check mark when sent, two check marks when delivered, blue check marks when read.
  5. 5.Share photos and videos - Send pictures and videos, not just text.
  6. 6.Work when offline - If your internet cuts out, messages you receive should still appear when internet comes back.

What to say first

This is a big problem with many features. Let me first understand the scale - how many users and messages. Then I will focus on the core feature (sending messages between two people) before adding group chats and media sharing.

What the interviewer really wants to see: - Can you keep millions of phone connections open at the same time? - What happens when someone is offline for days? Do their messages get lost? - If I send messages A, B, C quickly, do they arrive in the right order? - When I send a message to a group of 100 people, how does it reach everyone?

Clarifying Questions

Before you start designing, ask questions to understand what you are building. Good questions show the interviewer you think before you code.

Question 1: How big is this?

How many people use this app? How many are chatting at the same time? How many messages are sent every day?

Why ask this: If millions of people use it at once, we need a very different design than if only thousands use it.

What interviewers usually say: 500 million people use it daily. 100 million are connected at the same time. 100 billion messages are sent every day.

How this changes your design: We need thousands of servers just to handle the connections. One server can handle about 50,000 connections, so we need 2,000+ servers just for that!

Question 2: What if someone is offline for a long time?

If someone turns off their phone for a month, should their messages still be waiting when they come back?

Why ask this: This tells us how long we need to save undelivered messages.

What interviewers usually say: Yes, save messages for at least 30 days. After 30 days, we can delete undelivered messages.

How this changes your design: We need a real database to store messages, not just memory. Messages must survive even if servers restart.

Question 3: Do messages need to arrive in order?

If I send Hi then Hello quickly, must Hi always arrive before Hello?

Why ask this: Keeping perfect order is expensive and complicated.

What interviewers usually say: Messages in the same chat should be in order. Different chats can be slightly out of order.

How this changes your design: We add a number to each message (1, 2, 3...) so the phone can sort them correctly even if they arrive in the wrong order.

Question 4: How big can groups be?

What is the maximum number of people in a group chat? 10? 100? 10,000?

Why ask this: Sending a message to 10 people is easy. Sending to 10,000 people is hard.

What interviewers usually say: Maximum 256 people per group (like WhatsApp). Most groups have fewer than 20 people.

How this changes your design: For small groups, we can send the message to everyone one by one. We do not need fancy tricks for huge groups.

Summarize your assumptions

Let me summarize what I will design for: 500 million daily users, 100 million connected at the same time, 100 billion messages per day, messages saved for 30 days, messages in order within a chat, and groups up to 256 people. I will focus on one-on-one messaging first, then add groups.

The Hard Part

Say this to the interviewer

The hardest part of a messaging app is this: How do we keep 100 million phone connections open at the same time? And how do we make sure messages are never lost, even if someone is offline for days?

Why connections are tricky (explained simply):

  1. 1.Regular websites vs messaging apps - When you visit a website, your browser connects, gets the page, and disconnects. With messaging, your phone stays connected ALL the time so it can receive messages instantly.
  2. 2.100 million open connections - Imagine 100 million people on the phone at the same time, all waiting to hear from you. You need to remember who everyone is and which line they are on.
  3. 3.Phones disconnect randomly - People walk into elevators, go into tunnels, or their battery dies. The connection breaks without warning.
  4. 4.Server crashes - When a server with 50,000 connections crashes, all those people get disconnected. They need to reconnect to a different server.

Common mistake candidates make

Many people forget about offline users. They say: just send the message through the connection. But what if the connection is broken? What if the person turned off their phone? The message would be lost forever! You MUST save messages in a database first.

The key insight:

Every user has an "inbox" - like a mailbox that saves messages for them.

  • When you are online: New messages go to your inbox AND get pushed to your phone instantly. - When you are offline: New messages go to your inbox and wait. When you open the app, you get all waiting messages.

This way, messages are NEVER lost!

How messages are delivered

Scale and Access Patterns

Before designing, let me figure out how big this system needs to be. This helps us choose the right tools.

What we are measuringNumberWhat this means for our design
Total users2 billionHuge number of accounts to manage
Daily active users500 millionHow many people use it each day
+ 6 more rows...

What to tell the interviewer

At 100 billion messages per day, we are sending over 1 million messages every second. Each message needs to be: saved in the sender's chat history, saved in the recipient's inbox, and pushed to the recipient if they are online. That is 3+ database operations per message!

Connection Servers (how many servers to keep phones connected):
- 100 million connections at once
- One server can handle 50,000 connections (with good engineering)
+ 12 more lines...

How people use the app (from most common to least common):

  1. 1.Send a message - Someone types and sends a message. This is the #1 action.
  2. 2.Receive a message - Getting messages pushed to your phone instantly.
  3. 3.Open a chat - Looking at old messages in a conversation. People scroll back to see chat history.
  4. 4.Check who is online - Looking at the green dot to see if friends are available.

Common interview mistake: Underestimating scale

Many people say: just use one database for messages. But 1 million messages per second is way too much for one database! We need to split the data across many databases. The good news: each user's messages are separate, so we can easily split by user.

High-Level Architecture

Now let me draw the big picture of how all the pieces fit together. I will keep it simple and explain what each part does.

What to tell the interviewer

I will break this into separate services - one for handling connections, one for routing messages, one for storing messages, and one for tracking who is online. Each service does one job well. We split users across servers so the load is spread out.

WhatsApp System - The Big Picture

What each part does and WHY it is separate:

ServiceWhat it doesWhy it is separate (what to tell interviewer)
Gateway ServersKeep phone connections open. When a message needs to go to Bob, find which Gateway Bob is connected to.Why separate? Handling 100 million connections is a huge job. These servers do ONLY connection handling - nothing else. If they crash, phones reconnect to another Gateway.
Chat ServersRoute messages to the right place. When Alice sends to Bob, figure out where Bob is and deliver the message.Why separate? Message routing logic is different from connection handling. Chat servers can restart without dropping connections. We can scale them independently.
+ 4 more rows...

Common interview question: Why not one big service?

Interviewers often ask: Why do you need separate services? Your answer: Different parts have different needs. Gateways need to handle millions of connections. The database needs to handle millions of writes. Presence can be slightly delayed. By separating them, each part can be optimized for its job and scaled independently.

Technology Choices - Why we picked these tools:

Gateway: Custom server with WebSockets - WebSocket keeps a permanent connection between phone and server - Phone does not need to keep asking "any new messages?" - server pushes instantly - One server handles 50,000+ connections with good engineering

Message Queue: Kafka (Recommended) - Why we chose it: Messages are NEVER lost - Kafka saves to disk - If something goes wrong, we can replay old messages - Handles millions of messages per second

Message Database: Cassandra (Recommended) - Why we chose it: Handles massive writes (millions per second) - Scales by adding more servers (no limit) - Data is copied to multiple servers (if one dies, data is safe) - Other options: ScyllaDB (faster Cassandra), DynamoDB (managed by AWS)

Session Store: Redis - Why we chose it: Super fast (in-memory) - Simple key-value lookup: "Where is Bob connected?" - If Redis crashes, phones just reconnect (data is rebuilt)

Important interview tip

Pick technologies YOU know! If you have used MySQL at your job, explain how it could work. Interviewers care more about your reasoning than the specific tool. Say: I will use Cassandra because it handles write-heavy workloads well, but I could also use DynamoDB if we are on AWS.

How WhatsApp actually does it

WhatsApp famously handled 900 million users with only 50 engineers! Their secret: They used Erlang, a programming language designed for phone systems. Erlang can handle millions of tiny processes per server. Most companies use more mainstream technologies like Java or Go.

Data Model and Storage

Now let me show how we organize the data in the database. Think of tables like spreadsheets - each one stores a different type of information.

What to tell the interviewer

I will use Cassandra for message storage because it handles lots of writes well. The key decision is how to organize data: I will group messages by conversation so that loading a chat is fast.

Table 1: Messages - Where all the messages live

This table stores every message ever sent. Messages are grouped by conversation (so all messages between Alice and Bob are together).

ColumnWhat it storesExample
conversation_idWhich conversation this message belongs toconv_alice_bob_123
message_idUnique ID for this message (includes timestamp)msg_20240115_143022_abc
+ 5 more rows...

Why group by conversation?

When you open a chat with Bob, we need to load the last 50 messages quickly. By grouping all Bob messages together, we read from ONE place instead of searching the whole database.

Table 2: User Inbox - Messages waiting to be delivered

This is the "mailbox" for each user. When you are offline, messages pile up here. When you come back online, we deliver everything waiting in your inbox.

ColumnWhat it storesExample
user_idWhose inbox is thisuser_bob
message_idWhich messagemsg_20240115_143022_abc
+ 5 more rows...

Important: The inbox is the key to offline delivery!

When Bob is offline, messages go into his inbox with status pending. When Bob opens the app, we find all pending messages and deliver them. Then we change status to delivered. This is how we guarantee no message is ever lost.

Table 3: Conversations - List of all chats

This table knows which conversations exist and who is in them. When you open the app, we load your recent conversations from here.

ColumnWhat it storesExample
conversation_idUnique ID for this conversationconv_abc123
typeIs it one-on-one or a group?direct or group
+ 4 more rows...

Table 4: User Conversations - Each person's chat list

When you open the app, you see a list of all your chats with the newest message preview. This table makes that screen load fast.

ColumnWhat it storesExample
user_idWhose chat list is thisuser_alice
updated_atWhen was last activity2024-01-15 2:30 PM
conversation_idWhich conversationconv_alice_bob_123
last_message_previewPreview of recent messageHey, are you free for...
unread_countHow many unread messages3

Session Store (Redis) - Who is connected where

This is NOT in the database - it is in fast memory (Redis). We use it to find where to send messages.

# Find which Gateway server Bob is connected to
user_session:bob -> {
    "gateway_id": "gateway-server-5",
+ 11 more lines...

What happens when Redis data disappears?

If Redis crashes, we lose the session data. But that is okay! When Bob's app tries to send a message, it will notice the connection is gone and reconnect. The new connection creates new session data. Messages waiting in the inbox (in Cassandra) are safe.

Message Flow Deep Dive

Let me walk through exactly what happens when Alice sends a message to Bob. This is the most important part to explain clearly!

What happens when Alice sends Hey Bob!

The check marks explained:

  1. 1.One check mark (Sent): Message saved in database. Even if everything crashes, the message is safe.
  2. 2.Two check marks (Delivered): Message reached Bob's phone. We know Bob has the message now.
  3. 3.Blue check marks (Read): Bob opened the conversation and saw the message.
WHEN Alice sends a message to Bob:

STEP 1: Save the message (so it is never lost)
+ 23 more lines...

The important insight

We tell Alice "sent" BEFORE we deliver to Bob. Why? Because delivery might take time (Bob might be offline). Alice should not wait. The message is safe in the database - that is what matters.

What happens when Bob comes back online:

WHEN Bob opens the app:

STEP 1: Connect to a Gateway server
+ 22 more lines...

What if Bob was offline for a week?

Bob might have hundreds of waiting messages! We deliver them in batches (50 at a time) so his phone is not overwhelmed. We also sort by conversation so recent chats come first.

Group Messages

Group chats are a bit different. When Alice sends a message to a group of 10 people, we need to deliver it to all 10.

What to tell the interviewer

For group messages, I send the message to each member one by one. For small groups (under 50 people), this is fast enough. For very large groups, we would need smarter strategies.

Alice sends to Family Group (4 people)

How group messages work:

  1. 1.Save the message once - We store the message in the messages table (just one copy)
  2. 2.Get the member list - Look up who is in the group: Bob, Carol, Dave
  3. 3.Fan out to everyone - For each member, add the message to their inbox
  4. 4.Deliver to online members - Bob and Dave are online, so they get the message instantly
  5. 5.Wait for offline members - Carol is offline, her inbox saves the message
  6. 6.Delivery receipts - Alice sees two check marks when ALL members receive the message
WHEN Alice sends "Who wants pizza?" to Family Group:

STEP 1: Save the message
+ 21 more lines...

Why groups have a size limit

If a group has 10,000 people, sending one message creates 10,000 inbox entries. That is a lot of database writes for one message! This is why WhatsApp limits groups to 256 people. For bigger groups (like company announcements), you use Channels which work differently.

Group SizeHow We Handle ItWhy This Works
Small (under 20)Fan out to all members immediatelyFast enough - 20 writes is no problem
Medium (20-256)Fan out but do it in the backgroundDoes not slow down the sender, delivery takes a few seconds
Large (256+)Use channels or broadcast listsDifferent design - message is stored once, members pull updates

Preventing Message Loss

The golden rule

A message must NEVER be lost. If Alice sends a message, Bob must eventually receive it. Getting a message twice is annoying but okay. Losing a message breaks trust forever.

Why we choose "at-least-once" delivery:

There are three choices for message delivery: - At-most-once: Might lose messages, but never send twice. (BAD for chat!) - Exactly-once: Never lose, never duplicate. (Very hard and slow) - At-least-once: Never lose, might duplicate. (Good for chat!)

We choose at-least-once because: 1. Losing a message is terrible - user loses trust 2. Getting a duplicate is annoying but fixable - phone ignores duplicates 3. Exactly-once needs complicated coordination that slows everything down

ON Bob's Phone:

We keep a list of recently seen message IDs (last 10,000 messages)
+ 13 more lines...

How we make sure messages are never lost:

What could go wrongHow we protect against it
Server crashes before saving messageWe save to database FIRST, before doing anything else
Message queue (Kafka) loses messageKafka saves to disk and copies to multiple servers
+ 4 more rows...

The key insight

The inbox table is our safety net. Every message goes to the inbox. It stays there until we are SURE the recipient got it (they sent an acknowledgment). If anything goes wrong, the message is safe in the inbox.

Keeping messages in order:

When Alice sends messages quickly: "Hi" then "How are you?" then "Want to meet?"

They must arrive in that order! Here is how we do it:

Each message gets a special ID called TIMEUUID:
- It contains the exact time the message was created
- It is unique (no two messages have the same ID)
+ 13 more lines...

What Can Go Wrong and How We Handle It

Tell the interviewer about failures

Good engineers think about what can break. Let me walk through the things that can go wrong and how we protect against them.

Common failures and how we handle them:

What breaksWhat happens to usersHow we fix itWhy this works
Gateway server crashesUsers on that server get disconnectedPhone automatically reconnects to another GatewayLoad balancer sends them to a healthy server
Chat server crashesMessages are delayedKafka holds messages, another Chat server picks them upKafka never loses messages
+ 5 more rows...

When Bob reconnects after being offline:

WHEN Bob opens the app after being offline for 3 days:

STEP 1: Connect to a Gateway
+ 27 more lines...

What is idempotent? (important word to know)

Idempotent means: doing something twice has the same result as doing it once. Our delivery is idempotent - if we accidentally try to deliver the same message twice, it is okay! The phone ignores duplicates, and we do not create extra entries.

Growing the System Over Time

What to tell the interviewer

This design works great for up to 100 million users in one location. Let me explain how we would grow it to support users around the world.

How we grow step by step:

Stage 1: Single Region (up to 100 million users) - All servers in one data center (like US-East) - Simple and fast - messages never leave the building - Add more servers as users grow

Stage 2: Multiple Regions (100-500 million users) - Data centers in US, Europe, and Asia - Users connect to the nearest data center (faster!) - Messages between regions travel through a fast backbone

Stage 3: Global Scale (500 million+ users) - Each region is somewhat independent - Complex coordination for cross-region messages - This is where things get really hard!

Multiple regions around the world

When Alice (US) sends to Bob (Europe):

  1. 1.Alice's message goes to US data center 2. Saved in US database 3. Routed through backbone to Europe (adds ~100ms) 4. Saved in Europe database 5. Delivered to Bob through Europe Gateway

Total extra delay: about 100-200 milliseconds (still feels instant!)

Cool features we can add later:

1. End-to-End Encryption Messages are encrypted on Alice's phone and decrypted on Bob's phone. Even we (the server) cannot read them!

BEFORE sending:
- Alice's phone encrypts: "Hey Bob!" becomes "X3kJ9mP..."
- Only Bob's phone has the key to decrypt it
+ 12 more lines...

2. Voice and Video Calls - Real-time audio/video is different from messages - Uses peer-to-peer connections when possible (phone to phone directly) - Falls back to server relay when direct connection fails

3. Status Updates (Stories) - Photos/videos that disappear after 24 hours - Stored differently - expire automatically - Fan out to all contacts who want to see

FeatureWhy it is differentSpecial handling needed
Voice/Video CallsReal-time streaming, not store-and-forwardUse WebRTC for peer-to-peer, TURN servers for relay
Status/StoriesTemporary content, broadcast to manyTime-based expiration, lazy fan-out
File SharingLarge files (100MB+)Upload to CDN first, send link in message
Message SearchFind old messages by keywordSeparate search index (Elasticsearch)
Multi-deviceSame account on phone and laptopSync messages to all devices, handle conflicts

What about multiple devices?

If Bob uses WhatsApp on his phone AND his laptop, both need to see messages. With end-to-end encryption, this is tricky! WhatsApp Web works by using the phone as the source - laptop connects through the phone. Other apps like Signal sync messages to all devices directly.

Design Trade-offs

Advantages

  • +Messages are NEVER lost
  • +Simple to understand
  • +Survives any crash

Disadvantages

  • -Slightly slower - extra database write
  • -More storage needed
When to use

Always use this for real chat apps. Message loss destroys user trust.