System Design Masterclass

58 items

System Design Masterclass

Messagingmessagingreal-timewebsocketsdistributed-systemspresenceadvanced

Design WhatsApp / Messenger

Design a real-time messaging system at massive scale

2B+ users, 100B+ messages/day|Similar to WhatsApp, Facebook Messenger, Telegram, Signal, Discord, Slack|45 min read

Summary

WhatsApp lets you send messages to friends instantly. When your friend is online, they see your message right away. When they are offline (phone turned off or no internet), the message waits and gets delivered when they come back online. The tricky parts are: keeping millions of phones connected at the same time, making sure no message ever gets lost, showing messages in the right order, and sending one message to everyone in a group chat. Companies like Meta, Telegram, Discord, and Slack ask this question in interviews.

Key Takeaways

Core Problem

The main job is to deliver messages to people instantly when they are online, and save messages safely when they are offline so nothing gets lost.

The Hard Part

Keeping millions of phone connections open at the same time is hard. Also, if someone turns off their phone for a week, their messages must still be waiting when they come back.

Scaling Axis

Each person's messages are stored separately. We split users across many servers - users whose names start with A-M go to one set of servers, N-Z go to another set.

Critical Invariant

Messages must NEVER get lost. If you send a message, it MUST reach your friend eventually. Getting the same message twice is okay (annoying but not terrible). Losing a message is NOT okay.

Performance Requirement

When both people are online, messages should arrive in less than half a second. People expect instant messaging to be instant!

Key Tradeoff

We choose to sometimes send a message twice rather than risk losing it. The phone app remembers which messages it already showed, so duplicates get ignored.

Design Walkthrough

Problem Statement

The Question: Design a messaging app like WhatsApp where people can chat one-on-one, create group chats, see if friends are online, and share photos and videos.

What the app needs to do (most important first):

1.Send messages between two people - Alice sends a message to Bob. If Bob is online, he sees it instantly. If Bob is offline, he sees it when he opens the app later.
2.Group chats - Send one message to a group of friends (like a family group or work team). Everyone in the group sees the message.
3.Show who is online - See a green dot next to friends who are currently using the app. See "last seen 5 minutes ago" for others.
4.Message status - Show one check mark when sent, two check marks when delivered, blue check marks when read.
5.Share photos and videos - Send pictures and videos, not just text.
6.Work when offline - If your internet cuts out, messages you receive should still appear when internet comes back.

What to say first

This is a big problem with many features. Let me first understand the scale - how many users and messages. Then I will focus on the core feature (sending messages between two people) before adding group chats and media sharing.

What the interviewer really wants to see: - Can you keep millions of phone connections open at the same time? - What happens when someone is offline for days? Do their messages get lost? - If I send messages A, B, C quickly, do they arrive in the right order? - When I send a message to a group of 100 people, how does it reach everyone?

Clarifying Questions

Before you start designing, ask questions to understand what you are building. Good questions show the interviewer you think before you code.

Question 1: How big is this?

How many people use this app? How many are chatting at the same time? How many messages are sent every day?

Why ask this: If millions of people use it at once, we need a very different design than if only thousands use it.

What interviewers usually say: 500 million people use it daily. 100 million are connected at the same time. 100 billion messages are sent every day.

How this changes your design: We need thousands of servers just to handle the connections. One server can handle about 50,000 connections, so we need 2,000+ servers just for that!

Question 2: What if someone is offline for a long time?

If someone turns off their phone for a month, should their messages still be waiting when they come back?

Why ask this: This tells us how long we need to save undelivered messages.

What interviewers usually say: Yes, save messages for at least 30 days. After 30 days, we can delete undelivered messages.

How this changes your design: We need a real database to store messages, not just memory. Messages must survive even if servers restart.

Question 3: Do messages need to arrive in order?

If I send Hi then Hello quickly, must Hi always arrive before Hello?

Why ask this: Keeping perfect order is expensive and complicated.

What interviewers usually say: Messages in the same chat should be in order. Different chats can be slightly out of order.

How this changes your design: We add a number to each message (1, 2, 3...) so the phone can sort them correctly even if they arrive in the wrong order.

Question 4: How big can groups be?

What is the maximum number of people in a group chat? 10? 100? 10,000?

Why ask this: Sending a message to 10 people is easy. Sending to 10,000 people is hard.

What interviewers usually say: Maximum 256 people per group (like WhatsApp). Most groups have fewer than 20 people.

How this changes your design: For small groups, we can send the message to everyone one by one. We do not need fancy tricks for huge groups.

Summarize your assumptions

Let me summarize what I will design for: 500 million daily users, 100 million connected at the same time, 100 billion messages per day, messages saved for 30 days, messages in order within a chat, and groups up to 256 people. I will focus on one-on-one messaging first, then add groups.

The Hard Part

Say this to the interviewer

The hardest part of a messaging app is this: How do we keep 100 million phone connections open at the same time? And how do we make sure messages are never lost, even if someone is offline for days?

Why connections are tricky (explained simply):

1.Regular websites vs messaging apps - When you visit a website, your browser connects, gets the page, and disconnects. With messaging, your phone stays connected ALL the time so it can receive messages instantly.
2.100 million open connections - Imagine 100 million people on the phone at the same time, all waiting to hear from you. You need to remember who everyone is and which line they are on.
3.Phones disconnect randomly - People walk into elevators, go into tunnels, or their battery dies. The connection breaks without warning.
4.Server crashes - When a server with 50,000 connections crashes, all those people get disconnected. They need to reconnect to a different server.

Common mistake candidates make

Many people forget about offline users. They say: just send the message through the connection. But what if the connection is broken? What if the person turned off their phone? The message would be lost forever! You MUST save messages in a database first.

The key insight:

Every user has an "inbox" - like a mailbox that saves messages for them.

When you are online: New messages go to your inbox AND get pushed to your phone instantly. - When you are offline: New messages go to your inbox and wait. When you open the app, you get all waiting messages.

This way, messages are NEVER lost!

How messages are delivered

Scale and Access Patterns

Before designing, let me figure out how big this system needs to be. This helps us choose the right tools.

What we are measuring	Number	What this means for our design
Total users	2 billion	Huge number of accounts to manage
Daily active users	500 million	How many people use it each day

+ 6 more rows...

What to tell the interviewer

At 100 billion messages per day, we are sending over 1 million messages every second. Each message needs to be: saved in the sender's chat history, saved in the recipient's inbox, and pushed to the recipient if they are online. That is 3+ database operations per message!

Connection Servers (how many servers to keep phones connected):
- 100 million connections at once
- One server can handle 50,000 connections (with good engineering)

+ 12 more lines...

How people use the app (from most common to least common):

1.Send a message - Someone types and sends a message. This is the #1 action.
2.Receive a message - Getting messages pushed to your phone instantly.
3.Open a chat - Looking at old messages in a conversation. People scroll back to see chat history.
4.Check who is online - Looking at the green dot to see if friends are available.

Common interview mistake: Underestimating scale

Many people say: just use one database for messages. But 1 million messages per second is way too much for one database! We need to split the data across many databases. The good news: each user's messages are separate, so we can easily split by user.

High-Level Architecture

Now let me draw the big picture of how all the pieces fit together. I will keep it simple and explain what each part does.

What to tell the interviewer

I will break this into separate services - one for handling connections, one for routing messages, one for storing messages, and one for tracking who is online. Each service does one job well. We split users across servers so the load is spread out.

WhatsApp System - The Big Picture

What each part does and WHY it is separate:

Service	What it does	Why it is separate (what to tell interviewer)
Gateway Servers	Keep phone connections open. When a message needs to go to Bob, find which Gateway Bob is connected to.	Why separate? Handling 100 million connections is a huge job. These servers do ONLY connection handling - nothing else. If they crash, phones reconnect to another Gateway.
Chat Servers	Route messages to the right place. When Alice sends to Bob, figure out where Bob is and deliver the message.	Why separate? Message routing logic is different from connection handling. Chat servers can restart without dropping connections. We can scale them independently.

+ 4 more rows...

Common interview question: Why not one big service?

Interviewers often ask: Why do you need separate services? Your answer: Different parts have different needs. Gateways need to handle millions of connections. The database needs to handle millions of writes. Presence can be slightly delayed. By separating them, each part can be optimized for its job and scaled independently.

Technology Choices - Why we picked these tools:

Gateway: Custom server with WebSockets - WebSocket keeps a permanent connection between phone and server - Phone does not need to keep asking "any new messages?" - server pushes instantly - One server handles 50,000+ connections with good engineering

Message Queue: Kafka (Recommended) - Why we chose it: Messages are NEVER lost - Kafka saves to disk - If something goes wrong, we can replay old messages - Handles millions of messages per second

Message Database: Cassandra (Recommended) - Why we chose it: Handles massive writes (millions per second) - Scales by adding more servers (no limit) - Data is copied to multiple servers (if one dies, data is safe) - Other options: ScyllaDB (faster Cassandra), DynamoDB (managed by AWS)

Session Store: Redis - Why we chose it: Super fast (in-memory) - Simple key-value lookup: "Where is Bob connected?" - If Redis crashes, phones just reconnect (data is rebuilt)

Important interview tip

Pick technologies YOU know! If you have used MySQL at your job, explain how it could work. Interviewers care more about your reasoning than the specific tool. Say: I will use Cassandra because it handles write-heavy workloads well, but I could also use DynamoDB if we are on AWS.

How WhatsApp actually does it

WhatsApp famously handled 900 million users with only 50 engineers! Their secret: They used Erlang, a programming language designed for phone systems. Erlang can handle millions of tiny processes per server. Most companies use more mainstream technologies like Java or Go.

Data Model and Storage

Now let me show how we organize the data in the database. Think of tables like spreadsheets - each one stores a different type of information.

What to tell the interviewer

I will use Cassandra for message storage because it handles lots of writes well. The key decision is how to organize data: I will group messages by conversation so that loading a chat is fast.

Table 1: Messages - Where all the messages live

This table stores every message ever sent. Messages are grouped by conversation (so all messages between Alice and Bob are together).

Column	What it stores	Example
conversation_id	Which conversation this message belongs to	conv_alice_bob_123
message_id	Unique ID for this message (includes timestamp)	msg_20240115_143022_abc

+ 5 more rows...

Why group by conversation?

When you open a chat with Bob, we need to load the last 50 messages quickly. By grouping all Bob messages together, we read from ONE place instead of searching the whole database.

Table 2: User Inbox - Messages waiting to be delivered

This is the "mailbox" for each user. When you are offline, messages pile up here. When you come back online, we deliver everything waiting in your inbox.

Column	What it stores	Example
user_id	Whose inbox is this	user_bob
message_id	Which message	msg_20240115_143022_abc

+ 5 more rows...

Important: The inbox is the key to offline delivery!

When Bob is offline, messages go into his inbox with status pending. When Bob opens the app, we find all pending messages and deliver them. Then we change status to delivered. This is how we guarantee no message is ever lost.

Table 3: Conversations - List of all chats

This table knows which conversations exist and who is in them. When you open the app, we load your recent conversations from here.

Column	What it stores	Example
conversation_id	Unique ID for this conversation	conv_abc123
type	Is it one-on-one or a group?	direct or group

+ 4 more rows...

Table 4: User Conversations - Each person's chat list

When you open the app, you see a list of all your chats with the newest message preview. This table makes that screen load fast.

Column	What it stores	Example
user_id	Whose chat list is this	user_alice
updated_at	When was last activity	2024-01-15 2:30 PM
conversation_id	Which conversation	conv_alice_bob_123
last_message_preview	Preview of recent message	Hey, are you free for...
unread_count	How many unread messages	3

Session Store (Redis) - Who is connected where

This is NOT in the database - it is in fast memory (Redis). We use it to find where to send messages.

# Find which Gateway server Bob is connected to
user_session:bob -> {
    "gateway_id": "gateway-server-5",

+ 11 more lines...

What happens when Redis data disappears?

If Redis crashes, we lose the session data. But that is okay! When Bob's app tries to send a message, it will notice the connection is gone and reconnect. The new connection creates new session data. Messages waiting in the inbox (in Cassandra) are safe.

Message Flow Deep Dive

Let me walk through exactly what happens when Alice sends a message to Bob. This is the most important part to explain clearly!

What happens when Alice sends Hey Bob!

The check marks explained:

1.One check mark (Sent): Message saved in database. Even if everything crashes, the message is safe.
2.Two check marks (Delivered): Message reached Bob's phone. We know Bob has the message now.
3.Blue check marks (Read): Bob opened the conversation and saw the message.

WHEN Alice sends a message to Bob:

STEP 1: Save the message (so it is never lost)

+ 23 more lines...

The important insight

We tell Alice "sent" BEFORE we deliver to Bob. Why? Because delivery might take time (Bob might be offline). Alice should not wait. The message is safe in the database - that is what matters.

What happens when Bob comes back online:

WHEN Bob opens the app:

STEP 1: Connect to a Gateway server

+ 22 more lines...

What if Bob was offline for a week?

Bob might have hundreds of waiting messages! We deliver them in batches (50 at a time) so his phone is not overwhelmed. We also sort by conversation so recent chats come first.

Group Messages

Group chats are a bit different. When Alice sends a message to a group of 10 people, we need to deliver it to all 10.

What to tell the interviewer

For group messages, I send the message to each member one by one. For small groups (under 50 people), this is fast enough. For very large groups, we would need smarter strategies.

Alice sends to Family Group (4 people)

How group messages work:

1.Save the message once - We store the message in the messages table (just one copy)
2.Get the member list - Look up who is in the group: Bob, Carol, Dave
3.Fan out to everyone - For each member, add the message to their inbox
4.Deliver to online members - Bob and Dave are online, so they get the message instantly
5.Wait for offline members - Carol is offline, her inbox saves the message
6.Delivery receipts - Alice sees two check marks when ALL members receive the message

WHEN Alice sends "Who wants pizza?" to Family Group:

STEP 1: Save the message

+ 21 more lines...

Why groups have a size limit

If a group has 10,000 people, sending one message creates 10,000 inbox entries. That is a lot of database writes for one message! This is why WhatsApp limits groups to 256 people. For bigger groups (like company announcements), you use Channels which work differently.

Group Size	How We Handle It	Why This Works
Small (under 20)	Fan out to all members immediately	Fast enough - 20 writes is no problem
Medium (20-256)	Fan out but do it in the background	Does not slow down the sender, delivery takes a few seconds
Large (256+)	Use channels or broadcast lists	Different design - message is stored once, members pull updates

Preventing Message Loss

The golden rule

A message must NEVER be lost. If Alice sends a message, Bob must eventually receive it. Getting a message twice is annoying but okay. Losing a message breaks trust forever.

Why we choose "at-least-once" delivery:

There are three choices for message delivery: - At-most-once: Might lose messages, but never send twice. (BAD for chat!) - Exactly-once: Never lose, never duplicate. (Very hard and slow) - At-least-once: Never lose, might duplicate. (Good for chat!)

We choose at-least-once because: 1. Losing a message is terrible - user loses trust 2. Getting a duplicate is annoying but fixable - phone ignores duplicates 3. Exactly-once needs complicated coordination that slows everything down

ON Bob's Phone:

We keep a list of recently seen message IDs (last 10,000 messages)

+ 13 more lines...

How we make sure messages are never lost:

What could go wrong	How we protect against it
Server crashes before saving message	We save to database FIRST, before doing anything else
Message queue (Kafka) loses message	Kafka saves to disk and copies to multiple servers

+ 4 more rows...

The key insight

The inbox table is our safety net. Every message goes to the inbox. It stays there until we are SURE the recipient got it (they sent an acknowledgment). If anything goes wrong, the message is safe in the inbox.

Keeping messages in order:

When Alice sends messages quickly: "Hi" then "How are you?" then "Want to meet?"

They must arrive in that order! Here is how we do it:

Each message gets a special ID called TIMEUUID:
- It contains the exact time the message was created
- It is unique (no two messages have the same ID)

+ 13 more lines...

What Can Go Wrong and How We Handle It

Tell the interviewer about failures

Good engineers think about what can break. Let me walk through the things that can go wrong and how we protect against them.

Common failures and how we handle them:

What breaks	What happens to users	How we fix it	Why this works
Gateway server crashes	Users on that server get disconnected	Phone automatically reconnects to another Gateway	Load balancer sends them to a healthy server
Chat server crashes	Messages are delayed	Kafka holds messages, another Chat server picks them up	Kafka never loses messages

+ 5 more rows...

When Bob reconnects after being offline:

WHEN Bob opens the app after being offline for 3 days:

STEP 1: Connect to a Gateway

+ 27 more lines...

What is idempotent? (important word to know)

Idempotent means: doing something twice has the same result as doing it once. Our delivery is idempotent - if we accidentally try to deliver the same message twice, it is okay! The phone ignores duplicates, and we do not create extra entries.

Growing the System Over Time

What to tell the interviewer

This design works great for up to 100 million users in one location. Let me explain how we would grow it to support users around the world.

How we grow step by step:

Stage 1: Single Region (up to 100 million users) - All servers in one data center (like US-East) - Simple and fast - messages never leave the building - Add more servers as users grow

Stage 2: Multiple Regions (100-500 million users) - Data centers in US, Europe, and Asia - Users connect to the nearest data center (faster!) - Messages between regions travel through a fast backbone

Stage 3: Global Scale (500 million+ users) - Each region is somewhat independent - Complex coordination for cross-region messages - This is where things get really hard!

Multiple regions around the world

When Alice (US) sends to Bob (Europe):

1.Alice's message goes to US data center 2. Saved in US database 3. Routed through backbone to Europe (adds ~100ms) 4. Saved in Europe database 5. Delivered to Bob through Europe Gateway

Total extra delay: about 100-200 milliseconds (still feels instant!)

Cool features we can add later:

1. End-to-End Encryption Messages are encrypted on Alice's phone and decrypted on Bob's phone. Even we (the server) cannot read them!

BEFORE sending:
- Alice's phone encrypts: "Hey Bob!" becomes "X3kJ9mP..."
- Only Bob's phone has the key to decrypt it

+ 12 more lines...

2. Voice and Video Calls - Real-time audio/video is different from messages - Uses peer-to-peer connections when possible (phone to phone directly) - Falls back to server relay when direct connection fails

3. Status Updates (Stories) - Photos/videos that disappear after 24 hours - Stored differently - expire automatically - Fan out to all contacts who want to see

Feature	Why it is different	Special handling needed
Voice/Video Calls	Real-time streaming, not store-and-forward	Use WebRTC for peer-to-peer, TURN servers for relay
Status/Stories	Temporary content, broadcast to many	Time-based expiration, lazy fan-out
File Sharing	Large files (100MB+)	Upload to CDN first, send link in message
Message Search	Find old messages by keyword	Separate search index (Elasticsearch)
Multi-device	Same account on phone and laptop	Sync messages to all devices, handle conflicts

What about multiple devices?

If Bob uses WhatsApp on his phone AND his laptop, both need to see messages. With end-to-end encryption, this is tricky! WhatsApp Web works by using the phone as the source - laptop connects through the phone. Other apps like Signal sync messages to all devices directly.

Design Trade-offs

Advantages

+Messages are NEVER lost
+Simple to understand
+Survives any crash

Disadvantages

-Slightly slower - extra database write
-More storage needed

When to use

Always use this for real chat apps. Message loss destroys user trust.

System Design Masterclass

Weather Application with Forecasting

URL Shortener

Live Comments Feature

API Rate Limiter

On-Call Escalation System

Hotel Booking and Reservation System

Parts Compatibility Validation

Real-time Stock Price Viewer

Top-K Rankings System

File Download and Sync Library

Real-time Active Viewers

Marketplace Features

Price Alert System

Netflix Screen Concurrency Limits

Live Reactions System

Top K Most Shared Articles

High-Profile Likes Counter

Authentication and User Login

Google Calendar

Web Crawler

News Feed

Video Streaming Platform

IoC / Dependency Injection Framework

Distributed Control Infrastructure

Notification Service

Distributed Tracing System

P2P File Transfer System

Large Data Migration to Cloud

Wire Transfer API

Large Data Sorting and Processing

Database Control Plane

Distributed Metrics Logging and Aggregation

Ads Management & Delivery System

Flash Sale Backend

Photo Sharing Platform

Cluster Health Monitoring System

Rider Matching System

Surge Pricing System

Collaborative Editing System

Server Metrics Collection System

User Analytics Dashboard & Event Pipeline

Dropbox / Google Drive

Distributed Message Queue

ETA and Live Location Sharing

Distributed Key-Value Store

Distributed Stream Processing System

Payment Processing System

Distributed Job Scheduler

WhatsApp / Messenger

Payment Wallet at Global Scale

Uber / Ride Sharing

Web Search Engine

Globally Distributed SQL Database

Real-Time Analytics System

Recommendation System (Netflix)

Multi-Region Disaster Recovery System

Time-Series Database

Fraud Detection System

Design WhatsApp / Messenger

Summary

Key Takeaways

Core Problem

The Hard Part

Scaling Axis

Critical Invariant

Performance Requirement

Key Tradeoff

Design Walkthrough

Problem Statement

What to say first

Clarifying Questions

Question 1: How big is this?

Question 2: What if someone is offline for a long time?

Question 3: Do messages need to arrive in order?

Question 4: How big can groups be?

Summarize your assumptions

The Hard Part

Say this to the interviewer

Common mistake candidates make