System Design Masterclass

58 items

System Design Masterclass

Messagingnotificationspushemailsmsreal-timeadvanced

Design Notification Service

Design a system to send billions of notifications every day

Billions of notifications/day|Similar to Facebook, Slack, Discord, Twitter, Uber, WhatsApp, Instagram|45 min read

Summary

A notification service sends alerts to users through their phone (push), email, text message (SMS), or inside the app itself. The hard parts are: sending one message to millions of people at once (like when a celebrity posts), making sure messages are never lost or sent twice, handling when email or phone services are slow or down, and remembering what each user wants (some turn off certain notifications). Companies like Facebook, Uber, Slack, and Twitter ask this in interviews.

Key Takeaways

Core Problem

The main job is to take one event (like someone liking your photo) and send a message to the right person through their favorite channel (phone notification, email, or text message).

The Hard Part

When a famous person with 10 million followers posts something, we need to send 10 million notifications. We cannot do this all at once - it would crash everything. We must spread it out carefully.

Scaling Axis

We split users into groups. Each group gets its own worker. If we need to send more notifications, we just add more workers. One worker handles one group of users.

Critical Invariant

Two golden rules: Never send the same notification twice to a user (nobody likes duplicates). Never send to someone who turned off notifications (respect their choice).

Performance Requirement

Some notifications are urgent - like a code to log in (2FA) or telling you your Uber is here. These must arrive in seconds. Other notifications like someone liked your post can wait a few minutes.

Key Tradeoff

We care more about making sure the message arrives than about perfect order. If notification A arrives before notification B (even though B happened first), that is okay. But losing a notification is NOT okay.

Design Walkthrough

Problem Statement

The Question: Design a notification service that can send billions of messages every day through phone notifications, email, text messages, and in-app alerts.

What the service needs to do (most important first):

1.Send phone notifications - The little alerts that pop up on your phone even when the app is closed (push notifications).
2.Send emails - For things like order confirmations, weekly summaries, or password resets.
3.Send text messages (SMS) - For super important things like security codes or when your Uber driver is arriving.
4.Show in-app alerts - The red badge with a number, or the notification bell inside the app.
5.Remember user settings - Some people turn off certain notifications. We must respect that.
6.Handle millions of users at once - When a celebrity posts, millions need to know.
7.Never lose a notification - Important alerts like your flight is delayed must always arrive.

What to say first

Let me first understand what channels we need - push, email, SMS, in-app? Then I will ask about scale - how many notifications per day? Finally, I want to know which notifications are urgent and which can wait a bit.

What the interviewer really wants to see: - Can you handle sending to millions of people at once without crashing? - Do you know that different channels (email, push, SMS) have different limits and costs? - How do you make sure no message is ever lost? - How do you avoid sending the same notification twice? - Can you respect user preferences at scale?

Clarifying Questions

Before you start designing, ask questions to understand what you are building. Good questions show the interviewer you think before you code.

Question 1: How big is this?

How many notifications do we send per day? How many users do we have? Do some users have millions of followers (celebrities)?

Why ask this: If we send 1 million notifications per day, the design is simple. If we send 10 billion per day, we need a completely different approach.

What interviewers usually say: 10 billion notifications per day, 500 million users. Yes, some users have millions of followers.

How this changes your design: At this scale, we need message queues, multiple workers, and smart batching. We cannot send notifications one by one.

Question 2: Which channels do we need?

Do we need to support push notifications, email, SMS, and in-app? Are all of them equally important?

Why ask this: Each channel works differently. SMS costs money (a few cents per message). Email providers limit how many you can send. Push notifications are free but need device tokens.

What interviewers usually say: All channels. Push is the most common. SMS only for very important things (it costs money). Email for summaries and receipts.

How this changes your design: We need separate queues and workers for each channel because they have different speeds and limits.

Question 3: How fast must notifications arrive?

Some notifications are urgent (security codes, ride arriving). Others can wait (someone liked your old photo). What are the time requirements?

Why ask this: Urgent notifications need a fast lane - they skip the regular queue. Non-urgent can be batched and sent together.

What interviewers usually say: Security codes and ride updates must arrive in 5 seconds. Social notifications (likes, comments) can take up to 1 minute.

How this changes your design: We need priority queues. Urgent notifications get processed first. We might even use a separate fast path for critical alerts.

Question 4: What if delivery fails?

What happens if the email server is down? Or the user phone is off? Should we retry? For how long?

Why ask this: Network failures happen all the time. We need to decide: retry forever? Give up after some time? Try a different channel?

What interviewers usually say: Retry for 24 hours with increasing wait times between tries. For critical notifications, try multiple channels (push failed? send SMS).

How this changes your design: We need a retry system with backoff (wait longer each time). We also need to track which notifications were delivered and which are still pending.

Summarize your assumptions

Let me summarize what I will design for: 10 billion notifications per day, 500 million users, all channels (push, email, SMS, in-app), critical notifications in 5 seconds, retry for 24 hours if delivery fails, and support for celebrities with millions of followers.

The Hard Part

Say this to the interviewer

The hardest part is when one event needs to notify millions of people. Think about it: a celebrity with 10 million followers posts a photo. We cannot send 10 million notifications at the same instant - email servers would block us, phone services would throttle us, and our own servers might crash.

Why sending to millions is tricky (explained simply):

1.External services have limits - Apple lets you send maybe 100,000 push notifications per second. If you try to send 10 million at once, they will block you temporarily.
2.It takes time - Even at 100,000 per second, 10 million takes 100 seconds. The person who created the post should not wait 100 seconds for a success message.
3.Some will fail - Out of 10 million, maybe 100,000 phones are off or changed numbers. We need to track and retry these.
4.Different time zones - Should we wake people up at 3am with a notification? Most users set quiet hours.
5.People have preferences - Some turned off notifications for this celebrity. We must check 10 million user preferences.
6.Duplicates are annoying - If our server crashes mid-way and restarts, we might try to send again. Sending the same notification twice is bad.

Common mistake candidates make

Many people design a system that waits for the notification to be delivered before responding to the sender. This is wrong! The sender (like the app that detected someone liked a photo) should get an instant OK and move on. Delivery happens in the background.

The Fan-out Problem

The solution: Spread it out and use queues

Instead of sending 10 million at once: 1. Put all 10 million into a queue (like a waiting line) 2. Workers pick from the queue at a steady pace 3. Each worker sends at a rate the providers allow 4. Failed ones go back to the queue for retry

This way, a celebrity post might take 2 minutes to reach everyone, but nothing crashes and nothing is lost.

Scale and Access Patterns

Before designing, let me figure out how big this system needs to be. This helps us choose the right tools.

What we are measuring	Number	What this means for our design
Notifications per day	10 billion	About 115,000 every second on average, 500,000+ during busy times
Total users	500 million	We store settings and device info for half a billion people

+ 5 more rows...

What to tell the interviewer

At 10 billion notifications per day, we are sending about 115,000 every second on average. The hardest part is not the average - it is the spikes. When a celebrity posts or breaking news happens, we might see 500,000 per second. Our design must handle these spikes without falling over.

How people use the notification system (from most common to least common):

1.Send one notification to one person - John liked your photo. We send to just you. This is the most common case.
2.Send to a group - Your team made a comment. We send to 5-50 people in the group.
3.Send to followers - Celebrity posted. We send to their 10 million followers. This is the hardest case.
4.Send to everyone - The app has a new feature. We tell all 500 million users. This is rare and usually scheduled for off-peak hours.

How many notifications per second?
- 10 billion per day / 86,400 seconds = 115,000 per second average
- Peak times (5x average) = 575,000 per second

+ 13 more lines...

Important: Provider limits are the real constraint

We can make our own servers as fast as we want. But Apple and Google control how fast we can send push notifications. If we send too fast, they will temporarily block us. Our design must respect these external limits.

High-Level Architecture

Now let me draw the big picture of how all the pieces fit together. I will keep it simple and explain what each part does.

What to tell the interviewer

I will design this as a pipeline - like an assembly line in a factory. The notification enters at one end, passes through several stations (each doing one job), and exits as a delivered message. We use queues between stations so each part can work at its own pace.

Notification Service - The Big Picture

What each part does and WHY it is separate:

Part	What it does	Why it is separate (what to tell interviewer)
API Gateway	Receives notification requests from other apps. Says OK immediately and puts the work in a queue.	We do not want the calling app to wait. It sends a request and immediately gets OK. The actual sending happens later in the background.
Priority Queue	Holds urgent notifications (security codes, ride arriving). These get processed first.	Some notifications cannot wait. A 2FA code that arrives 2 minutes late is useless. Urgent notifications skip the regular line.

+ 5 more rows...

Common interview question: Why so many parts?

Interviewers often ask: Can you make this simpler? Why not one service that does everything? Your answer: For a small app, yes, one service is fine. We split because: (1) different parts need different amounts of servers - fan-out needs many, preference checking needs few, (2) if email service breaks, push should still work, (3) each part is simple and easy to understand.

Technology Choices - Why we picked these tools:

Message Queue: Kafka (Recommended) - Why: Can handle millions of messages per second, never loses messages, keeps messages for days so we can replay if needed - Other options: RabbitMQ is simpler but not as fast at this scale

Cache: Redis (Recommended) - Why: Super fast (under 1 millisecond), stores user preferences and duplicate tracking, everyone knows how to use it

Database: PostgreSQL (Recommended) - Why: Reliable, stores notification history and user settings permanently, handles billions of rows

External Services: - Push: Apple APNs (for iPhones) + Google FCM (for Android) - Email: SendGrid, AWS SES, or Mailgun - all work well - SMS: Twilio is the most popular, Vonage is another option

How real companies do it

Facebook uses a similar pipeline design. Slack separates real-time messages (WebSocket) from push notifications. Discord prefers in-app notifications over push to save money. Uber has a special fast-lane for ride updates.

Data Model and Storage

Now let me show how we organize the data in the database. Think of tables like spreadsheets - each one stores a different type of information.

What to tell the interviewer

I will use PostgreSQL for permanent storage of notifications and user settings. I will use Redis for fast lookups - like checking if we already sent a notification or getting a user preferences quickly.

Table 1: User Notification Settings - What each user wants

This stores what kind of notifications each user wants to receive and when.

Column	What it stores	Example
user_id	Unique ID for the user	user_12345
push_enabled	Do they want phone notifications?	true or false

+ 6 more rows...

Table 2: Device Tokens - Where to send push notifications

Each person might have multiple devices - phone, tablet, laptop. We need to track each one.

Column	What it stores	Example
id	Unique ID for this token	token_789
user_id	Who owns this device	user_12345

+ 5 more rows...

Database Index

We add an INDEX on (user_id) WHERE is_active = true. This makes finding a user active devices FAST - we skip all the old inactive tokens.

Table 3: Notifications - Record of every notification sent

We keep track of every notification - what was sent, when, and whether it was delivered.

Column	What it stores	Example
id	Unique ID for this notification	notif_abc123
user_id	Who should receive it	user_12345

+ 10 more rows...

Redis Data - Fast temporary storage:

We use Redis for things we need to check super fast:

1. USER SETTINGS (so we do not hit database for every notification)
   Key: user:settings:12345
   Value: {push: true, email: true, quiet_start: "22:00"}

+ 16 more lines...

What a notification message looks like:

When an app wants to send a notification, it sends us a message that looks like this:

{
  "id": "notif_abc123",           // Unique ID to prevent duplicates
  "type": "social.like",           // What kind of notification

+ 20 more lines...

What is a collapse key?

If 10 people like your photo in 1 minute, we do not want to send 10 separate notifications. The collapse_key groups them. We send one notification saying John and 9 others liked your photo instead of 10 separate ones.

Fan-out Deep Dive

Fan-out is when one event creates many notifications. It is the hardest part of the system. Let me explain how we handle it step by step.

The math problem

If a celebrity with 10 million followers posts, and each notification takes just 10 milliseconds to process, doing them one by one takes: 10,000,000 x 0.01 seconds = 100,000 seconds = 27 hours! That is way too slow. We must do many at once.

How Fan-out Works

Step by step: How we fan out to 10 million users:

FUNCTION handle_fan_out(event):
    
    // Step 1: Figure out who should get this notification

+ 33 more lines...

Two ways to handle fan-out:

Approach	How it works	Good for	Bad for
Push on write (send now)	When celebrity posts, immediately create 10M notifications and put them in queues	Small audiences (under 10K), urgent notifications	Celebrities with millions of followers - too slow
Pull on read (lazy fan-out)	Do not create notifications. When a user opens the app, check if anything new happened and show it then	Huge audiences, non-urgent stuff, saves work for inactive users	Users expect instant notifications, not just when they open the app
Hybrid (recommended)	Send immediately to active users (opened app in last hour). For inactive users, wait until they open the app	Best of both - fast for active users, efficient for inactive	More complex to build, two code paths

How Twitter/X does it

Twitter uses hybrid fan-out. For normal users, tweets are pushed to follower timelines immediately. For celebrities with millions of followers, tweets are stored once and pulled when followers open the app. This saves billions of database writes.

Rate limiting during fan-out:

We cannot send too fast or external providers will block us. Here is how we control the speed:

// We use a "token bucket" to control speed
// Imagine a bucket that gets 50,000 tokens per second
// Each notification we send uses 1 token

+ 16 more lines...

Delivery and Retries

Sending notifications can fail for many reasons. Networks have problems, phones are off, email addresses are wrong. We must handle all these cases and keep trying until we succeed.

What to tell the interviewer

We use at-least-once delivery with deduplication. This means: we keep trying until we succeed (at-least-once), but we check before sending to avoid duplicates. If we are not sure if it was sent, we try again - sending twice is better than not sending at all, and our duplicate checker prevents the user from seeing it twice.

How we send a push notification:

FUNCTION send_push_notification(notification):
    
    // Step 1: Did we already send this?

+ 34 more lines...

What can go wrong and what we do about it:

What went wrong	What we do	Do we try again?
Invalid token - device uninstalled the app	Remove this token from our database	No - the device does not exist anymore
User unregistered - turned off all notifications	Log it, respect their choice	No - they do not want notifications

+ 5 more rows...

How retry backoff works:

When something fails, we do not retry immediately. We wait, and each time we fail, we wait longer. This is called exponential backoff.

FUNCTION calculate_wait_time(attempt_number):
    
    // Base wait: 1 second

+ 29 more lines...

Dead Letter Queue - Where failed notifications go:

After 24 hours of trying, some notifications still fail. We do not just throw them away. We put them in a special Dead Letter Queue for investigation.

FUNCTION handle_final_failure(notification, error):
    
    // We tried for 24 hours and still could not send

+ 25 more lines...

Why is duplicate prevention important?

Imagine you get a notification saying John liked your photo, then the same notification appears again 5 minutes later. That is annoying! Our duplicate checker uses the notification ID stored in Redis. Before sending, we check if that ID exists. If it does, we skip sending.

Preventing Problems

The three golden rules we must never break

1. Never send the same notification twice to a user. 2. Never send to someone who turned off notifications. 3. Never exceed provider rate limits (or we get blocked).

How we prevent duplicates:

Every notification has a unique key that identifies it. This is not just a random ID - it describes what the notification is about.

FUNCTION create_duplicate_key(notification):
    
    // The key describes WHAT happened, not just a random ID

+ 40 more lines...

How we respect user preferences:

Before sending ANY notification, we check what the user wants. This happens for every single notification, even during a 10-million-user fan-out.

FUNCTION should_we_send_this(user_id, notification_type, channel):
    
    // Get user settings (cached in Redis for speed)

+ 31 more lines...

Why we cache user settings

We check preferences for EVERY notification. At 100,000 notifications per second, we cannot hit the database 100,000 times per second. We store user settings in Redis with a 1-hour expiration. When a user changes their settings, we delete their cache entry so the next check gets fresh data.

How we stay under provider rate limits:

Apple and Google limit how fast we can send. If we go too fast, they block us temporarily. We use rate limiters to control our speed.

// Each provider has a rate limiter
// Think of it like a traffic light that controls how fast cars can pass

+ 29 more lines...

What Can Go Wrong and How We Handle It

Tell the interviewer about failures

Good engineers think about what can break. Let me walk through the things that can go wrong and how we protect against them. A notification service is critical - people depend on it for security alerts and important updates.

Common failures and how we handle them:

What breaks	What happens to users	How we fix it	Why this works
Redis goes down	We cannot check duplicates or user settings	Fall back to database, accept some duplicates temporarily	Sending a duplicate is annoying but not a disaster. Not sending is worse.
Kafka (queue) goes down	New notifications cannot be queued	Write to local disk, replay when Kafka comes back	We never lose notifications - they wait on disk until the queue is back

+ 4 more rows...

Using multiple providers for reliability:

For important notifications, we use multiple email and SMS providers. If one is down, we try another.

FUNCTION send_email(email):
    
    // List of providers in order of preference

+ 28 more lines...

Circuit breaker pattern:

When a provider has problems, we stop sending to it temporarily. This is like a circuit breaker in your house - it trips to prevent damage.

// Each provider has a circuit breaker
// States: CLOSED (working), OPEN (broken), HALF_OPEN (testing)

+ 39 more lines...

Why circuit breakers help

Without a circuit breaker, if SendGrid is down, we would keep trying and failing, slowing everything down. The circuit breaker says after 5 failures, stop trying SendGrid for 30 seconds. Use a backup instead. Then test SendGrid again to see if it is back.

Growing the System Over Time

What to tell the interviewer

This design handles billions of notifications per day. Let me explain how we would grow it for 10x more traffic, add features like smart sending times, and handle users around the world.

How we grow step by step:

Stage 1: Starting out (up to 100 million notifications per day) - One Kafka cluster for queues - Simple fan-out in the application - Single provider per channel (just SendGrid for email) - All servers in one location

Stage 2: Scaling up (up to 10 billion notifications per day) - Kafka partitioned by user ID (many parallel queues) - Dedicated fan-out service (separate from main app) - Multiple providers with automatic failover - Priority queues for urgent notifications - This is where our design is now

Stage 3: Global scale (100 billion+ notifications per day) - Servers in multiple locations around the world - Send from server closest to the user (lower delay) - Machine learning to predict the best time to send - Real-time A/B testing for notification content

Multi-region notification service

Cool features we can add later:

Feature	What it does	How to build it
Smart send time	Send when the user is most likely to read it (not at 3am)	Machine learning model predicts best time per user. Delay notifications until that time.
Notification bundling	Instead of 10 separate like notifications, send John and 9 others liked your photo	Hold notifications in Redis for 5 minutes. Combine similar ones before sending.

+ 4 more rows...

Different apps need different focus:

Chat app (Slack, Discord): Real-time is critical. Use WebSocket for instant in-app notifications. Only send push when the app is in the background. Group messages to reduce notification spam.

Shopping app (Amazon): Never lose order updates. Use database writes before queuing to ensure nothing is lost. Send to multiple channels for shipping updates. Track which notifications lead to purchases.

Social app (Instagram, Twitter): Handle celebrity fan-out. Use hybrid push/pull. Bundle notifications (show X and 5 others instead of 6 separate notifications). Respect digest preferences.

Ride sharing (Uber, Lyft): Sub-second delivery for ride updates. Use in-memory queues for speed. Fall back to SMS for critical updates. Location-based routing to nearest server.

Final tip for the interview

When the interviewer asks how would you improve this, talk about: (1) smart send times using ML, (2) notification bundling to reduce spam, (3) A/B testing to improve engagement, (4) multi-region for lower latency worldwide. These show you think about the product, not just the technology.

Design Trade-offs

Advantages

+Users get notifications instantly
+Simple to understand
+Client app does not need to be smart

Disadvantages

-One celebrity post means 10 million writes
-Wastes work for users who do not open the app
-Can overload the system during viral events

When to use

Use for small audiences (under 10,000 users) or when notifications are time-critical (security codes, ride arriving).

System Design Masterclass

Weather Application with Forecasting

URL Shortener

Hotel Booking and Reservation System

Top K Most Shared Articles

API Rate Limiter

Real-time Stock Price Viewer

Parts Compatibility Validation

Live Comments Feature

Authentication and User Login

On-Call Escalation System

File Download and Sync Library

Marketplace Features

Price Alert System

High-Profile Likes Counter

Live Reactions System

Netflix Screen Concurrency Limits

Real-time Active Viewers

Top-K Rankings System

Google Calendar

Web Crawler

News Feed

Video Streaming Platform

Distributed Control Infrastructure

Cluster Health Monitoring System

User Analytics Dashboard & Event Pipeline

Surge Pricing System

Flash Sale Backend

Distributed Metrics Logging and Aggregation

IoC / Dependency Injection Framework

Wire Transfer API

Notification Service

ETA and Live Location Sharing

P2P File Transfer System

Photo Sharing Platform

Ads Management & Delivery System

Rider Matching System

Collaborative Editing System

Server Metrics Collection System

Distributed Key-Value Store

Payment Processing System

Distributed Stream Processing System

Distributed Job Scheduler

Distributed Message Queue

Dropbox / Google Drive

Distributed Tracing System

Large Data Sorting and Processing

Database Control Plane

Large Data Migration to Cloud

WhatsApp / Messenger

Payment Wallet at Global Scale

Uber / Ride Sharing

Web Search Engine

Globally Distributed SQL Database

Real-Time Analytics System

Recommendation System (Netflix)

Multi-Region Disaster Recovery System

Time-Series Database

Fraud Detection System

Design Notification Service

Summary

Key Takeaways

Core Problem

The Hard Part

Scaling Axis

Critical Invariant

Performance Requirement

Key Tradeoff

Design Walkthrough

Problem Statement

What to say first

Clarifying Questions

Question 1: How big is this?

Question 2: Which channels do we need?

Question 3: How fast must notifications arrive?

Question 4: What if delivery fails?

Summarize your assumptions

The Hard Part

Say this to the interviewer

Common mistake candidates make