Design Walkthrough
Problem Statement
The Question: Design a notification service that can send billions of messages every day through phone notifications, email, text messages, and in-app alerts.
What the service needs to do (most important first):
- 1.Send phone notifications - The little alerts that pop up on your phone even when the app is closed (push notifications).
- 2.Send emails - For things like order confirmations, weekly summaries, or password resets.
- 3.Send text messages (SMS) - For super important things like security codes or when your Uber driver is arriving.
- 4.Show in-app alerts - The red badge with a number, or the notification bell inside the app.
- 5.Remember user settings - Some people turn off certain notifications. We must respect that.
- 6.Handle millions of users at once - When a celebrity posts, millions need to know.
- 7.Never lose a notification - Important alerts like your flight is delayed must always arrive.
What to say first
Let me first understand what channels we need - push, email, SMS, in-app? Then I will ask about scale - how many notifications per day? Finally, I want to know which notifications are urgent and which can wait a bit.
What the interviewer really wants to see: - Can you handle sending to millions of people at once without crashing? - Do you know that different channels (email, push, SMS) have different limits and costs? - How do you make sure no message is ever lost? - How do you avoid sending the same notification twice? - Can you respect user preferences at scale?
Clarifying Questions
Before you start designing, ask questions to understand what you are building. Good questions show the interviewer you think before you code.
Question 1: How big is this?
How many notifications do we send per day? How many users do we have? Do some users have millions of followers (celebrities)?
Why ask this: If we send 1 million notifications per day, the design is simple. If we send 10 billion per day, we need a completely different approach.
What interviewers usually say: 10 billion notifications per day, 500 million users. Yes, some users have millions of followers.
How this changes your design: At this scale, we need message queues, multiple workers, and smart batching. We cannot send notifications one by one.
Question 2: Which channels do we need?
Do we need to support push notifications, email, SMS, and in-app? Are all of them equally important?
Why ask this: Each channel works differently. SMS costs money (a few cents per message). Email providers limit how many you can send. Push notifications are free but need device tokens.
What interviewers usually say: All channels. Push is the most common. SMS only for very important things (it costs money). Email for summaries and receipts.
How this changes your design: We need separate queues and workers for each channel because they have different speeds and limits.
Question 3: How fast must notifications arrive?
Some notifications are urgent (security codes, ride arriving). Others can wait (someone liked your old photo). What are the time requirements?
Why ask this: Urgent notifications need a fast lane - they skip the regular queue. Non-urgent can be batched and sent together.
What interviewers usually say: Security codes and ride updates must arrive in 5 seconds. Social notifications (likes, comments) can take up to 1 minute.
How this changes your design: We need priority queues. Urgent notifications get processed first. We might even use a separate fast path for critical alerts.
Question 4: What if delivery fails?
What happens if the email server is down? Or the user phone is off? Should we retry? For how long?
Why ask this: Network failures happen all the time. We need to decide: retry forever? Give up after some time? Try a different channel?
What interviewers usually say: Retry for 24 hours with increasing wait times between tries. For critical notifications, try multiple channels (push failed? send SMS).
How this changes your design: We need a retry system with backoff (wait longer each time). We also need to track which notifications were delivered and which are still pending.
Summarize your assumptions
Let me summarize what I will design for: 10 billion notifications per day, 500 million users, all channels (push, email, SMS, in-app), critical notifications in 5 seconds, retry for 24 hours if delivery fails, and support for celebrities with millions of followers.
The Hard Part
Say this to the interviewer
The hardest part is when one event needs to notify millions of people. Think about it: a celebrity with 10 million followers posts a photo. We cannot send 10 million notifications at the same instant - email servers would block us, phone services would throttle us, and our own servers might crash.
Why sending to millions is tricky (explained simply):
- 1.External services have limits - Apple lets you send maybe 100,000 push notifications per second. If you try to send 10 million at once, they will block you temporarily.
- 2.It takes time - Even at 100,000 per second, 10 million takes 100 seconds. The person who created the post should not wait 100 seconds for a success message.
- 3.Some will fail - Out of 10 million, maybe 100,000 phones are off or changed numbers. We need to track and retry these.
- 4.Different time zones - Should we wake people up at 3am with a notification? Most users set quiet hours.
- 5.People have preferences - Some turned off notifications for this celebrity. We must check 10 million user preferences.
- 6.Duplicates are annoying - If our server crashes mid-way and restarts, we might try to send again. Sending the same notification twice is bad.
Common mistake candidates make
Many people design a system that waits for the notification to be delivered before responding to the sender. This is wrong! The sender (like the app that detected someone liked a photo) should get an instant OK and move on. Delivery happens in the background.
The Fan-out Problem
The solution: Spread it out and use queues
Instead of sending 10 million at once: 1. Put all 10 million into a queue (like a waiting line) 2. Workers pick from the queue at a steady pace 3. Each worker sends at a rate the providers allow 4. Failed ones go back to the queue for retry
This way, a celebrity post might take 2 minutes to reach everyone, but nothing crashes and nothing is lost.
Scale and Access Patterns
Before designing, let me figure out how big this system needs to be. This helps us choose the right tools.
| What we are measuring | Number | What this means for our design |
|---|---|---|
| Notifications per day | 10 billion | About 115,000 every second on average, 500,000+ during busy times |
| Total users | 500 million | We store settings and device info for half a billion people |
What to tell the interviewer
At 10 billion notifications per day, we are sending about 115,000 every second on average. The hardest part is not the average - it is the spikes. When a celebrity posts or breaking news happens, we might see 500,000 per second. Our design must handle these spikes without falling over.
How people use the notification system (from most common to least common):
- 1.Send one notification to one person - John liked your photo. We send to just you. This is the most common case.
- 2.Send to a group - Your team made a comment. We send to 5-50 people in the group.
- 3.Send to followers - Celebrity posted. We send to their 10 million followers. This is the hardest case.
- 4.Send to everyone - The app has a new feature. We tell all 500 million users. This is rare and usually scheduled for off-peak hours.
How many notifications per second?
- 10 billion per day / 86,400 seconds = 115,000 per second average
- Peak times (5x average) = 575,000 per secondImportant: Provider limits are the real constraint
We can make our own servers as fast as we want. But Apple and Google control how fast we can send push notifications. If we send too fast, they will temporarily block us. Our design must respect these external limits.
High-Level Architecture
Now let me draw the big picture of how all the pieces fit together. I will keep it simple and explain what each part does.
What to tell the interviewer
I will design this as a pipeline - like an assembly line in a factory. The notification enters at one end, passes through several stations (each doing one job), and exits as a delivered message. We use queues between stations so each part can work at its own pace.
Notification Service - The Big Picture
What each part does and WHY it is separate:
| Part | What it does | Why it is separate (what to tell interviewer) |
|---|---|---|
| API Gateway | Receives notification requests from other apps. Says OK immediately and puts the work in a queue. | We do not want the calling app to wait. It sends a request and immediately gets OK. The actual sending happens later in the background. |
| Priority Queue | Holds urgent notifications (security codes, ride arriving). These get processed first. | Some notifications cannot wait. A 2FA code that arrives 2 minutes late is useless. Urgent notifications skip the regular line. |
Common interview question: Why so many parts?
Interviewers often ask: Can you make this simpler? Why not one service that does everything? Your answer: For a small app, yes, one service is fine. We split because: (1) different parts need different amounts of servers - fan-out needs many, preference checking needs few, (2) if email service breaks, push should still work, (3) each part is simple and easy to understand.
Technology Choices - Why we picked these tools:
Message Queue: Kafka (Recommended) - Why: Can handle millions of messages per second, never loses messages, keeps messages for days so we can replay if needed - Other options: RabbitMQ is simpler but not as fast at this scale
Cache: Redis (Recommended) - Why: Super fast (under 1 millisecond), stores user preferences and duplicate tracking, everyone knows how to use it
Database: PostgreSQL (Recommended) - Why: Reliable, stores notification history and user settings permanently, handles billions of rows
External Services: - Push: Apple APNs (for iPhones) + Google FCM (for Android) - Email: SendGrid, AWS SES, or Mailgun - all work well - SMS: Twilio is the most popular, Vonage is another option
How real companies do it
Facebook uses a similar pipeline design. Slack separates real-time messages (WebSocket) from push notifications. Discord prefers in-app notifications over push to save money. Uber has a special fast-lane for ride updates.
Data Model and Storage
Now let me show how we organize the data in the database. Think of tables like spreadsheets - each one stores a different type of information.
What to tell the interviewer
I will use PostgreSQL for permanent storage of notifications and user settings. I will use Redis for fast lookups - like checking if we already sent a notification or getting a user preferences quickly.
Table 1: User Notification Settings - What each user wants
This stores what kind of notifications each user wants to receive and when.
| Column | What it stores | Example |
|---|---|---|
| user_id | Unique ID for the user | user_12345 |
| push_enabled | Do they want phone notifications? | true or false |
Table 2: Device Tokens - Where to send push notifications
Each person might have multiple devices - phone, tablet, laptop. We need to track each one.
| Column | What it stores | Example |
|---|---|---|
| id | Unique ID for this token | token_789 |
| user_id | Who owns this device | user_12345 |
Database Index
We add an INDEX on (user_id) WHERE is_active = true. This makes finding a user active devices FAST - we skip all the old inactive tokens.
Table 3: Notifications - Record of every notification sent
We keep track of every notification - what was sent, when, and whether it was delivered.
| Column | What it stores | Example |
|---|---|---|
| id | Unique ID for this notification | notif_abc123 |
| user_id | Who should receive it | user_12345 |
Redis Data - Fast temporary storage:
We use Redis for things we need to check super fast:
1. USER SETTINGS (so we do not hit database for every notification)
Key: user:settings:12345
Value: {push: true, email: true, quiet_start: "22:00"}What a notification message looks like:
When an app wants to send a notification, it sends us a message that looks like this:
{
"id": "notif_abc123", // Unique ID to prevent duplicates
"type": "social.like", // What kind of notificationWhat is a collapse key?
If 10 people like your photo in 1 minute, we do not want to send 10 separate notifications. The collapse_key groups them. We send one notification saying John and 9 others liked your photo instead of 10 separate ones.
Fan-out Deep Dive
Fan-out is when one event creates many notifications. It is the hardest part of the system. Let me explain how we handle it step by step.
The math problem
If a celebrity with 10 million followers posts, and each notification takes just 10 milliseconds to process, doing them one by one takes: 10,000,000 x 0.01 seconds = 100,000 seconds = 27 hours! That is way too slow. We must do many at once.
How Fan-out Works
Step by step: How we fan out to 10 million users:
FUNCTION handle_fan_out(event):
// Step 1: Figure out who should get this notificationTwo ways to handle fan-out:
| Approach | How it works | Good for | Bad for |
|---|---|---|---|
| Push on write (send now) | When celebrity posts, immediately create 10M notifications and put them in queues | Small audiences (under 10K), urgent notifications | Celebrities with millions of followers - too slow |
| Pull on read (lazy fan-out) | Do not create notifications. When a user opens the app, check if anything new happened and show it then | Huge audiences, non-urgent stuff, saves work for inactive users | Users expect instant notifications, not just when they open the app |
| Hybrid (recommended) | Send immediately to active users (opened app in last hour). For inactive users, wait until they open the app | Best of both - fast for active users, efficient for inactive | More complex to build, two code paths |
How Twitter/X does it
Twitter uses hybrid fan-out. For normal users, tweets are pushed to follower timelines immediately. For celebrities with millions of followers, tweets are stored once and pulled when followers open the app. This saves billions of database writes.
Rate limiting during fan-out:
We cannot send too fast or external providers will block us. Here is how we control the speed:
// We use a "token bucket" to control speed
// Imagine a bucket that gets 50,000 tokens per second
// Each notification we send uses 1 tokenDelivery and Retries
Sending notifications can fail for many reasons. Networks have problems, phones are off, email addresses are wrong. We must handle all these cases and keep trying until we succeed.
What to tell the interviewer
We use at-least-once delivery with deduplication. This means: we keep trying until we succeed (at-least-once), but we check before sending to avoid duplicates. If we are not sure if it was sent, we try again - sending twice is better than not sending at all, and our duplicate checker prevents the user from seeing it twice.
How we send a push notification:
FUNCTION send_push_notification(notification):
// Step 1: Did we already send this?What can go wrong and what we do about it:
| What went wrong | What we do | Do we try again? |
|---|---|---|
| Invalid token - device uninstalled the app | Remove this token from our database | No - the device does not exist anymore |
| User unregistered - turned off all notifications | Log it, respect their choice | No - they do not want notifications |
How retry backoff works:
When something fails, we do not retry immediately. We wait, and each time we fail, we wait longer. This is called exponential backoff.
FUNCTION calculate_wait_time(attempt_number):
// Base wait: 1 secondDead Letter Queue - Where failed notifications go:
After 24 hours of trying, some notifications still fail. We do not just throw them away. We put them in a special Dead Letter Queue for investigation.
FUNCTION handle_final_failure(notification, error):
// We tried for 24 hours and still could not sendWhy is duplicate prevention important?
Imagine you get a notification saying John liked your photo, then the same notification appears again 5 minutes later. That is annoying! Our duplicate checker uses the notification ID stored in Redis. Before sending, we check if that ID exists. If it does, we skip sending.
Preventing Problems
The three golden rules we must never break
1. Never send the same notification twice to a user. 2. Never send to someone who turned off notifications. 3. Never exceed provider rate limits (or we get blocked).
How we prevent duplicates:
Every notification has a unique key that identifies it. This is not just a random ID - it describes what the notification is about.
FUNCTION create_duplicate_key(notification):
// The key describes WHAT happened, not just a random IDHow we respect user preferences:
Before sending ANY notification, we check what the user wants. This happens for every single notification, even during a 10-million-user fan-out.
FUNCTION should_we_send_this(user_id, notification_type, channel):
// Get user settings (cached in Redis for speed)Why we cache user settings
We check preferences for EVERY notification. At 100,000 notifications per second, we cannot hit the database 100,000 times per second. We store user settings in Redis with a 1-hour expiration. When a user changes their settings, we delete their cache entry so the next check gets fresh data.
How we stay under provider rate limits:
Apple and Google limit how fast we can send. If we go too fast, they block us temporarily. We use rate limiters to control our speed.
// Each provider has a rate limiter
// Think of it like a traffic light that controls how fast cars can pass
What Can Go Wrong and How We Handle It
Tell the interviewer about failures
Good engineers think about what can break. Let me walk through the things that can go wrong and how we protect against them. A notification service is critical - people depend on it for security alerts and important updates.
Common failures and how we handle them:
| What breaks | What happens to users | How we fix it | Why this works |
|---|---|---|---|
| Redis goes down | We cannot check duplicates or user settings | Fall back to database, accept some duplicates temporarily | Sending a duplicate is annoying but not a disaster. Not sending is worse. |
| Kafka (queue) goes down | New notifications cannot be queued | Write to local disk, replay when Kafka comes back | We never lose notifications - they wait on disk until the queue is back |
Using multiple providers for reliability:
For important notifications, we use multiple email and SMS providers. If one is down, we try another.
FUNCTION send_email(email):
// List of providers in order of preferenceCircuit breaker pattern:
When a provider has problems, we stop sending to it temporarily. This is like a circuit breaker in your house - it trips to prevent damage.
// Each provider has a circuit breaker
// States: CLOSED (working), OPEN (broken), HALF_OPEN (testing)
Why circuit breakers help
Without a circuit breaker, if SendGrid is down, we would keep trying and failing, slowing everything down. The circuit breaker says after 5 failures, stop trying SendGrid for 30 seconds. Use a backup instead. Then test SendGrid again to see if it is back.
Growing the System Over Time
What to tell the interviewer
This design handles billions of notifications per day. Let me explain how we would grow it for 10x more traffic, add features like smart sending times, and handle users around the world.
How we grow step by step:
Stage 1: Starting out (up to 100 million notifications per day) - One Kafka cluster for queues - Simple fan-out in the application - Single provider per channel (just SendGrid for email) - All servers in one location
Stage 2: Scaling up (up to 10 billion notifications per day) - Kafka partitioned by user ID (many parallel queues) - Dedicated fan-out service (separate from main app) - Multiple providers with automatic failover - Priority queues for urgent notifications - This is where our design is now
Stage 3: Global scale (100 billion+ notifications per day) - Servers in multiple locations around the world - Send from server closest to the user (lower delay) - Machine learning to predict the best time to send - Real-time A/B testing for notification content
Multi-region notification service
Cool features we can add later:
| Feature | What it does | How to build it |
|---|---|---|
| Smart send time | Send when the user is most likely to read it (not at 3am) | Machine learning model predicts best time per user. Delay notifications until that time. |
| Notification bundling | Instead of 10 separate like notifications, send John and 9 others liked your photo | Hold notifications in Redis for 5 minutes. Combine similar ones before sending. |
Different apps need different focus:
Chat app (Slack, Discord): Real-time is critical. Use WebSocket for instant in-app notifications. Only send push when the app is in the background. Group messages to reduce notification spam.
Shopping app (Amazon): Never lose order updates. Use database writes before queuing to ensure nothing is lost. Send to multiple channels for shipping updates. Track which notifications lead to purchases.
Social app (Instagram, Twitter): Handle celebrity fan-out. Use hybrid push/pull. Bundle notifications (show X and 5 others instead of 6 separate notifications). Respect digest preferences.
Ride sharing (Uber, Lyft): Sub-second delivery for ride updates. Use in-memory queues for speed. Fall back to SMS for critical updates. Location-based routing to nearest server.
Final tip for the interview
When the interviewer asks how would you improve this, talk about: (1) smart send times using ML, (2) notification bundling to reduce spam, (3) A/B testing to improve engagement, (4) multi-region for lower latency worldwide. These show you think about the product, not just the technology.