Design Walkthrough
Problem Statement
The Question: Design a recommendation system like Netflix that suggests movies and shows to users based on their taste.
What the system needs to do (most important first):
- 1.Show personalized homepage - When a user opens the app, show rows of content picked just for them ("Because you watched Stranger Things", "Top picks for you").
- 2.Rank search results - When someone searches, put the most relevant results at the top based on what we know about them.
- 3.Similar items - Show "More like this" suggestions when viewing a movie or show.
- 4.Handle new users - Show good suggestions even for someone who just signed up and has no history yet.
- 5.Learn from behavior - Improve suggestions based on what people watch, rate, and skip.
- 6.Support A/B testing - Test different recommendation algorithms to see which one works better.
What to say first
Let me first understand the requirements. What kind of content are we recommending - movies, products, or something else? How many users and items do we have? What signals do we get from users - explicit ratings, watch history, or both? Once I know this, I will ask about latency requirements - how fast must the homepage load?
What the interviewer really wants to see: - Do you understand that recommendations must be served in milliseconds, not seconds? - Can you handle the cold start problem (new users, new items)? - Do you know the difference between offline training and online serving? - How do you balance showing popular items vs. discovering hidden gems?
Clarifying Questions
Before you start designing, ask questions to understand what you are building. Good questions show the interviewer you think before you code.
Question 1: How big is this?
How many users do we have? How many movies or shows in the catalog? How many recommendations do we need to serve per second?
Why ask this: The design changes a lot based on scale. A million users is very different from a billion users.
What interviewers usually say: 200 million users, 50,000 movies/shows in catalog, peak of 100,000 homepage loads per second.
How this changes your design: At this scale, we cannot compute recommendations on-the-fly. We must pre-compute and cache most of the work.
Question 2: What signals do we have from users?
Do users give explicit ratings (1-5 stars)? Or do we only have implicit signals like what they watched, how long they watched, and what they skipped?
Why ask this: Explicit ratings are clear but rare (few people rate). Implicit signals are abundant but noisy (someone might start a show and hate it).
What interviewers usually say: Mostly implicit signals - watch history, watch duration, browsing behavior. Some explicit thumbs up/down.
How this changes your design: We need to infer preference from behavior. Watching 90% of a movie probably means they liked it. Stopping after 5 minutes probably means they did not.
Question 3: How fresh do recommendations need to be?
If someone just finished watching a horror movie, should their homepage immediately show more horror? Or is a few hours delay okay?
Why ask this: Real-time updates need different architecture than batch updates.
What interviewers usually say: Homepage can be a few hours stale. But the "Continue Watching" row must be real-time.
How this changes your design: We use a mix - pre-compute most rows, but fetch some data in real-time (like what they are currently watching).
Question 4: What about new users with no history?
How should we handle someone who just created an account? We know nothing about their taste.
Why ask this: This is the famous "cold start" problem. You need to show something good even with zero information.
What interviewers usually say: Use any available info - sign-up country, device type, time of day. Also ask users to pick some favorite genres when they sign up.
How this changes your design: We need fallback strategies - start with popular content, use demographic signals, and learn quickly from first few interactions.
Summarize your assumptions
Let me summarize: 200 million users, 50,000 items, 100K requests per second at peak. Mostly implicit signals with some explicit ratings. Homepage can be a few hours stale but Continue Watching must be real-time. We need good cold start handling for new users.
The Hard Part
Say this to the interviewer
The hardest parts of recommendation systems are: (1) serving recommendations in under 200ms when we have millions of items to choose from, (2) handling new users with no history (cold start), and (3) keeping recommendations fresh without rebuilding everything constantly.
Why serving is hard (explained simply):
Imagine you have 50,000 movies and need to pick the best 50 for a user. If scoring one movie takes 1 millisecond, scoring all 50,000 would take 50 seconds. But the user expects the page to load in 200 milliseconds!
The solution: Do NOT score all items at request time. Instead: 1. Pre-compute most of the work offline (in batches) 2. Store the results in a fast cache 3. At request time, just look up the cached results 4. Maybe do a small amount of real-time adjustments
Common mistake candidates make
Many people say: When a user loads the homepage, run the ML model to score all items and pick the top 50. This is wrong because: (1) ML models are slow - you cannot run them for every request, (2) scoring millions of items takes too long, (3) this would require massive compute power for 100K requests per second.
The cold start problem (two types):
New user cold start: A user just signed up. They have watched nothing. What do we recommend? - Show popular items (everyone likes these) - Use whatever info we have (country, device, time of day) - Ask them to pick favorite genres during sign-up - Learn quickly from their first few clicks
New item cold start: We just added a new movie. Nobody has watched it yet. How do we recommend it? - Use the movie's metadata (genre, actors, director) - Show it to users who like similar movies - Give it a "boost" to get initial views and learn if people like it
How recommendations flow from training to serving
Scale and Access Patterns
Before designing, let me figure out how big this system needs to be. This helps us choose the right tools.
| What we are measuring | Number | What this means for our design |
|---|---|---|
| Total users | 200 million | Too many to compute recommendations in real-time for each |
| Items in catalog | 50,000 | Small enough that we can pre-compute scores for all items per user |
What to tell the interviewer
At 200 million users and 100K requests per second, we MUST pre-compute recommendations. The math is simple: we cannot run ML models for each request. Instead, we compute recommendations for every user in a batch job every few hours, store results in Redis or a similar cache, and serve directly from cache. The request path becomes just a cache lookup.
Common interview mistake: Real-time ML scoring
Many candidates propose running ML models at request time. Let me show why this does not work: If a model takes 10ms to score one item, and we have 50,000 items, that is 500 seconds per request. Even if we parallelize across 100 machines, that is still 5 seconds - way too slow. The only option is pre-computation.
How people use the recommendation system (from most common to least common):
- 1.Load homepage - User opens the app and sees multiple rows of recommendations. This is 90% of traffic. Must be fast.
- 2.View item details - Click on a movie to see "More like this". We need similar items ready.
- 3.Search - User types a query. We show results ranked by relevance AND personalization.
- 4.Browse category - User clicks "Action Movies". We show personalized ordering within that category.
How much cache storage do we need?
- Each user gets ~1000 pre-computed recommendations
- Each recommendation is: item_id (8 bytes) + score (4 bytes) = 12 bytesHigh-Level Architecture
Now let me draw the big picture of how all the pieces fit together. I will keep it simple and explain what each part does.
What to tell the interviewer
I will split this into two separate pipelines: (1) Offline pipeline - trains models and pre-computes recommendations in batch, (2) Online pipeline - serves recommendations to users in real-time. This separation is critical because training is slow and serving must be fast.
Recommendation System - The Big Picture
What each service does and WHY it is separate:
| Service | What it does | Why it is separate (what to tell interviewer) |
|---|---|---|
| Recommendation Service | Handles user requests. Looks up cached recommendations. Returns top items. | This is the fast path. It only does cache lookups and simple filtering. No heavy computation. Must respond in under 200ms. |
| Filter Service | Removes items user already watched. Removes items not available in their country. | Why separate? Filtering rules change often (new country restrictions, new business rules). Keeping it separate means we can update filters without touching the main service. |
| Batch Job (Spark) | Computes recommendations for ALL users every few hours. Writes results to cache. | Why separate? This is the slow path. It runs for hours and uses massive compute. It should never block user requests. Different scaling needs - we scale up during batch runs, scale down otherwise. |
| Feature Store | Stores pre-computed user features (how many action movies watched, average rating given, etc.) | Why separate? Features are used by multiple models. Computing them once and sharing saves work. Also helps with feature consistency - training and serving use same features. |
| Event Processor | Converts raw user events (clicks, watches) into structured data for training. | Why separate? Events come in messy formats from different clients. This service normalizes them. Also handles deduplication and late-arriving events. |
Common interview question: Why two pipelines?
Interviewers often ask: Why separate offline and online? Cannot you train in real-time? Your answer: Training ML models takes hours and needs lots of data. We cannot wait for training during a user request. So we train offline (slow, thorough) and serve online (fast, simple). Some companies do add real-time signals ON TOP of pre-computed recommendations, but the base is always pre-computed.
Technology Choices - Why we picked these tools:
Cache: Redis (Recommended) - Why we chose it: Super fast lookups (under 1ms), can store 5+ TB with clustering, supports TTL (auto-delete old data) - Other options we considered: - Memcached: Also works but Redis has more features - DynamoDB: Works but more expensive for simple key-value lookups
Batch Processing: Apache Spark (Recommended) - Why we chose it: Handles petabytes of data, great for ML training, widely used so lots of help available - Other options we considered: - Flink: Better for real-time, but we need batch - Presto: Good for queries, not as good for ML
Event Streaming: Kafka (Recommended) - Why we chose it: Handles billions of events per day, keeps events for replay, very reliable - Other options we considered: - Kinesis: Also good, especially if you use AWS - Pulsar: Newer, fewer people know it
Database: PostgreSQL for item metadata - Why we chose it: 50,000 items is small - any database works. PostgreSQL is reliable and familiar.
Important interview tip
Pick technologies YOU know! If you have used DynamoDB at your job, use DynamoDB. If you know Cassandra well, explain why it could work. Interviewers care more about your reasoning than the specific tool.
Recommendation Algorithms Deep Dive
Let me explain the different ways to recommend items. We usually combine multiple approaches.
What to tell the interviewer
There are three main approaches to recommendations: (1) Collaborative filtering - find users similar to you and recommend what they liked, (2) Content-based - recommend items similar to what you already liked, (3) Hybrid - combine both. Modern systems use all three plus deep learning to combine signals.
Approach 1: Collaborative Filtering (What similar users liked)
The idea is simple: If you and another person both liked movies A, B, and C, and they also liked movie D, you will probably like D too.
How it works: 1. Find users who watched similar things as you 2. Look at what those users watched that you have not 3. Recommend those items
Pros: Does not need to understand the content. Works even for items that are hard to describe. Cons: Cannot recommend new items that nobody has watched yet (cold start).
Example:
User Alice watched: Inception, Interstellar, The MatrixApproach 2: Content-Based Filtering (Similar to what you liked)
The idea: If you liked action movies with Tom Cruise, recommend other action movies with Tom Cruise.
How it works: 1. For each item, create a profile of its features (genre, actors, director, keywords) 2. For each user, create a profile of what features they like (based on what they watched) 3. Recommend items whose profile matches the user's profile
Pros: Can recommend new items (just need item features). Explains WHY we recommend ("Because you like action movies"). Cons: Can get stuck in a bubble (only action movies forever). Needs good item metadata.
Approach 3: Matrix Factorization (The math behind Netflix)
Imagine a giant spreadsheet where rows are users and columns are movies. Each cell contains a rating (1-5 stars) or is empty.
The problem: Most cells are empty (a user has not rated most movies).
The solution: Learn hidden patterns that explain the ratings. Then use those patterns to fill in the empty cells (predict ratings).
Think of it like this: We learn that User Alice likes "sci-fi" and "thought-provoking" movies. We learn that Inception has high scores for "sci-fi" and "thought-provoking". So we predict Alice will like Inception.
Matrix factorization - learning hidden patterns
Approach 4: Deep Learning (Modern approach)
Deep learning can combine many signals at once: - What the user watched - How long they watched - Time of day - Device type - Item features - Sequence of recent watches
Popular architectures: - Two-tower model: One neural network for user, one for item. Compare their outputs. - Transformer models: Like the ones in ChatGPT, but for sequences of watched items.
Pros: Can learn complex patterns automatically. Very accurate. Cons: Needs lots of data and compute. Harder to explain why something was recommended.
TWO-TOWER MODEL:
User Tower Item TowerWhat Netflix actually uses
Netflix uses a combination of many algorithms. They have multiple "rankers" (different models for different purposes), a "blender" that combines outputs, and business rules on top. The homepage has multiple rows, each generated by a different algorithm. They run hundreds of A/B tests to improve.
Data Model and Storage
Now let me show how we organize the data. We have different storage for different purposes.
What to tell the interviewer
We use three types of storage: (1) A relational database for item metadata (movies, shows), (2) A fast cache (Redis) for pre-computed recommendations, (3) A data warehouse for training data (all historical user events). Each storage type is optimized for its use case.
Table 1: Items - Information about movies and shows
This stores everything we know about each piece of content. We use this for content-based recommendations and for displaying item details.
| Column | What it stores | Example |
|---|---|---|
| id | Unique ID for this item | item_12345 |
| title | Name of the movie/show | Stranger Things |
Table 2: User Profiles - What we know about each user
This stores user preferences that we learn over time. It helps with personalization.
| Column | What it stores | Example |
|---|---|---|
| user_id | Unique ID for this user | user_67890 |
| preferred_genres | Genres they watch most | ["action", "comedy"] |
Table 3: User Events - Everything users do (stored in data warehouse)
This is where we collect all user behavior for training. It is stored in a data warehouse like BigQuery or Snowflake, not in the regular database.
| Column | What it stores | Example |
|---|---|---|
| event_id | Unique ID for this event | evt_abc123 |
| user_id | Who did this action | user_67890 |
Cache Structure: Pre-computed Recommendations
The most important data structure - what we actually serve to users.
Key: recommendations:{user_id}
Value: JSON with multiple recommendation rows
Why we store item IDs, not full item data
We only store item IDs in the cache, not full item details. Why? Item details change (thumbnail updated, availability changed). If we cached full details, they might become stale. Instead, we look up item details separately, and that lookup can have its own caching with shorter TTL.
Handling Cold Start
The cold start problem
When a new user signs up, we know nothing about them. When a new movie is added, nobody has watched it. How do we make good recommendations with no data? This is one of the hardest problems in recommendation systems.
New User Cold Start - What we can do:
Step 1: Use whatever signals we have - Country (people in Japan might prefer anime) - Device (mobile users might prefer shorter content) - Time of day (late night viewers might want different content) - Sign-up source (came from an ad for action movies?)
Step 2: Ask during onboarding - Show 20 popular movies and ask them to pick 5 they like - Ask for favorite genres - This gives us immediate signal to work with
Step 3: Fall back to popularity - Show what is popular right now - Popular items are popular for a reason - most people like them
Step 4: Learn quickly (exploration) - Pay attention to first few clicks - Update recommendations rapidly during first session - Be willing to show diverse content to learn preferences
FUNCTION get_recommendations_for_new_user(user):
STEP 1: Check if user completed onboarding surveyNew Item Cold Start - What we can do:
When a new movie is added, nobody has watched it, so collaborative filtering cannot help. We rely on:
- 1.Content features: Genre, actors, director. If users like Tom Hanks movies, show them the new Tom Hanks movie.
- 2.Item embeddings: Convert item description to vectors using language models. Find similar items.
- 3.Editorial boost: Human curators can feature new releases.
- 4.Exploration strategy: Show new items to a small random sample of users. Measure engagement. If people like it, show to more people.
New item cold start - exploration strategy
Exploration vs Exploitation
Exploitation means showing items we are confident the user will like. Exploration means showing items we are unsure about to learn more. Good systems balance both. Too much exploitation = boring recommendations (same type forever). Too much exploration = bad recommendations (random stuff). A common approach: 90% exploitation, 10% exploration.
Real-Time Personalization
Pre-computed recommendations are a few hours old. But we can add real-time signals to make them better.
What to tell the interviewer
The base recommendations are pre-computed every few hours. But we can apply real-time adjustments on top: filtering out items they just watched, boosting items related to what they are currently watching, and re-ranking based on time of day or device.
What we can do in real-time (during the request):
- 1.Filter already watched - Remove items user has completed in the last session.
- 2.Continue Watching row - Show items they started but did not finish. This MUST be real-time.
- 3.Context-based re-ranking: - Night time? Rank calming content higher - Weekend? Rank longer movies higher - Kids profile? Filter out adult content
- 4.Session-based adjustments: - Just finished a sad movie? Maybe do not show another sad one right away - Browsing the Action category? Boost action movies in other rows too
FUNCTION get_homepage(user_id):
STEP 1: Get pre-computed recommendations from cacheWhy we do not do more in real-time:
The math is harsh. With 100,000 requests per second: - 1 extra database query per request = 100,000 queries/second - 10ms extra latency per request = noticeable slowdown
So we keep real-time work minimal: - Filter and re-rank: Yes (fast operations on small lists) - Run ML model: No (too slow) - Query user history: Limit to recent items (from fast cache)
Where work happens: offline vs online
A/B Testing and Evaluation
Tell the interviewer about testing
You cannot just launch a new algorithm and hope it works. We need A/B testing to measure if the new algorithm is actually better. We also need offline evaluation before going live.
Offline Metrics (measured before launching):
- 1.Precision@K: Of the top K items we recommend, how many would the user actually watch?
- 2.Recall@K: Of all items the user would like, what fraction did we include in top K?
- 3.NDCG (Normalized Discounted Cumulative Gain): Are the best items ranked highest?
- 4.Coverage: What fraction of our catalog gets recommended? (Low coverage = same items shown to everyone)
- 5.Diversity: How different are the recommended items from each other?
Online Metrics (measured during A/B test):
- 1.Click-through rate (CTR): What fraction of users click on a recommendation?
- 2.Watch time: How many hours of content do users watch?
- 3.Completion rate: Do users finish what they start?
- 4.Engagement: How often do users come back to the app?
- 5.Subscription retention: Do users cancel less often?
The most important metric is usually total watch time - it captures both good recommendations (users find things to watch) and satisfaction (they finish what they start).
A/B TEST SETUP:
Group A (Control - 50% of users):The Feedback Loop Problem:
Recommendation systems create a feedback loop: 1. We recommend item X to users 2. More users watch item X (because we recommended it) 3. We see "users like item X" in our data 4. We recommend item X even more
This can make popular items TOO dominant. Solutions: - Explicitly track if users found items through recommendations vs. search - Inject diversity into recommendations - Use causal inference techniques to measure true preferences
What Netflix measures
Netflix's north star metric is "quality hours watched" - not just watch time, but time spent on content users rate highly. They also measure "take rate" (users who start watching something vs. just browsing) and "retention" (users who keep their subscription).
What Can Go Wrong and How We Handle It
Tell the interviewer about failures
Good engineers think about what can break. Let me walk through the things that can go wrong and how we protect against them.
Common failures and how we handle them:
| What breaks | What happens to users | How we fix it | Why this works |
|---|---|---|---|
| Cache is empty for a user | No pre-computed recommendations | Fall back to popular/trending items | Popular items work for everyone - better than nothing |
| Cache server goes down | Cannot look up recommendations | Have multiple cache replicas + circuit breaker to fallback | Redundancy ensures availability |
The Fallback Chain:
Every good recommendation system has a chain of fallbacks. If one method fails, try the next:
FUNCTION get_recommendations_with_fallback(user_id):
TRY Level 1: Personalized recommendations from cacheMonitoring and Alerting:
What we watch for: - Cache hit rate (should be >95%) - Batch job completion time and success rate - Request latency (p50, p95, p99) - Fallback trigger rate (how often we hit Level 2+) - Recommendation diversity (are we showing varied content?)
What is idempotent? (important word to know)
Idempotent means: doing something twice has the same result as doing it once. For batch jobs, we design them to be idempotent - if a job runs twice by accident, it should not corrupt data. Each run replaces old recommendations completely instead of appending.
Growing the System Over Time
What to tell the interviewer
This design works great for up to 500 million users. Let me explain how we would grow it further and what advanced features we could add.
How we grow step by step:
Stage 1: Starting out (up to 100 million users) - Single Redis cluster for cache (5-10 TB) - Single Spark cluster for batch processing - Simple collaborative filtering + content-based - Recommendations refresh every 12-24 hours - This is enough for most companies
Stage 2: Scaling up (100-500 million users) - Multiple Redis clusters, sharded by user_id - Larger Spark cluster or multiple clusters - Add deep learning models (two-tower) - Refresh every 4-6 hours - Add real-time feature store
Stage 3: Internet scale (500M+ users, like Netflix) - Global cache distribution (US, Europe, Asia) - Multiple specialized models for different use cases - Near real-time updates for some features - Sophisticated A/B testing framework - Custom hardware for ML inference
Scaling the cache layer globally
Advanced features we can add later:
1. Multi-objective optimization
Optimize for multiple goals at once: - User satisfaction (they watch and enjoy) - Diversity (show varied content, not just one genre) - Freshness (promote new releases) - Business goals (promote originals, balance licensing costs)
2. Sequence-aware recommendations
Consider the ORDER of what users watched. Watching comedies lately? Maybe they want a break from serious content.
3. Context-aware recommendations
Deeply integrate context: - Time of day patterns (user watches different things at 9am vs 9pm) - Seasonal content (holiday movies in December) - Social viewing (different recs when multiple people watch together) - Mood detection (from browsing patterns)
4. Explainable recommendations
Tell users WHY we recommend something: - "Because you watched Stranger Things" - "Popular in your area" - "Trending this week" - "Critics are raving about this"
This builds trust and helps users make decisions faster.
What Netflix actually does beyond this
Netflix has hundreds of engineers working on recommendations. They use: personalized artwork (different thumbnails for different users), personalized row ordering, personalized autoplay previews, and even personalized descriptions. The system we designed is the foundation - real systems layer many more personalization features on top.