Design Walkthrough
Problem Statement
The Question: Design an advertising system for a social media platform that serves billions of ad impressions per day.
Ad systems are the revenue engine of social platforms: - Facebook: $115B+ annual ad revenue - Google: $225B+ annual ad revenue - Twitter/X: $4B+ annual ad revenue
The system must: - Select the best ad for each user in real-time - Track budgets and prevent overspend - Measure and report ad performance - Handle millions of concurrent advertisers
What to say first
This is a complex system with multiple components. Let me clarify the scope - are we focusing on the ad serving pipeline, the campaign management system, or the full end-to-end? I will assume ad serving is the priority since that is the most technically challenging.
Hidden requirements interviewers test: - Do you understand auction mechanics (second-price, Vickrey)? - Can you design for competing objectives (revenue vs UX)? - Do you know ML ranking at scale? - Can you handle real-time budget enforcement? - Do you understand ad fraud and brand safety?
Clarifying Questions
These questions shape your architecture significantly.
Question 1: Scale
What is the scale? How many daily active users, ad impressions per day, and number of advertisers?
Why this matters: Determines caching strategy and system capacity. Typical answer: 500M DAU, 10B impressions/day, 10M advertisers Architecture impact: Cannot evaluate all ads for every request - need filtering and caching
Question 2: Ad Types
What types of ads? Feed ads, stories, search ads, display ads?
Why this matters: Different ad types have different serving patterns. Typical answer: Focus on feed ads (in-stream native ads) Architecture impact: Ads served as part of feed generation, must be fast
Question 3: Auction Model
What auction model? First-price, second-price, or VCG auction?
Why this matters: Affects bidding strategy and revenue. Typical answer: Generalized second-price (GSP) auction Architecture impact: Need to track clearing price, not just winning bid
Question 4: Billing Model
CPC (cost per click), CPM (cost per thousand impressions), or CPA (cost per action)?
Why this matters: Determines when budget is consumed. Typical answer: All three, but CPC most common Architecture impact: Need click tracking and conversion tracking systems
Stating assumptions
I will assume: 500M DAU, 10B impressions/day, feed ads with GSP auction, CPC billing primary. The system must select ads in under 100ms p99.
The Hard Part
Say this out loud
The hard part here is selecting the best ad from millions of candidates in under 100ms while balancing advertiser ROI, user experience, and platform revenue - all while enforcing budgets in real-time.
Why this is genuinely hard:
- 1.Scale of Selection: 10M ads, 500M users, need to find best match in milliseconds. Cannot score all ads for every request.
- 2.Competing Objectives: - Advertisers want conversions (show to likely buyers) - Users want relevance (not annoyed by irrelevant ads) - Platform wants revenue (maximize total ad spend) - These often conflict!
- 3.Real-time Budget Enforcement: If an advertiser has $1000 daily budget and we serve 1M impressions, we must track spend accurately without adding latency.
- 4.Cold Start Problem: New ads have no performance data. How do we rank them fairly against proven ads?
- 5.Fraud Detection: Click fraud, impression fraud, bot traffic - billions of dollars at stake.
Common mistakes
Candidates often: (1) Ignore the auction mechanism entirely, (2) Propose scoring all 10M ads for every request, (3) Forget about budget enforcement, (4) Skip fraud considerations.
The fundamental tradeoff:
Ad Rank = Bid x Quality Score x Relevance
- High bid, low relevance = Bad UX, users leave - Low bid, high relevance = Less revenue - Balancing act that ML models optimize
Scale & Access Patterns
Let me quantify the scale and understand access patterns.
| Dimension | Value | Impact |
|---|---|---|
| Daily Active Users | 500 million | 500M ad requests/day minimum |
| Impressions per Day | 10 billion | ~115K impressions/second average |
What to say
At 10B impressions/day with 10M ads, we cannot evaluate every ad for every request. We need a funnel: broad filtering first, then detailed scoring on a smaller candidate set.
Access Pattern Analysis:
- Read-heavy for ads: Ads change rarely, read millions of times - Write-heavy for events: Every impression, click is an event - Hot advertisers: Top 1% of advertisers drive 50%+ of spend - Temporal patterns: Budgets reset daily, traffic peaks predictably - Geographic locality: Users see ads targeted to their region
Ad serving throughput:
- 10B impressions / 86400 seconds = 115K/second average
- Peak = 5x = 575K impressions/secondHigh-Level Architecture
The ad system has three main subsystems: 1. Campaign Management: Advertisers create/manage campaigns 2. Ad Serving: Real-time ad selection for each request 3. Event Processing: Track impressions, clicks, conversions
What to say
I will focus on ad serving since it is the most technically challenging. The key insight is a funnel architecture: filter millions down to hundreds, then score and rank the top candidates.
Ad Serving Architecture
The Ad Selection Funnel:
- 1.Targeting Filter (10M to 100K ads, < 10ms) - Geographic targeting - Demographic targeting - Device/platform targeting - Budget > 0 check
- 2.Candidate Selection (100K to 1000 ads, < 20ms) - Lightweight relevance scoring - Diversity requirements - Frequency capping check
- 3.ML Ranking (1000 to 100 ads, < 50ms) - Deep learning model - Predicts CTR, conversion probability - User-ad feature interactions
- 4.Auction (100 to 10 ads, < 20ms) - Apply bids - Run GSP auction - Select winners
Real-world reference
Facebook uses a similar funnel. Their ML ranking model has 1000+ features and serves predictions in single-digit milliseconds using specialized inference servers.
Data Model & Storage
Multiple storage systems optimized for different access patterns.
What to say
We use different storage for different needs: fast key-value for real-time serving, OLAP for analytics, and a relational database for campaign management.
-- Advertiser accounts
CREATE TABLE advertisers (
id BIGINT PRIMARY KEY,Real-time Data Structures (Redis):
# Budget tracking (per advertiser)
budget:{advertiser_id}:daily_remaining -> INT (cents remaining)
budget:{advertiser_id}:daily_spent -> INT (cents spent today)Feature Store (for ML ranking):
User Features (precomputed, refreshed hourly):
- user_id -> {
age_bucket, gender, location,Important detail
The ad index must be rebuilt when campaigns start/stop or budgets deplete. We use a background job that publishes updates to ad servers every few seconds.
Algorithm Deep Dive: Auction & Ranking
The core of ad selection is the auction combined with ML ranking.
Generalized Second-Price (GSP) Auction:
In GSP auction, the winner pays the price of the second-highest bidder (plus a small increment). This encourages truthful bidding.
def run_gsp_auction(candidates: List[AdCandidate], num_slots: int) -> List[AuctionResult]:
"""
Run generalized second-price auction.ML Ranking Model:
The quality score comes from ML models predicting engagement.
ML Ranking Pipeline
def calculate_quality_score(ad: Ad, user: User, context: Context) -> float:
"""
Quality score balances multiple objectives.Facebook approach
Facebook uses Total Value = Advertiser Bid x Estimated Action Rate x User Value. The User Value component penalizes low-quality or irrelevant ads to protect user experience.
Budget Enforcement & Pacing
System Invariant
Never significantly overspend an advertiser budget. Even 1% overspend across millions of advertisers means millions in refunds. Slight overspend (< 5%) acceptable due to distributed nature.
The Budget Challenge:
With 100M clicks/day and 10M advertisers: - Cannot check central database for every click - Cannot lock budgets for every impression - Must handle eventual consistency gracefully
class BudgetManager:
def __init__(self):
self.local_cache = {} # Local budget cacheBudget Pacing:
Advertisers want to spread budget across the day, not spend it all in the morning.
def calculate_pacing_multiplier(campaign_id: str) -> float:
"""
Adjust bid based on spend rate vs target rate.What to say
Budget enforcement uses optimistic tracking with eventual consistency. We debit budgets asynchronously after impressions, accepting small overspend risk for latency. Reconciliation runs hourly to catch discrepancies.
Consistency & Invariants
Key Invariants:
| Invariant | Impact if Violated | Enforcement |
|---|---|---|
| No significant budget overspend | Must refund advertisers | Atomic decrements + buffer |
| No ads shown after campaign ends | Wasted spend, angry advertisers | TTL on ad index entries |
| Frequency caps respected | User annoyance, lower CTR | Redis counters per user-ad |
| Brand safety maintained | PR disaster, legal issues | Pre-filtering + real-time checks |
| Click attribution accurate | Wrong billing | Click ID tracking, deduplication |
Critical invariant detail
Budget tracking must be monotonic. If we show an ad, we MUST debit the budget. Lost debits mean overspend. We use write-ahead logging: log the impression before serving, then debit budget.
Consistency Model:
We use different consistency levels for different data:
| Data | Consistency | Reason |
|---|---|---|
| Ad creative content | Eventually consistent | Changes rare, caching OK |
| Budget remaining | Eventual with bounds | Speed critical, small overrun OK |
| Frequency cap counts | Eventually consistent | Approximate OK |
| Click/conversion events | At-least-once | May double-count, reconcile later |
| Billing records | Strongly consistent | Money - must be accurate |
Business impact mapping
If we show an ad to a user who hit frequency cap, they see one extra ad - minor annoyance. If we overspend a budget by 50%, we owe that advertiser money. Consistency requirements follow business impact.
Failure Modes & Resilience
Proactively discuss failures
Ad systems handle billions of dollars. Let me walk through failure modes and mitigations.
| Failure | Impact | Mitigation | Why It Works |
|---|---|---|---|
| ML model serving down | Cannot rank ads | Fallback to simple rules | Better to show suboptimal ads than no ads |
| Redis budget cache down | Cannot track budgets | Local estimation + pause high spenders | Conservative approach prevents overspend |
Graceful Degradation Strategy:
def select_ads(user: User, context: Context) -> List[Ad]:
"""
Fallback chain for ad selection.Key insight
For ad systems, fail open is usually correct. No ads is better than broken page. Revenue loss from downtime is less than user trust loss from broken experience.
Evolution & Scaling
What to say
This design handles 10B impressions/day. Let me discuss how it evolves for 10x scale and additional capabilities.
Evolution Path:
Stage 1: Basic System (up to 1B impressions/day) - Single ML model - Centralized Redis for budgets - Simple rule-based candidate filtering
Stage 2: Scaled System (up to 10B impressions/day) - Distributed ad index - Sharded budget tracking - Cascade ML ranking (simple model -> complex model) - Regional deployment
Stage 3: Advanced System (100B+ impressions/day) - Real-time model training - Federated budget tracking - Edge-based ad selection - Cross-device attribution
Multi-Region Architecture
Advanced Topics to Mention:
- 1.Real-Time Bidding (RTB): For ad exchanges, auction happens across multiple ad networks in 100ms.
- 2.Conversion Attribution: Users see ad on mobile, buy on desktop days later. Multi-touch attribution is complex.
- 3.Fraud Detection: Click farms, bot traffic, impression fraud. ML models detect anomalies.
- 4.Privacy Compliance: GDPR, CCPA affect targeting. Need consent management, data retention policies.
- 5.A/B Testing: Test ranking models, auction parameters, UI changes. Need holdout groups and statistical rigor.
Alternative approaches
If latency requirements were stricter, we could pre-compute ad selections for user segments and cache them. If budget accuracy were critical, we could use distributed transactions at the cost of latency. The design always trades off competing concerns.