Design Walkthrough
Problem Statement
The Question: Design a ranking system like App Store Top Charts or Amazon Bestsellers that can rank millions of items based on sales/downloads in various time windows (hourly, daily, weekly).
Real-world examples: - App Store: Top Free, Top Paid, Top Grossing apps per category - Amazon: Bestsellers updated hourly across thousands of categories - Spotify: Top 50 charts per country, updated daily - Reddit: Hot posts ranking across thousands of subreddits - YouTube: Trending videos per region
What to say first
Before designing, I want to understand the scale, freshness requirements, and what metrics drive the ranking. The approach differs significantly based on whether we need real-time or can tolerate minutes of staleness.
Hidden requirements interviewers test: - Can you handle high write throughput (millions of events)? - Do you understand time-windowed aggregation? - Can you design for both real-time and batch scenarios? - Do you know data structures for maintaining sorted sets efficiently?
Clarifying Questions
These questions dramatically shape the architecture. Ask them early.
Question 1: Scale
How many items do we rank? How many events (purchases/downloads) per day?
Why this matters: Determines if we can use simple database queries or need streaming infrastructure. Typical answer: 10M items, 1B events/day (roughly 12K events/second average, 100K peak) Architecture impact: Need streaming aggregation, cannot query raw events
Question 2: Freshness
How fresh must rankings be? Real-time, or can they be minutes/hours old?
Why this matters: Real-time requires streaming; batch is simpler but stale. Typical answer: Updated every few minutes is acceptable (Amazon updates hourly) Architecture impact: Can use micro-batch processing instead of true streaming
Question 3: Time Windows
What time windows do we need? Last hour, day, week, all-time?
Why this matters: Multiple windows multiply storage and computation. Typical answer: Hourly, daily, weekly, monthly, all-time Architecture impact: Need efficient window management, possibly different strategies per window
Question 4: Categories
Do we rank globally or per category? How many categories?
Why this matters: Per-category multiplies the number of rankings to maintain. Typical answer: Both global and per-category, 10K categories Architecture impact: 10K categories x 5 time windows = 50K separate rankings
Question 5: Ranking Formula
Simple count, or weighted formula (recency, engagement quality)?
Why this matters: Complex formulas require more computation. Typical answer: Start with simple count, mention weighted as extension Architecture impact: Simple aggregation vs complex scoring pipeline
Stating assumptions
I will assume: 10M items, 1B events/day, rankings updated every 5 minutes, 5 time windows, 10K categories, simple count-based ranking initially.
The Hard Part
Say this out loud
The hard part is maintaining accurate aggregated counts for millions of items across multiple time windows while handling 100K events per second at peak.
Why this is genuinely hard:
- 1.Write Amplification: Each event must update multiple time windows (hourly, daily, weekly, etc.). 1 event becomes 5 writes.
- 2.Time Window Management: Sliding windows are expensive. How do you efficiently expire old data from a window?
- 3.Sorting at Scale: Maintaining sorted order of 10M items is expensive. You cannot sort on every read.
- 4.Hot Items: Top items (viral apps) get disproportionate traffic. Single item might get 1M events/hour.
- 5.Category Fan-out: One item belongs to multiple categories. Each event updates rankings in all categories.
Common mistake
Candidates often propose SELECT COUNT(*) GROUP BY item_id ORDER BY count DESC. This works for thousands of items, not millions with billions of events.
The fundamental insight:
You do not need to maintain exact rankings for ALL items. Users only ever see: - Top 100-200 items in a category - Maybe search for a specific item ranking
This means we can: - Maintain precise rankings only for top items - Use approximate counts for the long tail - Cache aggressively since top items change slowly
Scale and Access Patterns
Let me quantify the scale and understand access patterns.
| Dimension | Value | Impact |
|---|---|---|
| Total Items | 10,000,000 | Cannot sort all items on every query |
| Events per Day | 1,000,000,000 | 12K/sec avg, 100K/sec peak - need streaming |
What to say
At 1B events/day with 5 time windows, we are looking at 5B counter updates per day. This is clearly a streaming aggregation problem, not a batch query problem.
Access Pattern Analysis:
Writes (Events): - High throughput: 100K/sec peak - Bursty: Product launches, viral moments - Each event: increment counter for item in multiple windows/categories
Reads (Rankings): - Top-K queries: Get top 100 items in category X for window Y - Specific rank: What rank is item Z in category X? - Highly cacheable: Top items change slowly (minutes, not seconds)
Storage for counters:
- 10M items x 5 windows x 8 bytes = 400 MB (fits in memory easily)
- With 10K categories: 10M x 5 x 10K x 8 bytes = 4 TB (too much!)High-Level Architecture
I will design a Lambda architecture that combines streaming for freshness with batch for accuracy.
What to say
We use a streaming layer for near-real-time rankings and a batch layer for accurate historical aggregation. Results are merged at query time. We scale writes by partitioning events, and reads by caching top-K results.
Top-K Rankings Architecture
Component Responsibilities:
- 1.Kafka: Durable event stream, enables replay, decouples producers from consumers
- 2.Stream Processing (Flink/Spark Streaming): - Aggregates events in micro-batches (every 1-5 minutes) - Maintains windowed counts per item/category - Updates Redis with latest counts
- 3.Redis Sorted Sets: - Stores current rankings per category/window - ZADD/ZINCRBY for atomic updates - ZREVRANGE for top-K queries
- 4.Batch Layer (Spark): - Processes full event history daily - Produces accurate rankings (corrects any streaming drift) - Handles complex ranking formulas
- 5.Serving API: - Queries Redis for real-time rankings - Falls back to batch results if streaming lag - Caches responses aggressively
Real-world reference
Amazon updates bestseller rankings hourly using batch jobs. Apple App Store uses a hybrid approach with streaming for trending and batch for official charts. Spotify runs batch jobs every 24 hours for their official charts.
Data Model and Storage
Redis Sorted Sets are the natural choice for maintaining rankings - they provide O(log N) insert and O(K) top-K retrieval.
What to say
Redis Sorted Sets give us exactly what we need: atomic score increments, automatic sorting, and efficient top-K queries. The key design is: rankings:{category}:{window}.
Key pattern: rankings:{category_id}:{window_type}
Examples:Window Management Strategy:
For time windows, we have two approaches:
| Approach | How It Works | Pros | Cons |
|---|---|---|---|
| Tumbling Windows | Reset counter at window boundary | Simple, low memory | Rankings jump at boundaries |
| Sliding Windows | Track events with timestamps, sum recent | Smooth rankings | High memory, complex expiry |
| Hybrid (Bucketed) | Multiple small buckets, sum for window | Balance of both | Slightly complex |
# For daily ranking, maintain 24 hourly buckets
# Daily count = sum of last 24 hourly buckets
Batch Layer Schema (for accurate historical rankings):
-- Precomputed rankings (updated by batch job)
CREATE TABLE rankings (
category_id VARCHAR(100),Algorithm Deep Dive
Let me explain the key algorithms for maintaining and querying rankings efficiently.
1. Streaming Aggregation with Flink/Spark
# Stream processing pseudocode
class RankingAggregator:2. Efficient Top-K Retrieval
class RankingService:
def __init__(self):
self.cache_ttl = 60 # Cache for 60 seconds3. Handling Hot Items (Celebrity Problem)
When one item gets massive traffic (viral app), it can overwhelm a single Redis shard.
from collections import defaultdict
import time
Hot Item Aggregation Flow
Consistency and Invariants
System Invariants
Rankings must be monotonically consistent: if item A has more events than item B in a time window, A must rank higher than B. Brief inconsistency (seconds) is acceptable during updates.
Consistency Model:
We use eventual consistency with bounded staleness: - Rankings are eventually consistent (may be seconds to minutes stale) - Staleness is bounded by aggregation interval (e.g., 5 minutes max) - Strong consistency within a single sorted set (Redis guarantees this)
| Scenario | Consistency Guarantee | Why Acceptable |
|---|---|---|
| Top-K query | Eventual (seconds stale) | Users do not notice if rank 5 vs 6 |
| Event counting | Eventual (aggregation window) | Micro-batching is sufficient |
| Cross-category | Independent | Each category ranking is isolated |
| Time window boundary | Brief inconsistency | Old window expires, new starts |
Business impact mapping
If an app briefly shows as rank 5 instead of rank 4, no business impact. If rankings are hours stale during a viral moment, we lose relevance. We optimize for freshness, not perfect accuracy.
Handling Race Conditions:
Redis ZINCRBY is atomic, so concurrent increments are safe. The main race condition is at window boundaries:
def rotate_window(category, old_window, new_window):
# 1. Create new window (empty sorted set auto-created on first ZINCRBY)
# Events now write to new windowFailure Modes and Resilience
Proactively discuss failures
Let me walk through failure scenarios and mitigations. Rankings should degrade gracefully, never completely fail.
| Failure | Impact | Mitigation | Why It Works |
|---|---|---|---|
| Redis down | Cannot update or query rankings | Multi-level cache + batch fallback | Serve stale data from cache or DB |
| Kafka lag | Rankings become stale | Monitor lag, alert if > threshold | Staleness is visible, not silent |
Multi-Level Caching for Resilience:
class ResilientRankingService:
def get_top_k(self, category, window, k):
# Level 1: Local in-memory cache (fastest)Monitoring and Alerting:
Freshness:
- ranking_staleness_seconds: Time since last update
- kafka_consumer_lag: Events waiting to be processedEvolution and Scaling
What to say
This design handles millions of items and billions of events. Let me discuss how it evolves for different scale points and additional requirements.
Evolution Path:
Stage 1: Simple (up to 100K items, 10M events/day) - Single Redis instance with sorted sets - Cron job for batch aggregation - Direct queries, simple caching
Stage 2: Streaming (up to 10M items, 1B events/day) - Kafka + Flink for streaming aggregation - Redis Cluster for sharded sorted sets - Multi-level caching
Stage 3: Global (100M+ items, multi-region) - Regional processing with global aggregation - Approximate algorithms for long-tail - Pre-computed rankings in CDN
| Scale | Architecture | Key Change |
|---|---|---|
| < 100K items | Single Redis + Cron | Simple and sufficient |
| 100K - 10M items | Streaming + Redis Cluster | Add Kafka and Flink |
| 10M - 100M items | Add approximate counting | Use HyperLogLog for long-tail |
| > 100M items | Tiered storage | Hot items in Redis, cold in DB |
Advanced Extensions:
1. Personalized Rankings
Combine global popularity with user preferences:
import math
def personalized_score(item_id, user_id, category):2. Trending vs Bestseller
Trending = velocity of growth, Bestseller = absolute count:
def trending_score(item_id):
# Trending = how fast count is growing
# Compare recent period to baselineAlternative approach
If we needed sub-second freshness for trending topics (like Twitter), I would use a different architecture: in-memory stream processing (like Apache Kafka Streams or Flink with RocksDB state) with direct serving from the stream processor, bypassing Redis entirely.
What I would do differently for...
Gaming leaderboards (exact ranks matter): Use database with proper transactions, accept higher latency. Players care about exact rank.
News/social trending: Sub-minute freshness needed. Use Kafka Streams with in-memory state, serve directly from stream processor.
E-commerce (revenue critical): Double-write to both streaming and batch, reconcile discrepancies. Cannot afford to miss sales data.