System Design Masterclass
Searchrankingleaderboardreal-timeaggregationredisintermediate

Design Top-K Rankings System

Design App Store / Amazon Bestsellers rankings at scale

Millions of items, billions of events/day|Similar to Apple, Amazon, Google Play, Spotify, Netflix|45 min read

Summary

Design a system that ranks millions of items (apps, products, articles) based on sales, downloads, or engagement metrics. The core challenge is maintaining accurate real-time rankings while handling massive write throughput from events and serving low-latency read queries for the top items.

Key Takeaways

Core Problem

This is fundamentally a streaming aggregation problem - counting events in time windows and maintaining sorted order of millions of items.

The Hard Part

Maintaining accurate rankings in real-time when millions of events arrive per second, while serving low-latency queries for top items.

Scaling Axis

Scale writes by partitioning events by item ID. Scale reads by caching top-K results aggressively since they change slowly.

Critical Invariant

Rankings must be monotonically consistent - an item with more sales must never appear below an item with fewer sales in the same time window.

Performance Insight

Top-K queries must return in under 50ms. Most users only view top 100 items, so optimize for that case.

Key Tradeoff

We trade real-time accuracy for performance - rankings may be 1-5 minutes stale, which is acceptable for bestseller lists.

Design Walkthrough

Problem Statement

The Question: Design a ranking system like App Store Top Charts or Amazon Bestsellers that can rank millions of items based on sales/downloads in various time windows (hourly, daily, weekly).

Real-world examples: - App Store: Top Free, Top Paid, Top Grossing apps per category - Amazon: Bestsellers updated hourly across thousands of categories - Spotify: Top 50 charts per country, updated daily - Reddit: Hot posts ranking across thousands of subreddits - YouTube: Trending videos per region

What to say first

Before designing, I want to understand the scale, freshness requirements, and what metrics drive the ranking. The approach differs significantly based on whether we need real-time or can tolerate minutes of staleness.

Hidden requirements interviewers test: - Can you handle high write throughput (millions of events)? - Do you understand time-windowed aggregation? - Can you design for both real-time and batch scenarios? - Do you know data structures for maintaining sorted sets efficiently?

Clarifying Questions

These questions dramatically shape the architecture. Ask them early.

Question 1: Scale

How many items do we rank? How many events (purchases/downloads) per day?

Why this matters: Determines if we can use simple database queries or need streaming infrastructure. Typical answer: 10M items, 1B events/day (roughly 12K events/second average, 100K peak) Architecture impact: Need streaming aggregation, cannot query raw events

Question 2: Freshness

How fresh must rankings be? Real-time, or can they be minutes/hours old?

Why this matters: Real-time requires streaming; batch is simpler but stale. Typical answer: Updated every few minutes is acceptable (Amazon updates hourly) Architecture impact: Can use micro-batch processing instead of true streaming

Question 3: Time Windows

What time windows do we need? Last hour, day, week, all-time?

Why this matters: Multiple windows multiply storage and computation. Typical answer: Hourly, daily, weekly, monthly, all-time Architecture impact: Need efficient window management, possibly different strategies per window

Question 4: Categories

Do we rank globally or per category? How many categories?

Why this matters: Per-category multiplies the number of rankings to maintain. Typical answer: Both global and per-category, 10K categories Architecture impact: 10K categories x 5 time windows = 50K separate rankings

Question 5: Ranking Formula

Simple count, or weighted formula (recency, engagement quality)?

Why this matters: Complex formulas require more computation. Typical answer: Start with simple count, mention weighted as extension Architecture impact: Simple aggregation vs complex scoring pipeline

Stating assumptions

I will assume: 10M items, 1B events/day, rankings updated every 5 minutes, 5 time windows, 10K categories, simple count-based ranking initially.

The Hard Part

Say this out loud

The hard part is maintaining accurate aggregated counts for millions of items across multiple time windows while handling 100K events per second at peak.

Why this is genuinely hard:

  1. 1.Write Amplification: Each event must update multiple time windows (hourly, daily, weekly, etc.). 1 event becomes 5 writes.
  2. 2.Time Window Management: Sliding windows are expensive. How do you efficiently expire old data from a window?
  3. 3.Sorting at Scale: Maintaining sorted order of 10M items is expensive. You cannot sort on every read.
  4. 4.Hot Items: Top items (viral apps) get disproportionate traffic. Single item might get 1M events/hour.
  5. 5.Category Fan-out: One item belongs to multiple categories. Each event updates rankings in all categories.

Common mistake

Candidates often propose SELECT COUNT(*) GROUP BY item_id ORDER BY count DESC. This works for thousands of items, not millions with billions of events.

The fundamental insight:

You do not need to maintain exact rankings for ALL items. Users only ever see: - Top 100-200 items in a category - Maybe search for a specific item ranking

This means we can: - Maintain precise rankings only for top items - Use approximate counts for the long tail - Cache aggressively since top items change slowly

Scale and Access Patterns

Let me quantify the scale and understand access patterns.

DimensionValueImpact
Total Items10,000,000Cannot sort all items on every query
Events per Day1,000,000,00012K/sec avg, 100K/sec peak - need streaming
+ 4 more rows...

What to say

At 1B events/day with 5 time windows, we are looking at 5B counter updates per day. This is clearly a streaming aggregation problem, not a batch query problem.

Access Pattern Analysis:

Writes (Events): - High throughput: 100K/sec peak - Bursty: Product launches, viral moments - Each event: increment counter for item in multiple windows/categories

Reads (Rankings): - Top-K queries: Get top 100 items in category X for window Y - Specific rank: What rank is item Z in category X? - Highly cacheable: Top items change slowly (minutes, not seconds)

Storage for counters:
- 10M items x 5 windows x 8 bytes = 400 MB (fits in memory easily)
- With 10K categories: 10M x 5 x 10K x 8 bytes = 4 TB (too much!)
+ 10 more lines...

High-Level Architecture

I will design a Lambda architecture that combines streaming for freshness with batch for accuracy.

What to say

We use a streaming layer for near-real-time rankings and a batch layer for accurate historical aggregation. Results are merged at query time. We scale writes by partitioning events, and reads by caching top-K results.

Top-K Rankings Architecture

Component Responsibilities:

  1. 1.Kafka: Durable event stream, enables replay, decouples producers from consumers
  2. 2.Stream Processing (Flink/Spark Streaming): - Aggregates events in micro-batches (every 1-5 minutes) - Maintains windowed counts per item/category - Updates Redis with latest counts
  3. 3.Redis Sorted Sets: - Stores current rankings per category/window - ZADD/ZINCRBY for atomic updates - ZREVRANGE for top-K queries
  4. 4.Batch Layer (Spark): - Processes full event history daily - Produces accurate rankings (corrects any streaming drift) - Handles complex ranking formulas
  5. 5.Serving API: - Queries Redis for real-time rankings - Falls back to batch results if streaming lag - Caches responses aggressively

Real-world reference

Amazon updates bestseller rankings hourly using batch jobs. Apple App Store uses a hybrid approach with streaming for trending and batch for official charts. Spotify runs batch jobs every 24 hours for their official charts.

Data Model and Storage

Redis Sorted Sets are the natural choice for maintaining rankings - they provide O(log N) insert and O(K) top-K retrieval.

What to say

Redis Sorted Sets give us exactly what we need: atomic score increments, automatic sorting, and efficient top-K queries. The key design is: rankings:{category}:{window}.

Key pattern: rankings:{category_id}:{window_type}

Examples:
+ 13 more lines...

Window Management Strategy:

For time windows, we have two approaches:

ApproachHow It WorksProsCons
Tumbling WindowsReset counter at window boundarySimple, low memoryRankings jump at boundaries
Sliding WindowsTrack events with timestamps, sum recentSmooth rankingsHigh memory, complex expiry
Hybrid (Bucketed)Multiple small buckets, sum for windowBalance of bothSlightly complex
# For daily ranking, maintain 24 hourly buckets
# Daily count = sum of last 24 hourly buckets
+ 21 more lines...

Batch Layer Schema (for accurate historical rankings):

-- Precomputed rankings (updated by batch job)
CREATE TABLE rankings (
    category_id     VARCHAR(100),
+ 12 more lines...

Algorithm Deep Dive

Let me explain the key algorithms for maintaining and querying rankings efficiently.

1. Streaming Aggregation with Flink/Spark

# Stream processing pseudocode

class RankingAggregator:
+ 25 more lines...

2. Efficient Top-K Retrieval

class RankingService:
    def __init__(self):
        self.cache_ttl = 60  # Cache for 60 seconds
+ 41 more lines...

3. Handling Hot Items (Celebrity Problem)

When one item gets massive traffic (viral app), it can overwhelm a single Redis shard.

from collections import defaultdict
import time
+ 30 more lines...

Hot Item Aggregation Flow

Consistency and Invariants

System Invariants

Rankings must be monotonically consistent: if item A has more events than item B in a time window, A must rank higher than B. Brief inconsistency (seconds) is acceptable during updates.

Consistency Model:

We use eventual consistency with bounded staleness: - Rankings are eventually consistent (may be seconds to minutes stale) - Staleness is bounded by aggregation interval (e.g., 5 minutes max) - Strong consistency within a single sorted set (Redis guarantees this)

ScenarioConsistency GuaranteeWhy Acceptable
Top-K queryEventual (seconds stale)Users do not notice if rank 5 vs 6
Event countingEventual (aggregation window)Micro-batching is sufficient
Cross-categoryIndependentEach category ranking is isolated
Time window boundaryBrief inconsistencyOld window expires, new starts

Business impact mapping

If an app briefly shows as rank 5 instead of rank 4, no business impact. If rankings are hours stale during a viral moment, we lose relevance. We optimize for freshness, not perfect accuracy.

Handling Race Conditions:

Redis ZINCRBY is atomic, so concurrent increments are safe. The main race condition is at window boundaries:

def rotate_window(category, old_window, new_window):
    # 1. Create new window (empty sorted set auto-created on first ZINCRBY)
    # Events now write to new window
+ 8 more lines...

Failure Modes and Resilience

Proactively discuss failures

Let me walk through failure scenarios and mitigations. Rankings should degrade gracefully, never completely fail.

FailureImpactMitigationWhy It Works
Redis downCannot update or query rankingsMulti-level cache + batch fallbackServe stale data from cache or DB
Kafka lagRankings become staleMonitor lag, alert if > thresholdStaleness is visible, not silent
+ 4 more rows...

Multi-Level Caching for Resilience:

class ResilientRankingService:
    def get_top_k(self, category, window, k):
        # Level 1: Local in-memory cache (fastest)
+ 25 more lines...

Monitoring and Alerting:

Freshness:
- ranking_staleness_seconds: Time since last update
- kafka_consumer_lag: Events waiting to be processed
+ 16 more lines...

Evolution and Scaling

What to say

This design handles millions of items and billions of events. Let me discuss how it evolves for different scale points and additional requirements.

Evolution Path:

Stage 1: Simple (up to 100K items, 10M events/day) - Single Redis instance with sorted sets - Cron job for batch aggregation - Direct queries, simple caching

Stage 2: Streaming (up to 10M items, 1B events/day) - Kafka + Flink for streaming aggregation - Redis Cluster for sharded sorted sets - Multi-level caching

Stage 3: Global (100M+ items, multi-region) - Regional processing with global aggregation - Approximate algorithms for long-tail - Pre-computed rankings in CDN

ScaleArchitectureKey Change
< 100K itemsSingle Redis + CronSimple and sufficient
100K - 10M itemsStreaming + Redis ClusterAdd Kafka and Flink
10M - 100M itemsAdd approximate countingUse HyperLogLog for long-tail
> 100M itemsTiered storageHot items in Redis, cold in DB

Advanced Extensions:

1. Personalized Rankings

Combine global popularity with user preferences:

import math

def personalized_score(item_id, user_id, category):
+ 14 more lines...

2. Trending vs Bestseller

Trending = velocity of growth, Bestseller = absolute count:

def trending_score(item_id):
    # Trending = how fast count is growing
    # Compare recent period to baseline
+ 11 more lines...

Alternative approach

If we needed sub-second freshness for trending topics (like Twitter), I would use a different architecture: in-memory stream processing (like Apache Kafka Streams or Flink with RocksDB state) with direct serving from the stream processor, bypassing Redis entirely.

What I would do differently for...

Gaming leaderboards (exact ranks matter): Use database with proper transactions, accept higher latency. Players care about exact rank.

News/social trending: Sub-minute freshness needed. Use Kafka Streams with in-memory state, serve directly from stream processor.

E-commerce (revenue critical): Double-write to both streaming and batch, reconcile discrepancies. Cannot afford to miss sales data.

Design Trade-offs

Advantages

  • +Simple architecture
  • +Accurate results
  • +Easy to debug

Disadvantages

  • -Stale rankings (hours old)
  • -High latency for updates
  • -Cannot handle real-time trending
When to use

When hourly updates are acceptable (e.g., Amazon bestsellers)