System Design Masterclass

58 items

System Design Masterclass

Searchanalyticsevent-pipelinestreamingbatch-processingdata-warehouseadvanced

Design User Analytics Dashboard & Event Pipeline

Design an analytics system like Google Analytics handling billions of events per day

10B+ events/day, 100K+ customers, petabytes of data|Similar to Google, Mixpanel, Amplitude, Segment, Snowflake, Databricks|45 min read

Summary

Analytics platforms like Google Analytics, Mixpanel, and Amplitude process billions of events daily to provide insights on user behavior. The system must ingest events at massive scale, process them for both real-time and batch analytics, store data cost-effectively, and serve dashboards with sub-second query latency. This tests your understanding of event-driven architecture, stream processing, data warehousing, and the lambda/kappa architecture patterns.

Key Takeaways

Core Problem

This is fundamentally an ETL pipeline problem: Extract events from clients, Transform them into queryable format, Load into storage optimized for analytics queries.

The Hard Part

Serving both real-time dashboards (last 5 minutes) and historical analysis (last 2 years) from the same system with different latency and cost requirements.

Scaling Axis

Scale ingestion by partitioning on customer_id and timestamp. Scale queries by pre-aggregating common dimensions and using columnar storage.

The Question: Design an analytics platform like Google Analytics that can track user behavior across websites and mobile apps, processing billions of events per day.

The system must support: - Event ingestion from millions of websites/apps - Real-time dashboards showing current active users - Historical reports with custom date ranges and dimensions - Funnel analysis tracking user journeys - Segmentation by user properties and behaviors

What to say first

Before diving in, let me clarify the requirements. I want to understand the scale, latency needs for different use cases, and what types of queries we need to support.

What interviewers are testing: - Can you design for massive write throughput? - Do you understand the tradeoffs between real-time and batch processing? - Can you optimize storage for both cost and query performance? - Do you know when to use pre-aggregation vs raw data?

Summary

Key Takeaways

Core Problem

This is fundamentally an ETL pipeline problem: Extract events from clients, Transform them into queryable format, Load into storage optimized for analytics queries.

The Hard Part

Serving both real-time dashboards (last 5 minutes) and historical analysis (last 2 years) from the same system with different latency and cost requirements.

Scaling Axis

Scale ingestion by partitioning on customer_id and timestamp. Scale queries by pre-aggregating common dimensions and using columnar storage.

Critical Invariant

Never lose events. Analytics data drives business decisions - missing data means wrong decisions. At-least-once delivery with deduplication.

Performance Requirement

Ingest 100K+ events/second. Dashboard queries must return in under 2 seconds even for 90-day date ranges.

Key Tradeoff

Real-time accuracy vs cost. We pre-aggregate for speed but store raw events for flexibility. Lambda architecture handles both needs.

Design Walkthrough

Problem Statement

The Question: Design an analytics platform like Google Analytics that can track user behavior across websites and mobile apps, processing billions of events per day.

What to say first

Before diving in, let me clarify the requirements. I want to understand the scale, latency needs for different use cases, and what types of queries we need to support.

Clarifying Questions

These questions shape fundamental architecture decisions.

Question 1: Scale

What is the expected event volume? How many customers (websites/apps) will we support?

Why this matters: Determines if we need distributed ingestion and storage. Typical answer: 10B events/day, 100K customers Architecture impact: Need Kafka for ingestion, distributed storage, partitioning strategy

Question 2: Latency Requirements

What latency is acceptable for real-time dashboards vs historical reports?

Why this matters: Real-time requires stream processing; batch can use cheaper methods. Typical answer: Real-time within 1 minute, historical queries under 2 seconds Architecture impact: Lambda architecture with hot and cold paths

Question 3: Query Patterns

What types of queries will users run? Pre-defined dashboards or ad-hoc analysis?

Why this matters: Pre-defined queries can be pre-aggregated; ad-hoc needs raw data. Typical answer: 80% standard dashboards, 20% custom queries Architecture impact: Pre-aggregate common dimensions, keep raw data for flexibility

Question 4: Data Retention

How long do we need to keep data? Different retention for different granularities?

Why this matters: Storage costs grow linearly with retention. Typical answer: Raw events 30 days, hourly aggregates 1 year, daily aggregates forever Architecture impact: Tiered storage with rollup jobs

State your assumptions

I will assume: 10B events/day, 100K customers, 1-minute real-time latency, 2-second query latency, 80/20 standard/ad-hoc queries, tiered retention (30d raw, 1y hourly, unlimited daily).

The Hard Part

Say this out loud

The hard part here is serving both real-time and historical analytics from the same system while keeping costs manageable at petabyte scale.

Why this is genuinely hard:

1.Write vs Read Optimization: High write throughput wants append-only logs. Fast queries want columnar, indexed storage. These are fundamentally different.
2.Real-time vs Batch: Real-time needs stream processing (expensive, complex). Batch is cheaper but delayed. Users want both.
3.Cost at Scale: 10B events/day = 300B events/month. Storing and querying petabytes is expensive. Need smart aggregation and tiering.
4.Query Flexibility: Users want to slice data by any dimension. Pre-aggregating every combination is impossible (combinatorial explosion).

Common mistake

Candidates often design only for real-time OR batch. Production analytics systems need both - real-time for current state, batch for accurate historical analysis.

The Lambda Architecture Solution:

Process data through two parallel paths:

1.Speed Layer (Real-time): Stream processing for immediate results, approximate, recent data only
2.Batch Layer (Accurate): Periodic batch jobs for accurate historical data, handles late-arriving events
3.Serving Layer: Merges results from both layers for queries

Scale & Access Patterns

Let me quantify the scale and understand access patterns.

Dimension	Value	Impact
Events per day	10 billion	~115K events/second sustained, need distributed ingestion
Event size	~500 bytes average	5 TB/day raw data, 150 TB/month

+ 5 more rows...

What to say

At 10B events/day, we are ingesting 115K events per second. This is write-heavy during ingestion but read-heavy during query time. We need to optimize for both.

Storage estimation:
- Raw events: 10B events x 500 bytes = 5 TB/day
- Monthly raw: 5 TB x 30 days = 150 TB

+ 13 more lines...

Access Patterns:

Writes: Append-only, high throughput, can be batched client-side - Reads: Time-range scans, aggregations by dimensions, mostly recent data - Hot data: Last 24 hours queried 100x more than older data - Seasonality: Traffic spikes during business hours, marketing campaigns

High-Level Architecture

What to say

I will design a Lambda architecture with separate speed and batch layers, unified by a serving layer. We scale by partitioning on customer_id and timestamp.

Analytics Pipeline Architecture

Component Responsibilities:

1. Event Ingestion - SDKs batch events client-side (reduce requests) - Event API validates, enriches, and forwards to Kafka - API is stateless, horizontally scalable

2. Message Buffer (Kafka) - Decouples ingestion from processing - Provides durability (events survive processor failures) - Enables replay for reprocessing

3. Speed Layer (Flink) - Processes events in real-time - Maintains running counts for last N minutes - Writes to fast store (Redis/Druid)

4. Batch Layer (Spark) - Runs hourly/daily jobs on raw events - Produces accurate aggregates - Handles late-arriving events

5. Serving Layer - Merges real-time and batch results - Caches frequent queries - Provides unified query API

Real-world reference

Google Analytics uses a similar architecture. Mixpanel uses Kafka plus ClickHouse. Amplitude uses a custom streaming pipeline with Druid for real-time and S3/Spark for batch.

Data Model & Storage

We need different storage systems optimized for different access patterns.

What to say

The source of truth is raw events in object storage. We materialize views at different granularities for query performance. Storage choice depends on query pattern.

Storage	Purpose	Technology	Retention
Raw Events	Source of truth, replay, ad-hoc	S3/GCS (Parquet)	30 days
Real-time Store	Last 24h, fast updates	Redis/Druid	24 hours
OLAP Warehouse	Historical queries, aggregates	ClickHouse/BigQuery	Forever
Pre-aggregates	Common dashboard queries	ClickHouse materialized views	Forever
Query Cache	Repeated queries	Redis	TTL 5 min

CREATE TABLE events (
    event_id UUID,
    customer_id UInt64,

+ 28 more lines...

Why ClickHouse/Columnar Storage?

Analytics queries scan few columns across many rows - Columnar storage reads only needed columns (10x less I/O) - Compression is excellent for repeated values - Vectorized execution for aggregations

-- Hourly aggregates by common dimensions
CREATE MATERIALIZED VIEW events_hourly_mv
ENGINE = SummingMergeTree()

+ 13 more lines...

Aggregation tradeoff

We cannot pre-aggregate every dimension combination - 10 dimensions with 100 values each = 100^10 combinations. We pre-aggregate the most common 5-6 dimension combinations and fall back to raw data for rare queries.

Event Ingestion Deep Dive

The ingestion pipeline must handle 100K+ events/second without losing data.

Event Ingestion Flow

async def handle_events(request: EventBatchRequest) -> Response:
    customer_id = authenticate(request.api_key)

+ 38 more lines...

Key Design Decisions:

1.Client-side batching: SDKs batch events for 5 seconds or 10 events before sending. Reduces API calls by 10x.
2.Async acknowledgment: API returns success after Kafka ACK, not after processing. Decouples ingestion from processing.
3.Partition by customer: Events for same customer go to same Kafka partition. Enables per-customer ordering and processing.
4.Dead Letter Queue: Invalid events are not lost - sent to DLQ for debugging and potential replay.

At-least-once delivery

We guarantee at-least-once delivery. If Kafka ACK fails, client retries. Downstream processing handles duplicates via event_id deduplication.

Stream Processing (Speed Layer)

The speed layer processes events in real-time for immediate visibility.

# Apache Flink job for real-time aggregation
class AnalyticsStreamJob:
    def process(self):

+ 31 more lines...

Real-time Metrics Computed:

1.Active users (1-min, 5-min, 15-min windows) 2. Event counts by type, country, device 3. Page views per page/screen 4. Session counts and average duration 5. Conversion events (purchases, signups)

Late arrivals

Mobile events often arrive late (offline then sync). Speed layer uses watermarks with 5-minute allowed lateness. Events arriving later are dropped from real-time but captured in batch layer.

Windowing Strategy

Batch Processing Layer

The batch layer produces accurate historical data and handles late-arriving events.

# Apache Spark job for hourly aggregation
def hourly_aggregation_job(spark, hour):
    # Read raw events from S3 (Parquet)

+ 33 more lines...

Batch Jobs Schedule:

1.Hourly aggregation: Runs at minute 15 (allows 15 min for late arrivals) 2. Daily rollup: Combines hourly into daily at 2 AM 3. Weekly/Monthly rollup: Further compression for long-term storage 4. Data quality checks: Validates counts match between layers

Why batch even with streaming?

Batch provides: (1) Exact counts after deduplication, (2) Handles arbitrarily late data, (3) Cheaper compute (spot instances), (4) Easier debugging and reprocessing.

def reprocess_late_arrivals(spark, date):
    # Find events that arrived late (ingested_at >> event_timestamp)
    late_events = spark.read.parquet(

+ 13 more lines...

Query Layer & Dashboards

The query layer serves dashboards with sub-second latency by merging real-time and batch data.

Query Flow

class QueryEngine:
    async def execute_query(self, query: AnalyticsQuery) -> QueryResult:
        # Check cache first

+ 38 more lines...

Query Optimization Strategies:

1.Use pre-aggregates when possible: If query matches a materialized view, use it (100x faster)
2.Time-based routing: Recent data from fast real-time store, historical from batch
3.Result caching: Cache dashboard queries (most users see same dashboards)
4.Query rewriting: Expand user-friendly queries to optimized SQL

-- Active users in last 24 hours (uses real-time + batch)
SELECT 
    toStartOfHour(timestamp) AS hour,

+ 29 more lines...

Approximate distinct counts

For unique user counts at scale, use HyperLogLog (uniqHLL in ClickHouse). It uses 12KB instead of storing all user IDs, with 2% error rate - acceptable for analytics.

Consistency & Invariants

System Invariants

Never lose events - analytics data drives business decisions. At-least-once delivery everywhere, deduplication at query time.

Consistency Model:

We use eventual consistency with bounded staleness: - Real-time layer: ~1 minute delay from event to dashboard - Batch layer: ~1-2 hour delay but exact counts - Merged view: Shows real-time for recent, batch for historical

Data Age	Source	Accuracy	Latency
0-60 min	Real-time (Flink)	~99% (may miss late arrivals)	<1 min
1-24 hours	Real-time + Batch	~99.9% (batch corrects)	<2 sec query
1+ days	Batch only	100% (fully reconciled)	<2 sec query

Business impact

Users see slightly different numbers for today vs yesterday. This is expected - yesterday is fully processed, today is still accumulating. We show indicators when data is still processing.

Deduplication Strategy:

Events may be sent multiple times (client retries, Kafka redelivery). We deduplicate at multiple levels:

1.Client SDK: Generate unique event_id client-side 2. Ingestion: Idempotent writes to Kafka (producer ID) 3. Processing: Dedupe by event_id in Spark jobs 4. Query time: Use DISTINCT or first() aggregations

def generate_event_id(event):
    # Deterministic ID based on event content
    # Same event always generates same ID
    content = f"{event.user_id}:{event.timestamp}:{event.event_type}:{event.page_url}"
    return hashlib.sha256(content.encode()).hexdigest()[:32]

# This means if client retries same event, it gets same ID
# Processing can dedupe by event_id

Failure Modes & Resilience

Proactively discuss failures

Let me walk through failure modes. At 10B events/day, every component WILL fail eventually.

Failure	Impact	Mitigation	Recovery
Event API down	Events queued in SDK	Client-side buffer (1000 events), retry with backoff	Flush buffer when API recovers
Kafka broker down	Reduced throughput	Replication factor 3, automatic leader election	New leader elected in seconds

+ 4 more rows...

Data Loss Prevention:

1.Kafka retention: Keep raw events in Kafka for 7 days. Can replay if processing fails.
2.S3 raw storage: Write raw events to S3 (immutable). Source of truth for reprocessing.
3.Checkpointing: Flink checkpoints to S3 every minute. Can resume from checkpoint.
4.Idempotent processing: All jobs can be re-run safely. Duplicates are handled.

# Enable exactly-once processing with checkpointing
env = StreamExecutionEnvironment.get_execution_environment()

+ 16 more lines...

Backpressure handling

When downstream cannot keep up (ClickHouse slow), Flink applies backpressure to Kafka consumer. Events queue in Kafka (retained for 7 days). System self-heals when bottleneck resolves.

Evolution & Scaling

What to say

This design handles 10B events/day. Let me discuss how it evolves for 10x scale and what I would do differently for specific requirements.

Scaling Path:

Current Design (10B events/day): - 5 Kafka brokers - 10 Flink task managers - 3-node ClickHouse cluster - Single-region deployment

10x Scale (100B events/day): - Kafka cluster per region - Flink autoscaling based on lag - ClickHouse sharded by customer_id - Multi-region with regional aggregation

Multi-Region Architecture

Alternative Approaches:

If real-time latency requirement was stricter (<10 seconds): - Use Apache Druid instead of ClickHouse for real-time - Druid has native streaming ingestion - Tradeoff: More complex operations, higher cost

If we needed more ad-hoc query flexibility: - Use Trino/Presto as query engine - Query across S3, ClickHouse, and real-time stores - Tradeoff: Slower queries, more infrastructure

If cost was the primary concern: - Use BigQuery/Snowflake (serverless) - No real-time layer, accept 15-minute delay - Tradeoff: Higher query costs at scale, no real-time

Kappa vs Lambda

Lambda architecture (batch + stream) adds complexity. Kappa architecture uses only streaming with reprocessing capability. For analytics, Lambda is preferred because batch provides exact counts and easier debugging.

Cost Optimization:

1.Tiered storage: Hot data in ClickHouse, warm in S3+Parquet, cold in S3 Glacier
2.Sampling for real-time: For very high volume customers, sample 10% for real-time dashboards (with statistical confidence)
3.Spot instances for batch: Spark jobs on spot instances (70% cost reduction)
4.Aggressive aggregation: Drop raw events after 30 days, keep only aggregates

Design Trade-offs

Advantages

+Accurate historical data
+Fast real-time updates
+Handles late arrivals
+Battle-tested pattern

Disadvantages

-Two codepaths to maintain
-Complexity in merging
-Higher infrastructure cost

When to use

Production analytics systems needing both accuracy and real-time

System Design Masterclass

Weather Application with Forecasting

URL Shortener

Live Comments Feature

API Rate Limiter

On-Call Escalation System

Hotel Booking and Reservation System

Parts Compatibility Validation

Real-time Stock Price Viewer

Top-K Rankings System

File Download and Sync Library

Real-time Active Viewers

Marketplace Features

Price Alert System

Netflix Screen Concurrency Limits

Live Reactions System

Top K Most Shared Articles

High-Profile Likes Counter

Authentication and User Login

Google Calendar

Web Crawler

News Feed

Video Streaming Platform

IoC / Dependency Injection Framework

Distributed Control Infrastructure

Notification Service

Distributed Tracing System

P2P File Transfer System

Large Data Migration to Cloud

Wire Transfer API

Large Data Sorting and Processing

Database Control Plane

Distributed Metrics Logging and Aggregation

Ads Management & Delivery System

Flash Sale Backend

Photo Sharing Platform

Cluster Health Monitoring System

Rider Matching System

Surge Pricing System

Collaborative Editing System

Server Metrics Collection System

User Analytics Dashboard & Event Pipeline

Dropbox / Google Drive

Distributed Message Queue

ETA and Live Location Sharing

Distributed Key-Value Store

Distributed Stream Processing System

Payment Processing System

Distributed Job Scheduler

WhatsApp / Messenger

Payment Wallet at Global Scale

Uber / Ride Sharing

Web Search Engine

Globally Distributed SQL Database

Real-Time Analytics System

Recommendation System (Netflix)

Multi-Region Disaster Recovery System

Time-Series Database

Fraud Detection System

Design User Analytics Dashboard & Event Pipeline

Summary

Key Takeaways

Core Problem

The Hard Part

Scaling Axis

Problem Statement

What to say first

Premium Content