Design Walkthrough
Problem Statement
The Question: Design an analytics platform like Google Analytics that can track user behavior across websites and mobile apps, processing billions of events per day.
The system must support: - Event ingestion from millions of websites/apps - Real-time dashboards showing current active users - Historical reports with custom date ranges and dimensions - Funnel analysis tracking user journeys - Segmentation by user properties and behaviors
What to say first
Before diving in, let me clarify the requirements. I want to understand the scale, latency needs for different use cases, and what types of queries we need to support.
What interviewers are testing: - Can you design for massive write throughput? - Do you understand the tradeoffs between real-time and batch processing? - Can you optimize storage for both cost and query performance? - Do you know when to use pre-aggregation vs raw data?
Clarifying Questions
These questions shape fundamental architecture decisions.
Question 1: Scale
What is the expected event volume? How many customers (websites/apps) will we support?
Why this matters: Determines if we need distributed ingestion and storage. Typical answer: 10B events/day, 100K customers Architecture impact: Need Kafka for ingestion, distributed storage, partitioning strategy
Question 2: Latency Requirements
What latency is acceptable for real-time dashboards vs historical reports?
Why this matters: Real-time requires stream processing; batch can use cheaper methods. Typical answer: Real-time within 1 minute, historical queries under 2 seconds Architecture impact: Lambda architecture with hot and cold paths
Question 3: Query Patterns
What types of queries will users run? Pre-defined dashboards or ad-hoc analysis?
Why this matters: Pre-defined queries can be pre-aggregated; ad-hoc needs raw data. Typical answer: 80% standard dashboards, 20% custom queries Architecture impact: Pre-aggregate common dimensions, keep raw data for flexibility
Question 4: Data Retention
How long do we need to keep data? Different retention for different granularities?
Why this matters: Storage costs grow linearly with retention. Typical answer: Raw events 30 days, hourly aggregates 1 year, daily aggregates forever Architecture impact: Tiered storage with rollup jobs
State your assumptions
I will assume: 10B events/day, 100K customers, 1-minute real-time latency, 2-second query latency, 80/20 standard/ad-hoc queries, tiered retention (30d raw, 1y hourly, unlimited daily).
The Hard Part
Say this out loud
The hard part here is serving both real-time and historical analytics from the same system while keeping costs manageable at petabyte scale.
Why this is genuinely hard:
- 1.Write vs Read Optimization: High write throughput wants append-only logs. Fast queries want columnar, indexed storage. These are fundamentally different.
- 2.Real-time vs Batch: Real-time needs stream processing (expensive, complex). Batch is cheaper but delayed. Users want both.
- 3.Cost at Scale: 10B events/day = 300B events/month. Storing and querying petabytes is expensive. Need smart aggregation and tiering.
- 4.Query Flexibility: Users want to slice data by any dimension. Pre-aggregating every combination is impossible (combinatorial explosion).
Common mistake
Candidates often design only for real-time OR batch. Production analytics systems need both - real-time for current state, batch for accurate historical analysis.
The Lambda Architecture Solution:
Process data through two parallel paths:
- 1.Speed Layer (Real-time): Stream processing for immediate results, approximate, recent data only
- 2.Batch Layer (Accurate): Periodic batch jobs for accurate historical data, handles late-arriving events
- 3.Serving Layer: Merges results from both layers for queries
Scale & Access Patterns
Let me quantify the scale and understand access patterns.
| Dimension | Value | Impact |
|---|---|---|
| Events per day | 10 billion | ~115K events/second sustained, need distributed ingestion |
| Event size | ~500 bytes average | 5 TB/day raw data, 150 TB/month |
What to say
At 10B events/day, we are ingesting 115K events per second. This is write-heavy during ingestion but read-heavy during query time. We need to optimize for both.
Storage estimation:
- Raw events: 10B events x 500 bytes = 5 TB/day
- Monthly raw: 5 TB x 30 days = 150 TBAccess Patterns:
- Writes: Append-only, high throughput, can be batched client-side - Reads: Time-range scans, aggregations by dimensions, mostly recent data - Hot data: Last 24 hours queried 100x more than older data - Seasonality: Traffic spikes during business hours, marketing campaigns
High-Level Architecture
What to say
I will design a Lambda architecture with separate speed and batch layers, unified by a serving layer. We scale by partitioning on customer_id and timestamp.
Analytics Pipeline Architecture
Component Responsibilities:
1. Event Ingestion - SDKs batch events client-side (reduce requests) - Event API validates, enriches, and forwards to Kafka - API is stateless, horizontally scalable
2. Message Buffer (Kafka) - Decouples ingestion from processing - Provides durability (events survive processor failures) - Enables replay for reprocessing
3. Speed Layer (Flink) - Processes events in real-time - Maintains running counts for last N minutes - Writes to fast store (Redis/Druid)
4. Batch Layer (Spark) - Runs hourly/daily jobs on raw events - Produces accurate aggregates - Handles late-arriving events
5. Serving Layer - Merges real-time and batch results - Caches frequent queries - Provides unified query API
Real-world reference
Google Analytics uses a similar architecture. Mixpanel uses Kafka plus ClickHouse. Amplitude uses a custom streaming pipeline with Druid for real-time and S3/Spark for batch.
Data Model & Storage
We need different storage systems optimized for different access patterns.
What to say
The source of truth is raw events in object storage. We materialize views at different granularities for query performance. Storage choice depends on query pattern.
| Storage | Purpose | Technology | Retention |
|---|---|---|---|
| Raw Events | Source of truth, replay, ad-hoc | S3/GCS (Parquet) | 30 days |
| Real-time Store | Last 24h, fast updates | Redis/Druid | 24 hours |
| OLAP Warehouse | Historical queries, aggregates | ClickHouse/BigQuery | Forever |
| Pre-aggregates | Common dashboard queries | ClickHouse materialized views | Forever |
| Query Cache | Repeated queries | Redis | TTL 5 min |
CREATE TABLE events (
event_id UUID,
customer_id UInt64,Why ClickHouse/Columnar Storage?
- Analytics queries scan few columns across many rows - Columnar storage reads only needed columns (10x less I/O) - Compression is excellent for repeated values - Vectorized execution for aggregations
-- Hourly aggregates by common dimensions
CREATE MATERIALIZED VIEW events_hourly_mv
ENGINE = SummingMergeTree()Aggregation tradeoff
We cannot pre-aggregate every dimension combination - 10 dimensions with 100 values each = 100^10 combinations. We pre-aggregate the most common 5-6 dimension combinations and fall back to raw data for rare queries.
Event Ingestion Deep Dive
The ingestion pipeline must handle 100K+ events/second without losing data.
Event Ingestion Flow
async def handle_events(request: EventBatchRequest) -> Response:
customer_id = authenticate(request.api_key)
Key Design Decisions:
- 1.Client-side batching: SDKs batch events for 5 seconds or 10 events before sending. Reduces API calls by 10x.
- 2.Async acknowledgment: API returns success after Kafka ACK, not after processing. Decouples ingestion from processing.
- 3.Partition by customer: Events for same customer go to same Kafka partition. Enables per-customer ordering and processing.
- 4.Dead Letter Queue: Invalid events are not lost - sent to DLQ for debugging and potential replay.
At-least-once delivery
We guarantee at-least-once delivery. If Kafka ACK fails, client retries. Downstream processing handles duplicates via event_id deduplication.
Stream Processing (Speed Layer)
The speed layer processes events in real-time for immediate visibility.
# Apache Flink job for real-time aggregation
class AnalyticsStreamJob:
def process(self):Real-time Metrics Computed:
- 1.Active users (1-min, 5-min, 15-min windows) 2. Event counts by type, country, device 3. Page views per page/screen 4. Session counts and average duration 5. Conversion events (purchases, signups)
Late arrivals
Mobile events often arrive late (offline then sync). Speed layer uses watermarks with 5-minute allowed lateness. Events arriving later are dropped from real-time but captured in batch layer.
Windowing Strategy
Batch Processing Layer
The batch layer produces accurate historical data and handles late-arriving events.
# Apache Spark job for hourly aggregation
def hourly_aggregation_job(spark, hour):
# Read raw events from S3 (Parquet)Batch Jobs Schedule:
- 1.Hourly aggregation: Runs at minute 15 (allows 15 min for late arrivals) 2. Daily rollup: Combines hourly into daily at 2 AM 3. Weekly/Monthly rollup: Further compression for long-term storage 4. Data quality checks: Validates counts match between layers
Why batch even with streaming?
Batch provides: (1) Exact counts after deduplication, (2) Handles arbitrarily late data, (3) Cheaper compute (spot instances), (4) Easier debugging and reprocessing.
def reprocess_late_arrivals(spark, date):
# Find events that arrived late (ingested_at >> event_timestamp)
late_events = spark.read.parquet(Query Layer & Dashboards
The query layer serves dashboards with sub-second latency by merging real-time and batch data.
Query Flow
class QueryEngine:
async def execute_query(self, query: AnalyticsQuery) -> QueryResult:
# Check cache firstQuery Optimization Strategies:
- 1.Use pre-aggregates when possible: If query matches a materialized view, use it (100x faster)
- 2.Time-based routing: Recent data from fast real-time store, historical from batch
- 3.Result caching: Cache dashboard queries (most users see same dashboards)
- 4.Query rewriting: Expand user-friendly queries to optimized SQL
-- Active users in last 24 hours (uses real-time + batch)
SELECT
toStartOfHour(timestamp) AS hour,Approximate distinct counts
For unique user counts at scale, use HyperLogLog (uniqHLL in ClickHouse). It uses 12KB instead of storing all user IDs, with 2% error rate - acceptable for analytics.
Consistency & Invariants
System Invariants
Never lose events - analytics data drives business decisions. At-least-once delivery everywhere, deduplication at query time.
Consistency Model:
We use eventual consistency with bounded staleness: - Real-time layer: ~1 minute delay from event to dashboard - Batch layer: ~1-2 hour delay but exact counts - Merged view: Shows real-time for recent, batch for historical
| Data Age | Source | Accuracy | Latency |
|---|---|---|---|
| 0-60 min | Real-time (Flink) | ~99% (may miss late arrivals) | <1 min |
| 1-24 hours | Real-time + Batch | ~99.9% (batch corrects) | <2 sec query |
| 1+ days | Batch only | 100% (fully reconciled) | <2 sec query |
Business impact
Users see slightly different numbers for today vs yesterday. This is expected - yesterday is fully processed, today is still accumulating. We show indicators when data is still processing.
Deduplication Strategy:
Events may be sent multiple times (client retries, Kafka redelivery). We deduplicate at multiple levels:
- 1.Client SDK: Generate unique event_id client-side 2. Ingestion: Idempotent writes to Kafka (producer ID) 3. Processing: Dedupe by event_id in Spark jobs 4. Query time: Use DISTINCT or first() aggregations
def generate_event_id(event):
# Deterministic ID based on event content
# Same event always generates same ID
content = f"{event.user_id}:{event.timestamp}:{event.event_type}:{event.page_url}"
return hashlib.sha256(content.encode()).hexdigest()[:32]
# This means if client retries same event, it gets same ID
# Processing can dedupe by event_idFailure Modes & Resilience
Proactively discuss failures
Let me walk through failure modes. At 10B events/day, every component WILL fail eventually.
| Failure | Impact | Mitigation | Recovery |
|---|---|---|---|
| Event API down | Events queued in SDK | Client-side buffer (1000 events), retry with backoff | Flush buffer when API recovers |
| Kafka broker down | Reduced throughput | Replication factor 3, automatic leader election | New leader elected in seconds |
Data Loss Prevention:
- 1.Kafka retention: Keep raw events in Kafka for 7 days. Can replay if processing fails.
- 2.S3 raw storage: Write raw events to S3 (immutable). Source of truth for reprocessing.
- 3.Checkpointing: Flink checkpoints to S3 every minute. Can resume from checkpoint.
- 4.Idempotent processing: All jobs can be re-run safely. Duplicates are handled.
# Enable exactly-once processing with checkpointing
env = StreamExecutionEnvironment.get_execution_environment()
Backpressure handling
When downstream cannot keep up (ClickHouse slow), Flink applies backpressure to Kafka consumer. Events queue in Kafka (retained for 7 days). System self-heals when bottleneck resolves.
Evolution & Scaling
What to say
This design handles 10B events/day. Let me discuss how it evolves for 10x scale and what I would do differently for specific requirements.
Scaling Path:
Current Design (10B events/day): - 5 Kafka brokers - 10 Flink task managers - 3-node ClickHouse cluster - Single-region deployment
10x Scale (100B events/day): - Kafka cluster per region - Flink autoscaling based on lag - ClickHouse sharded by customer_id - Multi-region with regional aggregation
Multi-Region Architecture
Alternative Approaches:
If real-time latency requirement was stricter (<10 seconds): - Use Apache Druid instead of ClickHouse for real-time - Druid has native streaming ingestion - Tradeoff: More complex operations, higher cost
If we needed more ad-hoc query flexibility: - Use Trino/Presto as query engine - Query across S3, ClickHouse, and real-time stores - Tradeoff: Slower queries, more infrastructure
If cost was the primary concern: - Use BigQuery/Snowflake (serverless) - No real-time layer, accept 15-minute delay - Tradeoff: Higher query costs at scale, no real-time
Kappa vs Lambda
Lambda architecture (batch + stream) adds complexity. Kappa architecture uses only streaming with reprocessing capability. For analytics, Lambda is preferred because batch provides exact counts and easier debugging.
Cost Optimization:
- 1.Tiered storage: Hot data in ClickHouse, warm in S3+Parquet, cold in S3 Glacier
- 2.Sampling for real-time: For very high volume customers, sample 10% for real-time dashboards (with statistical confidence)
- 3.Spot instances for batch: Spark jobs on spot instances (70% cost reduction)
- 4.Aggressive aggregation: Drop raw events after 30 days, keep only aggregates