System Design Masterclass
streaminganalyticstime-seriesaggregationdashboardsadvanced

Design Real-Time Analytics System

Design a system that ingests millions of events per second and provides real-time dashboards with sub-second queries

Millions of events per second|Similar to Google, Netflix, Uber, LinkedIn, Meta, Datadog, Splunk|45 min read

Summary

A real-time analytics system takes in millions of tiny pieces of data every second (like clicks, purchases, or sensor readings), does math on them super fast (counting, averaging, summing), and shows the results on dashboards that update live. The tricky parts are: handling data that arrives late or out of order, making queries fast even when you have trillions of rows, keeping dashboards updated without overwhelming the database, and not losing any data even when servers crash. Companies like Google, Netflix, Uber, and LinkedIn ask this question in interviews.

Key Takeaways

Core Problem

The main job is to swallow millions of events every second without dropping any, crunch numbers on them in real-time, and let users query billions of rows in under one second.

The Hard Part

Events arrive out of order or late. A click that happened at 2:00 PM might reach our servers at 2:05 PM. We need to put it in the right time bucket and update our counts correctly.

Scaling Axis

We scale writes and reads separately. Writes are split by time (today's data goes to today's partition). Reads are made fast by pre-computing common answers (like hourly totals).

Critical Invariant

Never lose an event. Every click, every purchase, every sensor reading must be counted exactly once - not zero times, not twice.

Performance Requirement

Dashboards must load in under 1 second. This means we cannot scan raw data for every query - we must pre-compute summaries and use special databases designed for analytics.

Key Tradeoff

Fresh data vs fast queries. Showing data from 1 second ago needs constant updates (expensive). Showing data from 1 minute ago allows batching (cheaper). Most dashboards can tolerate a small delay.

Design Walkthrough

Problem Statement

The Question: Design a system that can take in millions of events every second (like user clicks, app events, or sensor data) and show the results on dashboards that update in real-time. Users should be able to ask questions like "how many users signed up in the last hour?" and get answers in under 1 second.

What the system needs to do (most important first):

  1. 1.Swallow lots of data fast - Accept millions of events per second from apps, websites, IoT devices, etc. Never drop any data, even during traffic spikes.
  2. 2.Process data in real-time - As events come in, immediately do math on them (count clicks, sum purchases, average response times).
  3. 3.Show live dashboards - Display charts and numbers that update every few seconds. Show things like "orders in the last 5 minutes" or "errors per second".
  4. 4.Answer historical queries fast - Let users ask questions about old data ("sales last month by country") and get answers in under 1 second, even with billions of rows.
  5. 5.Handle late data - Some events arrive late (a phone was offline, then reconnects). Put these events in the correct time bucket.
  6. 6.Never lose data - Even if servers crash or the network breaks, every event must be saved and counted.
  7. 7.Scale up easily - When traffic doubles, we should be able to handle it by adding more servers.

What to say first

Let me first understand what kind of events we are processing and what questions users want to answer. Are these web clicks, app events, IoT sensor data, or logs? What metrics do they care about - counts, sums, averages, percentiles? How fresh does the data need to be - real-time (seconds) or near-real-time (minutes)?

What the interviewer really wants to see: - Can you design a system that separates fast writing (ingestion) from fast reading (queries)? - Do you understand stream processing and why it is different from batch processing? - Can you handle the tricky problem of late-arriving or out-of-order events? - Do you know how to make queries fast on huge datasets (pre-aggregation, columnar storage, partitioning)?

Clarifying Questions

Before you start designing, ask questions to understand what you are building. Good questions show the interviewer you think before you code.

Question 1: What kind of events and how many?

What type of events are we ingesting? Web clicks, mobile app events, IoT sensors, or server logs? How many events per second at peak? What is the size of each event?

Why ask this: The type and volume of events affects our architecture choices.

What interviewers usually say: Mix of web and mobile events - clicks, page views, purchases, errors. Peak of 1 million events per second. Each event is about 1 KB (has user ID, timestamp, event type, and some extra data).

How this changes your design: At 1M events/sec × 1KB = 1 GB/sec of incoming data. We need a message queue that can handle this (Kafka can do 10+ GB/sec). Storage will be about 86 TB per day - we need to think about retention and compression.

Question 2: How fresh does the dashboard data need to be?

When something happens, how quickly must it show on the dashboard? Under 1 second (true real-time), under 1 minute (near real-time), or is hourly okay?

Why ask this: True real-time (sub-second) is much harder and more expensive than near-real-time.

What interviewers usually say: Live dashboards should update within 5-10 seconds. For alerting (like detecting a spike in errors), we need data within 1-2 seconds. Historical queries can be a few minutes behind.

How this changes your design: 5-10 second freshness means we can batch events for a few seconds before processing. This is much easier than processing each event individually. We use micro-batches instead of true streaming.

Question 3: What questions will users ask?

What kind of queries will the dashboards run? Simple counts (how many clicks today)? Grouped queries (clicks per country)? Complex queries (95th percentile response time by API endpoint)?

Why ask this: Different query types need different optimizations.

What interviewers usually say: Mostly counts, sums, and averages grouped by dimensions (country, device type, event type). Also need percentiles (P50, P95, P99) for response times. Time-based queries are very common (last hour, last day, last week).

How this changes your design: For counts and sums, we can pre-compute aggregates (hourly totals per country). For percentiles, we need to keep more data or use approximate algorithms (t-digest). Time-based queries mean we should partition data by time.

Question 4: How long do we keep the data?

How long should we store raw events? How long should we keep aggregated summaries? Do we need to query data from years ago?

Why ask this: Keeping raw events forever is expensive. Keeping summaries is much cheaper.

What interviewers usually say: Keep raw events for 30 days (for debugging and ad-hoc queries). Keep hourly aggregates for 1 year. Keep daily aggregates forever.

How this changes your design: We need a tiered storage strategy. Recent data (hot) stays on fast SSDs. Older data (cold) moves to cheap object storage. We automatically roll up raw events into aggregates and then delete the raw data.

Summarize your assumptions

Let me summarize: We need to handle 1 million events per second, each about 1 KB. Dashboards should update within 10 seconds. Users mostly want counts and sums grouped by dimensions like country and device. We keep raw data for 30 days and aggregates for 1+ years. Now let me design the system.

The Hard Part

Say this to the interviewer

The hardest part of real-time analytics is handling late-arriving data. Imagine a user clicks a button at 2:00:00 PM, but their phone is on a slow network. The event reaches our server at 2:00:30 PM - 30 seconds late. We have already calculated the counts for 2:00 PM. Do we update them? What if an event arrives hours late?

Why late data is a big problem (explained simply):

  1. 1.Events travel at different speeds - An event from a fast server reaches us in 10 milliseconds. An event from a phone on a slow network might take 30 seconds. An event from a device that was offline might arrive hours or days late.
  2. 2.We need to count by event time, not arrival time - If a purchase happened at 2:00 PM, it should count in the 2:00 PM bucket - even if we receive it at 2:05 PM. Using arrival time would give wrong results.
  3. 3.We cannot wait forever - We want to show the 2:00 PM count on the dashboard. How long do we wait for late events before saying "the 2:00 PM count is final"?
  4. 4.Updating old counts is expensive - If we already computed and stored the 2:00 PM count, updating it when a late event arrives means changing data we thought was done.
  5. 5.Alerts might fire incorrectly - If we alerted on the 2:00 PM data, then a late event changes it, should we un-alert? This gets messy fast.

Common mistake candidates make

Many people ignore late data or say: just use the time the event arrived at our server. This is wrong because: (1) it gives incorrect business metrics - if a sale happened at 2 PM, it should count at 2 PM for reporting, (2) time zone differences alone can cause events to appear late, (3) mobile apps often batch events and send them together, arriving late.

How we handle late-arriving data:

Step 1: Use event time, not processing time - Every event has a timestamp of when it actually happened (event time) - We use this timestamp to decide which time bucket it belongs to - Processing time (when we received it) is only used for monitoring

Step 2: Define a watermark (how late is too late) - We say: we will wait up to 5 minutes for late events - After 5 minutes, we consider a time window "closed" - Example: At 2:05 PM, we close the 2:00-2:01 PM window

Step 3: Handle events that arrive within the watermark - Event for 2:00 PM arrives at 2:03 PM? Include it normally - We keep windows "open" until the watermark passes

Step 4: Handle events that arrive after the watermark (very late) - Event for 2:00 PM arrives at 2:30 PM? It is past our 5-minute watermark - Option A: Drop it and log it (simple but loses data) - Option B: Put it in a "late data" table and update aggregates in batch (keeps data but more complex) - Option C: Issue a correction to the aggregate (most accurate but expensive)

How we handle events arriving at different times

Scale and Access Patterns

Before designing, let me figure out how big this system needs to be. This helps us choose the right tools.

What we are measuringNumberWhat this means for our design
Events per second (peak)1 millionNeed Kafka or similar - can handle 10M+ events/sec easily
Event size1 KB average1M × 1KB = 1 GB/sec incoming data - need fast network and disks
+ 6 more rows...

What to tell the interviewer

At 1 million events per second, we need to separate the write path (ingestion) from the read path (queries) completely. Writes go to Kafka first (buffer), then to a stream processor (compute aggregates), then to a database (store). Reads hit the database through a cache layer. This separation lets us scale each part independently.

Common interview mistake: Underestimating the data volume

Many candidates do not do the math. 1M events/sec sounds manageable until you realize it is 86 billion events per day and 86 TB of raw data. Keeping this for 30 days is 2.5 PB without compression. Always do back-of-envelope calculations to show you understand the scale.

How people use the system (from most common to least common):

  1. 1.Dashboard refresh - Every 10 seconds, dashboards auto-refresh and run their queries. This is the most common read pattern - must be fast.
  2. 2.Real-time alerting - Continuously check if metrics cross thresholds (error rate > 5%). Needs the freshest data.
  3. 3.Ad-hoc exploration - Data analysts run custom queries to investigate issues. These can be complex and touch lots of data.
  4. 4.Historical reports - Weekly or monthly reports that aggregate data over longer periods. Can run in background.
How much storage do we need?
- Raw events: 1 KB × 1M/sec × 86,400 sec = 86 TB/day
- With 10x compression: 8.6 TB/day
+ 14 more lines...

High-Level Architecture

Now let me draw the big picture of how all the pieces fit together. I will keep it simple and explain what each part does.

What to tell the interviewer

I will use a Lambda Architecture - we have two paths: a fast path for real-time data (streaming) and a slow path for historical data (batch). The fast path gives us speed, the slow path gives us accuracy. Most modern systems use this pattern or a simplified version called Kappa Architecture.

Real-Time Analytics System - The Big Picture

What each part does and WHY it is there:

ComponentWhat it doesWhy we need it (what to tell interviewer)
Event CollectorsReceive events from apps and validate them. Add server timestamp.Why separate collectors? They are stateless and easy to scale. If we get a traffic spike, we just add more collectors. They also do validation - reject bad events early before they waste resources.
Kafka (Message Buffer)Temporarily stores events before processing. Acts as a shock absorber.Why Kafka? It handles 1M+ events/sec easily. It keeps events even if downstream systems are slow. It allows replay - if we find a bug, we can reprocess old events. It decouples producers from consumers.
+ 4 more rows...

Common interview question: Why not just use a regular database?

Interviewers often ask: Why do you need Kafka, Flink, and a special analytics database? Cannot PostgreSQL do it all? Your answer: At 1M events/sec, PostgreSQL would fall over. (1) It cannot ingest that fast - inserts would queue up. (2) Even if it could, queries on billions of rows would take minutes, not milliseconds. (3) We need a buffer (Kafka) because processing might be slower than ingestion. The right tool for each job.

Technology Choices - Why we picked these tools:

Message Queue: Kafka (Recommended) - Why we chose it: Battle-tested at LinkedIn, Uber, Netflix. Handles 10M+ events/sec. Keeps events for days (replay). Strong ordering guarantees. - Other options: - Amazon Kinesis: Good if you are all-in on AWS. Easier to manage but less flexible. - Apache Pulsar: Newer, has some nice features like tiered storage built-in. - Google Pub/Sub: Good if you are on GCP. Serverless, no capacity planning.

Stream Processing: Apache Flink (Recommended) - Why we chose it: Best at handling event time and late data (watermarks). Exactly-once processing. Very fast. - Other options: - Spark Streaming: More popular, easier to hire for. Works well for micro-batch (seconds latency). - Kafka Streams: Simpler, no separate cluster needed. Good for lighter processing. - Amazon Kinesis Analytics: Serverless option if on AWS.

Analytics Database: ClickHouse (Recommended) - Why we chose it: Fastest for analytical queries. Can ingest millions of rows/sec. Great compression. Open source. - Other options: - Apache Druid: Better for high-concurrency queries. More complex to operate. - TimescaleDB: Good if you want PostgreSQL compatibility. Slower than ClickHouse. - BigQuery/Snowflake: Serverless, easy to use. More expensive, less real-time.

Important interview tip

Pick technologies YOU know! If you have used Spark at work, use Spark in your design. If you know Druid well, use Druid. Interviewers care more about your reasoning than the specific tool. Say something like: I will use Flink because it handles late data well, but Spark Streaming would also work if the team knows it better.

Data Model and Storage

Now let me show how we organize the data. We have three types of data: raw events (detailed), aggregates (summarized), and metadata (dimension info).

What to tell the interviewer

I will use a star schema - a fact table for events and dimension tables for lookups. Raw events go to a data lake. Aggregates go to ClickHouse with different granularities: real-time (per minute), short-term (per hour), and long-term (per day).

Table 1: Raw Events - Every single thing that happened

This is the detailed record of everything. We keep it in a data lake (like S3) because there is too much data to keep in a database forever.

ColumnWhat it storesExample
event_idUnique ID for this eventevt_abc123xyz
event_timeWhen it actually happened (event time)2024-03-15 14:23:45.123 UTC
+ 9 more rows...

Table 2: Minute Aggregates - Summaries for real-time dashboards

Every minute, we count up all the events and store the totals. This makes dashboards fast - instead of counting 1 million rows, we read 1 row.

ColumnWhat it storesExample
time_bucketWhich minute this is for2024-03-15 14:23:00
event_typeWhat kind of eventpage_view
+ 9 more rows...

Why approximate unique users?

Counting exact unique users across billions of events is expensive - you need to remember every user ID. We use HyperLogLog, a clever algorithm that gives us 99% accurate counts using very little memory. For dashboards, this is good enough.

Table 3: Hour and Day Aggregates - For historical queries

Same structure as minute aggregates, but grouped by hour and day. We roll up minute data into hourly, and hourly into daily. This keeps the database small and queries fast.

-- Minute aggregates table (partitioned by date for fast queries)
CREATE TABLE minute_aggregates (
    time_bucket DateTime,
+ 21 more lines...

Important: Partitioning by time

Always partition analytics data by time (day or hour). When someone asks for the last hour of data, the database only looks at today partition - it skips all historical data. Without partitioning, every query would scan everything. Partitioning makes queries 100x faster.

Table 4: Dimensions - Lookup tables for extra details

Instead of storing country="United States" in every event (wastes space), we store country_id=1 and look up the full name when needed.

TableWhat it storesExample rows
countriesCountry detailsid=1, code="US", name="United States"
devicesDevice categoriesid=1, type="mobile", os="iOS"
event_typesEvent definitionsid=1, name="page_view", category="engagement"
usersUser profiles (if needed)id=123, name="John", segment="premium"

Stream Processing Deep Dive

Let me explain step by step how we turn raw events into aggregates in real-time.

The processing pipeline has 5 stages:

Stage 1: Read from Kafka - Our stream processor (Flink) reads events from Kafka topics - Events are distributed across partitions for parallelism - Each worker processes events from assigned partitions

Stage 2: Parse and validate - Parse the JSON event into structured fields - Validate required fields exist (event_time, event_type) - Drop or quarantine invalid events

Stage 3: Enrich with dimensions - Look up user segment from user table - Convert country code to region - Add any computed fields

Stage 4: Window and aggregate - Group events into time windows (1-minute buckets) - Count events, sum values, compute percentiles - Handle late data using watermarks

Stage 5: Write to storage - Write aggregates to ClickHouse - Write raw events to data lake - Update real-time cache

FUNCTION process_event_stream():
    // This runs continuously, processing events as they arrive
    
+ 40 more lines...

How windowing works

Handling late data in detail:

  1. 1.Events within watermark (up to 5 minutes late): Included in the correct window. The window stays open until the watermark passes.
  2. 2.Events after watermark closed: We have two options: - Option A: Side output - Send to a separate stream, process in batch later - Option B: Allowed lateness - Re-open the window, update the aggregate 3. Very late events (hours or days): Always go to batch processing. The real-time aggregates are already finalized.
FUNCTION configure_late_data_handling():
    // Configure the window to handle late data
    
+ 37 more lines...

Tell the interviewer about exactly-once

Flink guarantees exactly-once processing through checkpointing. Every few seconds, it saves its state (which events it has processed, current aggregate values). If a failure happens, it restarts from the last checkpoint. Combined with Kafka's offsets, we never lose events and never double-count.

Query Layer and Dashboards

Now let me explain how dashboards get their data fast - even when we have billions of rows.

What to tell the interviewer

Dashboards must load in under 1 second. We achieve this with three techniques: (1) Pre-aggregation - the hard math is already done, (2) Caching - frequently accessed data is in Redis, (3) Query optimization - ClickHouse is built for this.

How a dashboard query works (step by step):

Scenario: User opens a dashboard showing Page views by country for the last hour

  1. 1.Dashboard sends query to API: GET /metrics?metric=page_views&group_by=country&period=1h
  2. 2.API checks Redis cache first: - Cache key: metrics:page_views:by_country:1h:2024031514 (includes current hour) - If found and fresh (under 10 seconds old), return immediately - If not found, continue to database
  3. 3.API builds SQL query for ClickHouse: SELECT country, sum(event_count) as views FROM minute_aggregates WHERE event_type = page_view AND time_bucket >= now() - INTERVAL 1 HOUR GROUP BY country
  4. 4.ClickHouse executes query: - Only scans today's partition (fast) - Reads only the columns needed (country, event_count) - Returns results in ~50ms
  5. 5.API caches and returns result: - Store in Redis with 10-second TTL - Return to dashboard
  6. 6.Dashboard renders chart
FUNCTION query_metrics(metric, group_by, time_range):
    // Example: query_metrics("page_views", "country", "1h")
    
+ 37 more lines...

Making queries fast - three levels of optimization:

Level 1: Pre-aggregation (most important) - Instead of counting raw events (1 million per second), we query pre-computed aggregates (1 row per minute per dimension) - A query that would scan 60 million events (1 hour) now scans 60 rows - This is 1,000,000x less work

Level 2: Columnar storage (ClickHouse magic) - Regular databases store data by row: [name, age, country, ...], [name, age, country, ...] - ClickHouse stores data by column: [name, name, name, ...], [age, age, age, ...] - If your query only needs country and count, it only reads those columns, skipping everything else - Also enables much better compression (similar values together compress well)

Level 3: Caching (for hot data) - Many users look at the same dashboards - Instead of running the same query 1000 times, run once and cache - Redis can serve 100,000+ requests per second

Query flow with caching

Cache stampede problem

If 1000 dashboards all refresh at the same time and the cache is empty, all 1000 will hit the database simultaneously. Solution: Use a lock so only the first request queries the database. Others wait for the cache to be populated. This is called request coalescing or single-flight.

FUNCTION query_with_coalescing(cache_key, query_function):
    // Prevent multiple requests from hitting DB for the same data
    
+ 31 more lines...

What Can Go Wrong and How We Handle It

Tell the interviewer about failures

Good engineers think about what can break. Let me walk through the things that can go wrong in a real-time analytics system and how we protect against them.

Common failures and how we handle them:

What breaksWhat happens to usersHow we fix itWhy this works
Kafka broker goes downNothing - Kafka has replicasData is replicated to 3 brokers. If one dies, others take over.Kafka is designed for this. We lose no events.
Stream processor crashesBrief delay in aggregatesFlink checkpoints state to S3. Restarts from last checkpoint.We replay events from Kafka since the checkpoint. No data loss.
+ 5 more rows...

Making sure we never lose events (durability):

Events are our source of truth. If we lose them, we lose business data forever. Here is how we protect them:

  1. 1.At ingestion: Events are written to Kafka with acks=all - the write is not confirmed until 3 brokers have it.
  2. 2.In Kafka: Data is replicated across 3 brokers in different racks. If 2 brokers die, data is still safe.
  3. 3.During processing: Flink checkpoints to durable storage (S3). If processing fails, we restart from checkpoint and replay from Kafka.
  4. 4.In storage: ClickHouse replicates data. Raw events also go to S3 data lake (triple replicated).
  5. 5.Recovery: If everything fails, we can rebuild aggregates from raw events in the data lake.
UNDERSTANDING FLINK CHECKPOINTS:

// Every 10 seconds, Flink saves its state:
+ 18 more lines...

What is exactly-once semantics? (important interview topic)

Exactly-once means every event is processed exactly one time - not zero (lost), not two or more (duplicated). Flink achieves this with checkpoints + idempotent writes. If we crash and replay, we might process an event twice, but the write to ClickHouse uses upsert (update or insert) with the event ID, so duplicates just overwrite with the same value.

Handling traffic spikes (backpressure):

During Black Friday or a viral event, traffic might spike 10x. Here is how we handle it:

  1. 1.Kafka absorbs the spike - Kafka is configured to retain data for 7 days. Even if processing is slow, events are safe.
  2. 2.Stream processor scales up - Flink can auto-scale based on backlog. More workers = faster processing.
  3. 3.Aggregates might be delayed - Instead of showing data from 5 seconds ago, we might show data from 30 seconds ago. Still useful.
  4. 4.We prioritize important data - Critical alerting streams get dedicated resources. Dashboard updates can wait.
  5. 5.We shed load gracefully - If overwhelmed, we can sample (process 1 in 10 events) rather than drop randomly.

How backpressure flows through the system

Growing the System Over Time

What to tell the interviewer

This design handles millions of events per second in a single region. Let me explain how we would grow it for global scale, higher throughput, or new requirements.

How we grow step by step:

Stage 1: Single region (1M events/sec) - One Kafka cluster (10-20 brokers) - One Flink cluster (50-100 workers) - One ClickHouse cluster (20-50 nodes) - This is where most companies operate

Stage 2: High availability (survive failures) - Add standby Kafka cluster (sync replication) - Run Flink on Kubernetes (auto-restart failed workers) - ClickHouse with replicated tables (3 copies of each shard) - Automatic failover for all components

Stage 3: Global scale (multiple regions) - Kafka clusters in each region (US, EU, Asia) - Process data locally for low latency - Replicate aggregates to central location for global views - Users see local data fast, global data with small delay

Stage 4: Extreme scale (10M+ events/sec) - Multiple Kafka clusters per region - Tiered processing (raw → minute → hour → day) - Dedicated clusters for different use cases (alerts vs dashboards) - This is Google/Meta/Netflix scale

Multi-region deployment

Cool features we can add later:

1. Anomaly detection (automatic alerting)

Instead of setting fixed thresholds (alert if error rate > 5%), use machine learning to detect unusual patterns automatically.

FUNCTION detect_anomalies(metric_name, current_value):
    // Compare current value to historical patterns
    
+ 19 more lines...

2. Drill-down queries (find root cause)

When an alert fires (error rate high!), users want to drill down: which country? which endpoint? which user segment?

  • Keep raw events for 24-48 hours in a fast store (not just aggregates) - Allow filtering and grouping on any dimension - Pre-compute common drill-down paths

3. Custom metrics (user-defined calculations)

  • Let users define their own metrics: conversion_rate = purchases / page_views - Store the formula, compute it at query time - Or pre-compute for very common custom metrics

4. Data quality monitoring

  • Track if events are arriving late more often (infrastructure problem?) - Alert if event volume drops unexpectedly (SDK bug?) - Detect duplicate events (retry bug?)

Real-world examples

Netflix uses their system (Mantis + Druid) to track billions of events daily for real-time insights. LinkedIn built Unified Streaming and Batch to process petabytes for analytics. Uber uses Apache Kafka + Flink + Pinot for trip data. These systems evolved over years - start simple and add complexity as needed.

Design Trade-offs

Advantages

  • +Freshest data possible (sub-second)
  • +Can react to events instantly
  • +Natural fit for alerting

Disadvantages

  • -More complex to build
  • -Harder to handle late data
  • -More expensive (always running)
  • -Exactly-once is tricky
When to use

Use when you need data fresher than 1 minute: real-time alerting, fraud detection, live dashboards during events.