System Design Masterclass

58 items

System Design Masterclass

time-seriesdatabasewrite-throughputcompressionretentionadvanced

Design Time-Series Database

Design a database optimized for time-stamped data with high write throughput

Millions of writes per second|Similar to InfluxData, Timescale, Datadog, Prometheus, Amazon, Google, Uber|45 min read

Summary

A time-series database stores data points that have timestamps - like temperature readings every second, stock prices every millisecond, or how many people visit your website each minute. The tricky parts are: writing millions of data points every second without slowing down, finding all data between two times quickly (like show me CPU usage from 2pm to 3pm), squeezing data to use less storage (compression), and automatically deleting old data you do not need anymore (retention). Companies like InfluxDB, TimescaleDB, Prometheus, and Datadog ask this question in interviews.

Key Takeaways

Core Problem

The main job is to write tons of data points very fast (like 1 million per second) and then quickly find all points in a time range (like last hour). Regular databases are too slow for this.

The Hard Part

Data comes in super fast and never stops. If we use a regular database, it will fall behind and crash. We need special tricks to write fast and store data in a way that makes time-range searches lightning quick.

Scaling Axis

We scale by splitting data across time (each day gets its own file) and by metric name (CPU data on one server, memory data on another). This lets us add more servers as data grows.

Critical Invariant

Data points must be stored in time order. If we mess up the order, queries will return wrong results. Also, we must never lose data - if we say OK to a write, that data must be saved.

Performance Requirement

Write 1 million+ data points per second. Query last hour of data for any metric in under 100 milliseconds. Store 1 year of data using 10x less space than a regular database.

Key Tradeoff

We trade flexibility for speed. Unlike regular databases where you can update any row, time-series data is mostly write-once, read-many. We also trade real-time queries for batch efficiency.

Design Walkthrough

Problem Statement

The Question: Design a time-series database like InfluxDB that can handle high write throughput, efficient range queries, and automatic data retention.

What is time-series data? (explain this to the interviewer)

Time-series data is any data that has a timestamp. Think of it like a diary - every entry has a date and time. Examples:

Server monitoring: CPU usage is 45% at 2:00:00, 47% at 2:00:01, 52% at 2:00:02... - IoT sensors: Temperature is 72F at 10:00, 73F at 10:01, 71F at 10:02... - Stock prices: Apple stock is $150.25 at 9:30:00, $150.30 at 9:30:01... - Website analytics: 1000 visitors at 3pm, 1200 visitors at 4pm...

What the database needs to do (most important first):

1.Write data very fast - Accept millions of data points every second. Data arrives constantly and we cannot tell sensors to slow down.
2.Query by time range - Show me all CPU readings from 2pm to 3pm. This must be fast even if there are billions of data points.
3.Compress data - Time-series data is huge. 1 million points per second = 86 billion points per day. We need to squeeze it down.
4.Delete old data automatically - Keep last 30 days, delete everything older. Storage is expensive and old data is rarely needed.
5.Aggregate data - Instead of every second, show me the average per minute or per hour. This makes dashboards fast.
6.Tag and filter - Show me CPU for server-1 only, or all servers in the us-east region.

What to say first

Let me first understand the use case. Is this for server monitoring, IoT sensors, or financial data? Each has different needs. Monitoring data can lose a few points - not critical. Financial data cannot lose anything. I will also ask about the write rate and how long we need to keep data.

What the interviewer really wants to see: - Do you understand why regular databases (like MySQL) are bad for time-series? - Can you design a write path that handles millions of writes per second? - Do you know how to organize data so time-range queries are fast? - Can you explain compression techniques for time-series data? - How do you automatically delete old data without slowing down the system?

Clarifying Questions

Before you start designing, ask questions to understand what you are building. Good questions show the interviewer you think before you code.

Question 1: How fast is data coming in?

How many data points per second do we need to write? Are there spikes where suddenly 10x more data arrives?

Why ask this: This tells us how powerful our write system needs to be.

What interviewers usually say: 1 million writes per second normally, up to 5 million during spikes (like when many servers restart at once).

How this changes your design: We need a system that can burst to handle spikes. We might buffer writes in memory and batch them to disk.

Question 2: How long do we keep data?

Do we keep data forever or delete it after some time? Do we keep recent data in full detail and older data in summary form?

Why ask this: Keeping data forever would need infinite storage. Most companies keep recent data (last 30 days) in full detail and older data as summaries (hourly averages).

What interviewers usually say: Keep raw data for 30 days, hourly summaries for 1 year, daily summaries forever.

How this changes your design: We need background jobs that create summaries and delete old raw data. We also organize data by time so deleting old data is fast (just delete old files).

Question 3: What queries do users run?

Do users mostly query recent data (last hour) or also historical data (last year)? Do they query one metric or many metrics at once?

Why ask this: If 90% of queries are for the last hour, we should keep recent data in memory for super fast access.

What interviewers usually say: 80% of queries are for last 24 hours. Users often query multiple metrics together (CPU and memory for the same server).

How this changes your design: We keep recent data (last 24 hours) in a fast cache. We store data so that metrics from the same source are close together on disk.

Question 4: Can we lose any data?

If a server crashes, is it okay to lose the last few seconds of data? Or must every single data point be saved?

Why ask this: Durability (never losing data) costs performance. Writing to disk is slow. If we can lose a few seconds of data on crash, we can keep data in memory longer and write faster.

What interviewers usually say: For monitoring, losing a few seconds is okay. For financial data, we cannot lose anything.

How this changes your design: For monitoring: batch writes in memory, flush every few seconds. For financial: write to a log immediately before saying OK.

Summarize your assumptions

Let me summarize: We need to write 1 million points per second with 5x burst capacity. We keep raw data for 30 days, summaries for 1 year. Most queries are for recent data. We can tolerate losing up to 5 seconds of data on crash. I will design for a monitoring use case.

Why Regular Databases Are Bad for This

Say this to the interviewer

Before I design a solution, let me explain why we cannot just use PostgreSQL or MySQL. This shows I understand the problem deeply.

Problem 1: Regular databases update indexes on every write

When you insert a row in MySQL, it updates a B-tree index. This is great for finding any row quickly, but terrible when you are writing millions of rows per second. The index becomes a bottleneck.

Time-series insight: We almost never look up a single data point. We always query ranges (give me all points from 2pm to 3pm). We do not need a fancy index for every row.

Problem 2: Regular databases store rows scattered on disk

In MySQL, row 1 might be on disk block 100, row 2 on block 5000, row 3 on block 200. When you query a time range, the disk head jumps all over the place. This is very slow.

Time-series insight: Data points that are close in time should be close on disk. If we store January data together and February data together, querying January is just one sequential disk read.

Problem 3: Regular databases do not compress well

Time-series data has patterns. If CPU is around 50% for an hour, storing 50.1, 50.2, 50.0, 50.3... is wasteful. We could store 50 and then just the small differences (+0.1, +0.1, -0.2, +0.3).

Regular databases store each value fully. Time-series databases use special compression that understands patterns in the data.

Problem 4: Deleting old data is slow in regular databases

In MySQL, DELETE FROM metrics WHERE timestamp < 30_days_ago would scan millions of rows and take hours. It would also lock the table and slow down writes.

Time-series insight: If we store each day in a separate file, deleting old data means just deleting the file. rm january_15.db takes milliseconds, not hours.

Regular DB vs Time-Series DB

Scale and Access Patterns

Before designing, let me figure out how big this system needs to be. This helps us choose the right tools.

What we are measuring	Number	What this means for our design
Data points per second	1 million (5M peak)	Need to batch writes - cannot hit disk 1M times per second
Size of one data point	About 20 bytes	timestamp (8 bytes) + value (8 bytes) + tags (4 bytes)

+ 6 more rows...

What to tell the interviewer

At 1 million writes per second, we absolutely need batching - SSDs can only do about 100K random writes per second. By batching 10,000 points into one write, we turn 1M writes into just 100 disk operations. The key insight is that time-series data arrives in order, so we can buffer it in memory and write in chunks.

How people use a time-series database (from most common to least):

1.Dashboard queries - Show me CPU for the last hour, refreshed every 10 seconds. These need to be FAST.
2.Alert queries - Is CPU above 90% for the last 5 minutes? These run constantly in the background.
3.Investigation queries - What happened yesterday at 3pm when the site went down? Scan longer time ranges.
4.Batch analytics - What was the average CPU per server last month? These can be slower.

WRITES:
1 million points per second
= 60 million per minute

+ 18 more lines...

High-Level Architecture

Now let me draw the big picture of how all the pieces fit together. I will explain each part simply.

What to tell the interviewer

The design has three main parts: (1) A fast write path that buffers data in memory and batches writes to disk, (2) A storage engine that organizes data by time for fast range queries, (3) A query engine that knows how to efficiently scan time ranges and aggregate results.

Time-Series Database - The Big Picture

The Write Path - How data gets in:

1.Load Balancer spreads incoming data across multiple servers 2. Ingestion Service validates the data (is timestamp valid? is metric name okay?) 3. Write-Ahead Log (WAL) saves data to a simple log file immediately - this is our crash recovery 4. Memory Buffer collects many data points before writing to the main storage

Key insight: We write to WAL (fast sequential write) immediately for safety, but we batch up points in memory before writing to the main storage (slow random writes).

The Storage Engine - How data is organized:

1.Time Chunks - Data is split into files by time. All data from 2pm-3pm goes in one file. 2. Compaction - Background job that merges small files into bigger ones and compresses them 3. Time Index - A small lookup table that says chunk for 2pm-3pm is in file X

Key insight: By organizing data by time, a query for last hour only needs to open 1-2 files, not scan the entire database.

The Query Engine - How we find data fast:

1.Query Parser - Understand what the user wants (SELECT cpu WHERE time > 1 hour ago) 2. Query Planner - Figure out which time chunks to scan 3. Query Executor - Actually read the chunks and compute aggregates (avg, max, min) 4. Query Cache - Save recent query results so repeated queries are instant

How real systems do it

InfluxDB uses this exact pattern - WAL for durability, memory buffer for batching, time-structured merge tree (TSM) for storage. Prometheus uses a similar approach but optimizes for pull-based monitoring. TimescaleDB extends PostgreSQL with automatic time partitioning.

Data Model

Now let me explain how we organize the data. Time-series data has a special structure that is different from regular database tables.

What to tell the interviewer

Time-series data has three parts: a timestamp (when), tags (what/where), and values (the actual numbers). We organize data so that points with the same tags are stored together, making queries fast.

The three parts of a data point:

1.Timestamp - When was this recorded? (Example: 2024-03-15 14:30:00.123)
2.Tags - What is this about? Tags are like labels that describe the data. - metric_name: cpu.usage - host: server-1 - region: us-east - datacenter: dc-1
3.Value - The actual number (Example: 45.7 percent)

Point 1:
  timestamp: 2024-03-15 14:30:00
  tags: {metric: "cpu.usage", host: "server-1", region: "us-east"}

+ 16 more lines...

What is a Series?

A series is all data points that have the same tags. Think of it as a single line on a graph.

Example: All CPU readings for server-1 in us-east form one series. If you draw a line graph, this would be one line.

Why this matters: We store each series together on disk. When you ask for CPU of server-1, we read just that one series, not the whole database.

Concept	What it means	Example
Metric	The name of what we are measuring	cpu.usage, memory.free, http.requests
Tag	A label that describes the data	host=server-1, region=us-east
Series	All points with same metric + tags	cpu.usage where host=server-1
Point	One measurement at one time	cpu.usage=45.7 at 14:30:00
Field	The actual value (can have multiple)	value=45.7, or min=40, max=50, avg=45

How we store this on disk:

We group data in two ways:

1.By time - All data from 2pm-3pm goes in one chunk (file) 2. By series - Within each chunk, data from the same series is stored together

How data is organized on disk

The Series Cardinality Problem

Cardinality means how many unique series you have. If you have 1000 servers and 100 metrics, that is 100,000 series. This is fine. But if you add a tag like user_id with 1 million users, suddenly you have 100 billion series. This kills performance. Rule: Never use high-cardinality values (like user IDs or request IDs) as tags.

Write Path Deep Dive

Let me explain step by step how data gets written. This is the most important part because time-series databases live or die by write performance.

What to tell the interviewer

The write path has to balance three things: speed (accept writes fast), durability (do not lose data), and organization (store data so reads are fast). We use a Write-Ahead Log for durability and an in-memory buffer for batching.

Step by step: What happens when data arrives

1.Receive and validate - Check the data is correct (valid timestamp, metric name is not empty)
2.Write to WAL - Immediately write to a simple log file. This is fast because it is just appending to a file. If we crash, we can recover from this log.
3.Add to memory buffer - Put the data point in an in-memory structure organized by series. This is super fast because it is just memory.
4.Return OK - Tell the client the write succeeded. At this point, data is in WAL and memory.
5.Background: Flush to chunks - When the memory buffer gets big enough (or after X seconds), write all the buffered data to chunk files on disk.
6.Background: Delete WAL - After data is safely in chunk files, delete the old WAL entries.

FUNCTION write_point(timestamp, tags, value):
    
    STEP 1: Validate the data

+ 24 more lines...

The Memory Buffer - Why it is critical

Imagine writing 1 million points per second directly to disk files. Each write needs to: 1. Find the right file 2. Seek to the right position 3. Write the data 4. Update the file index

This would be impossibly slow. SSDs can only do about 100K random writes per second.

Instead, we buffer points in memory and write them in batches: - Collect 1 million points over 1 second - Organize them by series - Write each series as one sequential block - Now we only need ~1000 writes (one per series), not 1 million

Memory buffer batching

What if we crash before flushing?

The WAL saves us. On startup, we replay the WAL to recover any data that was in memory but not yet flushed. The tradeoff: WAL writes add latency, but only a tiny bit (appending to a file is fast). If you want even faster writes and can tolerate losing a few seconds of data, you can skip the WAL entirely.

Handling out-of-order data

Sometimes data arrives late. A sensor might send data from 5 minutes ago. We have options:

1.Reject it - Simple but you lose data 2. Keep a buffer for late data - Accept data up to X minutes late, merge it in during compaction 3. Separate late data - Write late data to a special location, merge later

Most systems use option 2 with a 5-15 minute window for late data.

Storage Engine Deep Dive

Now let me explain how we store data on disk so that time-range queries are fast.

What to tell the interviewer

The storage engine uses a Time-Structured Merge Tree (TSM). Data is organized into chunks by time period. Each chunk contains data sorted by series, then by time. This makes time-range queries fast because we read sequentially, not randomly.

What is a Chunk?

A chunk is a file that contains all data for a time period (like one hour). Inside the chunk:

1.Series Index - A list of all series in this chunk and where their data starts 2. Data Blocks - The actual timestamps and values, grouped by series 3. Compression - Everything is compressed to save space

CHUNK FILE: data_2024-03-15_14-15.tsm

=== HEADER ===

+ 20 more lines...

Compression - How we shrink data 10x

Time-series data compresses really well because it has patterns:

Timestamp compression (Delta-of-Delta): - Timestamps are usually evenly spaced (every 1 second) - Instead of storing: 1000, 1001, 1002, 1003 - Store the first value and the pattern: 1000, then +1, +1, +1 - Even better: store delta-of-deltas. If interval is always 1, the delta-of-delta is 0, 0, 0

Value compression (Gorilla/XOR): - Values often do not change much (CPU stays around 50%) - Instead of storing full 64-bit floats: 50.1, 50.2, 50.15, 50.18 - XOR consecutive values - the result has lots of zeros - Zeros compress really well

What we compress	Technique	Typical compression
Timestamps	Delta-of-delta encoding	8 bytes -> 1-2 bits per point
Integer values	Delta + ZigZag + Varint	8 bytes -> 1-2 bytes
Float values	XOR encoding (Gorilla)	8 bytes -> 1-2 bytes
Tag values	Dictionary encoding	"us-east" -> just the number 1
Everything	Block compression (LZ4)	Additional 2-3x compression

Compaction - Keeping storage efficient

Over time, we accumulate many small chunk files. Reading 100 small files is slower than reading 10 bigger files. Compaction merges small files into bigger ones.

How compaction works: 1. Find small chunk files from the same time period 2. Read all the data from them 3. Merge the data (combine series, sort by time) 4. Compress and write to a new bigger file 5. Delete the old small files

When we compact: - Level 1: Every 15 minutes, merge the tiny chunks from memory flushes - Level 2: Every hour, merge the 15-minute chunks - Level 3: Every day, merge the hourly chunks

Compaction merges small files into big ones

Why not just write big files from the start?

We cannot predict the future. Data arrives continuously. If we waited to collect a full hour of data before writing, we would need huge amounts of memory. Also, queries for very recent data would not see it. The solution is: write small files often, compact them later.

Query Engine Deep Dive

Now let me explain how we find and return data when someone queries the database.

What to tell the interviewer

Queries follow three steps: (1) Parse the query to understand what is needed, (2) Plan which chunks to read (skip chunks outside the time range), (3) Execute by reading only the needed data and computing aggregations. We use caching and parallel reads to make it fast.

Types of queries:

1.Raw data query - Give me every CPU reading from 2pm to 3pm - Returns thousands of points - Used for detailed investigation
2.Aggregation query - Give me the AVERAGE CPU per minute from 2pm to 3pm - Returns 60 points (one per minute) - Used for dashboards (much faster)
3.Filter query - Give me CPU where value > 90 from 2pm to 3pm - Returns only high values - Used for alerting

-- Raw data: Get all CPU readings for server-1 in the last hour\nSELECT value \nFROM cpu.usage \nWHERE host = 'server-1' \nAND time > now() - 1h\n\n-- Aggregation: Average CPU per minute for the last hour\nSELECT mean(value) \nFROM cpu.usage \nWHERE host = 'server-1' \nAND time > now() - 1h\nGROUP BY time(1m)\n\n-- Filter: Find when CPU was above 90%\nSELECT value \nFROM cpu.usage \nWHERE host = 'server-1' \nAND value > 90\nAND time > now() - 24h

Step by step: How a query runs

1.Parse - Understand the query (what metric, what time range, what aggregation)
2.Plan - Figure out which chunks contain data for this time range - Query: time > 2pm AND time < 3pm - Chunks: We have 14:00-14:30.tsm and 14:30-15:00.tsm - Only need to read those 2 files, skip all others
3.Filter series - Within each chunk, find only the series we need - Query asks for host=server-1 - Chunk index shows cpu.usage|host=server-1 at offset 1000 - Jump directly to offset 1000, skip other series
4.Read and decompress - Read the data block, decompress timestamps and values
5.Apply filters - If query has WHERE value > 90, filter out values <= 90
6.Aggregate - If query has GROUP BY time(1m), compute mean/max/min for each minute
7.Return results - Send back to the client

Query execution flow

Making queries fast:

1.Chunk pruning - Skip chunks outside the time range. If you query last hour, we ignore files from yesterday.
2.Series index - Each chunk has an index. We look up the series and jump directly to it. No scanning.
3.Parallel reads - Read multiple chunks at the same time using multiple threads.
4.Query cache - Dashboard queries repeat (refresh every 10 seconds). Cache the results for a few seconds.
5.Pre-aggregation - Store pre-computed hourly/daily averages. Query for last month uses pre-computed data, not raw points.

Queries that are slow (and why)

1. Queries without a time range - We have to scan everything. Always require a time range. 2. High-cardinality queries - SELECT * WHERE user_id = X when there are millions of user_ids means scanning many series. 3. Very long time ranges - Querying 1 year of raw data means reading terabytes. Use aggregations instead.

Data Retention and Downsampling

Time-series data grows forever if we let it. We need to automatically delete old data and create summaries of historical data.

What to tell the interviewer

We use a tiered retention policy: raw data for 30 days, 1-minute aggregates for 6 months, 1-hour aggregates for 2 years, 1-day aggregates forever. Deleting old data is fast because we just delete old chunk files - no scanning needed.

What is a Retention Policy?

A retention policy says: delete data older than X days. Example: - Raw data: keep for 30 days, then delete - This saves storage and keeps the database fast

Why this is easy for time-series databases:

Remember, our data is organized in chunk files by time: - data_2024-01-15.tsm (January 15) - data_2024-01-16.tsm (January 16) - data_2024-01-17.tsm (January 17)

To delete data older than 30 days, we just delete the old files:

rm data_2024-01-15.tsm  # Done in milliseconds!

No scanning, no slow DELETE queries.

What is Downsampling?

Downsampling means creating lower-resolution summaries of your data.

Raw data: 3600 points per hour (one per second) Downsampled: 60 points per hour (one per minute)

The summary stores: min, max, avg, count for each minute. This is enough for most historical queries.

RAW DATA (1 second resolution):
14:00:00 -> 45.7
14:00:01 -> 47.2

+ 12 more lines...

Tiered retention policy

FUNCTION run_retention_job():
    // This runs every hour

+ 28 more lines...

Smart query routing

When a query comes in, we automatically pick the right data tier. Query for last hour? Use raw data. Query for last month? Use minute aggregates. Query for last year? Use hour aggregates. The user gets fast results without knowing about the tiers.

Scaling the System

What to tell the interviewer

We scale by sharding data across multiple servers. We can shard by metric name (all CPU data on server 1, all memory data on server 2) or by time (January data on server 1, February on server 2). Each approach has tradeoffs.

When do we need to scale?

1.Write throughput - One server cannot handle 10 million writes per second. Split writes across servers.
2.Storage capacity - One server cannot store 100 TB. Split data across servers.
3.Query throughput - One server cannot handle 100,000 queries per second. Add more query servers.

How we scale writes:

Option 1: Shard by metric - Server 1 handles: cpu.*, disk.* - Server 2 handles: memory.*, network.* - Pro: Queries for one metric hit one server - Con: If CPU has 10x more data, Server 1 is overloaded

Option 2: Shard by tag (like host) - Server 1 handles: hosts starting with a-m - Server 2 handles: hosts starting with n-z - Pro: Even distribution if hosts are evenly named - Con: Queries across all hosts need to hit all servers

Option 3: Shard by time - Server 1 handles: last 7 days - Server 2 handles: 7-30 days ago - Server 3 handles: older than 30 days - Pro: Recent data (most queries) on fewer, faster servers - Con: Moving data between servers as it ages

Sharding by metric name

How we scale reads:

1.Read replicas - Copy data to multiple servers. Queries can go to any replica.
2.Query caching - Cache frequent query results. Dashboard queries hit cache, not database.
3.Pre-aggregation - Compute and store common aggregations. Queries read pre-computed results.

Scale strategy	When to use	Pros	Cons
Single server	Less than 100K writes/sec	Simple, no complexity	Limited by one machine
Read replicas	High query load	Scales reads easily	Does not help write throughput
Shard by metric	Many different metrics	Queries hit one shard	Uneven data distribution
Shard by tag	Many similar tags	Even distribution	Cross-tag queries are slow
Shard by time	Very long retention	Old data on cheap storage	Complex data movement

Replication for durability:

We do not want to lose data if a server dies. We replicate each shard to 2-3 servers:

Primary: Accepts writes - Replica 1: Gets a copy of all writes - Replica 2: Gets a copy of all writes

If primary dies, a replica takes over. We lose zero data (or at most a few seconds if using async replication).

The cross-shard query problem

If you query SELECT * FROM cpu WHERE time > last hour and CPU data is spread across 10 shards, we need to query all 10 shards and merge the results. This is slow. Design your sharding so most queries hit one shard. For monitoring, shard by metric so dashboard queries (showing one metric) are fast.

What Can Go Wrong and How We Handle It

Tell the interviewer about failures

Good engineers think about what can break. Let me walk through common failures and how we protect against them.

What breaks	What happens	How we fix it
Server crashes	In-memory data is lost	Recover from WAL on restart - we lose only a few seconds of data
Disk fills up	Cannot write new data	Monitor disk space, alert at 80%, auto-delete oldest data if needed

+ 5 more rows...

Backpressure - What to do when data comes too fast:

If writers send data faster than we can process:

1.Buffer in memory - Accept more data into a bigger buffer (use more RAM) 2. Slow down writers - Return errors or delays so writers slow down 3. Drop low-priority data - If we must drop data, drop debug-level metrics first

The key is to not crash. A slow response is better than no response.

FUNCTION handle_write(data):
    
    // Check if we are overloaded

+ 20 more lines...

Monitoring the time-series database itself:

We eat our own cooking - we use time-series metrics to monitor our time-series database:

Write rate (points per second) - Query latency (p50, p95, p99) - Memory buffer size - Disk usage - Compaction backlog - Replication lag

What interviewers look for

Interviewers want to see that you think about edge cases. Saying things like what if the disk fills up? or what if data arrives out of order? shows you understand real-world systems. Always mention at least 2-3 failure scenarios and how you handle them.

Advanced Topics

If you finish early or the interviewer asks for more depth, here are advanced topics to discuss.

1. Continuous Queries / Streaming Aggregation

Instead of running aggregation queries over and over, we can pre-compute them as data arrives:

As each CPU point arrives, immediately update the running average - Store the pre-computed 1-minute averages directly - Dashboard queries read the pre-computed values - super fast!

This is how real-time dashboards show live data without querying raw data.

2. Approximate Queries

For huge datasets, exact answers are slow. Approximate answers are often good enough:

Sampling: Query 1% of the data, multiply result by 100 - Sketches: Data structures like HyperLogLog count unique values approximately - Pre-aggregation: Use hourly summaries instead of raw data

Example: How many unique users visited in the last month? - Exact: Scan 30 days of raw data (slow) - Approximate with HyperLogLog: Read one small sketch (instant)

3. Multi-tenancy

If multiple customers share the database:

Isolation: One customer cannot see another customer's data - Fair resource usage: One customer cannot use all the CPU/memory - Billing: Track how much each customer uses

We add a tenant_id tag to all data and enforce it in the query layer.

4. Time-Series Specific Query Optimizations

Last value queries: What is the current CPU? Keep a separate index of latest values.
Gap filling: If data is missing for some minutes, fill with previous value or interpolate.
Rate calculations: Compute rate of change (CPU is increasing at 5% per minute).
Anomaly detection: Alert if value is 3 standard deviations from normal.

How real systems compare

InfluxDB: General purpose, easy to use, good compression. Prometheus: Pull-based model, great for Kubernetes monitoring. TimescaleDB: Built on PostgreSQL, can do SQL joins. Datadog: Fully managed SaaS, expensive but no operations needed. Each has different tradeoffs - there is no single best choice.

Interview Summary

How to conclude the interview

Let me summarize my design. I built a time-series database that can handle 1 million writes per second, query any time range in under 100 milliseconds, and automatically manage storage by deleting old data. The key insights are: batch writes in memory, organize data by time for fast range queries, and use compression to reduce storage 10x.

Key components:

1.Write Path: WAL for durability + memory buffer for batching + flush to time-sorted chunks
2.Storage Engine: Time-based chunks + series-based organization + compression (Gorilla for values, delta-of-delta for timestamps)
3.Query Engine: Chunk pruning by time + series index for direct access + parallel reads + query cache
4.Retention: File-based deletion (fast!) + tiered downsampling (raw -> minute -> hour -> day)
5.Scaling: Shard by metric name + read replicas + pre-aggregation

Key numbers to remember:

1 million writes/second with batching - 10x compression typical for time-series - 80% of queries are for last 24 hours (keep this hot) - Sub-100ms query latency target - Chunk size: 1 hour (balance between write efficiency and query granularity)

Common interview mistakes to avoid:

1.Using a regular database - explain why MySQL/PostgreSQL is wrong 2. Forgetting about compression - it is essential for time-series 3. Not mentioning retention - data cannot grow forever 4. Over-engineering sharding - a single server handles a lot before you need to shard 5. Ignoring cardinality - high-cardinality tags kill performance

Design Trade-offs

Advantages

+Simple to implement
+Data is immediately durable
+No memory pressure

Disadvantages

-Very slow - disk cannot keep up with high write rates
-Random I/O kills performance
-Cannot handle spikes

When to use

Only for very low write rates (under 1000 points per second). Not suitable for real monitoring.

System Design Masterclass

Weather Application with Forecasting

URL Shortener

Live Comments Feature

API Rate Limiter

On-Call Escalation System

Hotel Booking and Reservation System

Parts Compatibility Validation

Real-time Stock Price Viewer

Top-K Rankings System

File Download and Sync Library

Real-time Active Viewers

Marketplace Features

Price Alert System

Netflix Screen Concurrency Limits

Live Reactions System

Top K Most Shared Articles

High-Profile Likes Counter

Authentication and User Login

Google Calendar

Web Crawler

News Feed

Video Streaming Platform

IoC / Dependency Injection Framework

Distributed Control Infrastructure

Notification Service

Distributed Tracing System

P2P File Transfer System

Large Data Migration to Cloud

Wire Transfer API

Large Data Sorting and Processing

Database Control Plane

Distributed Metrics Logging and Aggregation

Ads Management & Delivery System

Flash Sale Backend

Photo Sharing Platform

Cluster Health Monitoring System

Rider Matching System

Surge Pricing System

Collaborative Editing System

Server Metrics Collection System

User Analytics Dashboard & Event Pipeline

Dropbox / Google Drive

Distributed Message Queue

ETA and Live Location Sharing

Distributed Key-Value Store

Distributed Stream Processing System

Payment Processing System

Distributed Job Scheduler

WhatsApp / Messenger

Payment Wallet at Global Scale

Uber / Ride Sharing

Web Search Engine

Globally Distributed SQL Database

Real-Time Analytics System

Recommendation System (Netflix)

Multi-Region Disaster Recovery System

Time-Series Database

Fraud Detection System

Design Time-Series Database

Summary

Key Takeaways

Core Problem

The Hard Part

Scaling Axis

Critical Invariant

Performance Requirement

Key Tradeoff

Design Walkthrough

Problem Statement

What to say first

Clarifying Questions

Question 1: How fast is data coming in?

Question 2: How long do we keep data?

Question 3: What queries do users run?

Question 4: Can we lose any data?

Summarize your assumptions

Why Regular Databases Are Bad for This

Say this to the interviewer

Regular DB vs Time-Series DB