Design Walkthrough
Problem Statement
The Question: Design a time-series database like InfluxDB that can handle high write throughput, efficient range queries, and automatic data retention.
What is time-series data? (explain this to the interviewer)
Time-series data is any data that has a timestamp. Think of it like a diary - every entry has a date and time. Examples:
- Server monitoring: CPU usage is 45% at 2:00:00, 47% at 2:00:01, 52% at 2:00:02... - IoT sensors: Temperature is 72F at 10:00, 73F at 10:01, 71F at 10:02... - Stock prices: Apple stock is $150.25 at 9:30:00, $150.30 at 9:30:01... - Website analytics: 1000 visitors at 3pm, 1200 visitors at 4pm...
What the database needs to do (most important first):
- 1.Write data very fast - Accept millions of data points every second. Data arrives constantly and we cannot tell sensors to slow down.
- 2.Query by time range - Show me all CPU readings from 2pm to 3pm. This must be fast even if there are billions of data points.
- 3.Compress data - Time-series data is huge. 1 million points per second = 86 billion points per day. We need to squeeze it down.
- 4.Delete old data automatically - Keep last 30 days, delete everything older. Storage is expensive and old data is rarely needed.
- 5.Aggregate data - Instead of every second, show me the average per minute or per hour. This makes dashboards fast.
- 6.Tag and filter - Show me CPU for server-1 only, or all servers in the us-east region.
What to say first
Let me first understand the use case. Is this for server monitoring, IoT sensors, or financial data? Each has different needs. Monitoring data can lose a few points - not critical. Financial data cannot lose anything. I will also ask about the write rate and how long we need to keep data.
What the interviewer really wants to see: - Do you understand why regular databases (like MySQL) are bad for time-series? - Can you design a write path that handles millions of writes per second? - Do you know how to organize data so time-range queries are fast? - Can you explain compression techniques for time-series data? - How do you automatically delete old data without slowing down the system?
Clarifying Questions
Before you start designing, ask questions to understand what you are building. Good questions show the interviewer you think before you code.
Question 1: How fast is data coming in?
How many data points per second do we need to write? Are there spikes where suddenly 10x more data arrives?
Why ask this: This tells us how powerful our write system needs to be.
What interviewers usually say: 1 million writes per second normally, up to 5 million during spikes (like when many servers restart at once).
How this changes your design: We need a system that can burst to handle spikes. We might buffer writes in memory and batch them to disk.
Question 2: How long do we keep data?
Do we keep data forever or delete it after some time? Do we keep recent data in full detail and older data in summary form?
Why ask this: Keeping data forever would need infinite storage. Most companies keep recent data (last 30 days) in full detail and older data as summaries (hourly averages).
What interviewers usually say: Keep raw data for 30 days, hourly summaries for 1 year, daily summaries forever.
How this changes your design: We need background jobs that create summaries and delete old raw data. We also organize data by time so deleting old data is fast (just delete old files).
Question 3: What queries do users run?
Do users mostly query recent data (last hour) or also historical data (last year)? Do they query one metric or many metrics at once?
Why ask this: If 90% of queries are for the last hour, we should keep recent data in memory for super fast access.
What interviewers usually say: 80% of queries are for last 24 hours. Users often query multiple metrics together (CPU and memory for the same server).
How this changes your design: We keep recent data (last 24 hours) in a fast cache. We store data so that metrics from the same source are close together on disk.
Question 4: Can we lose any data?
If a server crashes, is it okay to lose the last few seconds of data? Or must every single data point be saved?
Why ask this: Durability (never losing data) costs performance. Writing to disk is slow. If we can lose a few seconds of data on crash, we can keep data in memory longer and write faster.
What interviewers usually say: For monitoring, losing a few seconds is okay. For financial data, we cannot lose anything.
How this changes your design: For monitoring: batch writes in memory, flush every few seconds. For financial: write to a log immediately before saying OK.
Summarize your assumptions
Let me summarize: We need to write 1 million points per second with 5x burst capacity. We keep raw data for 30 days, summaries for 1 year. Most queries are for recent data. We can tolerate losing up to 5 seconds of data on crash. I will design for a monitoring use case.
Why Regular Databases Are Bad for This
Say this to the interviewer
Before I design a solution, let me explain why we cannot just use PostgreSQL or MySQL. This shows I understand the problem deeply.
Problem 1: Regular databases update indexes on every write
When you insert a row in MySQL, it updates a B-tree index. This is great for finding any row quickly, but terrible when you are writing millions of rows per second. The index becomes a bottleneck.
Time-series insight: We almost never look up a single data point. We always query ranges (give me all points from 2pm to 3pm). We do not need a fancy index for every row.
Problem 2: Regular databases store rows scattered on disk
In MySQL, row 1 might be on disk block 100, row 2 on block 5000, row 3 on block 200. When you query a time range, the disk head jumps all over the place. This is very slow.
Time-series insight: Data points that are close in time should be close on disk. If we store January data together and February data together, querying January is just one sequential disk read.
Problem 3: Regular databases do not compress well
Time-series data has patterns. If CPU is around 50% for an hour, storing 50.1, 50.2, 50.0, 50.3... is wasteful. We could store 50 and then just the small differences (+0.1, +0.1, -0.2, +0.3).
Regular databases store each value fully. Time-series databases use special compression that understands patterns in the data.
Problem 4: Deleting old data is slow in regular databases
In MySQL, DELETE FROM metrics WHERE timestamp < 30_days_ago would scan millions of rows and take hours. It would also lock the table and slow down writes.
Time-series insight: If we store each day in a separate file, deleting old data means just deleting the file. rm january_15.db takes milliseconds, not hours.
Regular DB vs Time-Series DB
Scale and Access Patterns
Before designing, let me figure out how big this system needs to be. This helps us choose the right tools.
| What we are measuring | Number | What this means for our design |
|---|---|---|
| Data points per second | 1 million (5M peak) | Need to batch writes - cannot hit disk 1M times per second |
| Size of one data point | About 20 bytes | timestamp (8 bytes) + value (8 bytes) + tags (4 bytes) |
What to tell the interviewer
At 1 million writes per second, we absolutely need batching - SSDs can only do about 100K random writes per second. By batching 10,000 points into one write, we turn 1M writes into just 100 disk operations. The key insight is that time-series data arrives in order, so we can buffer it in memory and write in chunks.
How people use a time-series database (from most common to least):
- 1.Dashboard queries - Show me CPU for the last hour, refreshed every 10 seconds. These need to be FAST.
- 2.Alert queries - Is CPU above 90% for the last 5 minutes? These run constantly in the background.
- 3.Investigation queries - What happened yesterday at 3pm when the site went down? Scan longer time ranges.
- 4.Batch analytics - What was the average CPU per server last month? These can be slower.
WRITES:
1 million points per second
= 60 million per minuteHigh-Level Architecture
Now let me draw the big picture of how all the pieces fit together. I will explain each part simply.
What to tell the interviewer
The design has three main parts: (1) A fast write path that buffers data in memory and batches writes to disk, (2) A storage engine that organizes data by time for fast range queries, (3) A query engine that knows how to efficiently scan time ranges and aggregate results.
Time-Series Database - The Big Picture
The Write Path - How data gets in:
- 1.Load Balancer spreads incoming data across multiple servers 2. Ingestion Service validates the data (is timestamp valid? is metric name okay?) 3. Write-Ahead Log (WAL) saves data to a simple log file immediately - this is our crash recovery 4. Memory Buffer collects many data points before writing to the main storage
Key insight: We write to WAL (fast sequential write) immediately for safety, but we batch up points in memory before writing to the main storage (slow random writes).
The Storage Engine - How data is organized:
- 1.Time Chunks - Data is split into files by time. All data from 2pm-3pm goes in one file. 2. Compaction - Background job that merges small files into bigger ones and compresses them 3. Time Index - A small lookup table that says chunk for 2pm-3pm is in file X
Key insight: By organizing data by time, a query for last hour only needs to open 1-2 files, not scan the entire database.
The Query Engine - How we find data fast:
- 1.Query Parser - Understand what the user wants (SELECT cpu WHERE time > 1 hour ago) 2. Query Planner - Figure out which time chunks to scan 3. Query Executor - Actually read the chunks and compute aggregates (avg, max, min) 4. Query Cache - Save recent query results so repeated queries are instant
How real systems do it
InfluxDB uses this exact pattern - WAL for durability, memory buffer for batching, time-structured merge tree (TSM) for storage. Prometheus uses a similar approach but optimizes for pull-based monitoring. TimescaleDB extends PostgreSQL with automatic time partitioning.
Data Model
Now let me explain how we organize the data. Time-series data has a special structure that is different from regular database tables.
What to tell the interviewer
Time-series data has three parts: a timestamp (when), tags (what/where), and values (the actual numbers). We organize data so that points with the same tags are stored together, making queries fast.
The three parts of a data point:
- 1.Timestamp - When was this recorded? (Example: 2024-03-15 14:30:00.123)
- 2.Tags - What is this about? Tags are like labels that describe the data. - metric_name: cpu.usage - host: server-1 - region: us-east - datacenter: dc-1
- 3.Value - The actual number (Example: 45.7 percent)
Point 1:
timestamp: 2024-03-15 14:30:00
tags: {metric: "cpu.usage", host: "server-1", region: "us-east"}What is a Series?
A series is all data points that have the same tags. Think of it as a single line on a graph.
Example: All CPU readings for server-1 in us-east form one series. If you draw a line graph, this would be one line.
Why this matters: We store each series together on disk. When you ask for CPU of server-1, we read just that one series, not the whole database.
| Concept | What it means | Example |
|---|---|---|
| Metric | The name of what we are measuring | cpu.usage, memory.free, http.requests |
| Tag | A label that describes the data | host=server-1, region=us-east |
| Series | All points with same metric + tags | cpu.usage where host=server-1 |
| Point | One measurement at one time | cpu.usage=45.7 at 14:30:00 |
| Field | The actual value (can have multiple) | value=45.7, or min=40, max=50, avg=45 |
How we store this on disk:
We group data in two ways:
- 1.By time - All data from 2pm-3pm goes in one chunk (file) 2. By series - Within each chunk, data from the same series is stored together
How data is organized on disk
The Series Cardinality Problem
Cardinality means how many unique series you have. If you have 1000 servers and 100 metrics, that is 100,000 series. This is fine. But if you add a tag like user_id with 1 million users, suddenly you have 100 billion series. This kills performance. Rule: Never use high-cardinality values (like user IDs or request IDs) as tags.
Write Path Deep Dive
Let me explain step by step how data gets written. This is the most important part because time-series databases live or die by write performance.
What to tell the interviewer
The write path has to balance three things: speed (accept writes fast), durability (do not lose data), and organization (store data so reads are fast). We use a Write-Ahead Log for durability and an in-memory buffer for batching.
Step by step: What happens when data arrives
- 1.Receive and validate - Check the data is correct (valid timestamp, metric name is not empty)
- 2.Write to WAL - Immediately write to a simple log file. This is fast because it is just appending to a file. If we crash, we can recover from this log.
- 3.Add to memory buffer - Put the data point in an in-memory structure organized by series. This is super fast because it is just memory.
- 4.Return OK - Tell the client the write succeeded. At this point, data is in WAL and memory.
- 5.Background: Flush to chunks - When the memory buffer gets big enough (or after X seconds), write all the buffered data to chunk files on disk.
- 6.Background: Delete WAL - After data is safely in chunk files, delete the old WAL entries.
FUNCTION write_point(timestamp, tags, value):
STEP 1: Validate the dataThe Memory Buffer - Why it is critical
Imagine writing 1 million points per second directly to disk files. Each write needs to: 1. Find the right file 2. Seek to the right position 3. Write the data 4. Update the file index
This would be impossibly slow. SSDs can only do about 100K random writes per second.
Instead, we buffer points in memory and write them in batches: - Collect 1 million points over 1 second - Organize them by series - Write each series as one sequential block - Now we only need ~1000 writes (one per series), not 1 million
Memory buffer batching
What if we crash before flushing?
The WAL saves us. On startup, we replay the WAL to recover any data that was in memory but not yet flushed. The tradeoff: WAL writes add latency, but only a tiny bit (appending to a file is fast). If you want even faster writes and can tolerate losing a few seconds of data, you can skip the WAL entirely.
Handling out-of-order data
Sometimes data arrives late. A sensor might send data from 5 minutes ago. We have options:
- 1.Reject it - Simple but you lose data 2. Keep a buffer for late data - Accept data up to X minutes late, merge it in during compaction 3. Separate late data - Write late data to a special location, merge later
Most systems use option 2 with a 5-15 minute window for late data.
Storage Engine Deep Dive
Now let me explain how we store data on disk so that time-range queries are fast.
What to tell the interviewer
The storage engine uses a Time-Structured Merge Tree (TSM). Data is organized into chunks by time period. Each chunk contains data sorted by series, then by time. This makes time-range queries fast because we read sequentially, not randomly.
What is a Chunk?
A chunk is a file that contains all data for a time period (like one hour). Inside the chunk:
- 1.Series Index - A list of all series in this chunk and where their data starts 2. Data Blocks - The actual timestamps and values, grouped by series 3. Compression - Everything is compressed to save space
CHUNK FILE: data_2024-03-15_14-15.tsm
=== HEADER ===Compression - How we shrink data 10x
Time-series data compresses really well because it has patterns:
Timestamp compression (Delta-of-Delta): - Timestamps are usually evenly spaced (every 1 second) - Instead of storing: 1000, 1001, 1002, 1003 - Store the first value and the pattern: 1000, then +1, +1, +1 - Even better: store delta-of-deltas. If interval is always 1, the delta-of-delta is 0, 0, 0
Value compression (Gorilla/XOR): - Values often do not change much (CPU stays around 50%) - Instead of storing full 64-bit floats: 50.1, 50.2, 50.15, 50.18 - XOR consecutive values - the result has lots of zeros - Zeros compress really well
| What we compress | Technique | Typical compression |
|---|---|---|
| Timestamps | Delta-of-delta encoding | 8 bytes -> 1-2 bits per point |
| Integer values | Delta + ZigZag + Varint | 8 bytes -> 1-2 bytes |
| Float values | XOR encoding (Gorilla) | 8 bytes -> 1-2 bytes |
| Tag values | Dictionary encoding | "us-east" -> just the number 1 |
| Everything | Block compression (LZ4) | Additional 2-3x compression |
Compaction - Keeping storage efficient
Over time, we accumulate many small chunk files. Reading 100 small files is slower than reading 10 bigger files. Compaction merges small files into bigger ones.
How compaction works: 1. Find small chunk files from the same time period 2. Read all the data from them 3. Merge the data (combine series, sort by time) 4. Compress and write to a new bigger file 5. Delete the old small files
When we compact: - Level 1: Every 15 minutes, merge the tiny chunks from memory flushes - Level 2: Every hour, merge the 15-minute chunks - Level 3: Every day, merge the hourly chunks
Compaction merges small files into big ones
Why not just write big files from the start?
We cannot predict the future. Data arrives continuously. If we waited to collect a full hour of data before writing, we would need huge amounts of memory. Also, queries for very recent data would not see it. The solution is: write small files often, compact them later.
Query Engine Deep Dive
Now let me explain how we find and return data when someone queries the database.
What to tell the interviewer
Queries follow three steps: (1) Parse the query to understand what is needed, (2) Plan which chunks to read (skip chunks outside the time range), (3) Execute by reading only the needed data and computing aggregations. We use caching and parallel reads to make it fast.
Types of queries:
- 1.Raw data query - Give me every CPU reading from 2pm to 3pm - Returns thousands of points - Used for detailed investigation
- 2.Aggregation query - Give me the AVERAGE CPU per minute from 2pm to 3pm - Returns 60 points (one per minute) - Used for dashboards (much faster)
- 3.Filter query - Give me CPU where value > 90 from 2pm to 3pm - Returns only high values - Used for alerting
-- Raw data: Get all CPU readings for server-1 in the last hour\nSELECT value \nFROM cpu.usage \nWHERE host = 'server-1' \nAND time > now() - 1h\n\n-- Aggregation: Average CPU per minute for the last hour\nSELECT mean(value) \nFROM cpu.usage \nWHERE host = 'server-1' \nAND time > now() - 1h\nGROUP BY time(1m)\n\n-- Filter: Find when CPU was above 90%\nSELECT value \nFROM cpu.usage \nWHERE host = 'server-1' \nAND value > 90\nAND time > now() - 24hStep by step: How a query runs
- 1.Parse - Understand the query (what metric, what time range, what aggregation)
- 2.Plan - Figure out which chunks contain data for this time range - Query: time > 2pm AND time < 3pm - Chunks: We have 14:00-14:30.tsm and 14:30-15:00.tsm - Only need to read those 2 files, skip all others
- 3.Filter series - Within each chunk, find only the series we need - Query asks for host=server-1 - Chunk index shows cpu.usage|host=server-1 at offset 1000 - Jump directly to offset 1000, skip other series
- 4.Read and decompress - Read the data block, decompress timestamps and values
- 5.Apply filters - If query has WHERE value > 90, filter out values <= 90
- 6.Aggregate - If query has GROUP BY time(1m), compute mean/max/min for each minute
- 7.Return results - Send back to the client
Query execution flow
Making queries fast:
- 1.Chunk pruning - Skip chunks outside the time range. If you query last hour, we ignore files from yesterday.
- 2.Series index - Each chunk has an index. We look up the series and jump directly to it. No scanning.
- 3.Parallel reads - Read multiple chunks at the same time using multiple threads.
- 4.Query cache - Dashboard queries repeat (refresh every 10 seconds). Cache the results for a few seconds.
- 5.Pre-aggregation - Store pre-computed hourly/daily averages. Query for last month uses pre-computed data, not raw points.
Queries that are slow (and why)
1. Queries without a time range - We have to scan everything. Always require a time range. 2. High-cardinality queries - SELECT * WHERE user_id = X when there are millions of user_ids means scanning many series. 3. Very long time ranges - Querying 1 year of raw data means reading terabytes. Use aggregations instead.
Data Retention and Downsampling
Time-series data grows forever if we let it. We need to automatically delete old data and create summaries of historical data.
What to tell the interviewer
We use a tiered retention policy: raw data for 30 days, 1-minute aggregates for 6 months, 1-hour aggregates for 2 years, 1-day aggregates forever. Deleting old data is fast because we just delete old chunk files - no scanning needed.
What is a Retention Policy?
A retention policy says: delete data older than X days. Example: - Raw data: keep for 30 days, then delete - This saves storage and keeps the database fast
Why this is easy for time-series databases:
Remember, our data is organized in chunk files by time: - data_2024-01-15.tsm (January 15) - data_2024-01-16.tsm (January 16) - data_2024-01-17.tsm (January 17)
To delete data older than 30 days, we just delete the old files:
rm data_2024-01-15.tsm # Done in milliseconds!No scanning, no slow DELETE queries.
What is Downsampling?
Downsampling means creating lower-resolution summaries of your data.
Raw data: 3600 points per hour (one per second) Downsampled: 60 points per hour (one per minute)
The summary stores: min, max, avg, count for each minute. This is enough for most historical queries.
RAW DATA (1 second resolution):
14:00:00 -> 45.7
14:00:01 -> 47.2Tiered retention policy
FUNCTION run_retention_job():
// This runs every hour
Smart query routing
When a query comes in, we automatically pick the right data tier. Query for last hour? Use raw data. Query for last month? Use minute aggregates. Query for last year? Use hour aggregates. The user gets fast results without knowing about the tiers.
Scaling the System
What to tell the interviewer
We scale by sharding data across multiple servers. We can shard by metric name (all CPU data on server 1, all memory data on server 2) or by time (January data on server 1, February on server 2). Each approach has tradeoffs.
When do we need to scale?
- 1.Write throughput - One server cannot handle 10 million writes per second. Split writes across servers.
- 2.Storage capacity - One server cannot store 100 TB. Split data across servers.
- 3.Query throughput - One server cannot handle 100,000 queries per second. Add more query servers.
How we scale writes:
Option 1: Shard by metric - Server 1 handles: cpu.*, disk.* - Server 2 handles: memory.*, network.* - Pro: Queries for one metric hit one server - Con: If CPU has 10x more data, Server 1 is overloaded
Option 2: Shard by tag (like host) - Server 1 handles: hosts starting with a-m - Server 2 handles: hosts starting with n-z - Pro: Even distribution if hosts are evenly named - Con: Queries across all hosts need to hit all servers
Option 3: Shard by time - Server 1 handles: last 7 days - Server 2 handles: 7-30 days ago - Server 3 handles: older than 30 days - Pro: Recent data (most queries) on fewer, faster servers - Con: Moving data between servers as it ages
Sharding by metric name
How we scale reads:
- 1.Read replicas - Copy data to multiple servers. Queries can go to any replica.
- 2.Query caching - Cache frequent query results. Dashboard queries hit cache, not database.
- 3.Pre-aggregation - Compute and store common aggregations. Queries read pre-computed results.
| Scale strategy | When to use | Pros | Cons |
|---|---|---|---|
| Single server | Less than 100K writes/sec | Simple, no complexity | Limited by one machine |
| Read replicas | High query load | Scales reads easily | Does not help write throughput |
| Shard by metric | Many different metrics | Queries hit one shard | Uneven data distribution |
| Shard by tag | Many similar tags | Even distribution | Cross-tag queries are slow |
| Shard by time | Very long retention | Old data on cheap storage | Complex data movement |
Replication for durability:
We do not want to lose data if a server dies. We replicate each shard to 2-3 servers:
- Primary: Accepts writes - Replica 1: Gets a copy of all writes - Replica 2: Gets a copy of all writes
If primary dies, a replica takes over. We lose zero data (or at most a few seconds if using async replication).
The cross-shard query problem
If you query SELECT * FROM cpu WHERE time > last hour and CPU data is spread across 10 shards, we need to query all 10 shards and merge the results. This is slow. Design your sharding so most queries hit one shard. For monitoring, shard by metric so dashboard queries (showing one metric) are fast.
What Can Go Wrong and How We Handle It
Tell the interviewer about failures
Good engineers think about what can break. Let me walk through common failures and how we protect against them.
| What breaks | What happens | How we fix it |
|---|---|---|
| Server crashes | In-memory data is lost | Recover from WAL on restart - we lose only a few seconds of data |
| Disk fills up | Cannot write new data | Monitor disk space, alert at 80%, auto-delete oldest data if needed |
Backpressure - What to do when data comes too fast:
If writers send data faster than we can process:
- 1.Buffer in memory - Accept more data into a bigger buffer (use more RAM) 2. Slow down writers - Return errors or delays so writers slow down 3. Drop low-priority data - If we must drop data, drop debug-level metrics first
The key is to not crash. A slow response is better than no response.
FUNCTION handle_write(data):
// Check if we are overloadedMonitoring the time-series database itself:
We eat our own cooking - we use time-series metrics to monitor our time-series database:
- Write rate (points per second) - Query latency (p50, p95, p99) - Memory buffer size - Disk usage - Compaction backlog - Replication lag
What interviewers look for
Interviewers want to see that you think about edge cases. Saying things like what if the disk fills up? or what if data arrives out of order? shows you understand real-world systems. Always mention at least 2-3 failure scenarios and how you handle them.
Advanced Topics
If you finish early or the interviewer asks for more depth, here are advanced topics to discuss.
1. Continuous Queries / Streaming Aggregation
Instead of running aggregation queries over and over, we can pre-compute them as data arrives:
- As each CPU point arrives, immediately update the running average - Store the pre-computed 1-minute averages directly - Dashboard queries read the pre-computed values - super fast!
This is how real-time dashboards show live data without querying raw data.
2. Approximate Queries
For huge datasets, exact answers are slow. Approximate answers are often good enough:
- Sampling: Query 1% of the data, multiply result by 100 - Sketches: Data structures like HyperLogLog count unique values approximately - Pre-aggregation: Use hourly summaries instead of raw data
Example: How many unique users visited in the last month? - Exact: Scan 30 days of raw data (slow) - Approximate with HyperLogLog: Read one small sketch (instant)
3. Multi-tenancy
If multiple customers share the database:
- Isolation: One customer cannot see another customer's data - Fair resource usage: One customer cannot use all the CPU/memory - Billing: Track how much each customer uses
We add a tenant_id tag to all data and enforce it in the query layer.
4. Time-Series Specific Query Optimizations
- Last value queries: What is the current CPU? Keep a separate index of latest values.
- Gap filling: If data is missing for some minutes, fill with previous value or interpolate.
- Rate calculations: Compute rate of change (CPU is increasing at 5% per minute).
- Anomaly detection: Alert if value is 3 standard deviations from normal.
How real systems compare
InfluxDB: General purpose, easy to use, good compression. Prometheus: Pull-based model, great for Kubernetes monitoring. TimescaleDB: Built on PostgreSQL, can do SQL joins. Datadog: Fully managed SaaS, expensive but no operations needed. Each has different tradeoffs - there is no single best choice.
Interview Summary
How to conclude the interview
Let me summarize my design. I built a time-series database that can handle 1 million writes per second, query any time range in under 100 milliseconds, and automatically manage storage by deleting old data. The key insights are: batch writes in memory, organize data by time for fast range queries, and use compression to reduce storage 10x.
Key components:
- 1.Write Path: WAL for durability + memory buffer for batching + flush to time-sorted chunks
- 2.Storage Engine: Time-based chunks + series-based organization + compression (Gorilla for values, delta-of-delta for timestamps)
- 3.Query Engine: Chunk pruning by time + series index for direct access + parallel reads + query cache
- 4.Retention: File-based deletion (fast!) + tiered downsampling (raw -> minute -> hour -> day)
- 5.Scaling: Shard by metric name + read replicas + pre-aggregation
Key numbers to remember:
- 1 million writes/second with batching - 10x compression typical for time-series - 80% of queries are for last 24 hours (keep this hot) - Sub-100ms query latency target - Chunk size: 1 hour (balance between write efficiency and query granularity)
Common interview mistakes to avoid:
- 1.Using a regular database - explain why MySQL/PostgreSQL is wrong 2. Forgetting about compression - it is essential for time-series 3. Not mentioning retention - data cannot grow forever 4. Over-engineering sharding - a single server handles a lot before you need to shard 5. Ignoring cardinality - high-cardinality tags kill performance