Design Walkthrough
Problem Statement
The Question: Design a system to collect performance metrics (CPU, memory, disk, network, custom app metrics) from thousands of servers and enable real-time monitoring dashboards.
Metrics collection is essential for: - Operational visibility: Know when systems are unhealthy - Alerting: Trigger alerts when metrics exceed thresholds - Debugging: Correlate metrics with incidents - Capacity planning: Understand resource utilization trends
What to say first
Before I design, let me understand the scale, retention requirements, and query patterns. Metrics systems have very specific access patterns that drive architecture decisions.
Hidden requirements interviewers are testing: - Do you understand time-series data characteristics? - Can you handle high cardinality (many unique metric series)? - Do you know about downsampling and retention policies? - Can you design for both real-time dashboards and historical analysis?
Clarifying Questions
These questions reveal you understand the unique challenges of metrics systems.
Question 1: Scale
How many servers? How many metrics per server? What is the collection interval?
Why this matters: Determines write throughput. Typical answer: 10,000 servers, 1,000 metrics each, collected every 10 seconds Calculation: 10,000 x 1,000 / 10 = 1 million data points per second
Question 2: Retention
How long do we need to store raw data? Is downsampling acceptable for older data?
Why this matters: Determines storage strategy. Typical answer: Raw data for 15 days, 1-minute rollups for 6 months, hourly for 2 years Architecture impact: Need tiered storage with automatic downsampling
Question 3: Query Patterns
What are the common query patterns? Real-time dashboards? Historical analysis? Alerting?
Why this matters: Different queries need different optimizations. Typical answer: - Real-time: Last 5 minutes, specific host, refresh every 10s - Dashboard: Last 24 hours, aggregated across service - Investigation: Arbitrary time range, filter by multiple tags
Question 4: Labels/Tags
How many dimensions/labels per metric? (e.g., host, service, region, endpoint)
Why this matters: High cardinality labels are the #1 scaling challenge. Typical answer: 5-10 labels per metric Warning: Labels like user_id or request_id cause cardinality explosion
Stating assumptions
I will assume: 10K servers, 1M metrics/sec ingestion, 15-day raw retention with downsampling, labels limited to low-cardinality dimensions (host, service, region).
The Hard Part
Say this out loud
The hard part here is managing cardinality - the number of unique time series. With 10K hosts x 100 services x 50 metrics x 5 regions, we have 250 million unique series. Each needs separate storage and indexing.
Why cardinality is the killer:
- 1.Storage: Each unique series needs its own data structure 2. Memory: Index for fast lookups must fit in memory 3. Query: Aggregating across millions of series is slow 4. Cost: Storage grows linearly with cardinality
Safe labels:
http_requests{service="api", region="us-east", status="200"}
Cardinality: 10 services x 5 regions x 5 status codes = 250 series
Dangerous labels:
http_requests{service="api", user_id="12345", endpoint="/users/12345"}
Cardinality: 10 services x 1M users x 10K endpoints = 100 BILLION series
This is why Prometheus documentation warns against high-cardinality labels!Common mistake
Candidates often focus on write throughput without addressing cardinality. Always ask about label dimensions - this is what breaks metrics systems in production.
Other hard problems:
- Write amplification: Each data point may update multiple indexes - Compression: Time-series data compresses well, but decompression adds query latency - Out-of-order data: Network delays cause metrics to arrive late - Query fan-out: Dashboard query may touch thousands of series
Scale & Access Patterns
Let me quantify the scale and understand access patterns.
| Dimension | Value | Impact |
|---|---|---|
| Servers | 10,000 | Each runs metrics agent |
| Metrics per server | 1,000 | System + application metrics |
What to say
At 1M data points per second and 10M unique series, we need a purpose-built time-series database. General-purpose databases cannot handle this write pattern efficiently.
Access Pattern Analysis:
Writes: - Append-only (never update historical data) - Bursty (metrics arrive in batches from agents) - Recent data is hot, old data is cold
Reads: - Time-range queries (always filter by time) - Aggregations (avg, sum, max over time) - Label filtering (service=api AND region=us-east) - Recent data queried frequently, old data rarely
Raw data:
1M points/sec x 20 bytes x 86,400 sec/day = 1.7 TB/day
15 days retention = 25.5 TB raw dataHigh-Level Architecture
The architecture separates write path (optimized for throughput) from read path (optimized for query latency).
What to say
I will separate the write path from the read path. Writes go through a buffer for batching, reads hit a query layer with caching. We scale writes by partitioning on metric name hash.
Metrics Collection Architecture
Component Responsibilities:
1. Metrics Agents (on each server) - Collect system metrics (CPU, memory, disk, network) - Collect application metrics (custom counters, gauges, histograms) - Batch and compress before sending - Buffer locally if ingestion is unavailable
2. Ingestion Layer - Validate and parse incoming metrics - Add server-side timestamp if missing - Route to appropriate Kafka partition by metric name hash - Stateless, horizontally scalable
3. Kafka Buffer - Decouples ingestion from storage - Handles traffic spikes - Enables replay if storage falls behind - Partitioned by metric name for parallel processing
4. Storage Writers - Consume from Kafka - Batch writes to time-series DB - Handle out-of-order data - Trigger downsampling jobs
5. Time-Series Database - Optimized for time-series workload - Compressed columnar storage - Inverted index for label queries - Options: InfluxDB, TimescaleDB, Prometheus, ClickHouse
6. Query Engine - Parse PromQL/InfluxQL/SQL queries - Fan out to relevant storage partitions - Aggregate results - Cache frequent queries
Data Model & Storage
Time-series data has unique characteristics that require specialized storage.
What to say
The data model is: metric name + labels (dimensions) + timestamp + value. Storage is optimized for append-only writes and time-range queries.
Metric format (Prometheus-style):
cpu_usage{host="server1", service="api", region="us-east"} 0.85 1703001600Storage Layout (LSM-tree based):
Time-Series Storage Layout
Timestamps (delta-of-delta encoding):
Raw: 1703001600, 1703001610, 1703001620, 1703001630
Delta: 1703001600, +10, +10, +10Inverted Index for Label Queries:
Query: cpu_usage{service="api", region="us-east"}
Inverted Index:Real-world reference
Prometheus uses exactly this model: 2-hour blocks with compressed chunks, inverted index for labels, and head block in memory for recent data. InfluxDB and VictoriaMetrics use similar approaches.
Write Path Deep Dive
The write path must handle 1M+ data points per second while ensuring durability.
Write Path Flow
class MetricsIngestionService:
def __init__(self):
self.kafka_producer = KafkaProducer(Handling Out-of-Order Data:
Network delays and clock skew cause metrics to arrive out of order.
class TimeSeriesWriter:
def __init__(self):
self.ooo_window = 3600 # Accept data up to 1 hour lateImportant detail
Out-of-order handling is expensive. Prometheus historically rejected OOO data entirely. Newer versions (2.39+) support it but with performance overhead. Design agents to minimize clock skew.
Query Path Deep Dive
Queries must be fast even when touching millions of data points.
Dashboard query: Average CPU by service over last hour
avg by (service) (Query Execution
Query Optimizations:
| Optimization | How It Works | Impact |
|---|---|---|
| Query caching | Cache results of identical queries for short TTL | 10x faster for dashboard refreshes |
| Pre-aggregation | Store pre-computed rollups (1m, 5m, 1h) | 100x faster for long time ranges |
| Chunk caching | Cache decompressed chunks in memory | Avoid repeated decompression |
| Query pushdown | Push filters to storage layer | Reduce data transfer |
| Parallel execution | Fan out to multiple storage nodes | Linear speedup with nodes |
class QueryEngine:
def __init__(self):
self.query_cache = LRUCache(max_size=10000)Consistency & Invariants
System Invariants
Never lose metrics data that was acknowledged. Delayed visibility is acceptable (eventual consistency), but data loss is not.
Durability Guarantees:
- 1.Write-ahead log (WAL): Every write is logged before acknowledgment 2. Kafka retention: Buffer holds data until confirmed written to TSDB 3. Replication: Storage has replicas for durability
Consistency Model:
| Scenario | Consistency | Why Acceptable |
|---|---|---|
| Recent data query | Read-your-writes | Agent sees its own metrics immediately |
| Dashboard query | Eventual (~seconds) | Slight delay in dashboards is fine |
| Cross-region query | Eventual (~minutes) | Regional lag acceptable for global view |
| Alerting | At-least-once | Better to alert twice than miss alert |
What to say
For metrics, we prioritize durability and availability over strong consistency. A dashboard showing data 5 seconds stale is fine. Missing a critical metric that caused an outage is not.
Handling Duplicates:
At-least-once delivery means duplicates are possible. Metrics are idempotent by design:
Duplicate handling:
Same timestamp, same value -> No problem, just overwritesFailure Modes & Resilience
Proactively discuss failures
Metrics systems must be highly available - if monitoring is down during an outage, you are blind. Let me walk through failure scenarios.
| Failure | Impact | Mitigation | Why It Works |
|---|---|---|---|
| Ingestion node down | Some metrics delayed | Load balancer routes to healthy nodes | Stateless ingestion, instant failover |
| Kafka broker down | Reduced throughput | Kafka replication (RF=3) | Other brokers have copies |
Agent Resilience:
The metrics agent is critical - it runs on every server.
class MetricsAgent:
def __init__(self):
self.buffer = DiskBackedBuffer(max_size_mb=100)Real-world reference
Datadog agent buffers up to 100MB locally, automatically retries with exponential backoff, and can survive network outages for hours. This is critical for edge deployments with unreliable connectivity.
Cardinality Protection:
class CardinalityLimiter:
def __init__(self, max_series_per_metric=10000):
self.series_counts = defaultdict(set) # metric -> set of series_idsEvolution & Scaling
What to say
This design handles 1M metrics/sec with 10M series. For 10x scale, we need to shard the time-series database and add query federation. For 100x, we need a different storage tier for cold data.
Evolution Path:
Stage 1: Single TSDB (up to 1M metrics/sec) - Single Prometheus/InfluxDB instance per region - Vertical scaling (bigger machines) - Works for most companies
Stage 2: Sharded TSDB (up to 10M metrics/sec) - Shard by metric name hash - Query layer routes to correct shard - Each shard is independent TSDB
Stage 3: Tiered Storage (up to 100M+ metrics/sec) - Hot: Recent data in fast TSDB - Warm: Older data in cheaper storage (compressed) - Cold: Archived in object storage (S3/GCS) - Query engine federates across tiers
Tiered Storage Architecture
Downsampling Strategy:
def downsample_to_1min(raw_data: List[DataPoint]) -> List[RollupPoint]:
"""
Convert 10-second raw data to 1-minute rollups.Alternative approach
For extreme scale (billions of series), consider VictoriaMetrics or Thanos - they are purpose-built for massive Prometheus-compatible deployments. ClickHouse is excellent if you need SQL and can tolerate its complexity.
What I would do differently for...
Startup (simple): Single Prometheus with Grafana. Add Thanos sidecar when you outgrow it.
Enterprise (complex queries): InfluxDB or TimescaleDB for SQL support and better ad-hoc queries.
Extreme scale (100M+ series): VictoriaMetrics or custom solution with ClickHouse backend.
Multi-tenant SaaS: Add tenant isolation at ingestion, storage, and query layers. Consider per-tenant cardinality limits.