System Design Masterclass

58 items

System Design Masterclass

Infrastructuremetricsmonitoringtime-seriesaggregationobservabilityadvanced

Design Server Metrics Collection System

Collect performance metrics from thousands of servers in real-time

Millions of metrics/sec|Similar to Datadog, New Relic, Prometheus, Grafana, Splunk|45 min read

Summary

Design a system to collect, store, and query performance metrics from thousands of servers. The core challenges are handling high write throughput (millions of data points per second), efficient storage of time-series data, and enabling fast queries across multiple dimensions. This is asked at Datadog, New Relic, and any company building observability infrastructure.

Key Takeaways

Core Problem

This is fundamentally a time-series data problem: high write throughput, append-only, time-based queries, with heavy compression opportunities.

The Hard Part

Handling millions of unique metric series while keeping query latency low. Cardinality explosion from labels/tags is the killer.

Scaling Axis

Scale by metric name hash for writes, by time range for storage (hot/warm/cold tiers).

The Question: Design a system to collect performance metrics (CPU, memory, disk, network, custom app metrics) from thousands of servers and enable real-time monitoring dashboards.

Metrics collection is essential for: - Operational visibility: Know when systems are unhealthy - Alerting: Trigger alerts when metrics exceed thresholds - Debugging: Correlate metrics with incidents - Capacity planning: Understand resource utilization trends

What to say first

Before I design, let me understand the scale, retention requirements, and query patterns. Metrics systems have very specific access patterns that drive architecture decisions.

Hidden requirements interviewers are testing: - Do you understand time-series data characteristics? - Can you handle high cardinality (many unique metric series)? - Do you know about downsampling and retention policies? - Can you design for both real-time dashboards and historical analysis?

Summary

Key Takeaways

Core Problem

This is fundamentally a time-series data problem: high write throughput, append-only, time-based queries, with heavy compression opportunities.

The Hard Part

Handling millions of unique metric series while keeping query latency low. Cardinality explosion from labels/tags is the killer.

Scaling Axis

Scale by metric name hash for writes, by time range for storage (hot/warm/cold tiers).

Critical Invariant

Never lose metrics data during ingestion. Delayed is acceptable, lost is not.

Performance Requirement

Ingest 1M+ data points/second. Query dashboards must load in under 2 seconds even for week-long time ranges.

Key Tradeoff

We trade query flexibility for write performance by pre-aggregating common queries and using specialized time-series storage.

Design Walkthrough

Problem Statement

The Question: Design a system to collect performance metrics (CPU, memory, disk, network, custom app metrics) from thousands of servers and enable real-time monitoring dashboards.

What to say first

Before I design, let me understand the scale, retention requirements, and query patterns. Metrics systems have very specific access patterns that drive architecture decisions.

Clarifying Questions

These questions reveal you understand the unique challenges of metrics systems.

Question 1: Scale

How many servers? How many metrics per server? What is the collection interval?

Why this matters: Determines write throughput. Typical answer: 10,000 servers, 1,000 metrics each, collected every 10 seconds Calculation: 10,000 x 1,000 / 10 = 1 million data points per second

Question 2: Retention

How long do we need to store raw data? Is downsampling acceptable for older data?

Why this matters: Determines storage strategy. Typical answer: Raw data for 15 days, 1-minute rollups for 6 months, hourly for 2 years Architecture impact: Need tiered storage with automatic downsampling

Question 3: Query Patterns

What are the common query patterns? Real-time dashboards? Historical analysis? Alerting?

Why this matters: Different queries need different optimizations. Typical answer: - Real-time: Last 5 minutes, specific host, refresh every 10s - Dashboard: Last 24 hours, aggregated across service - Investigation: Arbitrary time range, filter by multiple tags

Question 4: Labels/Tags

How many dimensions/labels per metric? (e.g., host, service, region, endpoint)

Why this matters: High cardinality labels are the #1 scaling challenge. Typical answer: 5-10 labels per metric Warning: Labels like user_id or request_id cause cardinality explosion

Stating assumptions

I will assume: 10K servers, 1M metrics/sec ingestion, 15-day raw retention with downsampling, labels limited to low-cardinality dimensions (host, service, region).

The Hard Part

Say this out loud

The hard part here is managing cardinality - the number of unique time series. With 10K hosts x 100 services x 50 metrics x 5 regions, we have 250 million unique series. Each needs separate storage and indexing.

Why cardinality is the killer:

1.Storage: Each unique series needs its own data structure 2. Memory: Index for fast lookups must fit in memory 3. Query: Aggregating across millions of series is slow 4. Cost: Storage grows linearly with cardinality

Safe labels:
  http_requests{service="api", region="us-east", status="200"}
  Cardinality: 10 services x 5 regions x 5 status codes = 250 series

Dangerous labels:
  http_requests{service="api", user_id="12345", endpoint="/users/12345"}
  Cardinality: 10 services x 1M users x 10K endpoints = 100 BILLION series

This is why Prometheus documentation warns against high-cardinality labels!

Common mistake

Candidates often focus on write throughput without addressing cardinality. Always ask about label dimensions - this is what breaks metrics systems in production.

Other hard problems:

Write amplification: Each data point may update multiple indexes - Compression: Time-series data compresses well, but decompression adds query latency - Out-of-order data: Network delays cause metrics to arrive late - Query fan-out: Dashboard query may touch thousands of series

Scale & Access Patterns

Let me quantify the scale and understand access patterns.

Dimension	Value	Impact
Servers	10,000	Each runs metrics agent
Metrics per server	1,000	System + application metrics

+ 6 more rows...

What to say

At 1M data points per second and 10M unique series, we need a purpose-built time-series database. General-purpose databases cannot handle this write pattern efficiently.

Access Pattern Analysis:

Writes: - Append-only (never update historical data) - Bursty (metrics arrive in batches from agents) - Recent data is hot, old data is cold

Reads: - Time-range queries (always filter by time) - Aggregations (avg, sum, max over time) - Label filtering (service=api AND region=us-east) - Recent data queried frequently, old data rarely

Raw data:
  1M points/sec x 20 bytes x 86,400 sec/day = 1.7 TB/day
  15 days retention = 25.5 TB raw data

+ 9 more lines...

High-Level Architecture

The architecture separates write path (optimized for throughput) from read path (optimized for query latency).

What to say

I will separate the write path from the read path. Writes go through a buffer for batching, reads hit a query layer with caching. We scale writes by partitioning on metric name hash.

Metrics Collection Architecture

Component Responsibilities:

1. Metrics Agents (on each server) - Collect system metrics (CPU, memory, disk, network) - Collect application metrics (custom counters, gauges, histograms) - Batch and compress before sending - Buffer locally if ingestion is unavailable

2. Ingestion Layer - Validate and parse incoming metrics - Add server-side timestamp if missing - Route to appropriate Kafka partition by metric name hash - Stateless, horizontally scalable

3. Kafka Buffer - Decouples ingestion from storage - Handles traffic spikes - Enables replay if storage falls behind - Partitioned by metric name for parallel processing

4. Storage Writers - Consume from Kafka - Batch writes to time-series DB - Handle out-of-order data - Trigger downsampling jobs

5. Time-Series Database - Optimized for time-series workload - Compressed columnar storage - Inverted index for label queries - Options: InfluxDB, TimescaleDB, Prometheus, ClickHouse

6. Query Engine - Parse PromQL/InfluxQL/SQL queries - Fan out to relevant storage partitions - Aggregate results - Cache frequent queries

Data Model & Storage

Time-series data has unique characteristics that require specialized storage.

What to say

The data model is: metric name + labels (dimensions) + timestamp + value. Storage is optimized for append-only writes and time-range queries.

Metric format (Prometheus-style):

  cpu_usage{host="server1", service="api", region="us-east"} 0.85 1703001600

+ 9 more lines...

Storage Layout (LSM-tree based):

Time-Series Storage Layout

Timestamps (delta-of-delta encoding):
  Raw:      1703001600, 1703001610, 1703001620, 1703001630
  Delta:    1703001600, +10, +10, +10

+ 12 more lines...

Inverted Index for Label Queries:

Query: cpu_usage{service="api", region="us-east"}

Inverted Index:

+ 9 more lines...

Real-world reference

Prometheus uses exactly this model: 2-hour blocks with compressed chunks, inverted index for labels, and head block in memory for recent data. InfluxDB and VictoriaMetrics use similar approaches.

Write Path Deep Dive

The write path must handle 1M+ data points per second while ensuring durability.

Write Path Flow

class MetricsIngestionService:
    def __init__(self):
        self.kafka_producer = KafkaProducer(

+ 32 more lines...

Handling Out-of-Order Data:

Network delays and clock skew cause metrics to arrive out of order.

class TimeSeriesWriter:
    def __init__(self):
        self.ooo_window = 3600  # Accept data up to 1 hour late

+ 21 more lines...

Important detail

Out-of-order handling is expensive. Prometheus historically rejected OOO data entirely. Newer versions (2.39+) support it but with performance overhead. Design agents to minimize clock skew.

Query Path Deep Dive

Queries must be fast even when touching millions of data points.

Dashboard query: Average CPU by service over last hour

  avg by (service) (

+ 15 more lines...

Query Execution

Query Optimizations:

Optimization	How It Works	Impact
Query caching	Cache results of identical queries for short TTL	10x faster for dashboard refreshes
Pre-aggregation	Store pre-computed rollups (1m, 5m, 1h)	100x faster for long time ranges
Chunk caching	Cache decompressed chunks in memory	Avoid repeated decompression
Query pushdown	Push filters to storage layer	Reduce data transfer
Parallel execution	Fan out to multiple storage nodes	Linear speedup with nodes

class QueryEngine:
    def __init__(self):
        self.query_cache = LRUCache(max_size=10000)

+ 31 more lines...

Consistency & Invariants

System Invariants

Never lose metrics data that was acknowledged. Delayed visibility is acceptable (eventual consistency), but data loss is not.

Durability Guarantees:

1.Write-ahead log (WAL): Every write is logged before acknowledgment 2. Kafka retention: Buffer holds data until confirmed written to TSDB 3. Replication: Storage has replicas for durability

Consistency Model:

Scenario	Consistency	Why Acceptable
Recent data query	Read-your-writes	Agent sees its own metrics immediately
Dashboard query	Eventual (~seconds)	Slight delay in dashboards is fine
Cross-region query	Eventual (~minutes)	Regional lag acceptable for global view
Alerting	At-least-once	Better to alert twice than miss alert

What to say

For metrics, we prioritize durability and availability over strong consistency. A dashboard showing data 5 seconds stale is fine. Missing a critical metric that caused an outage is not.

Handling Duplicates:

At-least-once delivery means duplicates are possible. Metrics are idempotent by design:

Duplicate handling:

Same timestamp, same value -> No problem, just overwrites

+ 11 more lines...

Failure Modes & Resilience

Proactively discuss failures

Metrics systems must be highly available - if monitoring is down during an outage, you are blind. Let me walk through failure scenarios.

Failure	Impact	Mitigation	Why It Works
Ingestion node down	Some metrics delayed	Load balancer routes to healthy nodes	Stateless ingestion, instant failover
Kafka broker down	Reduced throughput	Kafka replication (RF=3)	Other brokers have copies

+ 4 more rows...

Agent Resilience:

The metrics agent is critical - it runs on every server.

class MetricsAgent:
    def __init__(self):
        self.buffer = DiskBackedBuffer(max_size_mb=100)

+ 33 more lines...

Real-world reference

Datadog agent buffers up to 100MB locally, automatically retries with exponential backoff, and can survive network outages for hours. This is critical for edge deployments with unreliable connectivity.

Cardinality Protection:

class CardinalityLimiter:
    def __init__(self, max_series_per_metric=10000):
        self.series_counts = defaultdict(set)  # metric -> set of series_ids

+ 17 more lines...

Evolution & Scaling

What to say

This design handles 1M metrics/sec with 10M series. For 10x scale, we need to shard the time-series database and add query federation. For 100x, we need a different storage tier for cold data.

Evolution Path:

Stage 1: Single TSDB (up to 1M metrics/sec) - Single Prometheus/InfluxDB instance per region - Vertical scaling (bigger machines) - Works for most companies

Stage 2: Sharded TSDB (up to 10M metrics/sec) - Shard by metric name hash - Query layer routes to correct shard - Each shard is independent TSDB

Stage 3: Tiered Storage (up to 100M+ metrics/sec) - Hot: Recent data in fast TSDB - Warm: Older data in cheaper storage (compressed) - Cold: Archived in object storage (S3/GCS) - Query engine federates across tiers

Tiered Storage Architecture

Downsampling Strategy:

def downsample_to_1min(raw_data: List[DataPoint]) -> List[RollupPoint]:
    """
    Convert 10-second raw data to 1-minute rollups.

+ 24 more lines...

Alternative approach

For extreme scale (billions of series), consider VictoriaMetrics or Thanos - they are purpose-built for massive Prometheus-compatible deployments. ClickHouse is excellent if you need SQL and can tolerate its complexity.

What I would do differently for...

Startup (simple): Single Prometheus with Grafana. Add Thanos sidecar when you outgrow it.

Enterprise (complex queries): InfluxDB or TimescaleDB for SQL support and better ad-hoc queries.

Extreme scale (100M+ series): VictoriaMetrics or custom solution with ClickHouse backend.

Multi-tenant SaaS: Add tenant isolation at ingestion, storage, and query layers. Consider per-tenant cardinality limits.

Design Trade-offs

Advantages

+Simple operation
+Well-documented
+Rich ecosystem

Disadvantages

-Limited scale (~1M series)
-No built-in HA
-Pull model can miss short-lived processes

When to use

Most companies, up to moderate scale

System Design Masterclass

Weather Application with Forecasting

URL Shortener

Live Comments Feature

API Rate Limiter

On-Call Escalation System

Hotel Booking and Reservation System

Parts Compatibility Validation

Real-time Stock Price Viewer

Top-K Rankings System

File Download and Sync Library

Real-time Active Viewers

Marketplace Features

Price Alert System

Netflix Screen Concurrency Limits

Live Reactions System

Top K Most Shared Articles

High-Profile Likes Counter

Authentication and User Login

Google Calendar

Web Crawler

News Feed

Video Streaming Platform

IoC / Dependency Injection Framework

Distributed Control Infrastructure

Notification Service

Distributed Tracing System

P2P File Transfer System

Large Data Migration to Cloud

Wire Transfer API

Large Data Sorting and Processing

Database Control Plane

Distributed Metrics Logging and Aggregation

Ads Management & Delivery System

Flash Sale Backend

Photo Sharing Platform

Cluster Health Monitoring System

Rider Matching System

Surge Pricing System

Collaborative Editing System

Server Metrics Collection System

User Analytics Dashboard & Event Pipeline

Dropbox / Google Drive

Distributed Message Queue

ETA and Live Location Sharing

Distributed Key-Value Store

Distributed Stream Processing System

Payment Processing System

Distributed Job Scheduler

WhatsApp / Messenger

Payment Wallet at Global Scale

Uber / Ride Sharing

Web Search Engine

Globally Distributed SQL Database

Real-Time Analytics System

Recommendation System (Netflix)

Multi-Region Disaster Recovery System

Time-Series Database

Fraud Detection System

Design Server Metrics Collection System

Summary

Key Takeaways

Core Problem

The Hard Part

Scaling Axis

Problem Statement

What to say first

Premium Content