System Design Masterclass
Infrastructuremetricsmonitoringtime-seriesaggregationobservabilityadvanced

Design Server Metrics Collection System

Collect performance metrics from thousands of servers in real-time

Millions of metrics/sec|Similar to Datadog, New Relic, Prometheus, Grafana, Splunk|45 min read

Summary

Design a system to collect, store, and query performance metrics from thousands of servers. The core challenges are handling high write throughput (millions of data points per second), efficient storage of time-series data, and enabling fast queries across multiple dimensions. This is asked at Datadog, New Relic, and any company building observability infrastructure.

Key Takeaways

Core Problem

This is fundamentally a time-series data problem: high write throughput, append-only, time-based queries, with heavy compression opportunities.

The Hard Part

Handling millions of unique metric series while keeping query latency low. Cardinality explosion from labels/tags is the killer.

Scaling Axis

Scale by metric name hash for writes, by time range for storage (hot/warm/cold tiers).

The Question: Design a system to collect performance metrics (CPU, memory, disk, network, custom app metrics) from thousands of servers and enable real-time monitoring dashboards.

Metrics collection is essential for: - Operational visibility: Know when systems are unhealthy - Alerting: Trigger alerts when metrics exceed thresholds - Debugging: Correlate metrics with incidents - Capacity planning: Understand resource utilization trends

What to say first

Before I design, let me understand the scale, retention requirements, and query patterns. Metrics systems have very specific access patterns that drive architecture decisions.

Hidden requirements interviewers are testing: - Do you understand time-series data characteristics? - Can you handle high cardinality (many unique metric series)? - Do you know about downsampling and retention policies? - Can you design for both real-time dashboards and historical analysis?

Premium Content

Sign in to access this content or upgrade for full access.