System Design Masterclass

58 items

System Design Masterclass

Infrastructuretracingobservabilitymicroservicesdebuggingdistributed-systemsadvanced

Design Distributed Tracing System

Design a system for tracing requests across microservices

Billions of spans/day|Similar to Google, Uber, Netflix, Datadog, New Relic, Lightstep|45 min read

Summary

Distributed tracing tracks requests as they flow through multiple microservices, enabling debugging and performance analysis in complex architectures. The core challenge is propagating trace context across service boundaries with minimal overhead while ingesting and storing billions of spans efficiently. This is asked at Google, Uber, Netflix, and any company with microservices architecture.

Key Takeaways

Core Problem

This is fundamentally a causality tracking problem. We need to reconstruct the path and timing of a request across dozens of services that only see their local piece.

The Hard Part

Propagating trace context across all service boundaries (HTTP, gRPC, queues, databases) with near-zero overhead while sampling intelligently to manage data volume.

Scaling Axis

Scale by trace ID hash for storage and querying. Sample aggressively (0.1-1% of traces) but keep 100% of errors and slow requests.

The Question: Design a distributed tracing system that can track requests across hundreds of microservices, handling billions of spans per day.

Distributed tracing is essential for: - Debugging production issues - Find where requests fail in a chain of 20 services - Performance optimization - Identify slow services and dependencies - Understanding system behavior - Visualize request flow and dependencies - SLA monitoring - Track end-to-end latency across service boundaries

What to say first

Before I design the system, let me clarify what we mean by tracing. A trace represents a single request journey through the system. Each service creates a span - a unit of work with timing information. I will design the system to collect, store, and query these spans.

Hidden requirements interviewers are testing: - Do you understand the tracing data model (traces, spans, context)? - Can you design efficient context propagation? - How do you handle sampling at scale? - Can you design a storage system for time-series span data?

Summary

Key Takeaways

Core Problem

This is fundamentally a causality tracking problem. We need to reconstruct the path and timing of a request across dozens of services that only see their local piece.

The Hard Part

Propagating trace context across all service boundaries (HTTP, gRPC, queues, databases) with near-zero overhead while sampling intelligently to manage data volume.

Scaling Axis

Scale by trace ID hash for storage and querying. Sample aggressively (0.1-1% of traces) but keep 100% of errors and slow requests.

Critical Invariant

A trace must never lose spans due to sampling decisions made independently by different services. Sampling must be consistent per-trace.

Performance Requirement

Tracing overhead must be less than 1% of request latency. Context propagation adds microseconds, not milliseconds.

Key Tradeoff

We trade completeness for cost - sampling 1% of traces loses 99% of data but makes the system economically viable at scale.

Design Walkthrough

Problem Statement

The Question: Design a distributed tracing system that can track requests across hundreds of microservices, handling billions of spans per day.

What to say first

Clarifying Questions

Ask these questions to demonstrate senior thinking. Each answer shapes your architecture.

Question 1: Scale

How many services do we have? What is the request volume? How many spans per request on average?

Why this matters: Determines storage requirements and sampling strategy. Typical answer: 500 services, 100K RPS, average 20 spans per request = 2M spans/second Architecture impact: Need aggressive sampling, cannot store everything

Question 2: Query Patterns

How will engineers query traces? By trace ID? By service? By error type? Time range?

Why this matters: Determines indexing strategy and storage schema. Typical answer: Lookup by trace ID is most common, also need to find traces by service, operation, and error Architecture impact: Need secondary indexes, possibly separate stores for different access patterns

Question 3: Retention

How long do we need to keep traces? Same retention for all traces or tiered?

Why this matters: Determines storage costs and architecture. Typical answer: 7 days for normal traces, 30 days for errors, 90 days for aggregated metrics Architecture impact: Need tiered storage, different retention policies

Question 4: Transport Types

What protocols do services use? HTTP, gRPC, message queues, databases?

Why this matters: Each transport needs different context propagation. Typical answer: Mix of HTTP, gRPC, Kafka, and database calls Architecture impact: Need pluggable context propagation for each transport

Stating assumptions

Based on this, I will assume: 500 services, 100K RPS, 20 spans per request (2M spans/sec), 1% sampling rate (20K spans/sec stored), 7-day retention, mixed transports. Let me estimate storage: 20K spans/sec x 1KB x 86400 sec x 7 days = about 12TB.

The Hard Part

Say this out loud

The hard part here is propagating trace context across all service boundaries with minimal overhead, while making consistent sampling decisions so we never get partial traces.

Why this is genuinely hard:

1.Context Propagation: Every outgoing call (HTTP, gRPC, queue publish) must include trace context. Every incoming call must extract it. This requires instrumenting every framework and library.
2.Consistent Sampling: If Service A decides to sample a trace, Services B, C, D must also sample it. Independent sampling decisions create partial traces that are useless.
3.Clock Skew: Spans from different machines have different clocks. A child span might appear to start before its parent. Need to handle this gracefully.
4.Async Boundaries: Message queues break the synchronous call chain. How do you connect a Kafka consumer span to its producer span?

The Sampling Consistency Problem

Common mistake

Candidates often propose each service deciding independently whether to sample. This creates partial traces where you see Service A and C but not B - completely useless for debugging.

The fundamental data model:

Trace: A tree of spans representing one request journey (identified by trace_id) - Span: A unit of work in one service (identified by span_id) - Parent-Child: Spans link to their parent via parent_span_id - Context: The trace_id, span_id, and sampling decision that propagates

Scale and Access Patterns

Let me estimate the scale and understand access patterns.

Dimension	Value	Impact
Services	500	Need wide instrumentation coverage
Request Rate	100,000 RPS	2M spans/second before sampling

+ 5 more rows...

What to say

At 2M spans per second, we cannot store everything. With 1% sampling, we get 20K spans/second which is manageable. The key is making sampling decisions intelligently - we want 100% of errors and slow requests even if we sample 1% of normal traffic.

Access Pattern Analysis:

Write-heavy: Spans are written once, read rarely (most traces never viewed) - Point queries: Most queries are by trace_id (find all spans for one trace) - Search queries: Find traces by service, operation, error, latency range - Time-bounded: Queries almost always include a time range - Append-only: Spans are immutable once written

Span ingestion:
- 100K RPS x 20 spans = 2M spans/second
- After 1% sampling: 20K spans/second

+ 13 more lines...

High-Level Architecture

Let me design the system in layers: instrumentation, collection, processing, storage, and query.

What to say

I will structure this as a pipeline: instrumentation generates spans, collectors aggregate them, processors sample and enrich, storage persists, and query services expose APIs. Each layer can scale independently.

Distributed Tracing Architecture

Component Responsibilities:

1.Tracing SDK: Embedded in each service - Creates spans for incoming/outgoing calls - Propagates context via headers - Exports spans to local agent
2.Local Agent: Runs on each host - Batches spans from multiple processes - Performs head-based sampling - Forwards to collectors
3.Collector Cluster: Receives from all agents - Validates and enriches spans - Routes to Kafka for processing - Horizontally scalable
4.Stream Processor: Consumes from Kafka - Assembles complete traces - Performs tail-based sampling - Computes service dependencies
5.Storage Tiers: Hot, warm, cold - Hot: Last 24 hours, fast queries - Warm: 7 days, slightly slower - Cold: 90 days, compressed, batch access

Real-world reference

Jaeger uses this architecture. Zipkin is similar but simpler. Datadog and Lightstep add ML-based analysis on top. Google Dapper pioneered many of these concepts.

Data Model and Storage

The span is the fundamental unit. Let me define the data model and storage strategy.

What to say

The data model follows the OpenTelemetry standard: traces contain spans, spans have context, timing, and annotations. Storage is optimized for trace ID lookups with secondary indexes for search.

Span {
  // Identity
  trace_id:        string[32]    // 128-bit, globally unique

+ 22 more lines...

Storage Schema Design:

We need to optimize for two access patterns: 1. Get trace by ID: Return all spans for a trace 2. Search traces: Find traces matching criteria

-- Primary table: lookup by trace_id
CREATE TABLE traces (
  trace_id        text,

+ 33 more lines...

Why Cassandra?

Write-optimized: LSM tree handles high write throughput - Time-series friendly: Good for append-only, time-bounded data - Scalable: Linear scaling with nodes - TTL support: Automatic data expiration

Alternative: ClickHouse for analytics

CREATE TABLE spans (
  trace_id        FixedString(32),
  span_id         FixedString(16),

+ 14 more lines...

Context Propagation Deep Dive

Context propagation is the mechanism that links spans across service boundaries. Without it, we have isolated spans, not traces.

Critical component

Context propagation is where most tracing implementations fail in practice. If even one service in the chain does not propagate context, the trace is broken.

HTTP Headers for context propagation:

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01

+ 8 more lines...

Context Propagation Flow

Propagation for different transports:

Transport	Propagation Method	Implementation
HTTP	Headers (traceparent, tracestate)	Interceptor on HTTP client/server
gRPC	Metadata	Interceptor pattern

+ 4 more rows...

class TracingHTTPClient:
    def __init__(self, tracer, base_client):
        self.tracer = tracer

+ 22 more lines...

class TracingMiddleware:
    def __init__(self, tracer, app):
        self.tracer = tracer

+ 24 more lines...

Sampling Strategies

At 2M spans/second, storing everything is impractical. Sampling is essential, but must be done carefully to preserve trace integrity.

System Invariant

All spans in a trace must have the same sampling decision. If we sample a trace, we must sample ALL its spans. Partial traces are useless.

Two types of sampling:

Type	When Decided	Pros	Cons
Head-based	At trace start (first span)	Simple, low overhead	Cannot consider outcome (errors, latency)
Tail-based	After trace complete	Can prioritize errors and slow traces	Must buffer all spans temporarily

Head vs Tail Sampling

class HeadBasedSampler:
    def __init__(self, base_rate=0.01):
        self.base_rate = base_rate  # 1% default

+ 24 more lines...

class TailBasedSampler:
    def __init__(self):
        self.buffer = {}  # trace_id -> list of spans

+ 35 more lines...

Best practice

Use head-based sampling for volume control, plus tail-based sampling for ensuring important traces (errors, slow requests) are always captured. This hybrid approach gives you the best of both worlds.

Consistency and Correctness

System Invariants

1) Sampling must be consistent: all spans in a trace sampled or none. 2) Parent-child relationships must be preserved. 3) Timestamps must be monotonic within a span (start before end).

Clock Skew Handling:

Different machines have different clocks. A child span might appear to start before its parent due to clock skew.

Machine A clock: 10:00:00.000 (accurate)
Machine B clock: 10:00:00.050 (50ms ahead)

+ 13 more lines...

Solutions for clock skew:

1.NTP synchronization: Keep clocks within a few milliseconds 2. Adjust on display: UI adjusts child span to fit within parent 3. Use logical clocks: Vector clocks for ordering (complex) 4. Record both: Store raw and adjusted timestamps

Trace Completeness:

How do we know when a trace is complete? We may never know for certain - an async process might add spans later.

def is_trace_likely_complete(spans: list, timeout_seconds: int = 60) -> bool:
    if not spans:
        return False

+ 17 more lines...

Business impact

If we store incomplete traces, engineers get confused by missing spans. If we wait too long, memory usage explodes. The 60-second timeout with 5-second grace period balances these concerns for most request-response workloads.

Failure Modes and Resilience

Proactively discuss failures

Tracing is observability infrastructure - it must not impact application reliability. Let me walk through failure modes.

Failure	Impact	Mitigation	Why It Works
Collector down	Spans lost, no traces	Local agent buffers, multiple collectors	Agents buffer 1-2 minutes, retry to healthy collectors
Kafka backlog	Delayed trace availability	Increase partitions, consumer parallelism	Kafka handles backpressure, traces appear later

+ 4 more rows...

Critical Design Principle: Tracing must fail open. An issue with tracing should never impact the application being traced.

class DefensiveTracer:
    def __init__(self, exporter, buffer_size=10000):
        self.exporter = exporter

+ 34 more lines...

Non-negotiable

The SDK must have timeouts on all I/O operations (100ms max for export), bounded buffers (to prevent memory exhaustion), and exception handling that never propagates to application code.

Evolution and Scaling

What to say

This design handles up to 100K RPS with 500 services. Let me discuss how it evolves for 10x scale and additional capabilities like distributed profiling.

Evolution Path:

Stage 1: Basic Tracing (up to 50K RPS) - Single collector cluster - Head-based sampling only - Cassandra for storage - Basic UI for trace lookup

Stage 2: Production Tracing (up to 500K RPS) - Distributed collectors per region - Kafka for durability and buffering - Tail-based sampling for errors/slow requests - Secondary indexes for search - Service dependency graphs

Stage 3: Advanced Observability Platform - Continuous profiling linked to traces - Anomaly detection on trace patterns - Auto-instrumentation for common frameworks - Trace-to-logs and trace-to-metrics correlation

Multi-Region Architecture

Scaling Strategies:

Component	Scaling Approach	Limit
Collectors	Add more instances, partition by trace_id hash	Linear with traffic
Kafka	Add partitions, partition by trace_id	Millions of spans/sec
Storage	Add nodes, time-based partitioning	Petabytes
Query	Read replicas, caching, query routing	Thousands of queries/sec

Alternative approach

If cost is a primary concern, I would consider a streaming architecture where traces are analyzed in real-time and only anomalies are stored. Tools like Apache Flink can compute aggregates and detect patterns without storing raw spans.

What I would add for specific use cases:

Debugging production issues: Add exemplars - links from metrics/logs to representative traces that explain the anomaly.

Performance optimization: Add continuous profiling (CPU, memory) and link profiles to spans for flamegraph analysis.

Security and compliance: Add trace sanitization to remove PII, audit logging for trace access, and data residency controls.

Design Trade-offs

Advantages

+Simple implementation
+Low overhead
+Consistent decisions across services

Disadvantages

-Cannot consider trace outcome
-May miss errors and slow requests

When to use

High-volume systems where cost is primary concern

System Design Masterclass

Weather Application with Forecasting

URL Shortener

Live Comments Feature

API Rate Limiter

On-Call Escalation System

Hotel Booking and Reservation System

Parts Compatibility Validation

Real-time Stock Price Viewer

Top-K Rankings System

File Download and Sync Library

Real-time Active Viewers

Marketplace Features

Price Alert System

Netflix Screen Concurrency Limits

Live Reactions System

Top K Most Shared Articles

High-Profile Likes Counter

Authentication and User Login

Google Calendar

Web Crawler

News Feed

Video Streaming Platform

IoC / Dependency Injection Framework

Distributed Control Infrastructure

Notification Service

Distributed Tracing System

P2P File Transfer System

Large Data Migration to Cloud

Wire Transfer API

Large Data Sorting and Processing

Database Control Plane

Distributed Metrics Logging and Aggregation

Ads Management & Delivery System

Flash Sale Backend

Photo Sharing Platform

Cluster Health Monitoring System

Rider Matching System

Surge Pricing System

Collaborative Editing System

Server Metrics Collection System

User Analytics Dashboard & Event Pipeline

Dropbox / Google Drive

Distributed Message Queue

ETA and Live Location Sharing

Distributed Key-Value Store

Distributed Stream Processing System

Payment Processing System

Distributed Job Scheduler

WhatsApp / Messenger

Payment Wallet at Global Scale

Uber / Ride Sharing

Web Search Engine

Globally Distributed SQL Database

Real-Time Analytics System

Recommendation System (Netflix)

Multi-Region Disaster Recovery System

Time-Series Database

Fraud Detection System

Design Distributed Tracing System

Summary

Key Takeaways

Core Problem

The Hard Part

Scaling Axis

Problem Statement

What to say first

Premium Content