Design Walkthrough
Problem Statement
The Question: Design a distributed tracing system that can track requests across hundreds of microservices, handling billions of spans per day.
Distributed tracing is essential for: - Debugging production issues - Find where requests fail in a chain of 20 services - Performance optimization - Identify slow services and dependencies - Understanding system behavior - Visualize request flow and dependencies - SLA monitoring - Track end-to-end latency across service boundaries
What to say first
Before I design the system, let me clarify what we mean by tracing. A trace represents a single request journey through the system. Each service creates a span - a unit of work with timing information. I will design the system to collect, store, and query these spans.
Hidden requirements interviewers are testing: - Do you understand the tracing data model (traces, spans, context)? - Can you design efficient context propagation? - How do you handle sampling at scale? - Can you design a storage system for time-series span data?
Clarifying Questions
Ask these questions to demonstrate senior thinking. Each answer shapes your architecture.
Question 1: Scale
How many services do we have? What is the request volume? How many spans per request on average?
Why this matters: Determines storage requirements and sampling strategy. Typical answer: 500 services, 100K RPS, average 20 spans per request = 2M spans/second Architecture impact: Need aggressive sampling, cannot store everything
Question 2: Query Patterns
How will engineers query traces? By trace ID? By service? By error type? Time range?
Why this matters: Determines indexing strategy and storage schema. Typical answer: Lookup by trace ID is most common, also need to find traces by service, operation, and error Architecture impact: Need secondary indexes, possibly separate stores for different access patterns
Question 3: Retention
How long do we need to keep traces? Same retention for all traces or tiered?
Why this matters: Determines storage costs and architecture. Typical answer: 7 days for normal traces, 30 days for errors, 90 days for aggregated metrics Architecture impact: Need tiered storage, different retention policies
Question 4: Transport Types
What protocols do services use? HTTP, gRPC, message queues, databases?
Why this matters: Each transport needs different context propagation. Typical answer: Mix of HTTP, gRPC, Kafka, and database calls Architecture impact: Need pluggable context propagation for each transport
Stating assumptions
Based on this, I will assume: 500 services, 100K RPS, 20 spans per request (2M spans/sec), 1% sampling rate (20K spans/sec stored), 7-day retention, mixed transports. Let me estimate storage: 20K spans/sec x 1KB x 86400 sec x 7 days = about 12TB.
The Hard Part
Say this out loud
The hard part here is propagating trace context across all service boundaries with minimal overhead, while making consistent sampling decisions so we never get partial traces.
Why this is genuinely hard:
- 1.Context Propagation: Every outgoing call (HTTP, gRPC, queue publish) must include trace context. Every incoming call must extract it. This requires instrumenting every framework and library.
- 2.Consistent Sampling: If Service A decides to sample a trace, Services B, C, D must also sample it. Independent sampling decisions create partial traces that are useless.
- 3.Clock Skew: Spans from different machines have different clocks. A child span might appear to start before its parent. Need to handle this gracefully.
- 4.Async Boundaries: Message queues break the synchronous call chain. How do you connect a Kafka consumer span to its producer span?
The Sampling Consistency Problem
Common mistake
Candidates often propose each service deciding independently whether to sample. This creates partial traces where you see Service A and C but not B - completely useless for debugging.
The fundamental data model:
- Trace: A tree of spans representing one request journey (identified by trace_id) - Span: A unit of work in one service (identified by span_id) - Parent-Child: Spans link to their parent via parent_span_id - Context: The trace_id, span_id, and sampling decision that propagates
Scale and Access Patterns
Let me estimate the scale and understand access patterns.
| Dimension | Value | Impact |
|---|---|---|
| Services | 500 | Need wide instrumentation coverage |
| Request Rate | 100,000 RPS | 2M spans/second before sampling |
What to say
At 2M spans per second, we cannot store everything. With 1% sampling, we get 20K spans/second which is manageable. The key is making sampling decisions intelligently - we want 100% of errors and slow requests even if we sample 1% of normal traffic.
Access Pattern Analysis:
- Write-heavy: Spans are written once, read rarely (most traces never viewed) - Point queries: Most queries are by trace_id (find all spans for one trace) - Search queries: Find traces by service, operation, error, latency range - Time-bounded: Queries almost always include a time range - Append-only: Spans are immutable once written
Span ingestion:
- 100K RPS x 20 spans = 2M spans/second
- After 1% sampling: 20K spans/secondHigh-Level Architecture
Let me design the system in layers: instrumentation, collection, processing, storage, and query.
What to say
I will structure this as a pipeline: instrumentation generates spans, collectors aggregate them, processors sample and enrich, storage persists, and query services expose APIs. Each layer can scale independently.
Distributed Tracing Architecture
Component Responsibilities:
- 1.Tracing SDK: Embedded in each service - Creates spans for incoming/outgoing calls - Propagates context via headers - Exports spans to local agent
- 2.Local Agent: Runs on each host - Batches spans from multiple processes - Performs head-based sampling - Forwards to collectors
- 3.Collector Cluster: Receives from all agents - Validates and enriches spans - Routes to Kafka for processing - Horizontally scalable
- 4.Stream Processor: Consumes from Kafka - Assembles complete traces - Performs tail-based sampling - Computes service dependencies
- 5.Storage Tiers: Hot, warm, cold - Hot: Last 24 hours, fast queries - Warm: 7 days, slightly slower - Cold: 90 days, compressed, batch access
Real-world reference
Jaeger uses this architecture. Zipkin is similar but simpler. Datadog and Lightstep add ML-based analysis on top. Google Dapper pioneered many of these concepts.
Data Model and Storage
The span is the fundamental unit. Let me define the data model and storage strategy.
What to say
The data model follows the OpenTelemetry standard: traces contain spans, spans have context, timing, and annotations. Storage is optimized for trace ID lookups with secondary indexes for search.
Span {
// Identity
trace_id: string[32] // 128-bit, globally uniqueStorage Schema Design:
We need to optimize for two access patterns: 1. Get trace by ID: Return all spans for a trace 2. Search traces: Find traces matching criteria
-- Primary table: lookup by trace_id
CREATE TABLE traces (
trace_id text,Why Cassandra?
- Write-optimized: LSM tree handles high write throughput - Time-series friendly: Good for append-only, time-bounded data - Scalable: Linear scaling with nodes - TTL support: Automatic data expiration
Alternative: ClickHouse for analytics
CREATE TABLE spans (
trace_id FixedString(32),
span_id FixedString(16),Context Propagation Deep Dive
Context propagation is the mechanism that links spans across service boundaries. Without it, we have isolated spans, not traces.
Critical component
Context propagation is where most tracing implementations fail in practice. If even one service in the chain does not propagate context, the trace is broken.
HTTP Headers for context propagation:
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01Context Propagation Flow
Propagation for different transports:
| Transport | Propagation Method | Implementation |
|---|---|---|
| HTTP | Headers (traceparent, tracestate) | Interceptor on HTTP client/server |
| gRPC | Metadata | Interceptor pattern |
class TracingHTTPClient:
def __init__(self, tracer, base_client):
self.tracer = tracerclass TracingMiddleware:
def __init__(self, tracer, app):
self.tracer = tracerSampling Strategies
At 2M spans/second, storing everything is impractical. Sampling is essential, but must be done carefully to preserve trace integrity.
System Invariant
All spans in a trace must have the same sampling decision. If we sample a trace, we must sample ALL its spans. Partial traces are useless.
Two types of sampling:
| Type | When Decided | Pros | Cons |
|---|---|---|---|
| Head-based | At trace start (first span) | Simple, low overhead | Cannot consider outcome (errors, latency) |
| Tail-based | After trace complete | Can prioritize errors and slow traces | Must buffer all spans temporarily |
Head vs Tail Sampling
class HeadBasedSampler:
def __init__(self, base_rate=0.01):
self.base_rate = base_rate # 1% defaultclass TailBasedSampler:
def __init__(self):
self.buffer = {} # trace_id -> list of spansBest practice
Use head-based sampling for volume control, plus tail-based sampling for ensuring important traces (errors, slow requests) are always captured. This hybrid approach gives you the best of both worlds.
Consistency and Correctness
System Invariants
1) Sampling must be consistent: all spans in a trace sampled or none. 2) Parent-child relationships must be preserved. 3) Timestamps must be monotonic within a span (start before end).
Clock Skew Handling:
Different machines have different clocks. A child span might appear to start before its parent due to clock skew.
Machine A clock: 10:00:00.000 (accurate)
Machine B clock: 10:00:00.050 (50ms ahead)
Solutions for clock skew:
- 1.NTP synchronization: Keep clocks within a few milliseconds 2. Adjust on display: UI adjusts child span to fit within parent 3. Use logical clocks: Vector clocks for ordering (complex) 4. Record both: Store raw and adjusted timestamps
Trace Completeness:
How do we know when a trace is complete? We may never know for certain - an async process might add spans later.
def is_trace_likely_complete(spans: list, timeout_seconds: int = 60) -> bool:
if not spans:
return FalseBusiness impact
If we store incomplete traces, engineers get confused by missing spans. If we wait too long, memory usage explodes. The 60-second timeout with 5-second grace period balances these concerns for most request-response workloads.
Failure Modes and Resilience
Proactively discuss failures
Tracing is observability infrastructure - it must not impact application reliability. Let me walk through failure modes.
| Failure | Impact | Mitigation | Why It Works |
|---|---|---|---|
| Collector down | Spans lost, no traces | Local agent buffers, multiple collectors | Agents buffer 1-2 minutes, retry to healthy collectors |
| Kafka backlog | Delayed trace availability | Increase partitions, consumer parallelism | Kafka handles backpressure, traces appear later |
Critical Design Principle: Tracing must fail open. An issue with tracing should never impact the application being traced.
class DefensiveTracer:
def __init__(self, exporter, buffer_size=10000):
self.exporter = exporterNon-negotiable
The SDK must have timeouts on all I/O operations (100ms max for export), bounded buffers (to prevent memory exhaustion), and exception handling that never propagates to application code.
Evolution and Scaling
What to say
This design handles up to 100K RPS with 500 services. Let me discuss how it evolves for 10x scale and additional capabilities like distributed profiling.
Evolution Path:
Stage 1: Basic Tracing (up to 50K RPS) - Single collector cluster - Head-based sampling only - Cassandra for storage - Basic UI for trace lookup
Stage 2: Production Tracing (up to 500K RPS) - Distributed collectors per region - Kafka for durability and buffering - Tail-based sampling for errors/slow requests - Secondary indexes for search - Service dependency graphs
Stage 3: Advanced Observability Platform - Continuous profiling linked to traces - Anomaly detection on trace patterns - Auto-instrumentation for common frameworks - Trace-to-logs and trace-to-metrics correlation
Multi-Region Architecture
Scaling Strategies:
| Component | Scaling Approach | Limit |
|---|---|---|
| Collectors | Add more instances, partition by trace_id hash | Linear with traffic |
| Kafka | Add partitions, partition by trace_id | Millions of spans/sec |
| Storage | Add nodes, time-based partitioning | Petabytes |
| Query | Read replicas, caching, query routing | Thousands of queries/sec |
Alternative approach
If cost is a primary concern, I would consider a streaming architecture where traces are analyzed in real-time and only anomalies are stored. Tools like Apache Flink can compute aggregates and detect patterns without storing raw spans.
What I would add for specific use cases:
Debugging production issues: Add exemplars - links from metrics/logs to representative traces that explain the anomaly.
Performance optimization: Add continuous profiling (CPU, memory) and link profiles to spans for flamegraph analysis.
Security and compliance: Add trace sanitization to remove PII, audit logging for trace access, and data residency controls.