System Design Masterclass
Infrastructuretracingobservabilitymicroservicesdebuggingdistributed-systemsadvanced

Design Distributed Tracing System

Design a system for tracing requests across microservices

Billions of spans/day|Similar to Google, Uber, Netflix, Datadog, New Relic, Lightstep|45 min read

Summary

Distributed tracing tracks requests as they flow through multiple microservices, enabling debugging and performance analysis in complex architectures. The core challenge is propagating trace context across service boundaries with minimal overhead while ingesting and storing billions of spans efficiently. This is asked at Google, Uber, Netflix, and any company with microservices architecture.

Key Takeaways

Core Problem

This is fundamentally a causality tracking problem. We need to reconstruct the path and timing of a request across dozens of services that only see their local piece.

The Hard Part

Propagating trace context across all service boundaries (HTTP, gRPC, queues, databases) with near-zero overhead while sampling intelligently to manage data volume.

Scaling Axis

Scale by trace ID hash for storage and querying. Sample aggressively (0.1-1% of traces) but keep 100% of errors and slow requests.

The Question: Design a distributed tracing system that can track requests across hundreds of microservices, handling billions of spans per day.

Distributed tracing is essential for: - Debugging production issues - Find where requests fail in a chain of 20 services - Performance optimization - Identify slow services and dependencies - Understanding system behavior - Visualize request flow and dependencies - SLA monitoring - Track end-to-end latency across service boundaries

What to say first

Before I design the system, let me clarify what we mean by tracing. A trace represents a single request journey through the system. Each service creates a span - a unit of work with timing information. I will design the system to collect, store, and query these spans.

Hidden requirements interviewers are testing: - Do you understand the tracing data model (traces, spans, context)? - Can you design efficient context propagation? - How do you handle sampling at scale? - Can you design a storage system for time-series span data?

Premium Content

Sign in to access this content or upgrade for full access.