Design Walkthrough
Problem Statement
The Question: Design a system that shows live driver location to riders and calculates accurate ETA, similar to Uber or Lyft.
This problem has two interconnected parts:
- 1.Live Location Sharing: Driver app sends GPS coordinates, rider app displays driver position in real-time on a map
- 2.ETA Calculation: Estimate arrival time based on current location, destination, route, and real-time traffic conditions
Both must work at massive scale (millions of concurrent rides) with low latency.
What to say first
This is a real-time geospatial streaming problem. Before I design, let me clarify the scale, latency requirements, and what happens during network issues. I will address location ingestion, storage, fanout to riders, and ETA computation separately.
Hidden requirements interviewers are testing: - Can you handle high-throughput location ingestion? - Do you understand geospatial indexing (quadtrees, geohash)? - Can you design efficient pub/sub for location updates? - Do you know how ETA calculation actually works (graph algorithms, traffic data)? - How do you handle offline/reconnection scenarios?
Clarifying Questions
Ask these questions to scope the problem correctly. Each answer shapes your architecture.
Question 1: Scale
How many concurrent active rides? How frequently do drivers send location updates?
Why this matters: Determines ingestion throughput and storage requirements. Typical answer: 1M concurrent rides, drivers send location every 3-4 seconds Architecture impact: 1M rides x 1 update/4s = 250K location writes/second
Question 2: Latency
What is the acceptable delay between driver movement and rider seeing it on screen?
Why this matters: Sub-second requires WebSockets/SSE, multi-second allows polling. Typical answer: Less than 2 seconds end-to-end Architecture impact: Need persistent connections, not HTTP polling
Question 3: ETA Accuracy
How accurate must ETA be? Should it account for real-time traffic?
Why this matters: Static routing is simple; live traffic adds complexity. Typical answer: Yes, incorporate live traffic. ETA should be within 2-3 minutes of actual. Architecture impact: Need traffic data pipeline, route recalculation on traffic changes
Question 4: Offline Handling
What happens if driver or rider goes offline temporarily?
Why this matters: Mobile networks are unreliable; need graceful degradation. Typical answer: Show last known location with timestamp, reconnect automatically Architecture impact: Need location history, client-side caching, reconnection logic
Stating assumptions
I will assume: 1M concurrent rides, location updates every 4 seconds, sub-2-second delivery to rider, real-time traffic for ETA, graceful offline handling with last-known location.
The Hard Part
Say this out loud
The hard part here is the fan-out problem: when a driver location changes, we need to efficiently notify all relevant subscribers (rider, support, internal systems) without broadcasting to everyone.
Why this is genuinely hard:
- 1.Write Amplification: One driver update could trigger notifications to multiple systems (rider app, dispatch, ETA service, analytics). Naive fan-out creates N writes per location update.
- 2.Geospatial Routing: You cannot just broadcast to all riders. Need to route updates only to riders whose driver is the one that moved. This requires maintaining subscription mappings.
- 3.Connection Management: 1M riders with persistent WebSocket connections. Each connection consumes memory and file descriptors. Need connection pooling and horizontal scaling.
- 4.ETA is Actually Hard: ETA is not just distance/speed. It requires: - Road network graph - Current traffic conditions - Historical patterns - Turn costs, traffic lights - Dynamic rerouting
Common mistake
Candidates often propose polling (rider polls every second). This does not scale: 1M riders polling every second = 1M requests/second of mostly unchanged data. Push-based architecture is essential.
The fundamental insight:
This is a pub/sub problem with geographic partitioning: - Publishers: Drivers sending location updates - Subscribers: Riders watching their specific driver - Topic: ride_id or driver_id
Unlike general pub/sub, we can optimize because: - Each rider subscribes to exactly ONE driver (their assigned driver) - We know the mapping (ride_id -> driver_id -> rider_id) - Updates are geographically clustered
Scale & Access Patterns
Let me estimate the scale and understand access patterns.
| Dimension | Value | Calculation |
|---|---|---|
| Concurrent Rides | 1,000,000 | Peak hour estimate |
| Location Update Frequency | Every 4 seconds | GPS battery optimization |
What to say
At 250K location updates per second, we need a streaming architecture. The 1:1 mapping of driver to rider actually makes fan-out simple - no broadcast needed, just point-to-point delivery.
Access Pattern Analysis:
Writes (Location Ingestion): - High throughput, append-only - Each write is independent - Need low latency acknowledgment to driver
Reads (Location Delivery): - Point-to-point: rider reads their driver location - Very low latency requirement (<1 second) - Push-based, not pull-based
ETA Computation: - Triggered by location updates - CPU-intensive (graph traversal) - Can be cached per route segment
Location Storage (if we keep 24 hours):
- 20B events x 200 bytes = 4 TB/day
- Need time-series database with TTLHigh-Level Architecture
Let me design the system in layers: ingestion, processing, storage, and delivery.
What to say
I will separate location ingestion from delivery. Drivers write to a stream, processors consume and route to the correct rider. We scale by geographic region.
Live Location System Architecture
Component Responsibilities:
1. Location Ingestion Service - Receives GPS data from driver apps - Validates coordinates (lat: -90 to 90, lng: -180 to 180) - Adds server timestamp - Publishes to Kafka topic partitioned by driver_id
2. Kafka / Message Stream - Decouples ingestion from processing - Partitioned by driver_id for ordering - Enables replay for debugging - Handles back-pressure
3. Location Processor - Consumes from Kafka - Looks up ride_id and rider_id from ride mapping - Updates Redis cache with latest location - Writes to time-series DB for history - Triggers ETA recalculation - Publishes to WebSocket delivery
4. ETA Service - Receives (driver_location, destination) - Computes route using road network graph - Incorporates real-time traffic - Returns ETA in seconds
5. WebSocket Servers - Maintains persistent connections with riders - Receives location updates for specific riders - Pushes to connected rider apps - Handles reconnection, heartbeats
Real-world reference
Uber uses a system called Ringpop for routing location updates to the correct WebSocket server. They partition by geographic cells (H3 hexagons) for efficient nearby-driver queries.
Data Model & Storage
We need multiple storage systems optimized for different access patterns.
What to say
Redis is the source of truth for current location (fast reads). TimeSeries DB stores history. The ride mapping table tells us which rider subscribes to which driver.
# Current driver location (updated every 4 seconds)
Key: driver:location:{driver_id}
Value: {Time-Series Storage (Location History):
CREATE TABLE location_history (
driver_id VARCHAR(36) NOT NULL,
ride_id VARCHAR(36),Geospatial Indexing with Geohash:
Geohash converts (lat, lng) into a string where nearby locations share a prefix.
Example:
- San Francisco: 9q8yy
- Nearby location: 9q8yz
- Far location: dpz83 (New York)
This enables efficient range queries: "Find all drivers where geohash LIKE '9q8y%'" returns drivers in a geographic region.
import geohash
def encode_location(lat: float, lng: float, precision: int = 7) -> str:Algorithm Deep Dive: ETA Calculation
ETA calculation is more complex than distance/speed. Let me walk through how it actually works.
What to say
ETA is fundamentally a shortest-path problem on a weighted graph, where edge weights represent travel time and change based on real-time traffic.
ETA Calculation Flow
Step 1: Map Matching
GPS coordinates are noisy. Map matching snaps the GPS point to the most likely road segment.
- GPS says driver is at (37.7749, -122.4194) - But that point is 10m off the road - Map matching finds the closest road segment - Returns: "Driver is on Market Street, heading East"
def map_match(lat: float, lng: float, heading: float) -> RoadSegment:
"""
Find the road segment this GPS point most likely represents.Step 2: Route Calculation with Traffic
The road network is a graph where: - Nodes = intersections - Edges = road segments - Edge weight = travel time (not distance!)
Travel time varies based on: - Road type (highway vs residential) - Current traffic conditions - Time of day (rush hour) - Historical patterns
import heapq
from dataclasses import dataclass
Step 3: Traffic Data Pipeline
Real-time traffic comes from multiple sources: - Driver speed reports (actual vs expected) - Historical patterns (Tuesday 8am is always slow) - Incidents (accidents, road closures) - Events (concerts, sports games)
| Traffic Source | Update Frequency | Latency |
|---|---|---|
| Driver speed probes | Every 4 seconds | Real-time |
| Historical patterns | Pre-computed | N/A |
| Incidents | Event-driven | Minutes |
| Special events | Scheduled | Hours ahead |
Optimization: Contraction Hierarchies
A* on the full road network is too slow for production (millions of edges). Uber and Google use Contraction Hierarchies - a preprocessing technique that enables continental-scale routing in milliseconds by pre-computing shortcuts.
Consistency & Invariants
System Invariants
Location shown to rider must never be more than 5 seconds stale (during active ride). If location is stale, show timestamp and warning. ETA must recalculate when route changes by more than 10%.
Why we choose eventual consistency for location:
Location data is naturally eventually consistent: - GPS readings have inherent latency - Network transmission adds delay - A 1-second-old location is still useful - Strong consistency would add unacceptable latency
| Consistency Model | Use Case | Reasoning |
|---|---|---|
| Eventual | Live location display | 1-2 second staleness is fine for map display |
| Eventual | ETA updates | ETA is an estimate anyway; updates every few seconds |
| Strong | Ride assignment | Cannot assign same driver to two riders |
| Strong | Payment | Must not double-charge or lose charges |
Business impact mapping
If rider sees driver 2 seconds behind actual position, minor inconvenience. If ride gets assigned to wrong driver, major operational issue. We use strong consistency only where business impact justifies the latency cost.
Handling Out-of-Order Updates:
Network conditions can cause location updates to arrive out of order:
Timeline: - T=0: Driver at location A - T=4: Driver at location B - T=8: Driver at location C
Arrival order: A, C, B (B was delayed)
Solution: Include client timestamp, only update if newer than current.
def update_location(driver_id: str, location: dict) -> bool:
"""
Only update if this location is newer than what we have.Failure Modes & Resilience
Proactively discuss failures
Let me walk through what happens when things go wrong. This is critical for a system riders depend on for safety.
| Failure | Impact | Mitigation | Why It Works |
|---|---|---|---|
| Driver app offline | No location updates | Show last known with timestamp, prediction | Rider knows it is stale; can call driver |
| Kafka down | Location updates queue up | Buffer in ingestion service, retry | Locations arrive late but not lost |
Client-Side Resilience:
The rider app must handle poor connectivity gracefully:
interface LocationUpdate {
lat: number;
lng: number;Graceful degradation philosophy
The principle is: always show something useful. Stale data with a warning is better than an error screen. Predicted location is better than frozen location. Cached ETA is better than no ETA.
Evolution & Scaling
What to say
This design works for 1M concurrent rides in a single region. Let me discuss scaling to 10M rides globally and adding features like nearby driver display.
Evolution Path:
Stage 1: Single Region (1M rides) - Single Kafka cluster - Redis cluster in one region - ETA service with road graph for one country - Works for most ride-sharing companies
Stage 2: Multi-Region (10M rides) - Kafka cluster per region - Redis cluster per region - Global ride metadata in distributed DB - Cross-region rides handled specially
Stage 3: Real-time Nearby Drivers (Uber-scale) - Geospatial indexing for driver discovery - Cell-based partitioning (H3 hexagons) - Millions of writes/sec just for nearby queries
Geographic Partitioning (H3 Cells)
Advanced Feature: Nearby Drivers Display
The rider app shows available drivers nearby before booking. This is harder than active ride tracking:
- Rider is NOT subscribing to a specific driver - Rider wants ALL drivers in a geographic area - As drivers move, the set changes constantly - Cannot push individual updates for every driver
def get_nearby_drivers(lat: float, lng: float, radius_km: float) -> list[Driver]:
"""
Find available drivers within radius.Alternative approach
If we needed sub-100ms ETA for every request globally, I would pre-compute ETA matrices between common points (airports, downtown areas) and use lookup instead of real-time calculation. This is what Google Maps does for popular routes.
What I would do differently for...
Food delivery (DoorDash): Add order status (preparing, ready, picked up) to location updates. ETA includes restaurant prep time.
Package tracking (FedEx): Lower update frequency (per stop, not continuous). Aggregate tracking for efficiency.
Fleet management: Add vehicle telemetry (fuel, engine diagnostics). Longer history retention for compliance.
Emergency services: Stricter latency requirements (<500ms). Redundant paths. Priority routing.