System Design Masterclass

58 items

System Design Masterclass

Ride Sharinggeolocationreal-timeetamapswebsocketsadvanced

Design ETA and Live Location Sharing

Design Uber-style ETA calculation and live location sharing

Millions of location updates/sec|Similar to Uber, Lyft, DoorDash, Google Maps, Grab|45 min read

Summary

Live location sharing and ETA calculation are core to ride-sharing, delivery, and logistics apps. The challenge is processing millions of location updates per second, computing accurate ETAs using real-time traffic, and pushing updates to riders with sub-second latency. This problem tests your understanding of real-time systems, geospatial indexing, and streaming architectures.

Key Takeaways

Core Problem

This is fundamentally a real-time streaming problem with geospatial indexing. Every driver location update must reach the right riders within milliseconds.

The Hard Part

Fan-out at scale: when a driver moves, potentially thousands of nearby users need updates. Naive broadcast does not scale.

Scaling Axis

Scale by geographic region (geo-sharding). Each shard handles a geographic area, reducing cross-shard communication.

The Question: Design a system that shows live driver location to riders and calculates accurate ETA, similar to Uber or Lyft.

This problem has two interconnected parts:

Live Location Sharing: Driver app sends GPS coordinates, rider app displays driver position in real-time on a map

ETA Calculation: Estimate arrival time based on current location, destination, route, and real-time traffic conditions

Both must work at massive scale (millions of concurrent rides) with low latency.

What to say first

This is a real-time geospatial streaming problem. Before I design, let me clarify the scale, latency requirements, and what happens during network issues. I will address location ingestion, storage, fanout to riders, and ETA computation separately.

Hidden requirements interviewers are testing: - Can you handle high-throughput location ingestion? - Do you understand geospatial indexing (quadtrees, geohash)? - Can you design efficient pub/sub for location updates? - Do you know how ETA calculation actually works (graph algorithms, traffic data)? - How do you handle offline/reconnection scenarios?

Summary

Key Takeaways

Core Problem

This is fundamentally a real-time streaming problem with geospatial indexing. Every driver location update must reach the right riders within milliseconds.

The Hard Part

Fan-out at scale: when a driver moves, potentially thousands of nearby users need updates. Naive broadcast does not scale.

Scaling Axis

Scale by geographic region (geo-sharding). Each shard handles a geographic area, reducing cross-shard communication.

Critical Invariant

Location shown to rider must never be more than 5 seconds stale. ETA must update when traffic conditions change significantly.

Performance Requirement

Location update to rider display: less than 1 second end-to-end. ETA recalculation: less than 100ms for cached routes.

Key Tradeoff

We batch location updates (every 3-4 seconds from driver) to reduce load, accepting slight staleness for massive throughput gain.

Design Walkthrough

Problem Statement

The Question: Design a system that shows live driver location to riders and calculates accurate ETA, similar to Uber or Lyft.

This problem has two interconnected parts:

1.Live Location Sharing: Driver app sends GPS coordinates, rider app displays driver position in real-time on a map
2.ETA Calculation: Estimate arrival time based on current location, destination, route, and real-time traffic conditions

Both must work at massive scale (millions of concurrent rides) with low latency.

What to say first

Clarifying Questions

Ask these questions to scope the problem correctly. Each answer shapes your architecture.

Question 1: Scale

How many concurrent active rides? How frequently do drivers send location updates?

Why this matters: Determines ingestion throughput and storage requirements. Typical answer: 1M concurrent rides, drivers send location every 3-4 seconds Architecture impact: 1M rides x 1 update/4s = 250K location writes/second

Question 2: Latency

What is the acceptable delay between driver movement and rider seeing it on screen?

Why this matters: Sub-second requires WebSockets/SSE, multi-second allows polling. Typical answer: Less than 2 seconds end-to-end Architecture impact: Need persistent connections, not HTTP polling

Question 3: ETA Accuracy

How accurate must ETA be? Should it account for real-time traffic?

Why this matters: Static routing is simple; live traffic adds complexity. Typical answer: Yes, incorporate live traffic. ETA should be within 2-3 minutes of actual. Architecture impact: Need traffic data pipeline, route recalculation on traffic changes

Question 4: Offline Handling

What happens if driver or rider goes offline temporarily?

Why this matters: Mobile networks are unreliable; need graceful degradation. Typical answer: Show last known location with timestamp, reconnect automatically Architecture impact: Need location history, client-side caching, reconnection logic

Stating assumptions

I will assume: 1M concurrent rides, location updates every 4 seconds, sub-2-second delivery to rider, real-time traffic for ETA, graceful offline handling with last-known location.

The Hard Part

Say this out loud

The hard part here is the fan-out problem: when a driver location changes, we need to efficiently notify all relevant subscribers (rider, support, internal systems) without broadcasting to everyone.

Why this is genuinely hard:

1.Write Amplification: One driver update could trigger notifications to multiple systems (rider app, dispatch, ETA service, analytics). Naive fan-out creates N writes per location update.
2.Geospatial Routing: You cannot just broadcast to all riders. Need to route updates only to riders whose driver is the one that moved. This requires maintaining subscription mappings.
3.Connection Management: 1M riders with persistent WebSocket connections. Each connection consumes memory and file descriptors. Need connection pooling and horizontal scaling.
4.ETA is Actually Hard: ETA is not just distance/speed. It requires: - Road network graph - Current traffic conditions - Historical patterns - Turn costs, traffic lights - Dynamic rerouting

Common mistake

Candidates often propose polling (rider polls every second). This does not scale: 1M riders polling every second = 1M requests/second of mostly unchanged data. Push-based architecture is essential.

The fundamental insight:

This is a pub/sub problem with geographic partitioning: - Publishers: Drivers sending location updates - Subscribers: Riders watching their specific driver - Topic: ride_id or driver_id

Unlike general pub/sub, we can optimize because: - Each rider subscribes to exactly ONE driver (their assigned driver) - We know the mapping (ride_id -> driver_id -> rider_id) - Updates are geographically clustered

Scale & Access Patterns

Let me estimate the scale and understand access patterns.

Dimension	Value	Calculation
Concurrent Rides	1,000,000	Peak hour estimate
Location Update Frequency	Every 4 seconds	GPS battery optimization

+ 5 more rows...

What to say

At 250K location updates per second, we need a streaming architecture. The 1:1 mapping of driver to rider actually makes fan-out simple - no broadcast needed, just point-to-point delivery.

Access Pattern Analysis:

Writes (Location Ingestion): - High throughput, append-only - Each write is independent - Need low latency acknowledgment to driver

Reads (Location Delivery): - Point-to-point: rider reads their driver location - Very low latency requirement (<1 second) - Push-based, not pull-based

ETA Computation: - Triggered by location updates - CPU-intensive (graph traversal) - Can be cached per route segment

Location Storage (if we keep 24 hours):
- 20B events x 200 bytes = 4 TB/day
- Need time-series database with TTL

+ 13 more lines...

High-Level Architecture

Let me design the system in layers: ingestion, processing, storage, and delivery.

What to say

I will separate location ingestion from delivery. Drivers write to a stream, processors consume and route to the correct rider. We scale by geographic region.

Live Location System Architecture

Component Responsibilities:

1. Location Ingestion Service - Receives GPS data from driver apps - Validates coordinates (lat: -90 to 90, lng: -180 to 180) - Adds server timestamp - Publishes to Kafka topic partitioned by driver_id

2. Kafka / Message Stream - Decouples ingestion from processing - Partitioned by driver_id for ordering - Enables replay for debugging - Handles back-pressure

3. Location Processor - Consumes from Kafka - Looks up ride_id and rider_id from ride mapping - Updates Redis cache with latest location - Writes to time-series DB for history - Triggers ETA recalculation - Publishes to WebSocket delivery

4. ETA Service - Receives (driver_location, destination) - Computes route using road network graph - Incorporates real-time traffic - Returns ETA in seconds

5. WebSocket Servers - Maintains persistent connections with riders - Receives location updates for specific riders - Pushes to connected rider apps - Handles reconnection, heartbeats

Real-world reference

Uber uses a system called Ringpop for routing location updates to the correct WebSocket server. They partition by geographic cells (H3 hexagons) for efficient nearby-driver queries.

Data Model & Storage

We need multiple storage systems optimized for different access patterns.

What to say

Redis is the source of truth for current location (fast reads). TimeSeries DB stores history. The ride mapping table tells us which rider subscribes to which driver.

# Current driver location (updated every 4 seconds)
Key: driver:location:{driver_id}
Value: {

+ 26 more lines...

Time-Series Storage (Location History):

CREATE TABLE location_history (
    driver_id     VARCHAR(36) NOT NULL,
    ride_id       VARCHAR(36),

+ 18 more lines...

Geospatial Indexing with Geohash:

Geohash converts (lat, lng) into a string where nearby locations share a prefix.

Example: - San Francisco: 9q8yy - Nearby location: 9q8yz - Far location: dpz83 (New York)

This enables efficient range queries: "Find all drivers where geohash LIKE '9q8y%'" returns drivers in a geographic region.

import geohash

def encode_location(lat: float, lng: float, precision: int = 7) -> str:

+ 15 more lines...

Algorithm Deep Dive: ETA Calculation

ETA calculation is more complex than distance/speed. Let me walk through how it actually works.

What to say

ETA is fundamentally a shortest-path problem on a weighted graph, where edge weights represent travel time and change based on real-time traffic.

ETA Calculation Flow

Step 1: Map Matching

GPS coordinates are noisy. Map matching snaps the GPS point to the most likely road segment.

GPS says driver is at (37.7749, -122.4194) - But that point is 10m off the road - Map matching finds the closest road segment - Returns: "Driver is on Market Street, heading East"

def map_match(lat: float, lng: float, heading: float) -> RoadSegment:
    """
    Find the road segment this GPS point most likely represents.

+ 22 more lines...

Step 2: Route Calculation with Traffic

The road network is a graph where: - Nodes = intersections - Edges = road segments - Edge weight = travel time (not distance!)

Travel time varies based on: - Road type (highway vs residential) - Current traffic conditions - Time of day (rush hour) - Historical patterns

import heapq
from dataclasses import dataclass

+ 57 more lines...

Step 3: Traffic Data Pipeline

Real-time traffic comes from multiple sources: - Driver speed reports (actual vs expected) - Historical patterns (Tuesday 8am is always slow) - Incidents (accidents, road closures) - Events (concerts, sports games)

Traffic Source	Update Frequency	Latency
Driver speed probes	Every 4 seconds	Real-time
Historical patterns	Pre-computed	N/A
Incidents	Event-driven	Minutes
Special events	Scheduled	Hours ahead

Optimization: Contraction Hierarchies

A* on the full road network is too slow for production (millions of edges). Uber and Google use Contraction Hierarchies - a preprocessing technique that enables continental-scale routing in milliseconds by pre-computing shortcuts.

Consistency & Invariants

System Invariants

Location shown to rider must never be more than 5 seconds stale (during active ride). If location is stale, show timestamp and warning. ETA must recalculate when route changes by more than 10%.

Why we choose eventual consistency for location:

Location data is naturally eventually consistent: - GPS readings have inherent latency - Network transmission adds delay - A 1-second-old location is still useful - Strong consistency would add unacceptable latency

Consistency Model	Use Case	Reasoning
Eventual	Live location display	1-2 second staleness is fine for map display
Eventual	ETA updates	ETA is an estimate anyway; updates every few seconds
Strong	Ride assignment	Cannot assign same driver to two riders
Strong	Payment	Must not double-charge or lose charges

Business impact mapping

If rider sees driver 2 seconds behind actual position, minor inconvenience. If ride gets assigned to wrong driver, major operational issue. We use strong consistency only where business impact justifies the latency cost.

Handling Out-of-Order Updates:

Network conditions can cause location updates to arrive out of order:

Timeline: - T=0: Driver at location A - T=4: Driver at location B - T=8: Driver at location C

Arrival order: A, C, B (B was delayed)

Solution: Include client timestamp, only update if newer than current.

def update_location(driver_id: str, location: dict) -> bool:
    """
    Only update if this location is newer than what we have.

+ 17 more lines...

Failure Modes & Resilience

Proactively discuss failures

Let me walk through what happens when things go wrong. This is critical for a system riders depend on for safety.

Failure	Impact	Mitigation	Why It Works
Driver app offline	No location updates	Show last known with timestamp, prediction	Rider knows it is stale; can call driver
Kafka down	Location updates queue up	Buffer in ingestion service, retry	Locations arrive late but not lost

+ 4 more rows...

Client-Side Resilience:

The rider app must handle poor connectivity gracefully:

interface LocationUpdate {
  lat: number;
  lng: number;

+ 50 more lines...

Graceful degradation philosophy

The principle is: always show something useful. Stale data with a warning is better than an error screen. Predicted location is better than frozen location. Cached ETA is better than no ETA.

Evolution & Scaling

What to say

This design works for 1M concurrent rides in a single region. Let me discuss scaling to 10M rides globally and adding features like nearby driver display.

Evolution Path:

Stage 1: Single Region (1M rides) - Single Kafka cluster - Redis cluster in one region - ETA service with road graph for one country - Works for most ride-sharing companies

Stage 2: Multi-Region (10M rides) - Kafka cluster per region - Redis cluster per region - Global ride metadata in distributed DB - Cross-region rides handled specially

Stage 3: Real-time Nearby Drivers (Uber-scale) - Geospatial indexing for driver discovery - Cell-based partitioning (H3 hexagons) - Millions of writes/sec just for nearby queries

Geographic Partitioning (H3 Cells)

Advanced Feature: Nearby Drivers Display

The rider app shows available drivers nearby before booking. This is harder than active ride tracking:

Rider is NOT subscribing to a specific driver - Rider wants ALL drivers in a geographic area - As drivers move, the set changes constantly - Cannot push individual updates for every driver

def get_nearby_drivers(lat: float, lng: float, radius_km: float) -> list[Driver]:
    """
    Find available drivers within radius.

+ 36 more lines...

Alternative approach

If we needed sub-100ms ETA for every request globally, I would pre-compute ETA matrices between common points (airports, downtown areas) and use lookup instead of real-time calculation. This is what Google Maps does for popular routes.

What I would do differently for...

Food delivery (DoorDash): Add order status (preparing, ready, picked up) to location updates. ETA includes restaurant prep time.

Package tracking (FedEx): Lower update frequency (per stop, not continuous). Aggregate tracking for efficiency.

Fleet management: Add vehicle telemetry (fuel, engine diagnostics). Longer history retention for compliance.

Emergency services: Stricter latency requirements (<500ms). Redundant paths. Priority routing.

Design Trade-offs

Advantages

+Simple to implement
+Works with any client
+Easy to debug

Disadvantages

-High server load
-Wasted bandwidth
-Higher latency

When to use

Low-frequency updates, simple systems, fallback

System Design Masterclass

Weather Application with Forecasting

URL Shortener

Live Comments Feature

API Rate Limiter

On-Call Escalation System

Hotel Booking and Reservation System

Parts Compatibility Validation

Real-time Stock Price Viewer

Top-K Rankings System

File Download and Sync Library

Real-time Active Viewers

Marketplace Features

Price Alert System

Netflix Screen Concurrency Limits

Live Reactions System

Top K Most Shared Articles

High-Profile Likes Counter

Authentication and User Login

Google Calendar

Web Crawler

News Feed

Video Streaming Platform

IoC / Dependency Injection Framework

Distributed Control Infrastructure

Notification Service

Distributed Tracing System

P2P File Transfer System

Large Data Migration to Cloud

Wire Transfer API

Large Data Sorting and Processing

Database Control Plane

Distributed Metrics Logging and Aggregation

Ads Management & Delivery System

Flash Sale Backend

Photo Sharing Platform

Cluster Health Monitoring System

Rider Matching System

Surge Pricing System

Collaborative Editing System

Server Metrics Collection System

User Analytics Dashboard & Event Pipeline

Dropbox / Google Drive

Distributed Message Queue

ETA and Live Location Sharing

Distributed Key-Value Store

Distributed Stream Processing System

Payment Processing System

Distributed Job Scheduler

WhatsApp / Messenger

Payment Wallet at Global Scale

Uber / Ride Sharing

Web Search Engine

Globally Distributed SQL Database

Real-Time Analytics System

Recommendation System (Netflix)

Multi-Region Disaster Recovery System

Time-Series Database

Fraud Detection System

Design ETA and Live Location Sharing

Summary

Key Takeaways

Core Problem

The Hard Part

Scaling Axis

Problem Statement

What to say first

Premium Content