System Design Masterclass

58 items

System Design Masterclass

Storagebig-datasortingmapreducedistributed-processingsparkadvanced

Design Large Data Sorting and Processing

Design a system to sort and process petabyte-scale datasets

PBs of data, 1000s of machines|Similar to Google, Facebook, Netflix, Uber, Airbnb|45 min read

Summary

Processing petabyte-scale datasets requires distributing work across thousands of machines while handling failures gracefully. The core challenge is minimizing data movement (shuffle) while maximizing parallelism. This pattern powers Google MapReduce, Apache Spark, Hadoop, and batch processing at every major tech company.

Key Takeaways

Core Problem

This is fundamentally a data partitioning and movement problem. The goal is to minimize shuffle (network transfer) while maximizing parallel execution across machines.

The Hard Part

Data skew - when some keys have orders of magnitude more data than others, causing stragglers that slow the entire job.

Scaling Axis

Scale by partitioning data across workers. Each partition is processed independently, with shuffle only at stage boundaries.

The Question: Design a system that can sort and process 1 petabyte of data efficiently using a cluster of commodity machines.

Large-scale data processing is essential for: - Analytics pipelines - processing daily logs, computing aggregations - ETL jobs - transforming and loading data into warehouses - Machine learning - feature extraction, training data preparation - Search indexing - building inverted indexes from raw documents

What to say first

Before I design, let me clarify the requirements. I need to understand the data characteristics, processing patterns, and latency requirements to choose the right architecture.

Hidden requirements interviewers are testing: - Do you understand external sorting (data larger than memory)? - Can you reason about data movement and shuffle costs? - Do you know how to handle failures in long-running jobs? - Can you optimize for data skew?

Summary

Key Takeaways

Core Problem

This is fundamentally a data partitioning and movement problem. The goal is to minimize shuffle (network transfer) while maximizing parallel execution across machines.

The Hard Part

Data skew - when some keys have orders of magnitude more data than others, causing stragglers that slow the entire job.

Scaling Axis

Scale by partitioning data across workers. Each partition is processed independently, with shuffle only at stage boundaries.

Critical Invariant

Output must be deterministic - running the same job twice on the same input must produce identical output, even with failures and retries.

Performance Insight

Network is the bottleneck. A job that shuffles 1PB across the network is 100x slower than one that shuffles 10TB.

Key Tradeoff

Memory vs disk - processing in-memory is 100x faster but limits data size. Spilling to disk enables unlimited scale but adds I/O overhead.

Design Walkthrough

Problem Statement

The Question: Design a system that can sort and process 1 petabyte of data efficiently using a cluster of commodity machines.

What to say first

Before I design, let me clarify the requirements. I need to understand the data characteristics, processing patterns, and latency requirements to choose the right architecture.

Clarifying Questions

Ask these questions to demonstrate senior thinking. Each answer shapes your architecture.

Question 1: Data Size and Shape

What is the total data size? How many records? What is the average record size?

Why this matters: Determines memory requirements and partitioning strategy. Typical answer: 1 PB total, 10 trillion records, 100 bytes average Architecture impact: Data will not fit in memory - need external sort

Question 2: Processing Type

Is this a one-time sort or a recurring job? Batch or streaming?

Why this matters: Determines framework choice and optimization strategy. Typical answer: Recurring daily batch job Architecture impact: Optimize for throughput, can checkpoint and restart

Question 3: Key Distribution

Is the sort key uniformly distributed or are there hot keys?

Why this matters: Skewed keys cause stragglers that slow everything. Typical answer: Some keys have 1000x more data (power law distribution) Architecture impact: Need skew handling - salting or custom partitioning

Question 4: Latency Requirements

What is the acceptable job completion time? Minutes, hours, or days?

Why this matters: Determines resource allocation and parallelism. Typical answer: Complete within 4 hours Architecture impact: Need high parallelism - thousands of workers

Stating assumptions

Based on this, I will assume: 1 PB input, 10 trillion records, recurring daily batch, skewed key distribution, 4-hour SLA, 1000-node cluster available.

The Hard Part

Say this out loud

The hard part here is minimizing data shuffle while handling skewed data distribution. Network transfer is the bottleneck - a naive approach would shuffle the entire dataset multiple times.

Why this is genuinely hard:

1.External Sorting: Data does not fit in memory. Must sort chunks, write to disk, then merge. This is fundamentally different from in-memory quicksort.
2.Shuffle Cost: Moving 1 PB across the network takes hours even with 10 Gbps links. Every byte shuffled is expensive.
3.Data Skew: If key 'user_123' has 1 billion records and all others have 1000, one worker gets 1000x more work.
4.Fault Tolerance: A 4-hour job on 1000 machines will see failures. Must checkpoint and recover without restarting from scratch.
5.Resource Contention: 1000 machines competing for network bandwidth, disk I/O, and memory creates complex bottlenecks.

Common mistake

Candidates often propose in-memory sorting algorithms without acknowledging that 1 PB will not fit in memory. Always start with the constraint that data exceeds available RAM.

The fundamental insight:

Distributed sorting has two phases:

1.Local Sort: Each machine sorts its local data chunk (fits in memory or uses external sort) 2. Global Merge: Combine sorted chunks into globally sorted output

The trick is making phase 2 efficient. MapReduce does this with partitioned shuffle - each reducer gets a range of keys, so local sort + merge = global sort.

Scale and Access Patterns

Let me estimate the scale and understand access patterns.

Dimension	Value	Impact
Total Data Size	1 PB (1,000 TB)	Cannot fit in memory - need external sort
Record Count	10 trillion	Sorting is O(n log n) - massive compute

+ 6 more rows...

What to say

With 1 PB and 4-hour SLA, we need to process about 70 GB per second sustained. The network can handle 1.25 TB/s, so shuffle is feasible but must be minimized.

Time to sort 1 PB:

Data per machine: 1 PB / 1000 = 1 TB each

+ 17 more lines...

Access Pattern Analysis:

Write once, read once: Intermediate data written and read exactly once - Sequential I/O: Sorting produces sequential disk access patterns - All-to-all communication: Shuffle phase connects every mapper to every reducer - Bursty network: Shuffle creates sudden network demand

High-Level Architecture

Let me start with the classic MapReduce architecture and then discuss modern improvements.

What to say

I will use a MapReduce-style architecture with three phases: Map (local processing), Shuffle (data redistribution), and Reduce (final aggregation). This minimizes data movement by moving computation to data.

Distributed Sort Architecture

Component Responsibilities:

1.Input Storage (HDFS/S3): Stores input data in large blocks (128MB-1GB). Data locality enables mappers to read local data.
2.Map Phase: - Read input block - Parse into records - Sort records locally by key - Partition into buckets by key range - Write partitioned, sorted data to local disk
3.Shuffle Phase: - Each reducer pulls its partition from all mappers - Data transferred over network - Most expensive phase - minimize this
4.Reduce Phase: - Receive sorted partitions from all mappers - K-way merge of sorted streams - Write final sorted output
5.Output Storage: Store final sorted data, partitioned by key range

Real-world reference

Google invented MapReduce for exactly this use case. Apache Spark improved on it with in-memory processing and DAG execution. Modern systems like Spark can sort 1 PB in under an hour.

Data Model and Storage

The storage layer must support high-throughput sequential I/O for both reading and writing massive datasets.

What to say

Data is stored in a distributed file system like HDFS or object storage like S3. The key insight is block-level data locality - we schedule computation where data lives to minimize network transfer.

Input Data Layout:
/data/input/
  part-00000.parquet  (128 MB, replicated 3x)

+ 19 more lines...

File Format Choices:

| Format | Compression | Splittable | Best For | |--------|-------------|------------|----------| | Parquet | High (columnar) | Yes | Analytics queries | | ORC | High (columnar) | Yes | Hive workloads | | Avro | Medium (row) | Yes | Schema evolution | | CSV/JSON | Low | Yes* | Interchange |

*Only with external index

def external_sort(input_file, output_file, memory_limit):
    """
    Sort a file larger than memory using external merge sort.

+ 42 more lines...

Important optimization

Use buffer pools and async I/O. Reading and writing one record at a time is slow. Batch into 64KB-1MB buffers for sequential I/O efficiency.

Algorithm Deep Dive

Let me explain the key algorithms for distributed sorting.

1. Partitioning Strategy

How we divide keys across reducers determines load balance.

def create_range_partitioner(input_data, num_partitions):
    """
    Sample input to find partition boundaries that balance load.

+ 24 more lines...

2. Handling Data Skew

When some keys have much more data than others:

Data Skew Problem

def salt_key(key, num_salts=10):
    """
    Add random salt to spread hot keys across partitions.

+ 20 more lines...

3. Combiner Optimization

Reduce shuffle by pre-aggregating on the map side:

def map_with_combiner(input_block):
    """
    Combine records with same key before shuffle.

+ 17 more lines...

Which optimization matters most?

For sorting: range partitioning with sampling to balance load. For aggregations: combiners to reduce shuffle. For skewed data: salting or custom partitioning. Always profile to find the bottleneck.

Consistency and Invariants

System Invariants

The system must guarantee: (1) No data loss - every input record appears in output exactly once, (2) Deterministic output - same input always produces same output, (3) Correct ordering - output is globally sorted by key.

Exactly-Once Processing:

In a distributed system with failures, ensuring each record is processed exactly once is challenging:

Guarantee	How to Achieve	Cost
At-least-once	Retry failed tasks, may produce duplicates	Simple, fast recovery
At-most-once	No retries, may lose data on failure	Unacceptable for batch
Exactly-once	Idempotent writes + checkpointing	More complex, some overhead

def ensure_determinism(job_config):
    """
    Ensure job produces identical output on retry.

+ 15 more lines...

Business impact

Non-deterministic jobs cause silent data quality issues. If job A produces different output than job A retry, downstream systems see inconsistent data. Financial reports, ML training data, and audit logs must be deterministic.

Correctness Verification:

How do we know the output is correctly sorted?

def verify_sorted_output(output_partitions):
    """
    Verify global sort order without loading all data.

+ 26 more lines...

Failure Modes and Resilience

Proactively discuss failures

Let me walk through failure modes. With 1000 machines running for 4 hours, failures are guaranteed - not exceptional.

Failure	Impact	Mitigation	Recovery Time
Worker node crash	Lose in-progress task	Re-execute task on another node	Minutes
Disk failure	Lose intermediate data	Re-run upstream tasks	10-30 minutes

+ 4 more rows...

Fault Tolerance Strategy:

MapReduce-style systems use lineage-based recovery:

Lineage-Based Recovery

class SpeculativeScheduler:
    """
    Launch backup copies of slow tasks to avoid stragglers.

+ 29 more lines...

Key insight

Speculative execution is critical for large jobs. Without it, a single slow node (bad disk, noisy neighbor) can double job runtime. With speculation, the backup task finishes while the original is still struggling.

Evolution and Scaling

What to say

This MapReduce architecture works well up to 1 PB with 4-hour SLA. Let me discuss how it evolves for larger scale and lower latency requirements.

Evolution Path:

Stage 1: Classic MapReduce (Hadoop) - Disk-based, fault-tolerant - High latency (writes intermediate to disk) - Good for: Very large datasets, unreliable hardware

Stage 2: In-Memory Processing (Spark) - Keep data in memory between stages - 10-100x faster for iterative jobs - Good for: ML pipelines, interactive queries

Stage 3: Streaming (Flink/Spark Streaming) - Process data as it arrives - Sub-second latency - Good for: Real-time analytics

System	Latency	Throughput	Fault Tolerance	Best For
Hadoop MR	Minutes-hours	High	Excellent	Batch ETL
Spark	Seconds-minutes	Very high	Good	Interactive/ML
Flink	Milliseconds	High	Excellent	Streaming
Presto/Trino	Seconds	Medium	Limited	Ad-hoc SQL
BigQuery	Seconds	Very high	Managed	Analytics

Modern Optimizations:

Sorting 1 PB benchmark:

Hadoop MapReduce:

+ 16 more lines...

Alternative approaches

If latency requirement was sub-minute, I would use Spark with aggressive caching. If data was continuously arriving, I would use Flink with incremental processing. If this was a one-time migration, I might even use a single beefy machine with NVMe SSDs.

What I would do differently for...

Sorting 100 TB instead of 1 PB: Consider a single large machine with 1TB RAM and fast SSDs. Simpler, faster for moderate scale.

Real-time leaderboard: Use a streaming system (Flink) with incremental updates rather than batch recomputation.

ML feature pipeline: Spark MLlib with DataFrame API for vectorized operations and built-in ML algorithms.

Cost-sensitive workload: Use spot/preemptible instances with checkpointing. 70% cost savings with 10% longer runtime.

Design Trade-offs

Advantages

+Highly fault-tolerant
+Scales to any size
+Simple programming model

Disadvantages

-Slow (disk I/O)
-Not for iterative jobs
-High latency

When to use

Large batch ETL jobs, unreliable infrastructure

System Design Masterclass

Weather Application with Forecasting

URL Shortener

Live Comments Feature

API Rate Limiter

On-Call Escalation System

Hotel Booking and Reservation System

Parts Compatibility Validation

Real-time Stock Price Viewer

Top-K Rankings System

File Download and Sync Library

Real-time Active Viewers

Marketplace Features

Price Alert System

Netflix Screen Concurrency Limits

Live Reactions System

Top K Most Shared Articles

High-Profile Likes Counter

Authentication and User Login

Google Calendar

Web Crawler

News Feed

Video Streaming Platform

IoC / Dependency Injection Framework

Distributed Control Infrastructure

Notification Service

Distributed Tracing System

P2P File Transfer System

Large Data Migration to Cloud

Wire Transfer API

Large Data Sorting and Processing

Database Control Plane

Distributed Metrics Logging and Aggregation

Ads Management & Delivery System

Flash Sale Backend

Photo Sharing Platform

Cluster Health Monitoring System

Rider Matching System

Surge Pricing System

Collaborative Editing System

Server Metrics Collection System

User Analytics Dashboard & Event Pipeline

Dropbox / Google Drive

Distributed Message Queue

ETA and Live Location Sharing

Distributed Key-Value Store

Distributed Stream Processing System

Payment Processing System

Distributed Job Scheduler

WhatsApp / Messenger

Payment Wallet at Global Scale

Uber / Ride Sharing

Web Search Engine

Globally Distributed SQL Database

Real-Time Analytics System

Recommendation System (Netflix)

Multi-Region Disaster Recovery System

Time-Series Database

Fraud Detection System

Design Large Data Sorting and Processing

Summary

Key Takeaways

Core Problem

The Hard Part

Scaling Axis

Problem Statement

What to say first

Premium Content