Design Walkthrough
Problem Statement
The Question: Design a system that can sort and process 1 petabyte of data efficiently using a cluster of commodity machines.
Large-scale data processing is essential for: - Analytics pipelines - processing daily logs, computing aggregations - ETL jobs - transforming and loading data into warehouses - Machine learning - feature extraction, training data preparation - Search indexing - building inverted indexes from raw documents
What to say first
Before I design, let me clarify the requirements. I need to understand the data characteristics, processing patterns, and latency requirements to choose the right architecture.
Hidden requirements interviewers are testing: - Do you understand external sorting (data larger than memory)? - Can you reason about data movement and shuffle costs? - Do you know how to handle failures in long-running jobs? - Can you optimize for data skew?
Clarifying Questions
Ask these questions to demonstrate senior thinking. Each answer shapes your architecture.
Question 1: Data Size and Shape
What is the total data size? How many records? What is the average record size?
Why this matters: Determines memory requirements and partitioning strategy. Typical answer: 1 PB total, 10 trillion records, 100 bytes average Architecture impact: Data will not fit in memory - need external sort
Question 2: Processing Type
Is this a one-time sort or a recurring job? Batch or streaming?
Why this matters: Determines framework choice and optimization strategy. Typical answer: Recurring daily batch job Architecture impact: Optimize for throughput, can checkpoint and restart
Question 3: Key Distribution
Is the sort key uniformly distributed or are there hot keys?
Why this matters: Skewed keys cause stragglers that slow everything. Typical answer: Some keys have 1000x more data (power law distribution) Architecture impact: Need skew handling - salting or custom partitioning
Question 4: Latency Requirements
What is the acceptable job completion time? Minutes, hours, or days?
Why this matters: Determines resource allocation and parallelism. Typical answer: Complete within 4 hours Architecture impact: Need high parallelism - thousands of workers
Stating assumptions
Based on this, I will assume: 1 PB input, 10 trillion records, recurring daily batch, skewed key distribution, 4-hour SLA, 1000-node cluster available.
The Hard Part
Say this out loud
The hard part here is minimizing data shuffle while handling skewed data distribution. Network transfer is the bottleneck - a naive approach would shuffle the entire dataset multiple times.
Why this is genuinely hard:
- 1.External Sorting: Data does not fit in memory. Must sort chunks, write to disk, then merge. This is fundamentally different from in-memory quicksort.
- 2.Shuffle Cost: Moving 1 PB across the network takes hours even with 10 Gbps links. Every byte shuffled is expensive.
- 3.Data Skew: If key 'user_123' has 1 billion records and all others have 1000, one worker gets 1000x more work.
- 4.Fault Tolerance: A 4-hour job on 1000 machines will see failures. Must checkpoint and recover without restarting from scratch.
- 5.Resource Contention: 1000 machines competing for network bandwidth, disk I/O, and memory creates complex bottlenecks.
Common mistake
Candidates often propose in-memory sorting algorithms without acknowledging that 1 PB will not fit in memory. Always start with the constraint that data exceeds available RAM.
The fundamental insight:
Distributed sorting has two phases:
- 1.Local Sort: Each machine sorts its local data chunk (fits in memory or uses external sort) 2. Global Merge: Combine sorted chunks into globally sorted output
The trick is making phase 2 efficient. MapReduce does this with partitioned shuffle - each reducer gets a range of keys, so local sort + merge = global sort.
Scale and Access Patterns
Let me estimate the scale and understand access patterns.
| Dimension | Value | Impact |
|---|---|---|
| Total Data Size | 1 PB (1,000 TB) | Cannot fit in memory - need external sort |
| Record Count | 10 trillion | Sorting is O(n log n) - massive compute |
What to say
With 1 PB and 4-hour SLA, we need to process about 70 GB per second sustained. The network can handle 1.25 TB/s, so shuffle is feasible but must be minimized.
Time to sort 1 PB:
Data per machine: 1 PB / 1000 = 1 TB eachAccess Pattern Analysis:
- Write once, read once: Intermediate data written and read exactly once - Sequential I/O: Sorting produces sequential disk access patterns - All-to-all communication: Shuffle phase connects every mapper to every reducer - Bursty network: Shuffle creates sudden network demand
High-Level Architecture
Let me start with the classic MapReduce architecture and then discuss modern improvements.
What to say
I will use a MapReduce-style architecture with three phases: Map (local processing), Shuffle (data redistribution), and Reduce (final aggregation). This minimizes data movement by moving computation to data.
Distributed Sort Architecture
Component Responsibilities:
- 1.Input Storage (HDFS/S3): Stores input data in large blocks (128MB-1GB). Data locality enables mappers to read local data.
- 2.Map Phase: - Read input block - Parse into records - Sort records locally by key - Partition into buckets by key range - Write partitioned, sorted data to local disk
- 3.Shuffle Phase: - Each reducer pulls its partition from all mappers - Data transferred over network - Most expensive phase - minimize this
- 4.Reduce Phase: - Receive sorted partitions from all mappers - K-way merge of sorted streams - Write final sorted output
- 5.Output Storage: Store final sorted data, partitioned by key range
Real-world reference
Google invented MapReduce for exactly this use case. Apache Spark improved on it with in-memory processing and DAG execution. Modern systems like Spark can sort 1 PB in under an hour.
Data Model and Storage
The storage layer must support high-throughput sequential I/O for both reading and writing massive datasets.
What to say
Data is stored in a distributed file system like HDFS or object storage like S3. The key insight is block-level data locality - we schedule computation where data lives to minimize network transfer.
Input Data Layout:
/data/input/
part-00000.parquet (128 MB, replicated 3x)File Format Choices:
| Format | Compression | Splittable | Best For | |--------|-------------|------------|----------| | Parquet | High (columnar) | Yes | Analytics queries | | ORC | High (columnar) | Yes | Hive workloads | | Avro | Medium (row) | Yes | Schema evolution | | CSV/JSON | Low | Yes* | Interchange |
*Only with external index
def external_sort(input_file, output_file, memory_limit):
"""
Sort a file larger than memory using external merge sort.Important optimization
Use buffer pools and async I/O. Reading and writing one record at a time is slow. Batch into 64KB-1MB buffers for sequential I/O efficiency.
Algorithm Deep Dive
Let me explain the key algorithms for distributed sorting.
1. Partitioning Strategy
How we divide keys across reducers determines load balance.
def create_range_partitioner(input_data, num_partitions):
"""
Sample input to find partition boundaries that balance load.2. Handling Data Skew
When some keys have much more data than others:
Data Skew Problem
def salt_key(key, num_salts=10):
"""
Add random salt to spread hot keys across partitions.3. Combiner Optimization
Reduce shuffle by pre-aggregating on the map side:
def map_with_combiner(input_block):
"""
Combine records with same key before shuffle.Which optimization matters most?
For sorting: range partitioning with sampling to balance load. For aggregations: combiners to reduce shuffle. For skewed data: salting or custom partitioning. Always profile to find the bottleneck.
Consistency and Invariants
System Invariants
The system must guarantee: (1) No data loss - every input record appears in output exactly once, (2) Deterministic output - same input always produces same output, (3) Correct ordering - output is globally sorted by key.
Exactly-Once Processing:
In a distributed system with failures, ensuring each record is processed exactly once is challenging:
| Guarantee | How to Achieve | Cost |
|---|---|---|
| At-least-once | Retry failed tasks, may produce duplicates | Simple, fast recovery |
| At-most-once | No retries, may lose data on failure | Unacceptable for batch |
| Exactly-once | Idempotent writes + checkpointing | More complex, some overhead |
def ensure_determinism(job_config):
"""
Ensure job produces identical output on retry.Business impact
Non-deterministic jobs cause silent data quality issues. If job A produces different output than job A retry, downstream systems see inconsistent data. Financial reports, ML training data, and audit logs must be deterministic.
Correctness Verification:
How do we know the output is correctly sorted?
def verify_sorted_output(output_partitions):
"""
Verify global sort order without loading all data.Failure Modes and Resilience
Proactively discuss failures
Let me walk through failure modes. With 1000 machines running for 4 hours, failures are guaranteed - not exceptional.
| Failure | Impact | Mitigation | Recovery Time |
|---|---|---|---|
| Worker node crash | Lose in-progress task | Re-execute task on another node | Minutes |
| Disk failure | Lose intermediate data | Re-run upstream tasks | 10-30 minutes |
Fault Tolerance Strategy:
MapReduce-style systems use lineage-based recovery:
Lineage-Based Recovery
class SpeculativeScheduler:
"""
Launch backup copies of slow tasks to avoid stragglers.Key insight
Speculative execution is critical for large jobs. Without it, a single slow node (bad disk, noisy neighbor) can double job runtime. With speculation, the backup task finishes while the original is still struggling.
Evolution and Scaling
What to say
This MapReduce architecture works well up to 1 PB with 4-hour SLA. Let me discuss how it evolves for larger scale and lower latency requirements.
Evolution Path:
Stage 1: Classic MapReduce (Hadoop) - Disk-based, fault-tolerant - High latency (writes intermediate to disk) - Good for: Very large datasets, unreliable hardware
Stage 2: In-Memory Processing (Spark) - Keep data in memory between stages - 10-100x faster for iterative jobs - Good for: ML pipelines, interactive queries
Stage 3: Streaming (Flink/Spark Streaming) - Process data as it arrives - Sub-second latency - Good for: Real-time analytics
| System | Latency | Throughput | Fault Tolerance | Best For |
|---|---|---|---|---|
| Hadoop MR | Minutes-hours | High | Excellent | Batch ETL |
| Spark | Seconds-minutes | Very high | Good | Interactive/ML |
| Flink | Milliseconds | High | Excellent | Streaming |
| Presto/Trino | Seconds | Medium | Limited | Ad-hoc SQL |
| BigQuery | Seconds | Very high | Managed | Analytics |
Modern Optimizations:
Sorting 1 PB benchmark:
Hadoop MapReduce:Alternative approaches
If latency requirement was sub-minute, I would use Spark with aggressive caching. If data was continuously arriving, I would use Flink with incremental processing. If this was a one-time migration, I might even use a single beefy machine with NVMe SSDs.
What I would do differently for...
Sorting 100 TB instead of 1 PB: Consider a single large machine with 1TB RAM and fast SSDs. Simpler, faster for moderate scale.
Real-time leaderboard: Use a streaming system (Flink) with incremental updates rather than batch recomputation.
ML feature pipeline: Spark MLlib with DataFrame API for vectorized operations and built-in ML algorithms.
Cost-sensitive workload: Use spot/preemptible instances with checkpointing. 70% cost savings with 10% longer runtime.