Design Walkthrough
Problem Statement
The Question: Design a system to migrate 5 petabytes of data from on-premises data centers to cloud storage within 3 months, with zero data loss.
Large-scale data migration is critical for: - Cloud transformation: Moving enterprise workloads to cloud - Data center consolidation: Shutting down physical infrastructure - Platform modernization: Moving from legacy to modern data platforms - Disaster recovery: Establishing cloud-based backup and recovery
What to say first
Before designing, I need to understand the data characteristics, network constraints, downtime tolerance, and validation requirements. These fundamentally shape the migration strategy.
Hidden requirements interviewers test: - Can you estimate transfer times and identify bottlenecks? - Do you understand validation and consistency challenges? - Can you plan for failures and partial migrations? - Do you know when physical transfer beats network transfer?
Clarifying Questions
Ask these questions to demonstrate senior thinking. Each answer dramatically changes architecture.
Question 1: Data Volume and Distribution
How much data total? How is it distributed - many small files or few large files? What is the file size distribution?
Why this matters: Small files (millions of 1KB files) have different bottlenecks than large files (thousands of 1TB files). Typical answer: 5 PB total, 80% in large files (100GB+), 20% in small files (under 1MB), 500 million files total Architecture impact: Need different strategies for small vs large files
Question 2: Network Capacity
What is the available bandwidth between source and cloud? Is it dedicated or shared with production traffic?
Why this matters: Determines if online transfer is feasible. Typical answer: 10 Gbps dedicated, can burst to 40 Gbps Architecture impact: At 10 Gbps, 5 PB takes 46 days minimum - tight for 3-month window
Question 3: Data Change Rate
Is the data static or actively changing during migration? What is the change rate?
Why this matters: Static data can use bulk transfer; changing data needs CDC (Change Data Capture). Typical answer: 90% static (historical), 10% active with 100GB/day change rate Architecture impact: Need incremental sync for active data, bulk for historical
Question 4: Downtime Tolerance
Can we have downtime during cutover? How long? Is there a rollback requirement?
Why this matters: Zero-downtime requires dual-write and careful cutover. Typical answer: 4-hour maintenance window acceptable, need ability to rollback for 1 week Architecture impact: Can do big-bang cutover, but need validation before and rollback capability
Stating assumptions
I will assume: 5 PB total, 10 Gbps network, 90% static data, 4-hour cutover window, and zero tolerance for data loss. Given the timeline, we may need to supplement network transfer with physical devices.
The Hard Part
Say this out loud
The hard part is validating that petabytes of data transferred correctly without re-reading everything, while handling failures and changes during the migration window.
Why this is genuinely hard:
- 1.Scale of Validation: You cannot simply compare every byte - reading 5 PB for comparison takes as long as the transfer itself.
- 2.Concurrent Modifications: Source data changes during migration. How do you ensure the destination reflects the final state?
- 3.Failure Recovery: A transfer of millions of files will have failures. Tracking state for retry without re-transferring everything is complex.
- 4.Network Saturation: Maximizing throughput without impacting production traffic requires careful bandwidth management.
- 5.Cutover Coordination: The moment you switch from source to destination, all applications must see consistent data.
Common mistake
Candidates often focus only on transfer speed and ignore validation. An unvalidated migration is worthless - you cannot trust the data.
The fundamental constraint:
Transfer Time = Data Volume / Bandwidth
5 PB at 10 Gbps = 5,000,000 GB / 1.25 GB/s = 4,000,000 seconds = 46 days
This assumes 100% utilization (impossible in practice).
Realistic: 60-70% utilization = 65-75 daysWith a 90-day window, network transfer is feasible but leaves little margin for error. Physical transfer devices provide insurance.
Scale and Access Patterns
Let me estimate the scale and understand access patterns.
| Dimension | Value | Impact |
|---|---|---|
| Total Data | 5 PB (5,000 TB) | Requires weeks of transfer time |
| File Count | 500 million files | Metadata operations become bottleneck |
What to say
The 500 million small files are actually harder than the 4 PB of large files. Small file transfers are metadata-bound, not bandwidth-bound. We need different strategies for each.
Access Pattern Analysis:
- Large files: Bandwidth-bound. Parallelize by chunking files into 100MB-1GB segments. - Small files: Metadata-bound. Batch into archives (tar/zip) for transfer, expand at destination. - Active data: Needs continuous replication until cutover. - Read patterns: Migration is write-heavy to destination, read-heavy from source.
Network transfer time:
5 PB / 10 Gbps = 46 days (theoretical)
With 70% efficiency = 66 daysHigh-Level Architecture
Let me design a migration architecture that handles both bulk historical data and incremental active data.
What to say
I will design a three-phase migration: bulk historical transfer, incremental sync, and cutover. This separates the static data problem from the changing data problem.
Migration Architecture Overview
Component Responsibilities:
- 1.File Scanner: Inventories source files, creates manifest with checksums
- 2.Change Data Capture (CDC): Tracks file modifications during migration
- 3.Orchestrator: Manages migration workflow, tracks progress, handles retries
- 4.State Database: Tracks status of every file (pending, in-progress, transferred, validated)
- 5.Job Queue: Distributes transfer work across agents, handles priorities
- 6.Transfer Agents: Parallel workers that transfer files from source to staging
- 7.Snowball Devices: Physical devices for bulk historical data transfer
- 8.Staging Bucket: Temporary landing zone in cloud for validation
- 9.Validator: Verifies checksums, record counts, and data integrity
- 10.Destination Storage: Final cloud storage location
Real-world reference
AWS Database Migration Service (DMS) and Azure Data Box follow similar patterns. Google Transfer Appliance handles the physical device aspect.
Data Model and Storage
The migration state database is critical for tracking progress, enabling restart, and generating reports.
What to say
The migration state database is the source of truth for what has been transferred and validated. It enables restartability and provides audit trail.
-- File inventory and transfer status
CREATE TABLE migration_files (
file_id BIGSERIAL PRIMARY KEY,Manifest File Format:
Generate a manifest during initial scan for validation:
{
"source_path": "/data/warehouse/sales/2023/01/transactions.parquet",
"dest_path": "s3://migration-bucket/warehouse/sales/2023/01/transactions.parquet",
"size_bytes": 1073741824,
"checksum_sha256": "a3f2b8c9d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1",
"last_modified": "2024-01-15T10:30:00Z",
"file_type": "parquet",
"record_count": 50000000,
"partition_key": "sales/2023/01"
}Important detail
Checksums must be computed during initial scan and stored. Recomputing checksums for 5 PB takes days - you cannot do it during validation.
Algorithm Deep Dive
Let me walk through the key algorithms for each migration phase.
Phase 1: Inventory and Planning
import hashlib
import os
from concurrent.futures import ProcessPoolExecutorPhase 2: Chunked Transfer for Large Files
import boto3
from dataclasses import dataclass
from typing import List, OptionalPhase 3: Small File Batching
import tarfile
import io
Why batching matters
Transferring 1 million 1KB files individually requires 1 million API calls. Batching into 1000 archives of 1000 files each reduces API calls by 1000x and improves throughput dramatically.
Consistency and Invariants
System Invariants
Zero data loss is non-negotiable. Every byte in source must exist in destination with identical content. The migration is not complete until this is proven.
Validation Strategy:
We cannot re-read 5 PB for validation. Instead, use a multi-layer approach:
| Validation Type | What It Checks | Coverage | Cost |
|---|---|---|---|
| Checksum Match | Every file content identical | 100% of files | Pre-computed during scan |
| Size Match | File sizes match exactly | 100% of files | Metadata only - fast |
| Record Count | Row counts match for structured data | Tables/Parquet | Query both sides |
| Statistical Sample | Random sample byte comparison | 0.1% of data | Read sample blocks |
| Manifest Reconciliation | All files in manifest exist | 100% of files | List operations |
class MigrationValidator:
def __init__(self, source_client, dest_client, state_db):
self.source = source_clientCloud provider checksums
S3 provides ETags (MD5 for single-part uploads), GCS provides CRC32C and MD5, Azure Blob provides MD5. Use these instead of recomputing when possible.
Failure Modes and Resilience
Proactively discuss failures
A 90-day migration will encounter failures. The system must handle them gracefully without losing progress.
| Failure | Impact | Mitigation | Recovery |
|---|---|---|---|
| Network interruption | Transfer stops mid-file | Multipart upload with checkpoints | Resume from last checkpoint |
| Source file deleted | Cannot complete transfer | Detect via CDC, mark as deleted | Skip file, log for review |
import time
import random
from functools import wrapsIdempotency Design:
Every operation must be safe to retry:
class TransferJob:
def execute(self, job_id: str) -> JobResult:
"""Execute transfer job idempotently."""Evolution and Scaling
What to say
This design handles 5 PB in 90 days. For larger migrations or tighter timelines, we would scale transfer agents, use more physical devices, or leverage cloud-native transfer services.
Cutover Strategy:
The final cutover is the most critical moment:
Cutover Timeline
class CutoverOrchestrator:
def execute_cutover(self) -> CutoverResult:
"""Execute the migration cutover."""Scaling to Larger Migrations:
| Scale | Approach | Key Changes |
|---|---|---|
| 10 PB | More Snowball devices in parallel | Coordinate device logistics, expand staging |
| 50 PB | Snowmobile (shipping container) | Single shipment vs many devices |
| 100+ PB | Dedicated network links + Snowmobile | AWS Direct Connect, multi-month timeline |
| Continuous sync | Real-time CDC with Kafka/Kinesis | Event streaming instead of batch |
| Multi-cloud | Abstract destination, parallel transfers | Cloud-agnostic transfer layer |
Alternative approaches
For active databases, consider AWS DMS, Azure Database Migration Service, or GCP Database Migration Service. These handle CDC natively and support zero-downtime migrations for supported database engines.
What I would do differently for...
Database migration: Use native CDC tools (Debezium, DMS) rather than file-level sync. Validate at row level.
Data lake migration: Migrate hot data first via network, cold data via Snowball. Use partition-level validation.
Real-time requirements: Implement dual-write during migration period, validate consistency continuously.
Compliance-sensitive data: Add encryption validation, chain-of-custody logging, and audit trails for every file.