System Design Masterclass

58 items

System Design Masterclass

Storagedata-migrationcloudetlvalidationpetabyte-scaleadvanced

Design Large Data Migration to Cloud

Design a system for migrating petabytes of data to cloud storage

Petabytes of data, millions of files|Similar to AWS, Google Cloud, Azure, Snowflake, Databricks|45 min read

Summary

Large-scale cloud migration involves moving petabytes of data from on-premises systems to cloud storage while maintaining data integrity, minimizing downtime, and managing bandwidth costs. The core challenge is orchestrating parallel transfers, handling failures gracefully, validating data consistency, and managing the cutover from source to destination. This is asked at AWS, Google Cloud, Azure, and companies undergoing cloud transformation.

Key Takeaways

Core Problem

This is fundamentally a distributed data pipeline problem with strong consistency requirements. Every byte must arrive intact, and the system must prove it.

The Hard Part

Validating that petabytes of data transferred correctly without re-reading everything. Checksums, sampling, and record counts must all align.

Scaling Axis

Scale by parallelizing across files, partitions, and physical transfer devices. A single stream cannot saturate petabyte transfers.

The Question: Design a system to migrate 5 petabytes of data from on-premises data centers to cloud storage within 3 months, with zero data loss.

Large-scale data migration is critical for: - Cloud transformation: Moving enterprise workloads to cloud - Data center consolidation: Shutting down physical infrastructure - Platform modernization: Moving from legacy to modern data platforms - Disaster recovery: Establishing cloud-based backup and recovery

What to say first

Before designing, I need to understand the data characteristics, network constraints, downtime tolerance, and validation requirements. These fundamentally shape the migration strategy.

Hidden requirements interviewers test: - Can you estimate transfer times and identify bottlenecks? - Do you understand validation and consistency challenges? - Can you plan for failures and partial migrations? - Do you know when physical transfer beats network transfer?

Summary

Key Takeaways

Core Problem

This is fundamentally a distributed data pipeline problem with strong consistency requirements. Every byte must arrive intact, and the system must prove it.

The Hard Part

Validating that petabytes of data transferred correctly without re-reading everything. Checksums, sampling, and record counts must all align.

Scaling Axis

Scale by parallelizing across files, partitions, and physical transfer devices. A single stream cannot saturate petabyte transfers.

Critical Invariant

Zero data loss. Every record in source must exist in destination with identical content. This is non-negotiable.

Performance Insight

Network bandwidth is often the bottleneck. At 10 Gbps, 1 PB takes 9+ days. Physical transfer devices (Snowball) may be faster.

Key Tradeoff

Online migration (continuous sync) vs offline migration (physical devices) vs hybrid. Choice depends on data volume, timeline, and network capacity.

Design Walkthrough

Problem Statement

The Question: Design a system to migrate 5 petabytes of data from on-premises data centers to cloud storage within 3 months, with zero data loss.

What to say first

Before designing, I need to understand the data characteristics, network constraints, downtime tolerance, and validation requirements. These fundamentally shape the migration strategy.

Clarifying Questions

Ask these questions to demonstrate senior thinking. Each answer dramatically changes architecture.

Question 1: Data Volume and Distribution

How much data total? How is it distributed - many small files or few large files? What is the file size distribution?

Why this matters: Small files (millions of 1KB files) have different bottlenecks than large files (thousands of 1TB files). Typical answer: 5 PB total, 80% in large files (100GB+), 20% in small files (under 1MB), 500 million files total Architecture impact: Need different strategies for small vs large files

Question 2: Network Capacity

What is the available bandwidth between source and cloud? Is it dedicated or shared with production traffic?

Why this matters: Determines if online transfer is feasible. Typical answer: 10 Gbps dedicated, can burst to 40 Gbps Architecture impact: At 10 Gbps, 5 PB takes 46 days minimum - tight for 3-month window

Question 3: Data Change Rate

Is the data static or actively changing during migration? What is the change rate?

Why this matters: Static data can use bulk transfer; changing data needs CDC (Change Data Capture). Typical answer: 90% static (historical), 10% active with 100GB/day change rate Architecture impact: Need incremental sync for active data, bulk for historical

Question 4: Downtime Tolerance

Can we have downtime during cutover? How long? Is there a rollback requirement?

Why this matters: Zero-downtime requires dual-write and careful cutover. Typical answer: 4-hour maintenance window acceptable, need ability to rollback for 1 week Architecture impact: Can do big-bang cutover, but need validation before and rollback capability

Stating assumptions

I will assume: 5 PB total, 10 Gbps network, 90% static data, 4-hour cutover window, and zero tolerance for data loss. Given the timeline, we may need to supplement network transfer with physical devices.

The Hard Part

Say this out loud

The hard part is validating that petabytes of data transferred correctly without re-reading everything, while handling failures and changes during the migration window.

Why this is genuinely hard:

1.Scale of Validation: You cannot simply compare every byte - reading 5 PB for comparison takes as long as the transfer itself.
2.Concurrent Modifications: Source data changes during migration. How do you ensure the destination reflects the final state?
3.Failure Recovery: A transfer of millions of files will have failures. Tracking state for retry without re-transferring everything is complex.
4.Network Saturation: Maximizing throughput without impacting production traffic requires careful bandwidth management.
5.Cutover Coordination: The moment you switch from source to destination, all applications must see consistent data.

Common mistake

Candidates often focus only on transfer speed and ignore validation. An unvalidated migration is worthless - you cannot trust the data.

The fundamental constraint:

Transfer Time = Data Volume / Bandwidth

5 PB at 10 Gbps = 5,000,000 GB / 1.25 GB/s = 4,000,000 seconds = 46 days

This assumes 100% utilization (impossible in practice).
Realistic: 60-70% utilization = 65-75 days

With a 90-day window, network transfer is feasible but leaves little margin for error. Physical transfer devices provide insurance.

Scale and Access Patterns

Let me estimate the scale and understand access patterns.

Dimension	Value	Impact
Total Data	5 PB (5,000 TB)	Requires weeks of transfer time
File Count	500 million files	Metadata operations become bottleneck

+ 5 more rows...

What to say

The 500 million small files are actually harder than the 4 PB of large files. Small file transfers are metadata-bound, not bandwidth-bound. We need different strategies for each.

Access Pattern Analysis:

Large files: Bandwidth-bound. Parallelize by chunking files into 100MB-1GB segments. - Small files: Metadata-bound. Batch into archives (tar/zip) for transfer, expand at destination. - Active data: Needs continuous replication until cutover. - Read patterns: Migration is write-heavy to destination, read-heavy from source.

Network transfer time:
  5 PB / 10 Gbps = 46 days (theoretical)
  With 70% efficiency = 66 days

+ 12 more lines...

High-Level Architecture

Let me design a migration architecture that handles both bulk historical data and incremental active data.

What to say

I will design a three-phase migration: bulk historical transfer, incremental sync, and cutover. This separates the static data problem from the changing data problem.

Migration Architecture Overview

Component Responsibilities:

1.File Scanner: Inventories source files, creates manifest with checksums
2.Change Data Capture (CDC): Tracks file modifications during migration
3.Orchestrator: Manages migration workflow, tracks progress, handles retries
4.State Database: Tracks status of every file (pending, in-progress, transferred, validated)
5.Job Queue: Distributes transfer work across agents, handles priorities
6.Transfer Agents: Parallel workers that transfer files from source to staging
7.Snowball Devices: Physical devices for bulk historical data transfer
8.Staging Bucket: Temporary landing zone in cloud for validation
9.Validator: Verifies checksums, record counts, and data integrity
10.Destination Storage: Final cloud storage location

Real-world reference

AWS Database Migration Service (DMS) and Azure Data Box follow similar patterns. Google Transfer Appliance handles the physical device aspect.

Data Model and Storage

The migration state database is critical for tracking progress, enabling restart, and generating reports.

What to say

The migration state database is the source of truth for what has been transferred and validated. It enables restartability and provides audit trail.

-- File inventory and transfer status
CREATE TABLE migration_files (
    file_id         BIGSERIAL PRIMARY KEY,

+ 54 more lines...

Manifest File Format:

Generate a manifest during initial scan for validation:

{
  "source_path": "/data/warehouse/sales/2023/01/transactions.parquet",
  "dest_path": "s3://migration-bucket/warehouse/sales/2023/01/transactions.parquet",
  "size_bytes": 1073741824,
  "checksum_sha256": "a3f2b8c9d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1",
  "last_modified": "2024-01-15T10:30:00Z",
  "file_type": "parquet",
  "record_count": 50000000,
  "partition_key": "sales/2023/01"
}

Important detail

Checksums must be computed during initial scan and stored. Recomputing checksums for 5 PB takes days - you cannot do it during validation.

Algorithm Deep Dive

Let me walk through the key algorithms for each migration phase.

Phase 1: Inventory and Planning

import hashlib
import os
from concurrent.futures import ProcessPoolExecutor

+ 45 more lines...

Phase 2: Chunked Transfer for Large Files

import boto3
from dataclasses import dataclass
from typing import List, Optional

+ 81 more lines...

Phase 3: Small File Batching

import tarfile
import io

+ 43 more lines...

Why batching matters

Transferring 1 million 1KB files individually requires 1 million API calls. Batching into 1000 archives of 1000 files each reduces API calls by 1000x and improves throughput dramatically.

Consistency and Invariants

System Invariants

Zero data loss is non-negotiable. Every byte in source must exist in destination with identical content. The migration is not complete until this is proven.

Validation Strategy:

We cannot re-read 5 PB for validation. Instead, use a multi-layer approach:

Validation Type	What It Checks	Coverage	Cost
Checksum Match	Every file content identical	100% of files	Pre-computed during scan
Size Match	File sizes match exactly	100% of files	Metadata only - fast
Record Count	Row counts match for structured data	Tables/Parquet	Query both sides
Statistical Sample	Random sample byte comparison	0.1% of data	Read sample blocks
Manifest Reconciliation	All files in manifest exist	100% of files	List operations

class MigrationValidator:
    def __init__(self, source_client, dest_client, state_db):
        self.source = source_client

+ 74 more lines...

Cloud provider checksums

S3 provides ETags (MD5 for single-part uploads), GCS provides CRC32C and MD5, Azure Blob provides MD5. Use these instead of recomputing when possible.

Failure Modes and Resilience

Proactively discuss failures

A 90-day migration will encounter failures. The system must handle them gracefully without losing progress.

Failure	Impact	Mitigation	Recovery
Network interruption	Transfer stops mid-file	Multipart upload with checkpoints	Resume from last checkpoint
Source file deleted	Cannot complete transfer	Detect via CDC, mark as deleted	Skip file, log for review

+ 6 more rows...

import time
import random
from functools import wraps

+ 48 more lines...

Idempotency Design:

Every operation must be safe to retry:

class TransferJob:
    def execute(self, job_id: str) -> JobResult:
        """Execute transfer job idempotently."""

+ 29 more lines...

Evolution and Scaling

What to say

This design handles 5 PB in 90 days. For larger migrations or tighter timelines, we would scale transfer agents, use more physical devices, or leverage cloud-native transfer services.

Cutover Strategy:

The final cutover is the most critical moment:

Cutover Timeline

class CutoverOrchestrator:
    def execute_cutover(self) -> CutoverResult:
        """Execute the migration cutover."""

+ 45 more lines...

Scaling to Larger Migrations:

Scale	Approach	Key Changes
10 PB	More Snowball devices in parallel	Coordinate device logistics, expand staging
50 PB	Snowmobile (shipping container)	Single shipment vs many devices
100+ PB	Dedicated network links + Snowmobile	AWS Direct Connect, multi-month timeline
Continuous sync	Real-time CDC with Kafka/Kinesis	Event streaming instead of batch
Multi-cloud	Abstract destination, parallel transfers	Cloud-agnostic transfer layer

Alternative approaches

For active databases, consider AWS DMS, Azure Database Migration Service, or GCP Database Migration Service. These handle CDC natively and support zero-downtime migrations for supported database engines.

What I would do differently for...

Database migration: Use native CDC tools (Debezium, DMS) rather than file-level sync. Validate at row level.

Data lake migration: Migrate hot data first via network, cold data via Snowball. Use partition-level validation.

Real-time requirements: Implement dual-write during migration period, validate consistency continuously.

Compliance-sensitive data: Add encryption validation, chain-of-custody logging, and audit trails for every file.

Design Trade-offs

Advantages

+Simple logistics
+Continuous progress visibility
+Easy to restart

Disadvantages

-Bandwidth limited
-Long duration for large data
-Network costs

When to use

Data under 100 TB, good network connectivity, flexible timeline

System Design Masterclass

Weather Application with Forecasting

URL Shortener

Hotel Booking and Reservation System

Top K Most Shared Articles

API Rate Limiter

Real-time Stock Price Viewer

Parts Compatibility Validation

Live Comments Feature

Authentication and User Login

On-Call Escalation System

File Download and Sync Library

Marketplace Features

Price Alert System

High-Profile Likes Counter

Live Reactions System

Netflix Screen Concurrency Limits

Real-time Active Viewers

Top-K Rankings System

Google Calendar

Web Crawler

News Feed

Video Streaming Platform

Distributed Control Infrastructure

Cluster Health Monitoring System

User Analytics Dashboard & Event Pipeline

Surge Pricing System

Flash Sale Backend

Distributed Metrics Logging and Aggregation

IoC / Dependency Injection Framework

Wire Transfer API

Notification Service

ETA and Live Location Sharing

P2P File Transfer System

Photo Sharing Platform

Ads Management & Delivery System

Rider Matching System

Collaborative Editing System

Server Metrics Collection System

Distributed Key-Value Store

Payment Processing System

Distributed Stream Processing System

Distributed Job Scheduler

Distributed Message Queue

Dropbox / Google Drive

Distributed Tracing System

Large Data Sorting and Processing

Database Control Plane

Large Data Migration to Cloud

WhatsApp / Messenger

Payment Wallet at Global Scale

Uber / Ride Sharing

Web Search Engine

Globally Distributed SQL Database

Real-Time Analytics System

Recommendation System (Netflix)

Multi-Region Disaster Recovery System

Time-Series Database

Fraud Detection System

Design Large Data Migration to Cloud

Summary

Key Takeaways

Core Problem

The Hard Part

Scaling Axis

Problem Statement

What to say first

Premium Content