System Design Masterclass

58 items

System Design Masterclass

Storagefile-syncdownload-managerresumablechunkingdelta-syncintermediate

Design File Download and Sync Library

Design a library for resumable downloads and bidirectional file synchronization

GB-TB file transfers, millions of small files|Similar to Dropbox, Google Drive, AWS, OneDrive, iCloud|45 min read

Summary

A file download and sync library handles resumable transfers, chunked uploads, bandwidth management, conflict resolution, and offline support. The core challenge is maintaining data integrity across unreliable networks while providing a clean API for application developers. This pattern powers Dropbox SDK, Google Drive API client, AWS S3 Transfer Manager, and mobile app update systems.

Key Takeaways

Core Problem

This is fundamentally a reliability and state management problem. The library must guarantee exactly-once delivery of data despite network failures, app crashes, and concurrent modifications.

The Hard Part

Handling partial failures gracefully - what happens when upload succeeds but the completion callback fails? Or when two devices modify the same file offline?

Scaling Axis

Scale by file size (chunking), number of files (batching and parallelism), and bandwidth (adaptive throttling). Each axis requires different optimization strategies.

The Question: Design a file download and synchronization library that handles large file transfers, resumable uploads/downloads, and bidirectional sync between client and server.

This library is used by: - Mobile apps: Downloading assets, uploading user content - Desktop sync clients: Dropbox-style folder sync - Backup applications: Large file transfers to cloud - Media applications: Streaming downloads with offline support

What to say first

Before designing the library, let me clarify the requirements. I want to understand the file sizes involved, network conditions, consistency requirements, and what platforms this library targets.

Hidden requirements interviewers are testing: - Do you understand chunked transfer protocols? - Can you handle partial failures and state recovery? - How do you resolve conflicts in bidirectional sync? - Do you consider mobile constraints (battery, bandwidth)?

Summary

Key Takeaways

Core Problem

This is fundamentally a reliability and state management problem. The library must guarantee exactly-once delivery of data despite network failures, app crashes, and concurrent modifications.

The Hard Part

Handling partial failures gracefully - what happens when upload succeeds but the completion callback fails? Or when two devices modify the same file offline?

Scaling Axis

Scale by file size (chunking), number of files (batching and parallelism), and bandwidth (adaptive throttling). Each axis requires different optimization strategies.

Critical Invariant

Never corrupt user data. A file must either fully succeed or fully fail - no partial states visible to users. Checksums verify integrity at every step.

Performance Insight

For large files, chunk size determines throughput vs resumability tradeoff. Too small = overhead, too large = wasted retransmission on failure.

Key Tradeoff

We choose eventual consistency for sync with automatic conflict resolution because users expect their files everywhere, even if briefly inconsistent.

Design Walkthrough

Problem Statement

The Question: Design a file download and synchronization library that handles large file transfers, resumable uploads/downloads, and bidirectional sync between client and server.

What to say first

Before designing the library, let me clarify the requirements. I want to understand the file sizes involved, network conditions, consistency requirements, and what platforms this library targets.

Clarifying Questions

These questions shape the library design significantly.

Question 1: File Size Range

What is the range of file sizes? KB documents, GB videos, or TB datasets?

Why this matters: Determines chunking strategy and memory management. Typical answer: Support 1KB to 50GB files Architecture impact: Need streaming/chunked approach, cannot load into memory

Question 2: Network Conditions

What network conditions should we handle? Reliable datacenter or flaky mobile?

Why this matters: Determines retry strategy and chunk size. Typical answer: Support mobile networks with frequent disconnections Architecture impact: Aggressive resumability, smaller chunks, offline queue

Question 3: Sync Direction

Download only, upload only, or bidirectional sync?

Why this matters: Bidirectional sync requires conflict resolution. Typical answer: Full bidirectional sync Architecture impact: Need vector clocks or similar for conflict detection

Question 4: Consistency Model

Is eventual consistency acceptable? How should conflicts be resolved?

Why this matters: Determines sync algorithm complexity. Typical answer: Eventual consistency with automatic conflict resolution Architecture impact: Last-writer-wins or merge strategies needed

Stating assumptions

Based on this, I will assume: files up to 50GB, unreliable mobile networks, bidirectional sync, eventual consistency with last-writer-wins for conflicts, and need to support iOS/Android/Desktop.

The Hard Part

Say this out loud

The hard part here is handling partial failures gracefully - ensuring data integrity when transfers are interrupted at any point, and resolving conflicts when the same file is modified on multiple devices.

Why this is genuinely hard:

1.Partial Upload Problem: Client uploads 90% of file, then crashes. Server has incomplete data. How do we resume without re-uploading everything?
2.Completion Ambiguity: Upload finishes, server commits, but ACK never reaches client. Client retries, potentially creating duplicates.
3.Conflict Detection: File modified on phone (offline) and laptop (online) simultaneously. Which version wins? Can we merge?
4.Integrity Verification: How do we know the file was not corrupted in transit? Checksums at what granularity?
5.State Recovery: App crashes mid-sync. On restart, what is the state of each file? Need durable progress tracking.

Common mistake

Candidates often design for the happy path only. Production file sync must handle: network timeout mid-chunk, app killed by OS, disk full, server returns 500 after partial upload, clock skew between devices.

The fundamental challenge:

File sync is a distributed systems problem disguised as a simple CRUD operation. Each device is a node that can modify data independently, and we need to reconcile state across all nodes.

Scale and Access Patterns

Let me analyze the scale and patterns this library must handle.

Dimension	Value	Impact
Max file size	50 GB	Must use chunked transfer, cannot buffer in memory
Min file size	1 KB	Many small files need batching for efficiency

+ 4 more rows...

What to say

The library must handle bimodal file sizes - many small files (documents) and occasional large files (videos). This requires different optimization strategies: batching for small files, chunking for large files.

Access Pattern Analysis:

Download patterns: Usually sequential, can benefit from prefetching - Upload patterns: Often bursty (user saves multiple files) - Sync patterns: Mostly no-change (polling finds nothing new) - Conflict rate: Low (less than 1% of syncs) but must handle gracefully

Chunk size selection:
- Too small (64KB): 50GB file = 780,000 chunks = massive overhead
- Too large (100MB): One failure = 100MB retransmit

+ 9 more lines...

High-Level Architecture

The library has three main components: Transfer Engine, Sync Engine, and State Manager.

What to say

I will design this as a layered library with clear separation: low-level transfer primitives, sync logic built on top, and persistent state management throughout.

File Sync Library Architecture

Component Responsibilities:

1. Public API Layer - Simple interface: download(), upload(), sync() - Progress callbacks and cancellation - Configuration (chunk size, concurrency, retry policy)

2. Sync Engine - Calculates diff between local and remote state - Detects and resolves conflicts - Manages sync queue and priorities

3. Transfer Engine - Chunked upload/download with resume support - Retry logic with exponential backoff - Bandwidth throttling and estimation

4. State Manager - Persists transfer progress to survive crashes - Tracks file metadata and sync state - SQLite for durability, memory cache for speed

Real-world reference

Dropbox SDK uses a similar architecture. AWS S3 Transfer Manager has TransferState persistence. Google Drive API client maintains a local database of sync state.

Data Model and Storage

The library needs persistent storage to survive crashes and track sync state.

What to say

I will use SQLite for persistent state because it is embedded, ACID-compliant, and available on all platforms. The schema tracks files, chunks, and sync metadata.

-- Track all known files (local and remote)
CREATE TABLE files (
    id TEXT PRIMARY KEY,           -- Unique file identifier

+ 38 more lines...

Key Design Decisions:

1.File ID Strategy: Use content-based ID (hash of path + account) for deduplication
2.Checksum Storage: Store both local and remote checksums to detect changes without re-downloading
3.Chunk State: Persist per-chunk progress to resume from exact point of failure
4.Version Tracking: Simple integer version for conflict detection (can upgrade to vector clocks)

class StateManager:
    def __init__(self, db_path: str):
        self.conn = sqlite3.connect(db_path)

+ 27 more lines...

Algorithm Deep Dive: Chunked Transfer

Chunked transfer is the foundation of reliable large file handling.

Chunked Upload Flow

class ChunkedUploader:
    def __init__(self, chunk_size: int = 8 * 1024 * 1024):  # 8MB default
        self.chunk_size = chunk_size

+ 55 more lines...

Resumable Download Implementation:

class ResumableDownloader:
    async def download(self, file_id: str, dest_path: str,
                       progress_callback: Callable = None) -> bool:

+ 38 more lines...

Key technique: Atomic rename

Always download to a temporary file, then atomic rename on success. This ensures users never see partial files in the destination.

Algorithm Deep Dive: Sync and Conflict Resolution

Bidirectional sync requires careful conflict detection and resolution.

Sync State Machine

class SyncEngine:
    def __init__(self):
        self.state = StateManager()

+ 77 more lines...

Conflict Resolution Strategies:

Strategy	How It Works	Best For
Last-writer-wins	Most recent modification timestamp wins	Simple files, photos
Keep both	Rename conflicted file with device suffix	Documents, important files
Server wins	Remote version always takes precedence	Managed enterprise data
Merge	Combine changes (requires format knowledge)	Text files, JSON/XML
User prompt	Ask user to choose which version	Critical data, explicit control

class ConflictResolver:
    def __init__(self, strategy: str = 'keep_both'):
        self.strategy = strategy

+ 37 more lines...

Consistency and Invariants

System Invariants

1. Never corrupt or lose user data - partial transfers must be recoverable or fail completely. 2. Checksums must verify at every stage - chunk, file, and end-to-end. 3. Sync state must be durable - crash at any point must be recoverable.

Data Integrity Guarantees:

The library maintains integrity through multiple layers:

class IntegrityVerifier:
    def verify_upload(self, file_path: str, 
                      server_response: UploadResponse) -> bool:

+ 36 more lines...

Business impact mapping

If a file is corrupted during sync, users lose work and trust. If sync is slow due to excessive verification, users disable it. We verify at boundaries (upload complete, download complete) not continuously.

Idempotency Design:

Every operation must be safe to retry:

async def upload_chunk_idempotent(self, upload_id: str, 
                                   chunk_idx: int, 
                                   data: bytes) -> bool:

+ 19 more lines...

Failure Modes and Resilience

Proactively discuss failures

Let me walk through the failure modes. File sync libraries face more failure scenarios than most systems because they bridge unreliable networks, local filesystems, and remote storage.

Failure	Impact	Mitigation	Why It Works
Network timeout mid-chunk	Chunk partially uploaded	Server discards partial, client retries	Chunks are atomic units
App killed by OS	Transfer state lost	Persist progress to SQLite after each chunk	State survives process death

+ 5 more rows...

Retry Strategy:

class RetryHandler:
    def __init__(self, 
                 max_attempts: int = 5,

+ 42 more lines...

Bandwidth Management:

class BandwidthManager:
    def __init__(self, max_bandwidth: int = None):
        self.max_bandwidth = max_bandwidth  # bytes/sec, None = unlimited

+ 42 more lines...

Evolution and Scaling

What to say

This design works well for single-user scenarios with reasonable file counts. Let me discuss how it evolves for enterprise scale and advanced features.

Evolution Path:

Stage 1: Basic Library (Current Design) - Single user, single device - Sequential sync - Basic conflict resolution

Stage 2: Multi-Device Sync - Add device ID to version vectors - Real-time push notifications for changes - Smarter conflict resolution with merge

Stage 3: Enterprise Features - Selective sync (choose folders) - Deduplication across organization - Admin controls and audit logging

Advanced Optimizations:

Optimization	How It Works	Benefit
Delta sync	Upload only changed bytes using rsync algorithm	90%+ bandwidth savings for edited files
Block-level dedup	Hash file blocks, upload only unique blocks	Storage savings across all users

+ 4 more rows...

class DeltaSyncEngine:
    """Upload only changed portions of files using rsync-like algorithm."""

+ 57 more lines...

Alternative approach

If bandwidth is unlimited but latency matters (enterprise LAN), I would use a simpler design with larger chunks and more parallelism. If storage cost matters, I would add content-addressable storage with deduplication.

What I would do differently for...

Mobile-first app: Smaller chunks (1-2MB), aggressive battery optimization, pause on cellular

Backup application: One-way sync only, immutable uploads, versioning instead of overwrite

Real-time collaboration: WebSocket for change notifications, operational transform for conflicts, sub-file sync

Media streaming: Range request support, progressive download, adaptive bitrate

Design Trade-offs

Advantages

+Simple implementation
+No delta calculation overhead
+Works for all file types

Disadvantages

-Wastes bandwidth for small edits to large files
-Slow for poor connections

When to use

Small files, fast networks, simple requirements

System Design Masterclass

Weather Application with Forecasting

URL Shortener

Live Comments Feature

API Rate Limiter

On-Call Escalation System

Hotel Booking and Reservation System

Parts Compatibility Validation

Real-time Stock Price Viewer

Top-K Rankings System

File Download and Sync Library

Real-time Active Viewers

Marketplace Features

Price Alert System

Netflix Screen Concurrency Limits

Live Reactions System

Top K Most Shared Articles

High-Profile Likes Counter

Authentication and User Login

Google Calendar

Web Crawler

News Feed

Video Streaming Platform

IoC / Dependency Injection Framework

Distributed Control Infrastructure

Notification Service

Distributed Tracing System

P2P File Transfer System

Large Data Migration to Cloud

Wire Transfer API

Large Data Sorting and Processing

Database Control Plane

Distributed Metrics Logging and Aggregation

Ads Management & Delivery System

Flash Sale Backend

Photo Sharing Platform

Cluster Health Monitoring System

Rider Matching System

Surge Pricing System

Collaborative Editing System

Server Metrics Collection System

User Analytics Dashboard & Event Pipeline

Dropbox / Google Drive

Distributed Message Queue

ETA and Live Location Sharing

Distributed Key-Value Store

Distributed Stream Processing System

Payment Processing System

Distributed Job Scheduler

WhatsApp / Messenger

Payment Wallet at Global Scale

Uber / Ride Sharing

Web Search Engine

Globally Distributed SQL Database

Real-Time Analytics System

Recommendation System (Netflix)

Multi-Region Disaster Recovery System

Time-Series Database

Fraud Detection System

Design File Download and Sync Library

Summary

Key Takeaways

Core Problem

The Hard Part

Scaling Axis

Problem Statement

What to say first

Premium Content