Design Walkthrough
Problem Statement
The Question: Design a file download and synchronization library that handles large file transfers, resumable uploads/downloads, and bidirectional sync between client and server.
This library is used by: - Mobile apps: Downloading assets, uploading user content - Desktop sync clients: Dropbox-style folder sync - Backup applications: Large file transfers to cloud - Media applications: Streaming downloads with offline support
What to say first
Before designing the library, let me clarify the requirements. I want to understand the file sizes involved, network conditions, consistency requirements, and what platforms this library targets.
Hidden requirements interviewers are testing: - Do you understand chunked transfer protocols? - Can you handle partial failures and state recovery? - How do you resolve conflicts in bidirectional sync? - Do you consider mobile constraints (battery, bandwidth)?
Clarifying Questions
These questions shape the library design significantly.
Question 1: File Size Range
What is the range of file sizes? KB documents, GB videos, or TB datasets?
Why this matters: Determines chunking strategy and memory management. Typical answer: Support 1KB to 50GB files Architecture impact: Need streaming/chunked approach, cannot load into memory
Question 2: Network Conditions
What network conditions should we handle? Reliable datacenter or flaky mobile?
Why this matters: Determines retry strategy and chunk size. Typical answer: Support mobile networks with frequent disconnections Architecture impact: Aggressive resumability, smaller chunks, offline queue
Question 3: Sync Direction
Download only, upload only, or bidirectional sync?
Why this matters: Bidirectional sync requires conflict resolution. Typical answer: Full bidirectional sync Architecture impact: Need vector clocks or similar for conflict detection
Question 4: Consistency Model
Is eventual consistency acceptable? How should conflicts be resolved?
Why this matters: Determines sync algorithm complexity. Typical answer: Eventual consistency with automatic conflict resolution Architecture impact: Last-writer-wins or merge strategies needed
Stating assumptions
Based on this, I will assume: files up to 50GB, unreliable mobile networks, bidirectional sync, eventual consistency with last-writer-wins for conflicts, and need to support iOS/Android/Desktop.
The Hard Part
Say this out loud
The hard part here is handling partial failures gracefully - ensuring data integrity when transfers are interrupted at any point, and resolving conflicts when the same file is modified on multiple devices.
Why this is genuinely hard:
- 1.Partial Upload Problem: Client uploads 90% of file, then crashes. Server has incomplete data. How do we resume without re-uploading everything?
- 2.Completion Ambiguity: Upload finishes, server commits, but ACK never reaches client. Client retries, potentially creating duplicates.
- 3.Conflict Detection: File modified on phone (offline) and laptop (online) simultaneously. Which version wins? Can we merge?
- 4.Integrity Verification: How do we know the file was not corrupted in transit? Checksums at what granularity?
- 5.State Recovery: App crashes mid-sync. On restart, what is the state of each file? Need durable progress tracking.
Common mistake
Candidates often design for the happy path only. Production file sync must handle: network timeout mid-chunk, app killed by OS, disk full, server returns 500 after partial upload, clock skew between devices.
The fundamental challenge:
File sync is a distributed systems problem disguised as a simple CRUD operation. Each device is a node that can modify data independently, and we need to reconcile state across all nodes.
Scale and Access Patterns
Let me analyze the scale and patterns this library must handle.
| Dimension | Value | Impact |
|---|---|---|
| Max file size | 50 GB | Must use chunked transfer, cannot buffer in memory |
| Min file size | 1 KB | Many small files need batching for efficiency |
What to say
The library must handle bimodal file sizes - many small files (documents) and occasional large files (videos). This requires different optimization strategies: batching for small files, chunking for large files.
Access Pattern Analysis:
- Download patterns: Usually sequential, can benefit from prefetching - Upload patterns: Often bursty (user saves multiple files) - Sync patterns: Mostly no-change (polling finds nothing new) - Conflict rate: Low (less than 1% of syncs) but must handle gracefully
Chunk size selection:
- Too small (64KB): 50GB file = 780,000 chunks = massive overhead
- Too large (100MB): One failure = 100MB retransmitHigh-Level Architecture
The library has three main components: Transfer Engine, Sync Engine, and State Manager.
What to say
I will design this as a layered library with clear separation: low-level transfer primitives, sync logic built on top, and persistent state management throughout.
File Sync Library Architecture
Component Responsibilities:
1. Public API Layer - Simple interface: download(), upload(), sync() - Progress callbacks and cancellation - Configuration (chunk size, concurrency, retry policy)
2. Sync Engine - Calculates diff between local and remote state - Detects and resolves conflicts - Manages sync queue and priorities
3. Transfer Engine - Chunked upload/download with resume support - Retry logic with exponential backoff - Bandwidth throttling and estimation
4. State Manager - Persists transfer progress to survive crashes - Tracks file metadata and sync state - SQLite for durability, memory cache for speed
Real-world reference
Dropbox SDK uses a similar architecture. AWS S3 Transfer Manager has TransferState persistence. Google Drive API client maintains a local database of sync state.
Data Model and Storage
The library needs persistent storage to survive crashes and track sync state.
What to say
I will use SQLite for persistent state because it is embedded, ACID-compliant, and available on all platforms. The schema tracks files, chunks, and sync metadata.
-- Track all known files (local and remote)
CREATE TABLE files (
id TEXT PRIMARY KEY, -- Unique file identifierKey Design Decisions:
- 1.File ID Strategy: Use content-based ID (hash of path + account) for deduplication
- 2.Checksum Storage: Store both local and remote checksums to detect changes without re-downloading
- 3.Chunk State: Persist per-chunk progress to resume from exact point of failure
- 4.Version Tracking: Simple integer version for conflict detection (can upgrade to vector clocks)
class StateManager:
def __init__(self, db_path: str):
self.conn = sqlite3.connect(db_path)Algorithm Deep Dive: Chunked Transfer
Chunked transfer is the foundation of reliable large file handling.
Chunked Upload Flow
class ChunkedUploader:
def __init__(self, chunk_size: int = 8 * 1024 * 1024): # 8MB default
self.chunk_size = chunk_sizeResumable Download Implementation:
class ResumableDownloader:
async def download(self, file_id: str, dest_path: str,
progress_callback: Callable = None) -> bool:Key technique: Atomic rename
Always download to a temporary file, then atomic rename on success. This ensures users never see partial files in the destination.
Algorithm Deep Dive: Sync and Conflict Resolution
Bidirectional sync requires careful conflict detection and resolution.
Sync State Machine
class SyncEngine:
def __init__(self):
self.state = StateManager()Conflict Resolution Strategies:
| Strategy | How It Works | Best For |
|---|---|---|
| Last-writer-wins | Most recent modification timestamp wins | Simple files, photos |
| Keep both | Rename conflicted file with device suffix | Documents, important files |
| Server wins | Remote version always takes precedence | Managed enterprise data |
| Merge | Combine changes (requires format knowledge) | Text files, JSON/XML |
| User prompt | Ask user to choose which version | Critical data, explicit control |
class ConflictResolver:
def __init__(self, strategy: str = 'keep_both'):
self.strategy = strategyConsistency and Invariants
System Invariants
1. Never corrupt or lose user data - partial transfers must be recoverable or fail completely. 2. Checksums must verify at every stage - chunk, file, and end-to-end. 3. Sync state must be durable - crash at any point must be recoverable.
Data Integrity Guarantees:
The library maintains integrity through multiple layers:
class IntegrityVerifier:
def verify_upload(self, file_path: str,
server_response: UploadResponse) -> bool:Business impact mapping
If a file is corrupted during sync, users lose work and trust. If sync is slow due to excessive verification, users disable it. We verify at boundaries (upload complete, download complete) not continuously.
Idempotency Design:
Every operation must be safe to retry:
async def upload_chunk_idempotent(self, upload_id: str,
chunk_idx: int,
data: bytes) -> bool:Failure Modes and Resilience
Proactively discuss failures
Let me walk through the failure modes. File sync libraries face more failure scenarios than most systems because they bridge unreliable networks, local filesystems, and remote storage.
| Failure | Impact | Mitigation | Why It Works |
|---|---|---|---|
| Network timeout mid-chunk | Chunk partially uploaded | Server discards partial, client retries | Chunks are atomic units |
| App killed by OS | Transfer state lost | Persist progress to SQLite after each chunk | State survives process death |
Retry Strategy:
class RetryHandler:
def __init__(self,
max_attempts: int = 5,Bandwidth Management:
class BandwidthManager:
def __init__(self, max_bandwidth: int = None):
self.max_bandwidth = max_bandwidth # bytes/sec, None = unlimitedEvolution and Scaling
What to say
This design works well for single-user scenarios with reasonable file counts. Let me discuss how it evolves for enterprise scale and advanced features.
Evolution Path:
Stage 1: Basic Library (Current Design) - Single user, single device - Sequential sync - Basic conflict resolution
Stage 2: Multi-Device Sync - Add device ID to version vectors - Real-time push notifications for changes - Smarter conflict resolution with merge
Stage 3: Enterprise Features - Selective sync (choose folders) - Deduplication across organization - Admin controls and audit logging
Advanced Optimizations:
| Optimization | How It Works | Benefit |
|---|---|---|
| Delta sync | Upload only changed bytes using rsync algorithm | 90%+ bandwidth savings for edited files |
| Block-level dedup | Hash file blocks, upload only unique blocks | Storage savings across all users |
class DeltaSyncEngine:
"""Upload only changed portions of files using rsync-like algorithm."""
Alternative approach
If bandwidth is unlimited but latency matters (enterprise LAN), I would use a simpler design with larger chunks and more parallelism. If storage cost matters, I would add content-addressable storage with deduplication.
What I would do differently for...
Mobile-first app: Smaller chunks (1-2MB), aggressive battery optimization, pause on cellular
Backup application: One-way sync only, immutable uploads, versioning instead of overwrite
Real-time collaboration: WebSocket for change notifications, operational transform for conflicts, sub-file sync
Media streaming: Range request support, progressive download, adaptive bitrate