Whitepapers

15 items

MapReduce: Simplified Data Processing on Large Clusters

30mintermediate

Kafka: A Distributed Messaging System for Log Processing

30mintermediate

Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

25mintermediate

distributed-systemsstoragefile-systemgooglehdfsreplicationadvanced

The Google File System

How Google reimagined distributed storage by embracing component failures and optimizing for large sequential workloads

Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung|Google|2003|35 min read

View Original Paper

Summary

GFS is a scalable distributed file system for large data-intensive applications. It runs on thousands of commodity machines, providing fault tolerance through replication while delivering high aggregate throughput. GFS challenged conventional file system assumptions: it embraced component failures as normal, optimized for large streaming reads and appends rather than random access, and relaxed consistency guarantees to achieve better performance. GFS became the storage foundation for MapReduce and Bigtable, enabling Google's web-scale data processing.

Key Takeaways

Failure is the Norm

With thousands of commodity machines, hardware failures happen constantly—disks fail, machines crash, networks partition. GFS treats failure as expected behavior, not exceptional. Automatic recovery and replication are built into the core design.

Large Files, Sequential Access

GFS optimizes for multi-GB files accessed sequentially, not millions of small files with random access. This matches Google's workloads: web crawls, logs, MapReduce intermediates. Block size is 64MB (vs 4KB typical), reducing metadata overhead dramatically.

Single Master Architecture

One master manages all metadata—namespace, file-to-chunk mapping, chunk locations. This simplifies design and enables sophisticated decisions (placement, rebalancing, garbage collection). The master is not a bottleneck because clients cache metadata and access data directly from chunkservers.

By 2003, Google's storage needs were unprecedented:

Petabytes of data: Web crawls, indexes, logs, user data
Thousands of machines: Commodity hardware, not expensive servers
Continuous failures: At scale, something is always broken
Extreme throughput: MapReduce jobs reading/writing terabytes

Existing distributed file systems (AFS, NFS) were designed for different workloads—small files, random access, few clients. Google needed a new approach.

Google's Workload Characteristics

Key observations that shaped GFS design:

Component failures are routine: With thousands of disks, multiple failures happen daily. The system must self-heal automatically.

Files are huge: Multi-GB files are common. Optimizing for small files wastes engineering effort.

Reads are mostly sequential: Streaming large reads dominate. Random reads are rare and can be slow.

Writes are mostly appends: Files grow by appending (logs, crawl data). Random writes are rare.

Throughput matters more than latency: Batch processing cares about aggregate bandwidth, not single-request latency.

These observations led to design choices that would seem wrong for general-purpose file systems but were exactly right for Google.

Summary

Key Takeaways

Failure is the Norm

Large Files, Sequential Access

Single Master Architecture

Relaxed Consistency Model

GFS provides weaker consistency than POSIX for better performance. After concurrent writes, a region is 'consistent' (all replicas same) but may not be 'defined' (reflects one write). Applications use append-at-least-once semantics and handle duplicates.

Record Append as Primary Operation

Most writes are appends, not overwrites. GFS provides atomic record append: clients specify data, GFS chooses offset and guarantees atomicity. This enables concurrent appends from multiple clients without locking—perfect for logs and queues.

Separation of Control and Data Flow

Metadata operations go through the master; data flows directly between clients and chunkservers. This separation is crucial—the master handles millions of metadata ops/sec while terabytes of data flow without touching it.

Deep Dive

By 2003, Google's storage needs were unprecedented:

Petabytes of data: Web crawls, indexes, logs, user data
Thousands of machines: Commodity hardware, not expensive servers
Continuous failures: At scale, something is always broken
Extreme throughput: MapReduce jobs reading/writing terabytes

Existing distributed file systems (AFS, NFS) were designed for different workloads—small files, random access, few clients. Google needed a new approach.

Google's Workload Characteristics

Key observations that shaped GFS design:

Component failures are routine: With thousands of disks, multiple failures happen daily. The system must self-heal automatically.

Files are huge: Multi-GB files are common. Optimizing for small files wastes engineering effort.

Reads are mostly sequential: Streaming large reads dominate. Random reads are rare and can be slow.

Writes are mostly appends: Files grow by appending (logs, crawl data). Random writes are rare.

Throughput matters more than latency: Batch processing cares about aggregate bandwidth, not single-request latency.

These observations led to design choices that would seem wrong for general-purpose file systems but were exactly right for Google.

Trade-offs

Aspect	Advantage	Disadvantage
Single Master Architecture	Simplifies design; enables sophisticated global decisions (placement, rebalancing); fast metadata operations	Single point of failure (mitigated by shadows); metadata bottleneck for very large namespaces
Large Chunk Size (64MB)	Reduces metadata volume; fewer client-master interactions; efficient for large sequential workloads	Wastes space for small files; can create hot spots when small files are popular
Relaxed Consistency	Enables concurrent appends without locking; higher write throughput; simpler failure handling	Applications must handle duplicates and undefined regions; more complex client logic
Record Append (At-Least-Once)	Atomic concurrent appends without coordination; perfect for logs and queues	Duplicates possible; applications must deduplicate; not suitable for exact-once requirements
Replication Over RAID	Handles correlated failures (rack, switch); enables read parallelism; works with commodity disks	3x storage overhead; more network traffic for writes; eventual consistency between replicas

Premium Content

distributed-systemsstoragefile-systemgooglehdfsreplicationadvanced

The Google File System

How Google reimagined distributed storage by embracing component failures and optimizing for large sequential workloads

Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung|Google|2003|35 min read

View Original Paper

Summary

Key Takeaways

Failure is the Norm

Large Files, Sequential Access

Single Master Architecture

By 2003, Google's storage needs were unprecedented:

Petabytes of data: Web crawls, indexes, logs, user data
Thousands of machines: Commodity hardware, not expensive servers
Continuous failures: At scale, something is always broken
Extreme throughput: MapReduce jobs reading/writing terabytes

Existing distributed file systems (AFS, NFS) were designed for different workloads—small files, random access, few clients. Google needed a new approach.

Google's Workload Characteristics

Key observations that shaped GFS design:

Component failures are routine: With thousands of disks, multiple failures happen daily. The system must self-heal automatically.

Files are huge: Multi-GB files are common. Optimizing for small files wastes engineering effort.

Reads are mostly sequential: Streaming large reads dominate. Random reads are rare and can be slow.

Writes are mostly appends: Files grow by appending (logs, crawl data). Random writes are rare.

Throughput matters more than latency: Batch processing cares about aggregate bandwidth, not single-request latency.

These observations led to design choices that would seem wrong for general-purpose file systems but were exactly right for Google.

Summary

Key Takeaways

Failure is the Norm

Large Files, Sequential Access

Single Master Architecture

Relaxed Consistency Model

Record Append as Primary Operation

Separation of Control and Data Flow

Deep Dive

By 2003, Google's storage needs were unprecedented:

Petabytes of data: Web crawls, indexes, logs, user data
Thousands of machines: Commodity hardware, not expensive servers
Continuous failures: At scale, something is always broken
Extreme throughput: MapReduce jobs reading/writing terabytes

Existing distributed file systems (AFS, NFS) were designed for different workloads—small files, random access, few clients. Google needed a new approach.

Google's Workload Characteristics

Key observations that shaped GFS design:

Component failures are routine: With thousands of disks, multiple failures happen daily. The system must self-heal automatically.

Files are huge: Multi-GB files are common. Optimizing for small files wastes engineering effort.

Reads are mostly sequential: Streaming large reads dominate. Random reads are rare and can be slow.

Writes are mostly appends: Files grow by appending (logs, crawl data). Random writes are rare.

Throughput matters more than latency: Batch processing cares about aggregate bandwidth, not single-request latency.

These observations led to design choices that would seem wrong for general-purpose file systems but were exactly right for Google.

Trade-offs

Aspect	Advantage	Disadvantage
Single Master Architecture	Simplifies design; enables sophisticated global decisions (placement, rebalancing); fast metadata operations	Single point of failure (mitigated by shadows); metadata bottleneck for very large namespaces
Large Chunk Size (64MB)	Reduces metadata volume; fewer client-master interactions; efficient for large sequential workloads	Wastes space for small files; can create hot spots when small files are popular
Relaxed Consistency	Enables concurrent appends without locking; higher write throughput; simpler failure handling	Applications must handle duplicates and undefined regions; more complex client logic
Record Append (At-Least-Once)	Atomic concurrent appends without coordination; perfect for logs and queues	Duplicates possible; applications must deduplicate; not suitable for exact-once requirements
Replication Over RAID	Handles correlated failures (rack, switch); enables read parallelism; works with commodity disks	3x storage overhead; more network traffic for writes; eventual consistency between replicas

Whitepapers

MapReduce: Simplified Data Processing on Large Clusters

Kafka: A Distributed Messaging System for Log Processing

Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

Bitcoin: A Peer-to-Peer Electronic Cash System

In Search of an Understandable Consensus Algorithm (Raft)

TAO: Facebook's Distributed Data Store for the Social Graph