SystemExpertsSystemExperts
Pricing

Whitepapers

15 items

MapReduce: Simplified Data Processing on Large Clusters

30mintermediate

Kafka: A Distributed Messaging System for Log Processing

30mintermediate

Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

25mintermediate

Bitcoin: A Peer-to-Peer Electronic Cash System

30mintermediate

In Search of an Understandable Consensus Algorithm (Raft)

35madvanced

TAO: Facebook's Distributed Data Store for the Social Graph

35madvanced

The Google File System

35madvanced

The Log-Structured Merge-Tree (LSM-Tree)

35madvanced

The Chubby Lock Service for Loosely-Coupled Distributed Systems

30madvanced

Spanner: Google's Globally Distributed Database

40madvanced

Bigtable: A Distributed Storage System for Structured Data

35madvanced

Scaling Memcache at Facebook

35madvanced

Large-scale cluster management at Google with Borg

35madvanced

The Next 700 Programming Languages

30madvanced

The Part-Time Parliament

40madvanced
distributed-systemsstoragefile-systemgooglehdfsreplicationadvanced

The Google File System

How Google reimagined distributed storage by embracing component failures and optimizing for large sequential workloads

Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung|Google|2003|35 min read
View Original Paper

Summary

GFS is a scalable distributed file system for large data-intensive applications. It runs on thousands of commodity machines, providing fault tolerance through replication while delivering high aggregate throughput. GFS challenged conventional file system assumptions: it embraced component failures as normal, optimized for large streaming reads and appends rather than random access, and relaxed consistency guarantees to achieve better performance. GFS became the storage foundation for MapReduce and Bigtable, enabling Google's web-scale data processing.

Key Takeaways

Failure is the Norm

With thousands of commodity machines, hardware failures happen constantly—disks fail, machines crash, networks partition. GFS treats failure as expected behavior, not exceptional. Automatic recovery and replication are built into the core design.

Large Files, Sequential Access

GFS optimizes for multi-GB files accessed sequentially, not millions of small files with random access. This matches Google's workloads: web crawls, logs, MapReduce intermediates. Block size is 64MB (vs 4KB typical), reducing metadata overhead dramatically.

Single Master Architecture

One master manages all metadata—namespace, file-to-chunk mapping, chunk locations. This simplifies design and enables sophisticated decisions (placement, rebalancing, garbage collection). The master is not a bottleneck because clients cache metadata and access data directly from chunkservers.

By 2003, Google's storage needs were unprecedented:

  • Petabytes of data: Web crawls, indexes, logs, user data
  • Thousands of machines: Commodity hardware, not expensive servers
  • Continuous failures: At scale, something is always broken
  • Extreme throughput: MapReduce jobs reading/writing terabytes

Existing distributed file systems (AFS, NFS) were designed for different workloads—small files, random access, few clients. Google needed a new approach.

Google's Workload Characteristics

Key observations that shaped GFS design:

  1. Component failures are routine: With thousands of disks, multiple failures happen daily. The system must self-heal automatically.
  1. Files are huge: Multi-GB files are common. Optimizing for small files wastes engineering effort.
  1. Reads are mostly sequential: Streaming large reads dominate. Random reads are rare and can be slow.
  1. Writes are mostly appends: Files grow by appending (logs, crawl data). Random writes are rare.
  1. Throughput matters more than latency: Batch processing cares about aggregate bandwidth, not single-request latency.

These observations led to design choices that would seem wrong for general-purpose file systems but were exactly right for Google.

Summary

GFS is a scalable distributed file system for large data-intensive applications. It runs on thousands of commodity machines, providing fault tolerance through replication while delivering high aggregate throughput. GFS challenged conventional file system assumptions: it embraced component failures as normal, optimized for large streaming reads and appends rather than random access, and relaxed consistency guarantees to achieve better performance. GFS became the storage foundation for MapReduce and Bigtable, enabling Google's web-scale data processing.

Key Takeaways

Failure is the Norm

With thousands of commodity machines, hardware failures happen constantly—disks fail, machines crash, networks partition. GFS treats failure as expected behavior, not exceptional. Automatic recovery and replication are built into the core design.

Large Files, Sequential Access

GFS optimizes for multi-GB files accessed sequentially, not millions of small files with random access. This matches Google's workloads: web crawls, logs, MapReduce intermediates. Block size is 64MB (vs 4KB typical), reducing metadata overhead dramatically.

Single Master Architecture

One master manages all metadata—namespace, file-to-chunk mapping, chunk locations. This simplifies design and enables sophisticated decisions (placement, rebalancing, garbage collection). The master is not a bottleneck because clients cache metadata and access data directly from chunkservers.

Relaxed Consistency Model

GFS provides weaker consistency than POSIX for better performance. After concurrent writes, a region is 'consistent' (all replicas same) but may not be 'defined' (reflects one write). Applications use append-at-least-once semantics and handle duplicates.

Record Append as Primary Operation

Most writes are appends, not overwrites. GFS provides atomic record append: clients specify data, GFS chooses offset and guarantees atomicity. This enables concurrent appends from multiple clients without locking—perfect for logs and queues.

Separation of Control and Data Flow

Metadata operations go through the master; data flows directly between clients and chunkservers. This separation is crucial—the master handles millions of metadata ops/sec while terabytes of data flow without touching it.

Deep Dive

By 2003, Google's storage needs were unprecedented:

  • Petabytes of data: Web crawls, indexes, logs, user data
  • Thousands of machines: Commodity hardware, not expensive servers
  • Continuous failures: At scale, something is always broken
  • Extreme throughput: MapReduce jobs reading/writing terabytes

Existing distributed file systems (AFS, NFS) were designed for different workloads—small files, random access, few clients. Google needed a new approach.

Google's Workload Characteristics

Key observations that shaped GFS design:

  1. Component failures are routine: With thousands of disks, multiple failures happen daily. The system must self-heal automatically.
  1. Files are huge: Multi-GB files are common. Optimizing for small files wastes engineering effort.
  1. Reads are mostly sequential: Streaming large reads dominate. Random reads are rare and can be slow.
  1. Writes are mostly appends: Files grow by appending (logs, crawl data). Random writes are rare.
  1. Throughput matters more than latency: Batch processing cares about aggregate bandwidth, not single-request latency.

These observations led to design choices that would seem wrong for general-purpose file systems but were exactly right for Google.

Trade-offs

AspectAdvantageDisadvantage
Single Master ArchitectureSimplifies design; enables sophisticated global decisions (placement, rebalancing); fast metadata operationsSingle point of failure (mitigated by shadows); metadata bottleneck for very large namespaces
Large Chunk Size (64MB)Reduces metadata volume; fewer client-master interactions; efficient for large sequential workloadsWastes space for small files; can create hot spots when small files are popular
Relaxed ConsistencyEnables concurrent appends without locking; higher write throughput; simpler failure handlingApplications must handle duplicates and undefined regions; more complex client logic
Record Append (At-Least-Once)Atomic concurrent appends without coordination; perfect for logs and queuesDuplicates possible; applications must deduplicate; not suitable for exact-once requirements
Replication Over RAIDHandles correlated failures (rack, switch); enables read parallelism; works with commodity disks3x storage overhead; more network traffic for writes; eventual consistency between replicas

Premium Content

Sign in to access this content or upgrade for full access.