Large-scale cluster management at Google with Borg

SystemExperts

Pricing

Whitepapers

15 items

MapReduce: Simplified Data Processing on Large Clusters

30mintermediate

Kafka: A Distributed Messaging System for Log Processing

30mintermediate

Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

25mintermediate

Summary

Borg is Google's cluster management system that has been running production workloads for over a decade. It manages hundreds of thousands of machines across multiple datacenters, running hundreds of thousands of jobs from thousands of applications. Borg achieves high utilization by mixing batch and latency-sensitive workloads, provides high availability through replication and preemption, and simplifies operations through declarative job specifications and a unified API. The lessons learned from Borg directly shaped Kubernetes, which adopted many of its core concepts while addressing known limitations.

Key Takeaways

Cells as the Unit of Management

Borg divides machines into cells (typically ~10,000 machines). Each cell is managed by a single Borgmaster. This provides failure isolation—a Borgmaster failure affects only one cell—and enables independent scaling and upgrades across cells.

Declarative Job Specifications

Users declare what they want (CPU, memory, constraints) not how to achieve it. Borg figures out where to place tasks, handles failures, and manages scaling. This separation of concerns is fundamental to Kubernetes' design.

Priority and Preemption

Jobs have priorities (production, batch, best-effort). Higher priority jobs can preempt lower priority ones. This enables high utilization—batch jobs fill unused capacity—while guaranteeing resources for critical services.

By 2015, Google operated hundreds of thousands of machines running everything from Search and Gmail to MapReduce and machine learning training. Managing this infrastructure manually was impossible.

The challenges:

Scale: Hundreds of thousands of machines, millions of running tasks
Diversity: Latency-sensitive services (Search, Gmail) alongside batch jobs (MapReduce, ML training)
Efficiency: Machines are expensive; unutilized capacity is wasted money
Reliability: Services must survive machine failures, which happen constantly at scale
Velocity: Thousands of developers deploying updates continuously

Borg has been Google's answer to these challenges for over a decade. It hides the complexity of cluster management behind a simple abstraction: tell Borg what you want to run, and it figures out where and how.

Summary

Key Takeaways

Cells as the Unit of Management

Declarative Job Specifications

Priority and Preemption

Allocs for Resource Bundling

An alloc reserves resources on a machine for one or more tasks. This enables co-located helper processes (sidecars), resource banking for future use, and predictable colocation of related tasks.

Premium Content

Trade-offs

Aspect	Advantage	Disadvantage
Centralized Scheduler (Borgmaster)	Global view enables optimal placement; simpler than distributed scheduling; easier consistency	Potential bottleneck at scale; single point of failure (mitigated by replication)
Priority and Preemption	Enables high utilization by mixing workloads; guarantees resources for production	Batch jobs experience unpredictable interruptions; complexity in quota management
Resource Reclamation	20-30% efficiency gain by using reserved-but-unused resources	Risk of resource contention if predictions wrong; complexity in estimation
Cell-Based Architecture	Failure isolation; independent scaling; administrative boundaries	Cross-cell coordination harder; resource fragmentation across cells
Poll-Based Communication	Borgmaster controls load; resilient to Borglet failures; simpler state management	Delayed detection of changes; polling overhead at scale

Whitepapers

MapReduce: Simplified Data Processing on Large Clusters

Kafka: A Distributed Messaging System for Log Processing

Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

Bitcoin: A Peer-to-Peer Electronic Cash System

In Search of an Understandable Consensus Algorithm (Raft)

TAO: Facebook's Distributed Data Store for the Social Graph

The Google File System

The Log-Structured Merge-Tree (LSM-Tree)

The Chubby Lock Service for Loosely-Coupled Distributed Systems

Spanner: Google's Globally Distributed Database

Bigtable: A Distributed Storage System for Structured Data

Scaling Memcache at Facebook