SystemExpertsSystemExperts
Pricing

Whitepapers

15 items

MapReduce: Simplified Data Processing on Large Clusters

30mintermediate

Kafka: A Distributed Messaging System for Log Processing

30mintermediate

Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

25mintermediate

Bitcoin: A Peer-to-Peer Electronic Cash System

30mintermediate

In Search of an Understandable Consensus Algorithm (Raft)

35madvanced

TAO: Facebook's Distributed Data Store for the Social Graph

35madvanced

The Google File System

35madvanced

The Log-Structured Merge-Tree (LSM-Tree)

35madvanced

The Chubby Lock Service for Loosely-Coupled Distributed Systems

30madvanced

Spanner: Google's Globally Distributed Database

40madvanced

Bigtable: A Distributed Storage System for Structured Data

35madvanced

Scaling Memcache at Facebook

35madvanced

Large-scale cluster management at Google with Borg

35madvanced

The Next 700 Programming Languages

30madvanced

The Part-Time Parliament

40madvanced
distributed-systemscluster-managementkubernetesgooglecontainersschedulingorchestrationadvanced

Large-scale cluster management at Google with Borg

The internal system that runs everything at Google and inspired Kubernetes—managing millions of jobs across hundreds of thousands of machines

Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, John Wilkes|Google|2015|35 min read
View Original Paper

Summary

Borg is Google's cluster management system that has been running production workloads for over a decade. It manages hundreds of thousands of machines across multiple datacenters, running hundreds of thousands of jobs from thousands of applications. Borg achieves high utilization by mixing batch and latency-sensitive workloads, provides high availability through replication and preemption, and simplifies operations through declarative job specifications and a unified API. The lessons learned from Borg directly shaped Kubernetes, which adopted many of its core concepts while addressing known limitations.

Key Takeaways

Cells as the Unit of Management

Borg divides machines into cells (typically ~10,000 machines). Each cell is managed by a single Borgmaster. This provides failure isolation—a Borgmaster failure affects only one cell—and enables independent scaling and upgrades across cells.

Declarative Job Specifications

Users declare what they want (CPU, memory, constraints) not how to achieve it. Borg figures out where to place tasks, handles failures, and manages scaling. This separation of concerns is fundamental to Kubernetes' design.

Priority and Preemption

Jobs have priorities (production, batch, best-effort). Higher priority jobs can preempt lower priority ones. This enables high utilization—batch jobs fill unused capacity—while guaranteeing resources for critical services.

By 2015, Google operated hundreds of thousands of machines running everything from Search and Gmail to MapReduce and machine learning training. Managing this infrastructure manually was impossible.

The challenges:

  1. Scale: Hundreds of thousands of machines, millions of running tasks
  2. Diversity: Latency-sensitive services (Search, Gmail) alongside batch jobs (MapReduce, ML training)
  3. Efficiency: Machines are expensive; unutilized capacity is wasted money
  4. Reliability: Services must survive machine failures, which happen constantly at scale
  5. Velocity: Thousands of developers deploying updates continuously

Borg has been Google's answer to these challenges for over a decade. It hides the complexity of cluster management behind a simple abstraction: tell Borg what you want to run, and it figures out where and how.

Summary

Borg is Google's cluster management system that has been running production workloads for over a decade. It manages hundreds of thousands of machines across multiple datacenters, running hundreds of thousands of jobs from thousands of applications. Borg achieves high utilization by mixing batch and latency-sensitive workloads, provides high availability through replication and preemption, and simplifies operations through declarative job specifications and a unified API. The lessons learned from Borg directly shaped Kubernetes, which adopted many of its core concepts while addressing known limitations.

Key Takeaways

Cells as the Unit of Management

Borg divides machines into cells (typically ~10,000 machines). Each cell is managed by a single Borgmaster. This provides failure isolation—a Borgmaster failure affects only one cell—and enables independent scaling and upgrades across cells.

Declarative Job Specifications

Users declare what they want (CPU, memory, constraints) not how to achieve it. Borg figures out where to place tasks, handles failures, and manages scaling. This separation of concerns is fundamental to Kubernetes' design.

Priority and Preemption

Jobs have priorities (production, batch, best-effort). Higher priority jobs can preempt lower priority ones. This enables high utilization—batch jobs fill unused capacity—while guaranteeing resources for critical services.

Allocs for Resource Bundling

An alloc reserves resources on a machine for one or more tasks. This enables co-located helper processes (sidecars), resource banking for future use, and predictable colocation of related tasks.

Premium Content

Sign in to access this content or upgrade for full access.

Borglet as the Node Agent

Each machine runs a Borglet agent that starts/stops tasks, manages local resources, and reports status. The Borgmaster polls Borglets rather than Borglets pushing updates, enabling better scalability and resilience.

High Utilization Through Mixing

Borg achieves 60-70% average utilization by mixing latency-sensitive (production) and batch workloads. Batch jobs use resources not needed by production jobs, dramatically reducing the fleet needed compared to static partitioning.

Deep Dive

By 2015, Google operated hundreds of thousands of machines running everything from Search and Gmail to MapReduce and machine learning training. Managing this infrastructure manually was impossible.

The challenges:

  1. Scale: Hundreds of thousands of machines, millions of running tasks
  2. Diversity: Latency-sensitive services (Search, Gmail) alongside batch jobs (MapReduce, ML training)
  3. Efficiency: Machines are expensive; unutilized capacity is wasted money
  4. Reliability: Services must survive machine failures, which happen constantly at scale
  5. Velocity: Thousands of developers deploying updates continuously

Borg has been Google's answer to these challenges for over a decade. It hides the complexity of cluster management behind a simple abstraction: tell Borg what you want to run, and it figures out where and how.

Trade-offs

AspectAdvantageDisadvantage
Centralized Scheduler (Borgmaster)Global view enables optimal placement; simpler than distributed scheduling; easier consistencyPotential bottleneck at scale; single point of failure (mitigated by replication)
Priority and PreemptionEnables high utilization by mixing workloads; guarantees resources for productionBatch jobs experience unpredictable interruptions; complexity in quota management
Resource Reclamation20-30% efficiency gain by using reserved-but-unused resourcesRisk of resource contention if predictions wrong; complexity in estimation
Cell-Based ArchitectureFailure isolation; independent scaling; administrative boundariesCross-cell coordination harder; resource fragmentation across cells
Poll-Based CommunicationBorgmaster controls load; resilient to Borglet failures; simpler state managementDelayed detection of changes; polling overhead at scale