Whitepapers
15 items
15 items
The internal system that runs everything at Google and inspired Kubernetes—managing millions of jobs across hundreds of thousands of machines
Borg is Google's cluster management system that has been running production workloads for over a decade. It manages hundreds of thousands of machines across multiple datacenters, running hundreds of thousands of jobs from thousands of applications. Borg achieves high utilization by mixing batch and latency-sensitive workloads, provides high availability through replication and preemption, and simplifies operations through declarative job specifications and a unified API. The lessons learned from Borg directly shaped Kubernetes, which adopted many of its core concepts while addressing known limitations.
Borg divides machines into cells (typically ~10,000 machines). Each cell is managed by a single Borgmaster. This provides failure isolation—a Borgmaster failure affects only one cell—and enables independent scaling and upgrades across cells.
Users declare what they want (CPU, memory, constraints) not how to achieve it. Borg figures out where to place tasks, handles failures, and manages scaling. This separation of concerns is fundamental to Kubernetes' design.
Jobs have priorities (production, batch, best-effort). Higher priority jobs can preempt lower priority ones. This enables high utilization—batch jobs fill unused capacity—while guaranteeing resources for critical services.
By 2015, Google operated hundreds of thousands of machines running everything from Search and Gmail to MapReduce and machine learning training. Managing this infrastructure manually was impossible.
The challenges:
Borg has been Google's answer to these challenges for over a decade. It hides the complexity of cluster management behind a simple abstraction: tell Borg what you want to run, and it figures out where and how.
Each machine runs a Borglet agent that starts/stops tasks, manages local resources, and reports status. The Borgmaster polls Borglets rather than Borglets pushing updates, enabling better scalability and resilience.
Borg achieves 60-70% average utilization by mixing latency-sensitive (production) and batch workloads. Batch jobs use resources not needed by production jobs, dramatically reducing the fleet needed compared to static partitioning.
By 2015, Google operated hundreds of thousands of machines running everything from Search and Gmail to MapReduce and machine learning training. Managing this infrastructure manually was impossible.
The challenges:
Borg has been Google's answer to these challenges for over a decade. It hides the complexity of cluster management behind a simple abstraction: tell Borg what you want to run, and it figures out where and how.
| Aspect | Advantage | Disadvantage |
|---|---|---|
| Centralized Scheduler (Borgmaster) | Global view enables optimal placement; simpler than distributed scheduling; easier consistency | Potential bottleneck at scale; single point of failure (mitigated by replication) |
| Priority and Preemption | Enables high utilization by mixing workloads; guarantees resources for production | Batch jobs experience unpredictable interruptions; complexity in quota management |
| Resource Reclamation | 20-30% efficiency gain by using reserved-but-unused resources | Risk of resource contention if predictions wrong; complexity in estimation |
| Cell-Based Architecture | Failure isolation; independent scaling; administrative boundaries | Cross-cell coordination harder; resource fragmentation across cells |
| Poll-Based Communication | Borgmaster controls load; resilient to Borglet failures; simpler state management | Delayed detection of changes; polling overhead at scale |