SystemExpertsSystemExperts
Pricing

Whitepapers

15 items

MapReduce: Simplified Data Processing on Large Clusters

30mintermediate

Kafka: A Distributed Messaging System for Log Processing

30mintermediate

Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

25mintermediate

Bitcoin: A Peer-to-Peer Electronic Cash System

30mintermediate

In Search of an Understandable Consensus Algorithm (Raft)

35madvanced

TAO: Facebook's Distributed Data Store for the Social Graph

35madvanced

The Google File System

35madvanced

The Log-Structured Merge-Tree (LSM-Tree)

35madvanced

The Chubby Lock Service for Loosely-Coupled Distributed Systems

30madvanced

Spanner: Google's Globally Distributed Database

40madvanced

Bigtable: A Distributed Storage System for Structured Data

35madvanced

Scaling Memcache at Facebook

35madvanced

Large-scale cluster management at Google with Borg

35madvanced

The Next 700 Programming Languages

30madvanced

The Part-Time Parliament

40madvanced
distributed-systemsdatabasesgoogletransactionsconsistencytruetimeglobal-scaleadvanced

Spanner: Google's Globally Distributed Database

The first system to achieve external consistency at global scale using synchronized clocks and TrueTime

James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, Dale Woodford|Google|2012|40 min read
View Original Paper

Summary

Spanner is Google's globally distributed database that provides external consistency (linearizability) for distributed transactions while maintaining high availability. The breakthrough is TrueTime—an API that exposes clock uncertainty, enabling Spanner to order transactions globally without communication. By waiting out clock uncertainty before committing, Spanner guarantees that if transaction T1 commits before T2 starts, T1's timestamp is less than T2's. This seemingly simple property enables globally consistent reads and true ACID transactions across continents.

Key Takeaways

TrueTime: Bounded Clock Uncertainty

Instead of pretending clocks are synchronized (they're not), TrueTime exposes uncertainty as an interval [earliest, latest]. Spanner waits out this uncertainty before committing, guaranteeing correct ordering. GPS and atomic clocks keep uncertainty typically under 7ms.

External Consistency Without Global Coordination

Traditional distributed transactions require coordination (2PC) which adds latency. Spanner's commit-wait protocol uses time instead of communication—if you wait long enough, you're guaranteed to be after all concurrent transactions.

Lock-Free Snapshot Reads

Any read at a past timestamp can execute on any replica without locks or coordination. The timestamp guarantees a consistent snapshot. This enables fast, globally distributed read replicas.

By 2012, Google had massive global systems built on Bigtable—but Bigtable had limitations:

No cross-row transactions: Updating multiple rows atomically required complex application logic No strong consistency for reads: Applications dealt with eventual consistency or added synchronization No SQL interface: Developers wrote custom MapReduce jobs for queries

Meanwhile, traditional SQL databases couldn't scale globally:

Single-master bottleneck: All writes go to one location Replication lag: Read replicas serve stale data 2PC across datacenters: Distributed transactions have unacceptable latency

Google needed a system that combined: - Global distribution (like Bigtable) - Strong consistency (like single-node SQL) - ACID transactions across shards

The Gap Spanner Fills

The fundamental challenge is clock synchronization. In a single datacenter, we can use a single clock or tightly synchronized clocks. Globally, clocks drift. Network delays vary. How do you order transactions across continents?

Previous approaches: - Logical clocks (Lamport): Provide causality but not real-time ordering - Vector clocks: Track causality precisely but don't map to wall-clock time - NTP: Reduces drift but can't bound uncertainty reliably

Spanner's answer: TrueTime—a novel API that exposes clock uncertainty and uses specialized hardware (GPS + atomic clocks) to minimize it.

Summary

Spanner is Google's globally distributed database that provides external consistency (linearizability) for distributed transactions while maintaining high availability. The breakthrough is TrueTime—an API that exposes clock uncertainty, enabling Spanner to order transactions globally without communication. By waiting out clock uncertainty before committing, Spanner guarantees that if transaction T1 commits before T2 starts, T1's timestamp is less than T2's. This seemingly simple property enables globally consistent reads and true ACID transactions across continents.

Key Takeaways

TrueTime: Bounded Clock Uncertainty

Instead of pretending clocks are synchronized (they're not), TrueTime exposes uncertainty as an interval [earliest, latest]. Spanner waits out this uncertainty before committing, guaranteeing correct ordering. GPS and atomic clocks keep uncertainty typically under 7ms.

External Consistency Without Global Coordination

Traditional distributed transactions require coordination (2PC) which adds latency. Spanner's commit-wait protocol uses time instead of communication—if you wait long enough, you're guaranteed to be after all concurrent transactions.

Lock-Free Snapshot Reads

Any read at a past timestamp can execute on any replica without locks or coordination. The timestamp guarantees a consistent snapshot. This enables fast, globally distributed read replicas.

Paxos Groups for Replication

Data is sharded across directories, each replicated via Paxos for fault tolerance. Leaders handle writes; any replica can serve snapshot reads. Paxos provides consensus within a shard; TrueTime provides ordering across shards.

Semi-Relational Data Model

Unlike Bigtable's key-value model, Spanner supports a schema with typed columns, secondary indexes, and SQL queries. Tables can be hierarchically interleaved for locality. It's the bridge between NoSQL scale and relational features.

Synchronous Replication Across Continents

Spanner synchronously replicates across multiple datacenters (zones), surviving zone failures without data loss. Write latency is bounded by cross-zone RTT plus commit-wait, typically 10-100ms depending on geography.

Deep Dive

By 2012, Google had massive global systems built on Bigtable—but Bigtable had limitations:

No cross-row transactions: Updating multiple rows atomically required complex application logic No strong consistency for reads: Applications dealt with eventual consistency or added synchronization No SQL interface: Developers wrote custom MapReduce jobs for queries

Meanwhile, traditional SQL databases couldn't scale globally:

Single-master bottleneck: All writes go to one location Replication lag: Read replicas serve stale data 2PC across datacenters: Distributed transactions have unacceptable latency

Google needed a system that combined: - Global distribution (like Bigtable) - Strong consistency (like single-node SQL) - ACID transactions across shards

The Gap Spanner Fills

The fundamental challenge is clock synchronization. In a single datacenter, we can use a single clock or tightly synchronized clocks. Globally, clocks drift. Network delays vary. How do you order transactions across continents?

Previous approaches: - Logical clocks (Lamport): Provide causality but not real-time ordering - Vector clocks: Track causality precisely but don't map to wall-clock time - NTP: Reduces drift but can't bound uncertainty reliably

Spanner's answer: TrueTime—a novel API that exposes clock uncertainty and uses specialized hardware (GPS + atomic clocks) to minimize it.

Trade-offs

AspectAdvantageDisadvantage
External ConsistencyTransactions are globally ordered by real time—if T1 commits before T2 starts, T1.timestamp < T2.timestamp. Enables truly consistent global reads.Commit-wait adds latency (average 2×ε ≈ 14ms). Every write pays this cost regardless of whether external consistency is needed.
TrueTime DependencyEliminates distributed coordination for ordering—time replaces communication. Elegant solution to a fundamental problem.Requires specialized hardware (GPS receivers, atomic clocks) in every datacenter. Not practical outside Google/cloud providers.
Synchronous Cross-Region ReplicationZero data loss on region failure. Reads from any replica see committed data. Strong durability guarantees.Write latency includes cross-region RTT (50-150ms). Unsuitable for latency-critical writes to global data.
Paxos Groups per ShardEach shard operates independently—failures and slowness are isolated. Scales by adding shards.Cross-shard transactions require 2PC, adding latency and complexity. Hot shards become bottlenecks.
Semi-Relational ModelSQL queries, schemas, secondary indexes—familiar relational features. Interleaving enables efficient hierarchical access.Less flexible than key-value stores. Schema changes, while non-blocking, require careful planning. Global indexes are expensive.

Premium Content

Sign in to access this content or upgrade for full access.