Whitepapers

15 items

MapReduce: Simplified Data Processing on Large Clusters

30mintermediate

Kafka: A Distributed Messaging System for Log Processing

30mintermediate

Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

25mintermediate

distributed-systemsdatabasesgoogletransactionsconsistencytruetimeglobal-scaleadvanced

Spanner: Google's Globally Distributed Database

The first system to achieve external consistency at global scale using synchronized clocks and TrueTime

James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, Dale Woodford|Google|2012|40 min read

View Original Paper

Summary

Spanner is Google's globally distributed database that provides external consistency (linearizability) for distributed transactions while maintaining high availability. The breakthrough is TrueTime—an API that exposes clock uncertainty, enabling Spanner to order transactions globally without communication. By waiting out clock uncertainty before committing, Spanner guarantees that if transaction T1 commits before T2 starts, T1's timestamp is less than T2's. This seemingly simple property enables globally consistent reads and true ACID transactions across continents.

Key Takeaways

TrueTime: Bounded Clock Uncertainty

Instead of pretending clocks are synchronized (they're not), TrueTime exposes uncertainty as an interval [earliest, latest]. Spanner waits out this uncertainty before committing, guaranteeing correct ordering. GPS and atomic clocks keep uncertainty typically under 7ms.

External Consistency Without Global Coordination

Traditional distributed transactions require coordination (2PC) which adds latency. Spanner's commit-wait protocol uses time instead of communication—if you wait long enough, you're guaranteed to be after all concurrent transactions.

Lock-Free Snapshot Reads

Any read at a past timestamp can execute on any replica without locks or coordination. The timestamp guarantees a consistent snapshot. This enables fast, globally distributed read replicas.

By 2012, Google had massive global systems built on Bigtable—but Bigtable had limitations:

No cross-row transactions: Updating multiple rows atomically required complex application logic No strong consistency for reads: Applications dealt with eventual consistency or added synchronization No SQL interface: Developers wrote custom MapReduce jobs for queries

Meanwhile, traditional SQL databases couldn't scale globally:

Single-master bottleneck: All writes go to one location Replication lag: Read replicas serve stale data 2PC across datacenters: Distributed transactions have unacceptable latency

Google needed a system that combined: - Global distribution (like Bigtable) - Strong consistency (like single-node SQL) - ACID transactions across shards

The Gap Spanner Fills

The fundamental challenge is clock synchronization. In a single datacenter, we can use a single clock or tightly synchronized clocks. Globally, clocks drift. Network delays vary. How do you order transactions across continents?

Previous approaches: - Logical clocks (Lamport): Provide causality but not real-time ordering - Vector clocks: Track causality precisely but don't map to wall-clock time - NTP: Reduces drift but can't bound uncertainty reliably

Spanner's answer: TrueTime—a novel API that exposes clock uncertainty and uses specialized hardware (GPS + atomic clocks) to minimize it.

Summary

Key Takeaways

TrueTime: Bounded Clock Uncertainty

External Consistency Without Global Coordination

Lock-Free Snapshot Reads

Any read at a past timestamp can execute on any replica without locks or coordination. The timestamp guarantees a consistent snapshot. This enables fast, globally distributed read replicas.

Paxos Groups for Replication

Data is sharded across directories, each replicated via Paxos for fault tolerance. Leaders handle writes; any replica can serve snapshot reads. Paxos provides consensus within a shard; TrueTime provides ordering across shards.

Semi-Relational Data Model

Unlike Bigtable's key-value model, Spanner supports a schema with typed columns, secondary indexes, and SQL queries. Tables can be hierarchically interleaved for locality. It's the bridge between NoSQL scale and relational features.

Synchronous Replication Across Continents

Spanner synchronously replicates across multiple datacenters (zones), surviving zone failures without data loss. Write latency is bounded by cross-zone RTT plus commit-wait, typically 10-100ms depending on geography.

Deep Dive

By 2012, Google had massive global systems built on Bigtable—but Bigtable had limitations:

Meanwhile, traditional SQL databases couldn't scale globally:

Single-master bottleneck: All writes go to one location Replication lag: Read replicas serve stale data 2PC across datacenters: Distributed transactions have unacceptable latency

Google needed a system that combined: - Global distribution (like Bigtable) - Strong consistency (like single-node SQL) - ACID transactions across shards

The Gap Spanner Fills

Spanner's answer: TrueTime—a novel API that exposes clock uncertainty and uses specialized hardware (GPS + atomic clocks) to minimize it.

Trade-offs

Aspect	Advantage	Disadvantage
External Consistency	Transactions are globally ordered by real time—if T1 commits before T2 starts, T1.timestamp < T2.timestamp. Enables truly consistent global reads.	Commit-wait adds latency (average 2×ε ≈ 14ms). Every write pays this cost regardless of whether external consistency is needed.
TrueTime Dependency	Eliminates distributed coordination for ordering—time replaces communication. Elegant solution to a fundamental problem.	Requires specialized hardware (GPS receivers, atomic clocks) in every datacenter. Not practical outside Google/cloud providers.
Synchronous Cross-Region Replication	Zero data loss on region failure. Reads from any replica see committed data. Strong durability guarantees.	Write latency includes cross-region RTT (50-150ms). Unsuitable for latency-critical writes to global data.
Paxos Groups per Shard	Each shard operates independently—failures and slowness are isolated. Scales by adding shards.	Cross-shard transactions require 2PC, adding latency and complexity. Hot shards become bottlenecks.
Semi-Relational Model	SQL queries, schemas, secondary indexes—familiar relational features. Interleaving enables efficient hierarchical access.	Less flexible than key-value stores. Schema changes, while non-blocking, require careful planning. Global indexes are expensive.

Premium Content

distributed-systemsdatabasesgoogletransactionsconsistencytruetimeglobal-scaleadvanced

Spanner: Google's Globally Distributed Database

The first system to achieve external consistency at global scale using synchronized clocks and TrueTime

View Original Paper

Summary

Key Takeaways

TrueTime: Bounded Clock Uncertainty

External Consistency Without Global Coordination

Lock-Free Snapshot Reads

Any read at a past timestamp can execute on any replica without locks or coordination. The timestamp guarantees a consistent snapshot. This enables fast, globally distributed read replicas.

By 2012, Google had massive global systems built on Bigtable—but Bigtable had limitations:

Meanwhile, traditional SQL databases couldn't scale globally:

Single-master bottleneck: All writes go to one location Replication lag: Read replicas serve stale data 2PC across datacenters: Distributed transactions have unacceptable latency

Google needed a system that combined: - Global distribution (like Bigtable) - Strong consistency (like single-node SQL) - ACID transactions across shards

The Gap Spanner Fills

Spanner's answer: TrueTime—a novel API that exposes clock uncertainty and uses specialized hardware (GPS + atomic clocks) to minimize it.

Summary

Key Takeaways

TrueTime: Bounded Clock Uncertainty

External Consistency Without Global Coordination

Lock-Free Snapshot Reads

Any read at a past timestamp can execute on any replica without locks or coordination. The timestamp guarantees a consistent snapshot. This enables fast, globally distributed read replicas.

Paxos Groups for Replication

Semi-Relational Data Model

Synchronous Replication Across Continents

Deep Dive

By 2012, Google had massive global systems built on Bigtable—but Bigtable had limitations:

Meanwhile, traditional SQL databases couldn't scale globally:

Single-master bottleneck: All writes go to one location Replication lag: Read replicas serve stale data 2PC across datacenters: Distributed transactions have unacceptable latency

Google needed a system that combined: - Global distribution (like Bigtable) - Strong consistency (like single-node SQL) - ACID transactions across shards

The Gap Spanner Fills

Spanner's answer: TrueTime—a novel API that exposes clock uncertainty and uses specialized hardware (GPS + atomic clocks) to minimize it.

Trade-offs

Aspect	Advantage	Disadvantage
External Consistency	Transactions are globally ordered by real time—if T1 commits before T2 starts, T1.timestamp < T2.timestamp. Enables truly consistent global reads.	Commit-wait adds latency (average 2×ε ≈ 14ms). Every write pays this cost regardless of whether external consistency is needed.
TrueTime Dependency	Eliminates distributed coordination for ordering—time replaces communication. Elegant solution to a fundamental problem.	Requires specialized hardware (GPS receivers, atomic clocks) in every datacenter. Not practical outside Google/cloud providers.
Synchronous Cross-Region Replication	Zero data loss on region failure. Reads from any replica see committed data. Strong durability guarantees.	Write latency includes cross-region RTT (50-150ms). Unsuitable for latency-critical writes to global data.
Paxos Groups per Shard	Each shard operates independently—failures and slowness are isolated. Scales by adding shards.	Cross-shard transactions require 2PC, adding latency and complexity. Hot shards become bottlenecks.
Semi-Relational Model	SQL queries, schemas, secondary indexes—familiar relational features. Interleaving enables efficient hierarchical access.	Less flexible than key-value stores. Schema changes, while non-blocking, require careful planning. Global indexes are expensive.

Whitepapers

MapReduce: Simplified Data Processing on Large Clusters

Kafka: A Distributed Messaging System for Log Processing

Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

Bitcoin: A Peer-to-Peer Electronic Cash System

In Search of an Understandable Consensus Algorithm (Raft)

TAO: Facebook's Distributed Data Store for the Social Graph

The Google File System

The Log-Structured Merge-Tree (LSM-Tree)

The Chubby Lock Service for Loosely-Coupled Distributed Systems