System Design Masterclass
Infrastructuredistributed-systemsdatabaseACIDconsensusreplicationadvanced

Design Globally Distributed SQL Database

Design a database with ACID transactions across multiple data centers

Petabytes of data across continents|Similar to Google, CockroachDB, YugabyteDB, TiDB, Amazon Aurora|45 min read

Summary

A globally distributed SQL database lets you store data across multiple data centers around the world while still giving you the guarantees of a traditional database - your transactions either fully succeed or fully fail (ACID). The hard parts are: making sure all copies of data agree on what the "truth" is (consensus), handling network delays between far-apart data centers, keeping the database fast even when computers are thousands of miles apart, and deciding what happens when part of the network goes down. Companies like Google (Spanner), CockroachDB, and YugabyteDB have built systems like this.

Key Takeaways

Core Problem

The main job is to let users read and write data as if it were in one place, even though the data is actually copied across data centers around the world. Every write must be seen by all readers in the same order.

The Hard Part

When data centers are far apart, messages take time to travel (light takes 70ms to go from New York to London). We need all data centers to agree on the order of operations, but we cannot wait too long or the database becomes slow.

Scaling Axis

We split data into chunks called shards. Each shard is copied to multiple data centers. We can add more shards to store more data, but cross-shard transactions are slower because they need to coordinate.

Critical Invariant

Once a transaction is committed (saved), it must never be lost, even if a data center goes down. All readers must see the same data in the same order - this is called serializability.

Performance Requirement

Reads should be fast (under 10 milliseconds) when reading from nearby data. Writes are slower (50-200 milliseconds) because we need agreement from multiple data centers before we say the write is done.

Key Tradeoff

We trade speed for safety. Waiting for multiple data centers to agree makes writes slower, but it means your data is safe even if an entire data center burns down.

Design Walkthrough

Problem Statement

The Question: Design a database system that stores data across multiple data centers around the world. It should support SQL queries and give the same guarantees as a traditional database (ACID transactions).

What the database needs to do (most important first):

  1. 1.Store data reliably - Data must not be lost, even if an entire data center goes down (fire, power outage, earthquake).
  2. 2.Support SQL queries - Users write normal SQL like SELECT, INSERT, UPDATE, DELETE. They should not need to know their data is spread across the world.
  3. 3.ACID transactions - When a user saves data, it either fully saves or fully fails. No half-done operations. Multiple users reading the same data see the same thing.
  4. 4.Low latency reads - When reading data, it should be fast (under 10ms) if the data is stored nearby.
  5. 5.Automatic failover - If one data center goes down, another takes over automatically. Users should not notice.
  6. 6.Horizontal scaling - We can add more machines to store more data and handle more requests.

What to say first

Let me first understand the requirements. Are we building a general-purpose SQL database, or is this for a specific use case like banking or e-commerce? How important is write speed versus read speed? Do we need all data centers to have all data, or can some data live only in certain regions?

What the interviewer really wants to see: - Do you understand the CAP theorem? (You cannot have perfect Consistency, Availability, and Partition tolerance all at once) - Can you explain how distributed consensus works (like Raft or Paxos)? - Do you know why clocks are tricky in distributed systems? - Can you design a system that is correct first, then make it fast?

Clarifying Questions

Before you start designing, ask questions to understand what you are building. Good questions show the interviewer you think before you code.

Question 1: How much data and how many requests?

How much total data will this database store? How many reads and writes per second? Is it read-heavy (lots of reads, few writes) or write-heavy?

Why ask this: The design changes a lot based on scale. A database handling 100 writes/second is very different from one handling 1 million writes/second.

What interviewers usually say: Petabytes of data, millions of reads per second, thousands of writes per second. Read-heavy workload (100 reads for every 1 write).

How this changes your design: Since it is read-heavy, we can have many read replicas. Writes need consensus which is slower, but that is okay since writes are less common.

Question 2: How many data centers and where?

How many data centers do we need? Where are they located? Do we need data to be stored in specific regions for legal reasons (like EU data must stay in EU)?

Why ask this: More data centers means better availability but slower writes (more machines need to agree).

What interviewers usually say: 3-5 data centers across different continents. Some data may have regional requirements.

How this changes your design: With 3-5 data centers, we can use a 3-replica system where 2 out of 3 must agree. Regional requirements mean we need to track where data is allowed to live.

Question 3: What consistency level do we need?

Do all reads need to see the latest data (strong consistency)? Or is it okay if a read sometimes shows slightly old data (eventual consistency)?

Why ask this: Strong consistency means every read sees the most recent write. This is slower but simpler for developers. Eventual consistency is faster but harder to reason about.

What interviewers usually say: We need strong consistency for most operations. Some reads can be eventually consistent if it makes them faster.

How this changes your design: We need to support both modes. Strong reads go to the leader. Eventually consistent reads can go to any replica.

Question 4: What happens during a network partition?

If a data center gets cut off from the others (network failure), should it stop accepting writes (prioritize consistency) or keep working with possibly stale data (prioritize availability)?

Why ask this: This is the CAP theorem in action. You cannot have both perfect consistency AND availability during a network partition.

What interviewers usually say: Prioritize consistency. It is better to reject a write than to have conflicting data.

How this changes your design: We choose CP (Consistency over Availability). During a partition, minority partitions stop accepting writes. This is what Spanner, CockroachDB do.

Summarize your assumptions

Let me summarize: Petabytes of data across 3-5 data centers, millions of reads and thousands of writes per second, strong consistency by default with optional eventual consistency for fast reads, and we prioritize consistency over availability during network failures.

The Hard Part

Say this to the interviewer

The hardest part of a distributed database is making sure all copies of the data agree on what happened and in what order. This is called distributed consensus. It is hard because: messages between data centers can be slow, arrive out of order, or get lost entirely.

Why is distributed consensus hard? (explained simply)

  1. 1.Messages are slow - Light takes 70ms to travel from New York to London. A round trip is 140ms. If we need 3 round trips to agree, that is 420ms just waiting for messages.
  2. 2.Messages can get lost - A network cable can get cut. A router can fail. We cannot tell the difference between a slow message and a lost message.
  3. 3.Clocks are not synchronized - Your computer in New York might say it is 10:00:00.000, but a computer in London might say 10:00:00.050. Which one is right? Neither! There is no perfect global clock.
  4. 4.Machines can fail - A server can crash in the middle of an operation. When it comes back, it needs to figure out what happened.
  5. 5.The order matters - If Alice sends $100 to Bob, then Bob sends $50 to Carol, we MUST process these in order. If we do them out of order, the account balances are wrong.

Common mistake candidates make

Many candidates say: just use timestamps to order operations - whoever has the earlier timestamp goes first. This is WRONG because clocks are not synchronized. Machine A might write at 10:00:00.001 and Machine B might write at 10:00:00.000, but Machine B's write actually happened later in real time. Using wall clock time leads to lost writes and data corruption.

Two main approaches to distributed consensus:

Approach 1: Leader-based consensus (Raft/Paxos) - One machine is the leader for each piece of data - All writes go through the leader - Leader tells followers about the write - When majority of followers confirm, the write is committed - If leader dies, followers elect a new leader - Used by: CockroachDB, TiDB, YugabyteDB

Approach 2: Synchronized clocks (Google Spanner) - Use special atomic clocks and GPS receivers to keep clocks synchronized - Clocks have a known error bound (within 7ms of each other) - When you do a write, wait for the error bound to pass before confirming - This guarantees your write is truly after all earlier writes - Used by: Google Spanner (requires special hardware)

How Raft consensus works (simplified)

Understanding ACID in Distributed Systems

What is ACID? (the promises a database makes)

ACID stands for four promises the database makes to you:

LetterWhat it meansSimple exampleWhy it is hard when distributed
A - AtomicityAll or nothing. A transaction either fully completes or fully fails.You transfer $100 from Account A to B. Either both accounts change, or neither changes. Never just one.If machine 1 succeeds but machine 2 fails, we need to undo machine 1's change.
C - ConsistencyThe database follows all the rules you set (like account balance cannot be negative).If you set a rule that users must have a unique email, the database will never allow two users with the same email.We need to check rules across multiple machines before allowing a write.
I - IsolationTransactions do not interfere with each other. Running 10 transactions at once gives the same result as running them one by one.If Alice and Bob both read their balance at the same time, they each see a consistent snapshot - not a half-updated state.We need to coordinate reads and writes across machines to prevent seeing partial updates.
D - DurabilityOnce you get a success message, your data is safe forever (even if power goes out).If the database says your payment was saved, it is saved. Even if the server crashes 1 second later.Data must be written to multiple data centers before we say success.

The hardest one: Isolation

Isolation is the trickiest part. Imagine this scenario:

  • Alice's account has $100 - Alice tries to send $80 to Bob - At the SAME TIME, Alice tries to send $80 to Carol

If both transactions read the balance ($100) at the same time, both see enough money. If both withdraw, Alice's account goes to -$60. This is the lost update problem.

How we prevent this: Before writing, we check if the data changed since we read it. If it changed, we retry the transaction.

TRANSACTION 1 (Alice sends $80 to Bob):
    Step 1: Read Alice balance = $100, version = 5
    Step 2: Calculate new balance = $100 - $80 = $20
+ 14 more lines...

Two types of locking

**Pessimistic locking**: Lock the row BEFORE reading. Others wait. Safer but slower. **Optimistic locking**: Read without locking, check at write time. Faster but may need retries. Distributed databases usually use optimistic locking because pessimistic locks across data centers are very slow.

Scale and Access Patterns

Before designing, let me figure out how big this system needs to be. This helps us choose the right approach.

What we are measuringNumberWhat this means for our design
Total data size1 Petabyte (1000 TB)Way too big for one machine. Must split data across thousands of machines.
Number of data centers5 (US-East, US-West, Europe, Asia, Australia)Need consensus across continents. Minimum round-trip times of 50-200ms.
+ 6 more rows...

What to tell the interviewer

At this scale, we MUST shard the data - one machine cannot hold 1 PB. We will split data into thousands of shards (chunks). Each shard is replicated 3-5 times across different data centers. For reads, we can serve from any replica. For writes, we need consensus which takes 50-200ms due to network distance.

How people use the database (most common patterns):

  1. 1.Point reads - Get one row by its primary key. Example: Get user with ID = 123. This is the most common and should be fastest.
  2. 2.Range scans - Get multiple rows in a range. Example: Get all orders from January. Need good indexes.
  3. 3.Single-shard writes - Update one row. Example: Update user's email. Needs consensus within that shard.
  4. 4.Multi-shard transactions - Update rows in different shards. Example: Transfer money between accounts on different shards. This is SLOW because we need two-phase commit.
  5. 5.Cross-region reads - Read data that lives in another continent. Slow due to network distance. Can cache if eventual consistency is okay.
How much storage per machine?
- 1 PB total data
- Each row stored 3 times (3 replicas) = 3 PB raw storage needed
+ 15 more lines...

High-Level Architecture

Now let me draw the big picture of how all the pieces fit together. I will keep it simple and explain what each part does.

What to tell the interviewer

I will design this with three main layers: (1) SQL layer that understands your queries, (2) Transaction layer that ensures ACID, and (3) Storage layer that actually stores and replicates data. Data is split into shards, and each shard has a leader that handles writes.

Distributed SQL Database Architecture

What each layer does:

LayerWhat it doesWhy it exists
SQL GatewayParses SQL queries, creates execution plan, routes to right shardsHides complexity from users. They write normal SQL without knowing about shards.
Transaction CoordinatorManages transactions across shards. Ensures atomicity (all or nothing).If a transaction touches multiple shards, someone needs to coordinate them.
Metadata ServiceTracks which shard has which data. Updated when shards split or move.Gateways need to know where to send each query.
Storage NodesActually store data on disk. Run consensus (Raft) for replication.The actual storage. Each shard has 1 leader + 2 followers.

How data is organized:

Data is split into ranges (also called shards or tablets). Each range covers a portion of the key space:

  • Range 1: All rows where primary key starts with A-M - Range 2: All rows where primary key starts with N-Z

Each range has: - One leader replica - handles all writes and strong reads - Two follower replicas - receive copies of writes, can serve eventually consistent reads

All three replicas run the Raft consensus protocol to agree on writes.

Data across multiple data centers

Why leaders are in different data centers

Notice that Shard 1's leader is in US-East, but Shard 2's leader is in US-West. We spread leaders across data centers so that: (1) if one data center dies, only some shards need to elect new leaders, and (2) write traffic is spread out instead of all going to one place.

Data Model and Storage

Now let me explain how we actually store the data on disk and organize it for fast queries.

What to tell the interviewer

We use a key-value storage engine (like RocksDB) under the hood. SQL tables are mapped to key-value pairs. Data is sorted by primary key, which makes range queries fast. Each shard stores a continuous range of keys.

How SQL tables become key-value pairs:

SQL:

sql
CREATE TABLE users ( id INT PRIMARY KEY, name VARCHAR(100), email VARCHAR(100) ); INSERT INTO users VALUES (1, 'Alice', 'alice@email.com');

Stored as key-value:

Key: /users/1
Value: {name: "Alice", email: "alice@email.com"}

The key includes the table name and primary key. Values are stored as a binary format (like Protocol Buffers).

SQL QueryKey-Value OperationWhy this works
SELECT * FROM users WHERE id = 5GET /users/5Direct key lookup - very fast O(1)
SELECT * FROM users WHERE id BETWEEN 10 AND 20SCAN /users/10 to /users/20Keys are sorted, so range scan is efficient
SELECT * FROM users WHERE name = 'Alice'Full table scan OR use indexWithout index, must scan all keys in table
INSERT INTO users VALUES (6, ...)PUT /users/6 = ...Insert is just a key-value put
UPDATE users SET name = 'Bob' WHERE id = 5PUT /users/5 = (new value)Update reads then writes the key

How indexes work:

When you create an index, we store an extra key-value pair that maps the indexed column to the primary key:

sql
CREATE INDEX idx_email ON users(email);

This creates additional keys:

Key: /users/idx_email/alice@email.com -> 1
Key: /users/idx_email/bob@email.com -> 2

Now finding a user by email is: 1. Look up /users/idx_email/alice@email.com -> get ID = 1 2. Look up /users/1 -> get the full row

How sharding works:

We split the key space into ranges. Each range is one shard:

Shard 1: All keys from /users/1 to /users/1000000
Shard 2: All keys from /users/1000001 to /users/2000000
Shard 3: All keys from /users/2000001 to /users/3000000

The metadata service keeps track of these ranges:

Shard Map:
  Shard 1: keys [/users/1, /users/1000000], leader = us-east-node-5
  Shard 2: keys [/users/1000001, /users/2000000], leader = us-west-node-3
  Shard 3: keys [/users/2000001, /users/3000000], leader = eu-node-7
FUNCTION route_query(query):
    // Example query: SELECT * FROM users WHERE id = 1500000
    
+ 16 more lines...

What if a query touches multiple shards?

If you query `SELECT * FROM users WHERE id > 500000 AND id < 1500000`, this touches Shard 1 AND Shard 2. The gateway must: (1) send sub-queries to both shards, (2) combine the results, (3) if it is a write transaction, use two-phase commit to ensure atomicity.

Consensus Deep Dive - How Raft Works

Let me explain step by step how Raft consensus makes sure all replicas agree on writes.

What to tell the interviewer

Raft is a consensus algorithm that ensures all replicas agree on the same sequence of operations. It works by: (1) electing a leader, (2) leader receives all writes, (3) leader sends writes to followers, (4) when majority confirms, write is committed.

The three states in Raft:

  1. 1.Leader - The boss. Receives all writes, sends them to followers. Only one leader per shard at a time.
  2. 2.Follower - Receives commands from leader, stores them, replies "got it". If leader dies, might become candidate.
  3. 3.Candidate - A follower that wants to become leader. Asks other nodes for votes.

Raft state transitions

How a write works with Raft:

Example: User wants to update their email to "new@email.com"
Shard has 3 nodes: Node A (Leader), Node B (Follower), Node C (Follower)
+ 31 more lines...

What happens when the leader dies?

Scenario: Node A (Leader) crashes. Node B and C are followers.

STEP 1: Followers notice leader is gone
+ 31 more lines...

Why do we need majority?

With 3 nodes, majority = 2. With 5 nodes, majority = 3. We need majority because: (1) only one group can have majority, preventing split-brain (two leaders), (2) any two majorities overlap, so a new leader will have all committed data from the old leader.

Two-Phase Commit - Transactions Across Shards

When a transaction needs to update data in multiple shards, we need a way to make it atomic (all or nothing). This is where two-phase commit (2PC) comes in.

Say this to the interviewer

Two-phase commit is needed when a transaction touches multiple shards. It works in two phases: (1) PREPARE - ask all shards if they can commit, (2) COMMIT - if all said yes, tell all to commit. The downside is it is slow (multiple round trips) and blocks if the coordinator crashes.

Example scenario:

Alice wants to transfer $100 from her account to Bob's account. - Alice's account is on Shard 1 - Bob's account is on Shard 2

We need to: 1. Subtract $100 from Alice (Shard 1) 2. Add $100 to Bob (Shard 2)

Both MUST succeed or both MUST fail. If only one succeeds, money disappears or appears from nowhere!

Two-Phase Commit

FUNCTION transfer_money(from_account, to_account, amount):
    // This transaction touches two shards
    
+ 38 more lines...

What happens if something fails?

FailureWhenWhat happens
Shard says NO to PREPAREPhase 1Abort all. No data changed. Safe.
Coordinator crashes after PREPARE, before COMMITBetween phasesShards are stuck with locks! After timeout, check coordinator's log to find decision.
Shard crashes after PREPAREPhase 1-2When shard recovers, it checks coordinator for the decision.
Network dies after COMMIT sentPhase 2Shard may not know to commit. Coordinator retries. Commits are idempotent.

The blocking problem with 2PC

If the coordinator crashes AFTER sending PREPARE but BEFORE sending COMMIT, all shards are stuck holding locks. They cannot commit (what if coordinator decided to abort?) and cannot abort (what if coordinator decided to commit?). Modern systems solve this by: (1) making the coordinator itself replicated with Raft, (2) using timeouts and having shards query the coordinator's replicas.

Avoid cross-shard transactions when possible

2PC is slow (4 message round trips minimum) and complex. Good schema design puts related data in the same shard. For example: all of a user's data (profile, settings, orders) shares the same shard key (user_id), so transactions on one user's data do not need 2PC.

Clock Synchronization - The Time Problem

One of the trickiest parts of distributed databases is dealing with time. Let me explain why clocks matter and how to handle them.

Say this to the interviewer

Clocks in distributed systems are NEVER perfectly synchronized. A computer in New York might think it is 10:00:00.000, while a computer in Tokyo thinks it is 10:00:00.123. If we use timestamps to order operations, we might order them wrong. There are two solutions: (1) do not use timestamps, use logical clocks, or (2) use special hardware to bound clock drift (like Google Spanner).

Why do we care about time?

Imagine this scenario:

  1. 1.Alice in New York writes: SET balance = 100 at 10:00:00.000 (NY time) 2. Bob in Tokyo reads: GET balance at 10:00:00.050 (Tokyo time)

Should Bob see Alice's write? It depends on whether Alice's write REALLY happened before Bob's read. But if clocks are not synchronized: - NY clock might be 100ms ahead of Tokyo clock - So Bob's read at 10:00:00.050 Tokyo time might actually be BEFORE Alice's write at 10:00:00.000 NY time in real time!

Using wall clock timestamps leads to wrong answers.

Solution 1: Logical clocks (most databases use this)

Instead of using real time, we use counters that only go up:

  • Every operation gets a sequence number - If operation A causes operation B, then A's number < B's number - We do not care about real time, just the order
Each node has a counter that starts at 0.

RULES:
+ 21 more lines...

Solution 2: TrueTime (Google Spanner's approach)

Google uses special hardware (atomic clocks and GPS receivers) to keep clocks synchronized within a known error bound (usually around 7ms).

Instead of saying "it is exactly 10:00:00.000", the clock says "it is between 10:00:00.000 and 10:00:00.007".

When you do a write: 1. Record the current time interval [earliest, latest] 2. Wait until you are SURE the write timestamp is in the past 3. This means waiting for the error bound (7ms)

This guarantees that if write A completes before write B starts, A's timestamp < B's timestamp.

Google TrueTime: commit wait

Which approach to recommend?

Most companies should use logical clocks (like CockroachDB, TiDB, YugabyteDB do). TrueTime requires special hardware that only Google has in every data center. For interview purposes, explain both approaches and say you would use logical clocks unless you have Google's infrastructure.

Read and Write Paths

Let me walk through exactly what happens when you read or write data.

The Write Path (step by step):

User runs: UPDATE users SET email = "new@email.com" WHERE id = 123

STEP 1: SQL Gateway receives the query
+ 34 more lines...

The Read Path (two options):

Read TypeWhere it goesSpeedConsistencyWhen to use
Strong readAlways goes to leader50-200ms if leader is far awayAlways sees latest dataWhen you MUST have latest data (checking bank balance before transfer)
Stale readCan go to any replica, including nearby ones1-10ms to nearby replicaMight see slightly old data (milliseconds to seconds old)When slightly old data is okay (showing user profile, search results)
STRONG READ (must see latest data):

User runs: SELECT balance FROM accounts WHERE id = 123
+ 29 more lines...

Follower reads with read leases

Some systems let followers serve strong reads using 'read leases'. The leader grants a lease saying 'you can serve reads for the next 10 seconds, and I promise not to commit any new writes without telling you first'. This makes strong reads faster because you can go to a nearby follower instead of the leader.

What Can Go Wrong and How We Handle It

Tell the interviewer about failures

Good engineers think about what can break. Let me walk through the things that can go wrong and how we protect against them.

What breaksWhat happens to usersHow we handle itRecovery time
One node crashesShard still works (2 of 3 replicas alive)Raft continues with remaining nodes. Dead node recovers and catches up.0 downtime for users. Node catches up in minutes.
Leader crashesWrites to that shard pause brieflyFollowers elect new leader automatically300-500ms to elect new leader
+ 5 more rows...

Handling network partitions:

A network partition is when some data centers cannot talk to others. This is the scariest failure because it creates the "split-brain" problem: two groups each think they are the only ones alive.

Network partition scenario

SCENARIO: 3 replicas, network cuts Europe off from US

Partition A (US-East + US-West): 2 nodes
+ 37 more lines...

This is CP from CAP theorem

By requiring majority for writes, we choose Consistency over Availability. During a partition, the minority side becomes unavailable (rejects writes). This is the right choice for a database - we never want two conflicting writes to both succeed.

Growing the System Over Time

What to tell the interviewer

The system grows by: (1) adding more shards when data grows, (2) adding more replicas when read traffic grows, (3) adding more data centers for global coverage. Shards can split automatically when they get too big.

How to handle growing data: Shard splitting

When a shard gets too big (say, over 10GB), we split it into two smaller shards:

BEFORE: Shard 1 covers keys A-Z (getting too big, 15GB)

STEP 1: Pick a split point
+ 26 more lines...

How to handle growing traffic: Add replicas

If a shard is getting too many reads: 1. Add more follower replicas 2. Spread read traffic across all followers 3. Leader still handles writes and strong reads

How to handle global users: Add data centers

To reduce latency for users in new regions: 1. Add a new data center in that region 2. Place follower replicas there 3. Local users read from local followers 4. For latency-sensitive data, move the leader to that region

Scaling from 3 to 5 data centers

More replicas = slower writes

With 3 replicas, we need 2 to agree (majority). With 5 replicas, we need 3 to agree. More replicas means waiting for more acknowledgments, which can slow writes. The tradeoff: more replicas = better durability and read scalability, but slower writes.

Interview Tips and Common Questions

Questions interviewers often ask:

QuestionGood answer
Why not just use timestamps?Computer clocks are not synchronized. A write that happened later might have an earlier timestamp. This leads to lost updates. We use logical clocks or synchronized clocks with bounded error (TrueTime).
How do you handle a write that touches 1000 rows across 100 shards?This is very slow! 2PC with 100 participants is a bad idea. Better design: batch the writes, use eventual consistency, or redesign schema so related data is on the same shard.
+ 4 more rows...

What makes a great answer

1. Start with the simplest correct solution, then optimize. 2. Acknowledge tradeoffs explicitly ("we are trading latency for consistency"). 3. Reference real systems (Spanner, CockroachDB, TiDB). 4. Know when NOT to use a distributed database ("if your data fits on one machine, do not distribute it").

Red flags interviewers look for:

  1. 1.Ignoring network latency - "We will just synchronize all writes instantly." No, light takes time to travel!
  2. 2.Trusting wall clocks - "We will use timestamps to order events." This is wrong.
  3. 3.Ignoring failure modes - "We will handle failures later." Failures are the core challenge.
  4. 4.Over-engineering - "Let us add 10 data centers and shard by user, region, and time." Start simple.
  5. 5.Not knowing CAP theorem - You cannot have perfect consistency, availability, and partition tolerance. Know which one you are giving up.

Real systems to study

**Google Spanner**: The original globally distributed SQL database. Uses TrueTime. Read the paper! **CockroachDB**: Open-source Spanner-like database. Uses Raft + hybrid logical clocks. **TiDB**: MySQL-compatible distributed database. Uses Raft. **YugabyteDB**: PostgreSQL-compatible distributed database. Uses Raft.

Design Trade-offs

Advantages

  • +Simple to operate
  • +Low latency within region
  • +No cross-region coordination needed

Disadvantages

  • -Single region failure = total outage
  • -High latency for users far from the region
  • -Does not meet disaster recovery requirements
When to use

Use only for development or when you truly do not need global distribution. Upgrade to multi-region when you need disaster recovery.