Design Walkthrough
Problem Statement
The Question: Design a database system that stores data across multiple data centers around the world. It should support SQL queries and give the same guarantees as a traditional database (ACID transactions).
What the database needs to do (most important first):
- 1.Store data reliably - Data must not be lost, even if an entire data center goes down (fire, power outage, earthquake).
- 2.Support SQL queries - Users write normal SQL like SELECT, INSERT, UPDATE, DELETE. They should not need to know their data is spread across the world.
- 3.ACID transactions - When a user saves data, it either fully saves or fully fails. No half-done operations. Multiple users reading the same data see the same thing.
- 4.Low latency reads - When reading data, it should be fast (under 10ms) if the data is stored nearby.
- 5.Automatic failover - If one data center goes down, another takes over automatically. Users should not notice.
- 6.Horizontal scaling - We can add more machines to store more data and handle more requests.
What to say first
Let me first understand the requirements. Are we building a general-purpose SQL database, or is this for a specific use case like banking or e-commerce? How important is write speed versus read speed? Do we need all data centers to have all data, or can some data live only in certain regions?
What the interviewer really wants to see: - Do you understand the CAP theorem? (You cannot have perfect Consistency, Availability, and Partition tolerance all at once) - Can you explain how distributed consensus works (like Raft or Paxos)? - Do you know why clocks are tricky in distributed systems? - Can you design a system that is correct first, then make it fast?
Clarifying Questions
Before you start designing, ask questions to understand what you are building. Good questions show the interviewer you think before you code.
Question 1: How much data and how many requests?
How much total data will this database store? How many reads and writes per second? Is it read-heavy (lots of reads, few writes) or write-heavy?
Why ask this: The design changes a lot based on scale. A database handling 100 writes/second is very different from one handling 1 million writes/second.
What interviewers usually say: Petabytes of data, millions of reads per second, thousands of writes per second. Read-heavy workload (100 reads for every 1 write).
How this changes your design: Since it is read-heavy, we can have many read replicas. Writes need consensus which is slower, but that is okay since writes are less common.
Question 2: How many data centers and where?
How many data centers do we need? Where are they located? Do we need data to be stored in specific regions for legal reasons (like EU data must stay in EU)?
Why ask this: More data centers means better availability but slower writes (more machines need to agree).
What interviewers usually say: 3-5 data centers across different continents. Some data may have regional requirements.
How this changes your design: With 3-5 data centers, we can use a 3-replica system where 2 out of 3 must agree. Regional requirements mean we need to track where data is allowed to live.
Question 3: What consistency level do we need?
Do all reads need to see the latest data (strong consistency)? Or is it okay if a read sometimes shows slightly old data (eventual consistency)?
Why ask this: Strong consistency means every read sees the most recent write. This is slower but simpler for developers. Eventual consistency is faster but harder to reason about.
What interviewers usually say: We need strong consistency for most operations. Some reads can be eventually consistent if it makes them faster.
How this changes your design: We need to support both modes. Strong reads go to the leader. Eventually consistent reads can go to any replica.
Question 4: What happens during a network partition?
If a data center gets cut off from the others (network failure), should it stop accepting writes (prioritize consistency) or keep working with possibly stale data (prioritize availability)?
Why ask this: This is the CAP theorem in action. You cannot have both perfect consistency AND availability during a network partition.
What interviewers usually say: Prioritize consistency. It is better to reject a write than to have conflicting data.
How this changes your design: We choose CP (Consistency over Availability). During a partition, minority partitions stop accepting writes. This is what Spanner, CockroachDB do.
Summarize your assumptions
Let me summarize: Petabytes of data across 3-5 data centers, millions of reads and thousands of writes per second, strong consistency by default with optional eventual consistency for fast reads, and we prioritize consistency over availability during network failures.
The Hard Part
Say this to the interviewer
The hardest part of a distributed database is making sure all copies of the data agree on what happened and in what order. This is called distributed consensus. It is hard because: messages between data centers can be slow, arrive out of order, or get lost entirely.
Why is distributed consensus hard? (explained simply)
- 1.Messages are slow - Light takes 70ms to travel from New York to London. A round trip is 140ms. If we need 3 round trips to agree, that is 420ms just waiting for messages.
- 2.Messages can get lost - A network cable can get cut. A router can fail. We cannot tell the difference between a slow message and a lost message.
- 3.Clocks are not synchronized - Your computer in New York might say it is 10:00:00.000, but a computer in London might say 10:00:00.050. Which one is right? Neither! There is no perfect global clock.
- 4.Machines can fail - A server can crash in the middle of an operation. When it comes back, it needs to figure out what happened.
- 5.The order matters - If Alice sends $100 to Bob, then Bob sends $50 to Carol, we MUST process these in order. If we do them out of order, the account balances are wrong.
Common mistake candidates make
Many candidates say: just use timestamps to order operations - whoever has the earlier timestamp goes first. This is WRONG because clocks are not synchronized. Machine A might write at 10:00:00.001 and Machine B might write at 10:00:00.000, but Machine B's write actually happened later in real time. Using wall clock time leads to lost writes and data corruption.
Two main approaches to distributed consensus:
Approach 1: Leader-based consensus (Raft/Paxos) - One machine is the leader for each piece of data - All writes go through the leader - Leader tells followers about the write - When majority of followers confirm, the write is committed - If leader dies, followers elect a new leader - Used by: CockroachDB, TiDB, YugabyteDB
Approach 2: Synchronized clocks (Google Spanner) - Use special atomic clocks and GPS receivers to keep clocks synchronized - Clocks have a known error bound (within 7ms of each other) - When you do a write, wait for the error bound to pass before confirming - This guarantees your write is truly after all earlier writes - Used by: Google Spanner (requires special hardware)
How Raft consensus works (simplified)
Understanding ACID in Distributed Systems
What is ACID? (the promises a database makes)
ACID stands for four promises the database makes to you:
| Letter | What it means | Simple example | Why it is hard when distributed |
|---|---|---|---|
| A - Atomicity | All or nothing. A transaction either fully completes or fully fails. | You transfer $100 from Account A to B. Either both accounts change, or neither changes. Never just one. | If machine 1 succeeds but machine 2 fails, we need to undo machine 1's change. |
| C - Consistency | The database follows all the rules you set (like account balance cannot be negative). | If you set a rule that users must have a unique email, the database will never allow two users with the same email. | We need to check rules across multiple machines before allowing a write. |
| I - Isolation | Transactions do not interfere with each other. Running 10 transactions at once gives the same result as running them one by one. | If Alice and Bob both read their balance at the same time, they each see a consistent snapshot - not a half-updated state. | We need to coordinate reads and writes across machines to prevent seeing partial updates. |
| D - Durability | Once you get a success message, your data is safe forever (even if power goes out). | If the database says your payment was saved, it is saved. Even if the server crashes 1 second later. | Data must be written to multiple data centers before we say success. |
The hardest one: Isolation
Isolation is the trickiest part. Imagine this scenario:
- Alice's account has $100 - Alice tries to send $80 to Bob - At the SAME TIME, Alice tries to send $80 to Carol
If both transactions read the balance ($100) at the same time, both see enough money. If both withdraw, Alice's account goes to -$60. This is the lost update problem.
How we prevent this: Before writing, we check if the data changed since we read it. If it changed, we retry the transaction.
TRANSACTION 1 (Alice sends $80 to Bob):
Step 1: Read Alice balance = $100, version = 5
Step 2: Calculate new balance = $100 - $80 = $20Two types of locking
**Pessimistic locking**: Lock the row BEFORE reading. Others wait. Safer but slower. **Optimistic locking**: Read without locking, check at write time. Faster but may need retries. Distributed databases usually use optimistic locking because pessimistic locks across data centers are very slow.
Scale and Access Patterns
Before designing, let me figure out how big this system needs to be. This helps us choose the right approach.
| What we are measuring | Number | What this means for our design |
|---|---|---|
| Total data size | 1 Petabyte (1000 TB) | Way too big for one machine. Must split data across thousands of machines. |
| Number of data centers | 5 (US-East, US-West, Europe, Asia, Australia) | Need consensus across continents. Minimum round-trip times of 50-200ms. |
What to tell the interviewer
At this scale, we MUST shard the data - one machine cannot hold 1 PB. We will split data into thousands of shards (chunks). Each shard is replicated 3-5 times across different data centers. For reads, we can serve from any replica. For writes, we need consensus which takes 50-200ms due to network distance.
How people use the database (most common patterns):
- 1.Point reads - Get one row by its primary key. Example: Get user with ID = 123. This is the most common and should be fastest.
- 2.Range scans - Get multiple rows in a range. Example: Get all orders from January. Need good indexes.
- 3.Single-shard writes - Update one row. Example: Update user's email. Needs consensus within that shard.
- 4.Multi-shard transactions - Update rows in different shards. Example: Transfer money between accounts on different shards. This is SLOW because we need two-phase commit.
- 5.Cross-region reads - Read data that lives in another continent. Slow due to network distance. Can cache if eventual consistency is okay.
How much storage per machine?
- 1 PB total data
- Each row stored 3 times (3 replicas) = 3 PB raw storage neededHigh-Level Architecture
Now let me draw the big picture of how all the pieces fit together. I will keep it simple and explain what each part does.
What to tell the interviewer
I will design this with three main layers: (1) SQL layer that understands your queries, (2) Transaction layer that ensures ACID, and (3) Storage layer that actually stores and replicates data. Data is split into shards, and each shard has a leader that handles writes.
Distributed SQL Database Architecture
What each layer does:
| Layer | What it does | Why it exists |
|---|---|---|
| SQL Gateway | Parses SQL queries, creates execution plan, routes to right shards | Hides complexity from users. They write normal SQL without knowing about shards. |
| Transaction Coordinator | Manages transactions across shards. Ensures atomicity (all or nothing). | If a transaction touches multiple shards, someone needs to coordinate them. |
| Metadata Service | Tracks which shard has which data. Updated when shards split or move. | Gateways need to know where to send each query. |
| Storage Nodes | Actually store data on disk. Run consensus (Raft) for replication. | The actual storage. Each shard has 1 leader + 2 followers. |
How data is organized:
Data is split into ranges (also called shards or tablets). Each range covers a portion of the key space:
- Range 1: All rows where primary key starts with A-M - Range 2: All rows where primary key starts with N-Z
Each range has: - One leader replica - handles all writes and strong reads - Two follower replicas - receive copies of writes, can serve eventually consistent reads
All three replicas run the Raft consensus protocol to agree on writes.
Data across multiple data centers
Why leaders are in different data centers
Notice that Shard 1's leader is in US-East, but Shard 2's leader is in US-West. We spread leaders across data centers so that: (1) if one data center dies, only some shards need to elect new leaders, and (2) write traffic is spread out instead of all going to one place.
Data Model and Storage
Now let me explain how we actually store the data on disk and organize it for fast queries.
What to tell the interviewer
We use a key-value storage engine (like RocksDB) under the hood. SQL tables are mapped to key-value pairs. Data is sorted by primary key, which makes range queries fast. Each shard stores a continuous range of keys.
How SQL tables become key-value pairs:
SQL:
sqlCREATE TABLE users ( id INT PRIMARY KEY, name VARCHAR(100), email VARCHAR(100) ); INSERT INTO users VALUES (1, 'Alice', 'alice@email.com');
Stored as key-value:
Key: /users/1
Value: {name: "Alice", email: "alice@email.com"}The key includes the table name and primary key. Values are stored as a binary format (like Protocol Buffers).
| SQL Query | Key-Value Operation | Why this works |
|---|---|---|
| SELECT * FROM users WHERE id = 5 | GET /users/5 | Direct key lookup - very fast O(1) |
| SELECT * FROM users WHERE id BETWEEN 10 AND 20 | SCAN /users/10 to /users/20 | Keys are sorted, so range scan is efficient |
| SELECT * FROM users WHERE name = 'Alice' | Full table scan OR use index | Without index, must scan all keys in table |
| INSERT INTO users VALUES (6, ...) | PUT /users/6 = ... | Insert is just a key-value put |
| UPDATE users SET name = 'Bob' WHERE id = 5 | PUT /users/5 = (new value) | Update reads then writes the key |
How indexes work:
When you create an index, we store an extra key-value pair that maps the indexed column to the primary key:
sqlCREATE INDEX idx_email ON users(email);
This creates additional keys:
Key: /users/idx_email/alice@email.com -> 1
Key: /users/idx_email/bob@email.com -> 2Now finding a user by email is:
1. Look up /users/idx_email/alice@email.com -> get ID = 1
2. Look up /users/1 -> get the full row
How sharding works:
We split the key space into ranges. Each range is one shard:
Shard 1: All keys from /users/1 to /users/1000000
Shard 2: All keys from /users/1000001 to /users/2000000
Shard 3: All keys from /users/2000001 to /users/3000000The metadata service keeps track of these ranges:
Shard Map:
Shard 1: keys [/users/1, /users/1000000], leader = us-east-node-5
Shard 2: keys [/users/1000001, /users/2000000], leader = us-west-node-3
Shard 3: keys [/users/2000001, /users/3000000], leader = eu-node-7FUNCTION route_query(query):
// Example query: SELECT * FROM users WHERE id = 1500000
What if a query touches multiple shards?
If you query `SELECT * FROM users WHERE id > 500000 AND id < 1500000`, this touches Shard 1 AND Shard 2. The gateway must: (1) send sub-queries to both shards, (2) combine the results, (3) if it is a write transaction, use two-phase commit to ensure atomicity.
Consensus Deep Dive - How Raft Works
Let me explain step by step how Raft consensus makes sure all replicas agree on writes.
What to tell the interviewer
Raft is a consensus algorithm that ensures all replicas agree on the same sequence of operations. It works by: (1) electing a leader, (2) leader receives all writes, (3) leader sends writes to followers, (4) when majority confirms, write is committed.
The three states in Raft:
- 1.Leader - The boss. Receives all writes, sends them to followers. Only one leader per shard at a time.
- 2.Follower - Receives commands from leader, stores them, replies "got it". If leader dies, might become candidate.
- 3.Candidate - A follower that wants to become leader. Asks other nodes for votes.
Raft state transitions
How a write works with Raft:
Example: User wants to update their email to "new@email.com"
Shard has 3 nodes: Node A (Leader), Node B (Follower), Node C (Follower)
What happens when the leader dies?
Scenario: Node A (Leader) crashes. Node B and C are followers.
STEP 1: Followers notice leader is goneWhy do we need majority?
With 3 nodes, majority = 2. With 5 nodes, majority = 3. We need majority because: (1) only one group can have majority, preventing split-brain (two leaders), (2) any two majorities overlap, so a new leader will have all committed data from the old leader.
Two-Phase Commit - Transactions Across Shards
When a transaction needs to update data in multiple shards, we need a way to make it atomic (all or nothing). This is where two-phase commit (2PC) comes in.
Say this to the interviewer
Two-phase commit is needed when a transaction touches multiple shards. It works in two phases: (1) PREPARE - ask all shards if they can commit, (2) COMMIT - if all said yes, tell all to commit. The downside is it is slow (multiple round trips) and blocks if the coordinator crashes.
Example scenario:
Alice wants to transfer $100 from her account to Bob's account. - Alice's account is on Shard 1 - Bob's account is on Shard 2
We need to: 1. Subtract $100 from Alice (Shard 1) 2. Add $100 to Bob (Shard 2)
Both MUST succeed or both MUST fail. If only one succeeds, money disappears or appears from nowhere!
Two-Phase Commit
FUNCTION transfer_money(from_account, to_account, amount):
// This transaction touches two shards
What happens if something fails?
| Failure | When | What happens |
|---|---|---|
| Shard says NO to PREPARE | Phase 1 | Abort all. No data changed. Safe. |
| Coordinator crashes after PREPARE, before COMMIT | Between phases | Shards are stuck with locks! After timeout, check coordinator's log to find decision. |
| Shard crashes after PREPARE | Phase 1-2 | When shard recovers, it checks coordinator for the decision. |
| Network dies after COMMIT sent | Phase 2 | Shard may not know to commit. Coordinator retries. Commits are idempotent. |
The blocking problem with 2PC
If the coordinator crashes AFTER sending PREPARE but BEFORE sending COMMIT, all shards are stuck holding locks. They cannot commit (what if coordinator decided to abort?) and cannot abort (what if coordinator decided to commit?). Modern systems solve this by: (1) making the coordinator itself replicated with Raft, (2) using timeouts and having shards query the coordinator's replicas.
Avoid cross-shard transactions when possible
2PC is slow (4 message round trips minimum) and complex. Good schema design puts related data in the same shard. For example: all of a user's data (profile, settings, orders) shares the same shard key (user_id), so transactions on one user's data do not need 2PC.
Clock Synchronization - The Time Problem
One of the trickiest parts of distributed databases is dealing with time. Let me explain why clocks matter and how to handle them.
Say this to the interviewer
Clocks in distributed systems are NEVER perfectly synchronized. A computer in New York might think it is 10:00:00.000, while a computer in Tokyo thinks it is 10:00:00.123. If we use timestamps to order operations, we might order them wrong. There are two solutions: (1) do not use timestamps, use logical clocks, or (2) use special hardware to bound clock drift (like Google Spanner).
Why do we care about time?
Imagine this scenario:
- 1.Alice in New York writes:
SET balance = 100at 10:00:00.000 (NY time) 2. Bob in Tokyo reads:GET balanceat 10:00:00.050 (Tokyo time)
Should Bob see Alice's write? It depends on whether Alice's write REALLY happened before Bob's read. But if clocks are not synchronized: - NY clock might be 100ms ahead of Tokyo clock - So Bob's read at 10:00:00.050 Tokyo time might actually be BEFORE Alice's write at 10:00:00.000 NY time in real time!
Using wall clock timestamps leads to wrong answers.
Solution 1: Logical clocks (most databases use this)
Instead of using real time, we use counters that only go up:
- Every operation gets a sequence number - If operation A causes operation B, then A's number < B's number - We do not care about real time, just the order
Each node has a counter that starts at 0.
RULES:Solution 2: TrueTime (Google Spanner's approach)
Google uses special hardware (atomic clocks and GPS receivers) to keep clocks synchronized within a known error bound (usually around 7ms).
Instead of saying "it is exactly 10:00:00.000", the clock says "it is between 10:00:00.000 and 10:00:00.007".
When you do a write: 1. Record the current time interval [earliest, latest] 2. Wait until you are SURE the write timestamp is in the past 3. This means waiting for the error bound (7ms)
This guarantees that if write A completes before write B starts, A's timestamp < B's timestamp.
Google TrueTime: commit wait
Which approach to recommend?
Most companies should use logical clocks (like CockroachDB, TiDB, YugabyteDB do). TrueTime requires special hardware that only Google has in every data center. For interview purposes, explain both approaches and say you would use logical clocks unless you have Google's infrastructure.
Read and Write Paths
Let me walk through exactly what happens when you read or write data.
The Write Path (step by step):
User runs: UPDATE users SET email = "new@email.com" WHERE id = 123
STEP 1: SQL Gateway receives the queryThe Read Path (two options):
| Read Type | Where it goes | Speed | Consistency | When to use |
|---|---|---|---|---|
| Strong read | Always goes to leader | 50-200ms if leader is far away | Always sees latest data | When you MUST have latest data (checking bank balance before transfer) |
| Stale read | Can go to any replica, including nearby ones | 1-10ms to nearby replica | Might see slightly old data (milliseconds to seconds old) | When slightly old data is okay (showing user profile, search results) |
STRONG READ (must see latest data):
User runs: SELECT balance FROM accounts WHERE id = 123Follower reads with read leases
Some systems let followers serve strong reads using 'read leases'. The leader grants a lease saying 'you can serve reads for the next 10 seconds, and I promise not to commit any new writes without telling you first'. This makes strong reads faster because you can go to a nearby follower instead of the leader.
What Can Go Wrong and How We Handle It
Tell the interviewer about failures
Good engineers think about what can break. Let me walk through the things that can go wrong and how we protect against them.
| What breaks | What happens to users | How we handle it | Recovery time |
|---|---|---|---|
| One node crashes | Shard still works (2 of 3 replicas alive) | Raft continues with remaining nodes. Dead node recovers and catches up. | 0 downtime for users. Node catches up in minutes. |
| Leader crashes | Writes to that shard pause briefly | Followers elect new leader automatically | 300-500ms to elect new leader |
Handling network partitions:
A network partition is when some data centers cannot talk to others. This is the scariest failure because it creates the "split-brain" problem: two groups each think they are the only ones alive.
Network partition scenario
SCENARIO: 3 replicas, network cuts Europe off from US
Partition A (US-East + US-West): 2 nodesThis is CP from CAP theorem
By requiring majority for writes, we choose Consistency over Availability. During a partition, the minority side becomes unavailable (rejects writes). This is the right choice for a database - we never want two conflicting writes to both succeed.
Growing the System Over Time
What to tell the interviewer
The system grows by: (1) adding more shards when data grows, (2) adding more replicas when read traffic grows, (3) adding more data centers for global coverage. Shards can split automatically when they get too big.
How to handle growing data: Shard splitting
When a shard gets too big (say, over 10GB), we split it into two smaller shards:
BEFORE: Shard 1 covers keys A-Z (getting too big, 15GB)
STEP 1: Pick a split pointHow to handle growing traffic: Add replicas
If a shard is getting too many reads: 1. Add more follower replicas 2. Spread read traffic across all followers 3. Leader still handles writes and strong reads
How to handle global users: Add data centers
To reduce latency for users in new regions: 1. Add a new data center in that region 2. Place follower replicas there 3. Local users read from local followers 4. For latency-sensitive data, move the leader to that region
Scaling from 3 to 5 data centers
More replicas = slower writes
With 3 replicas, we need 2 to agree (majority). With 5 replicas, we need 3 to agree. More replicas means waiting for more acknowledgments, which can slow writes. The tradeoff: more replicas = better durability and read scalability, but slower writes.
Interview Tips and Common Questions
Questions interviewers often ask:
| Question | Good answer |
|---|---|
| Why not just use timestamps? | Computer clocks are not synchronized. A write that happened later might have an earlier timestamp. This leads to lost updates. We use logical clocks or synchronized clocks with bounded error (TrueTime). |
| How do you handle a write that touches 1000 rows across 100 shards? | This is very slow! 2PC with 100 participants is a bad idea. Better design: batch the writes, use eventual consistency, or redesign schema so related data is on the same shard. |
What makes a great answer
1. Start with the simplest correct solution, then optimize. 2. Acknowledge tradeoffs explicitly ("we are trading latency for consistency"). 3. Reference real systems (Spanner, CockroachDB, TiDB). 4. Know when NOT to use a distributed database ("if your data fits on one machine, do not distribute it").
Red flags interviewers look for:
- 1.Ignoring network latency - "We will just synchronize all writes instantly." No, light takes time to travel!
- 2.Trusting wall clocks - "We will use timestamps to order events." This is wrong.
- 3.Ignoring failure modes - "We will handle failures later." Failures are the core challenge.
- 4.Over-engineering - "Let us add 10 data centers and shard by user, region, and time." Start simple.
- 5.Not knowing CAP theorem - You cannot have perfect consistency, availability, and partition tolerance. Know which one you are giving up.
Real systems to study
**Google Spanner**: The original globally distributed SQL database. Uses TrueTime. Read the paper! **CockroachDB**: Open-source Spanner-like database. Uses Raft + hybrid logical clocks. **TiDB**: MySQL-compatible distributed database. Uses Raft. **YugabyteDB**: PostgreSQL-compatible distributed database. Uses Raft.