System Design Masterclass
Storagekey-valuedistributed-systemsredisdynamodbcassandraadvanced

Design Distributed Key-Value Store

Design a distributed key-value store like Redis, DynamoDB, or Cassandra

Millions of reads/writes per second, Petabytes of data|Similar to Amazon, Google, Facebook, Netflix, Uber|45 min read

Summary

A key-value store is like a giant dictionary. You give it a name (key) and it gives you back a value. Think of it like a locker room - each locker has a number (key) and something inside (value). The hard parts are: splitting data across many computers so it fits, making copies so nothing is lost if a computer breaks, and deciding what happens when two people try to change the same thing at once. Amazon, Google, Facebook, and Netflix all ask this question in interviews.

Key Takeaways

Core Problem

The main job is to spread data across many computers and keep copies in sync. When one computer dies, the others must keep working without losing any data.

The Hard Part

When a computer dies, how do you notice it? How do you make sure requests go to other computers? How do you make new copies of the lost data? All this must happen fast without losing anything.

Scaling Axis

We use a trick called consistent hashing. It is like assigning each piece of data to a spot on a circle. When we add a new computer, only a small amount of data needs to move - not everything.

Critical Invariant

Never lose data that we said we saved. If we tell someone their data is saved, it must still be there even if computers break right after.

Performance Requirement

Reading data should take less than 10 milliseconds (that is 0.01 seconds - very fast!). The slowest requests matter most because users notice delays.

Key Tradeoff

We cannot have everything. Either we always show the latest data (but sometimes refuse requests), or we always respond (but sometimes show old data). This is called the CAP theorem.

Design Walkthrough

Problem Statement

The Question: Design a system that stores data as key-value pairs across many computers. It should handle millions of requests per second and never lose data.

What is a key-value store? Think of it like a dictionary or a phone book: - The KEY is what you look up (like a person's name) - The VALUE is what you get back (like their phone number)

Real examples of key-value stores:

  1. 1.Caching (Redis, Memcached) - Store frequently used data in memory for super fast access. Like keeping your favorite books on your desk instead of walking to the library.
  2. 2.Database (DynamoDB, Cassandra) - Store important data permanently. Like a filing cabinet that remembers everything.
  3. 3.User sessions - When you log into a website, it remembers you. The key is your session ID, the value is who you are.
  4. 4.Shopping carts - Key is your user ID, value is what is in your cart.
  5. 5.Settings and configs (etcd, Consul) - Key is the setting name, value is the setting value.

What to say first

Before I start designing, let me ask some questions. Key-value stores can be very different from each other. Some keep data in memory and lose it when restarted (like a cache). Others save data to disk and keep it forever (like a database). I need to know which one we are building.

What the interviewer really wants to see: - Do you understand that we cannot have perfect speed, perfect reliability, AND perfect consistency at the same time? (CAP theorem) - Can you explain how to split data across many computers fairly? (consistent hashing) - Do you know how to make copies of data so nothing is lost? (replication) - What happens when computers break or cannot talk to each other? (failure handling) - Do you know the difference between a fast cache and a permanent database?

Clarifying Questions

Before you design anything, ask questions. The answers completely change what you build. Good questions show the interviewer you think carefully.

Question 1: Do we need to keep data forever?

Is this a cache (okay to lose data when computer restarts) or a database (data must survive forever)?

Why ask this: This changes EVERYTHING about how we build it.

If it is a cache: Keep data only in memory (RAM). Super fast but data disappears when computer restarts. Simpler to build.

If it is a database: Must save data to disk. Need multiple copies. Must survive computer crashes. More complex.

What interviewers usually say: Build a database - data must survive restarts and computer failures.

How this changes your design: We need to write data to disk, make 3 copies on different computers, and have a plan when computers die.

Question 2: How fresh does data need to be?

When someone reads data, must they always see the very latest version? Or is it okay if they sometimes see slightly old data?

Why ask this: This is THE most important question. It decides everything about how we handle copies.

Strong consistency: Every read sees the latest write. Like a bank account - you must see your real balance. - How it works: We must check with multiple computers before answering. - Downside: Slower, and sometimes we cannot answer if computers cannot talk to each other.

Eventual consistency: Reads might show slightly old data, but it will catch up. Like social media likes - if the count is off by 1 for a second, nobody cares. - How it works: We can answer from any computer that has the data. - Downside: You might see old data briefly.

What interviewers usually say: We want eventual consistency most of the time, but let users choose strong consistency when they need it.

How this changes your design: We use a voting system. For strong consistency, we ask multiple computers and wait for them to agree. For eventual consistency, we ask just one.

Question 3: How much data and how big are the values?

How many keys will we store? How big is each value - tiny (a few bytes) or huge (megabytes)?

Why ask this: Determines how many computers we need and how we store data.

Small values (under 1KB): Most key-value stores. User profiles, settings, counters. Can fit lots in memory.

Large values (MB or more): Need special handling. Might need to split into chunks. Think storing images or files.

What interviewers usually say: 10 billion keys, each value is about 1KB on average, some up to 400KB.

How this changes your design: 10 billion keys times 1KB = 10 terabytes of data. With 3 copies = 30 TB. We need about 100 computers with 300GB each.

Question 4: More reads or more writes?

Do people read data more often than they write it? Are there some keys that everyone wants (hot keys)?

Why ask this: Helps us optimize for what happens most often.

Read-heavy (most apps): 100 reads for every 1 write. Make more copies so we can read from many places.

Write-heavy (logs, events): Lots of new data coming in. Optimize for fast writes.

Hot keys problem: If everyone wants the same key (like a celebrity's profile), that one computer gets overloaded.

What interviewers usually say: Read-heavy, about 10 reads for every 1 write. Some hot keys exist.

How this changes your design: More copies for popular data. Special handling for hot keys - maybe cache them everywhere.

Summarize your assumptions

Let me summarize what I will design: A key-value database (not just a cache) with 10 billion keys, values up to 400KB, eventual consistency by default with strong consistency option, and a read-heavy workload. I will make 3 copies of each key for safety.

The Hard Part

Say this to the interviewer

The hardest part is not storing the data - it is handling failures. Computers break. Networks have problems. When bad things happen, we must keep working AND not lose any data. Let me explain the tricky scenarios.

Why failures are genuinely hard (explained simply):

  1. 1.Is the computer dead or just slow? If a computer does not respond, we do not know if it crashed or if the network is slow. If we think it is dead but it is not, we might have two computers both thinking they are the boss!
  2. 2.Adding new computers is tricky. When a new computer joins, we need to move some data to it. But we cannot stop the whole system to do this. We must keep working while moving data around.
  3. 3.Two people write at the same time. Person A and Person B both try to change the same key at the same instant. Each talks to a different computer. Now we have two different values for the same key. Which one wins?
  4. 4.Reading what you just wrote. You save something, then immediately read it back. But your read might go to a different computer that has not received your write yet! You see old data even though you just updated it.
  5. 5.Computer comes back from the dead. A computer was "dead" for 5 minutes. When it comes back, suddenly everyone sends requests to it. It gets overwhelmed and might crash again!

Common mistake candidates make

Many people only design for the happy case when everything works. Interviewers want to see you think about failures. What if a computer dies in the middle of saving? What if the network splits in half? Always think: What could go wrong?

The CAP Theorem - The rule you cannot break

CAP stands for: - Consistency: Everyone sees the same data at the same time - Availability: The system always responds to requests - Partition tolerance: The system works even when some computers cannot talk to each other

The rule: When computers cannot talk to each other (partition), you MUST choose: either be consistent OR be available. You cannot have both.

Imagine two computers, A and B, cannot talk to each other: - If someone writes to A and someone reads from B, what happens? - Choose consistency: Refuse the read from B (not available) because B does not have the latest data. - Choose availability: Let B answer with old data (not consistent) so we can still respond.

Most key-value stores choose AVAILABILITY because: - A little bit of old data is usually okay - Being unavailable directly hurts users - We can fix the data later when computers reconnect

CAP Theorem - You Must Choose

Scale and Access Patterns

Before designing, let me figure out how big this system needs to be. This helps us choose the right approach.

What we are measuringNumberWhat this means for our design
Total keys10 billionWay too many for one computer. Must split across many machines.
Total data size10 TB (before copies)10 billion x 1KB. With 3 copies = 30 TB total.
+ 6 more rows...

What to tell the interviewer

With 10 billion keys and 3 copies, we need about 100 computers. Since we have 10 times more reads than writes, we should make reads super fast using caching and spreading data well. Most requests should never hit the disk - they should come from memory.

STORAGE MATH:
- 10 billion keys x 1KB average = 10 TB of data
- With 3 copies = 30 TB total storage
+ 12 more lines...

How people use key-value stores (from most common to least):

1. Read a single key - "Give me the value for user_123" - This must be FAST. Most common operation.

2. Write a single key - "Save this value for user_123" - Needs to be durable (not lost).

3. Delete a key - "Remove user_123" - Less common but important.

4. Scan a range - "Give me all keys from user_100 to user_200" - Some key-value stores support this, some do not.

Access patterns to know about: - Zipf distribution: A small number of keys get most of the traffic. Think celebrities on social media. - Hot spots: When everyone wants the same key at once, that computer gets overwhelmed. - Write bursts: Sometimes lots of writes come at once (like during a sale or event).

High-Level Architecture

Now let me draw the big picture. I will explain what each part does and WHY we need it.

What to tell the interviewer

I will use a masterless design where every computer can accept both reads and writes. This means if any computer dies, the others keep working. There is no single boss computer that could become a single point of failure. This is how DynamoDB and Cassandra work.

Key-Value Store - The Big Picture

What each part does and WHY it exists:

PartWhat it doesWhy we need it
Client LibraryLives in your app. Knows which node owns which key. Sends requests directly to the right node.Saves time by going directly to the right computer instead of asking around.
Load BalancerSpreads traffic across healthy nodes. Removes dead nodes from rotation.Protects against overloading one computer. Hides failures from users.
Coordinator NodeReceives a request and talks to the other nodes that have copies. Waits for enough responses.Makes sure data is saved to multiple places before saying done.
Storage NodeSaves data to memory and disk. Accepts copies from other nodes.Actually stores the data!
Gossip ProtocolNodes chat with each other to share health info. Who is alive? Who is dead?Lets nodes discover problems without a central boss.

How a write request works (step by step):

  1. 1.App wants to save key "user_123" with value "John" 2. Client library calculates: "user_123" belongs to Node 2 (using consistent hashing) 3. Request goes to Node 2 (the coordinator) 4. Node 2 figures out: "This key also needs copies on Node 3 and Node 4" 5. Node 2 sends the write to Node 2, Node 3, and Node 4 at the same time 6. Node 2 waits for 2 of 3 to say "saved!" (this is called quorum) 7. Node 2 tells the app "success!" 8. The third node might still be saving - that is okay

How a read request works (step by step):

  1. 1.App wants to read key "user_123" 2. Client library calculates: "user_123" is on Nodes 2, 3, and 4 3. For fast eventual consistency: Ask just one node (fastest) 4. For strong consistency: Ask 2 nodes, make sure they agree 5. Return the value to the app

Common interview question: Why no leader/boss?

The interviewer might ask: Why not have one leader node? Answer: A leader is a single point of failure. If the leader dies, the whole system stops until we pick a new one. With a masterless design, any node can do any job. If one dies, the others keep working instantly. The tradeoff: it is harder to keep data perfectly consistent without a leader.

Technology choices and why:

Storage: LSM Tree (Log-Structured Merge Tree) - Why: Super fast writes because we just append to a file - Used by: Cassandra, RocksDB, LevelDB - Tradeoff: Reads can be slower (might check multiple files)

In-memory index: Hash Table or Skip List - Why: Find any key instantly (O(1) lookup) - Keeps pointers to where data lives on disk

Communication: gRPC - Why: Fast, supports streaming, works well between services - Alternative: HTTP REST is simpler but slower

How real companies do it

DynamoDB (Amazon): Masterless, eventual consistency default, can choose strong consistency. Cassandra (Open source): Same idea, you run it yourself. Redis: Simpler, has a leader for each piece of data, super fast because mostly in-memory. etcd: Strong consistency using Raft consensus, used for important configs.

Data Partitioning: Consistent Hashing

How do we decide which computer stores which key? This is called partitioning. We use a clever trick called consistent hashing.

What to tell the interviewer

We use consistent hashing to split data across computers. The magic is: when we add or remove a computer, only a tiny fraction of data needs to move. With the simple approach (hash mod N), adding one computer means EVERYTHING moves. Consistent hashing is much better.

The problem with simple hashing:

Simple approach: Computer number = hash(key) mod number_of_computers

Example with 4 computers: - hash("user_1") = 17, 17 mod 4 = 1, goes to Computer 1 - hash("user_2") = 22, 22 mod 4 = 2, goes to Computer 2

Now we add a 5th computer: - hash("user_1") = 17, 17 mod 5 = 2, NOW goes to Computer 2! - hash("user_2") = 22, 22 mod 5 = 2, stays on Computer 2

Almost ALL keys change computers! This is terrible - we have to move all our data just to add one machine.

Consistent hashing - the smart way:

Imagine a clock face (a circle) numbered 0 to 100.

  1. 1.Put each computer at a spot on the circle (based on its hash) 2. Put each key at a spot on the circle (based on its hash) 3. Each key is owned by the first computer clockwise from it

Now when we add a new computer: - It takes over some keys from its clockwise neighbor - ONLY those keys move - Everything else stays where it is!

Consistent Hashing Ring

How it works step by step:

  1. 1.Hash the computer name to get its position: hash("Computer_A") = 25 2. Hash the key name to get its position: hash("user_123") = 30 3. Walk clockwise from the key's position until you hit a computer 4. That computer owns the key!

In the example above: - Key X (position 30) walks clockwise and hits Computer B (position 50) - Key Y (position 60) walks clockwise and hits Computer C (position 75) - Key Z (position 10) walks clockwise and hits Computer A (position 25)

When we add Computer D at position 60: - Keys between 50 and 60 move from Computer C to Computer D - Everything else stays the same! - Only about 1/4 of the keys move (instead of all of them)

Virtual nodes - making it fair:

Problem: With only a few computers, one might get way more keys than others (unfair!)

Solution: Each real computer pretends to be many "virtual" computers on the ring.

  • Computer A appears at positions: 10, 35, 60, 85 (4 virtual nodes) - Computer B appears at positions: 20, 45, 70, 95 (4 virtual nodes)

Now the keys are spread more evenly!

In real systems: - Each computer has 100-200 virtual nodes - This gives very even distribution - Bigger computers can have more virtual nodes (they handle more data)

FUNCTION find_owner(key):
    // Step 1: Hash the key to get its position on the ring
    position = hash(key) mod RING_SIZE
+ 28 more lines...

Important detail for replication

When finding computers for copies, skip virtual nodes that belong to the same physical machine! You want copies on DIFFERENT machines. If Machine A has virtual nodes at positions 10, 35, and 60, and your key is at position 8, do not put all copies on Machine A just because its virtual nodes are next on the ring.

Replication: Making Copies for Safety

Replication means keeping copies of data on multiple computers. If one computer dies, we do not lose any data because other computers have copies.

What to tell the interviewer

We copy each piece of data to N computers (usually 3). When writing, we wait for W computers to confirm. When reading, we ask R computers. The magic formula is: if W + R > N, we always get consistent data because at least one computer must have seen the latest write.

The three numbers that control everything: N, W, R

  • N = Number of copies (how many computers have this data) - W = Write quorum (how many must confirm before we say "saved!") - R = Read quorum (how many we ask when reading)

The magic rule: W + R > N means strong consistency

Why? If we write to W computers and read from R computers, and W + R > N, then at least one computer has seen both the write AND is being read from. So we always get fresh data!

SettingNWRW+RWhat happens
Strong consistency3224 > 3 ✓Always see latest data. Slower writes.
Fast writes3134 > 3 ✓Writes are fast (only wait for 1). Reads must check all 3.
Fast reads3314 > 3 ✓Reads are fast (only ask 1). Writes must wait for all 3.
Eventual consistency3112 < 3 ✗Super fast but might see old data. Used by most apps.

How a Write Works with W=2, N=3

Three ways to do replication:

1. Synchronous (wait for all) - Wait for ALL N computers before saying done - Pros: Very safe - data is definitely everywhere - Cons: Slow - as slow as the slowest computer. If one is dead, we are stuck.

2. Asynchronous (do not wait at all) - Say "done!" immediately, copy to others in background - Pros: Super fast writes - Cons: If the first computer dies before copying, data is lost!

3. Quorum (wait for some) - RECOMMENDED - Wait for W out of N computers - Pros: Balance of speed and safety - Cons: Need to choose W and R carefully

FUNCTION write_with_quorum(key, value, N=3, W=2):
    // Find the N nodes that should store this key
    nodes = find_replicas(key, N)
+ 33 more lines...

Hinted Handoff - What happens when a node is temporarily down

Imagine we need to write to Nodes 1, 2, and 3, but Node 3 is down.

  1. 1.We write to Nodes 1 and 2 (success! W=2 is met) 2. For Node 3, we write to Node 4 instead with a "hint" 3. The hint says: "This data is meant for Node 3. Give it to them when they wake up." 4. When Node 3 comes back online, Node 4 sends it the data 5. Node 4 then deletes its temporary copy

This keeps the system available even when nodes are temporarily down!

What DynamoDB does

DynamoDB uses N=3 by default. For eventual consistency reads (default), R=1 (fast!). For strongly consistent reads, R=2. For writes, W=2. This gives fast eventual reads while still being safe.

Storage Engine: How We Save Data

Now let me explain how each computer actually saves the data. We use something called an LSM Tree (Log-Structured Merge Tree). It sounds scary but it is actually simple!

What to tell the interviewer

We use an LSM tree storage engine because it is incredibly fast for writes. Instead of updating data in place (which requires finding it first), we just write everything at the end of a file. This is used by Cassandra, RocksDB, LevelDB, and inside DynamoDB.

Why random writes are slow:

Imagine a traditional database stored alphabetically on disk: - To save "user_banana", we find the B section and insert it - This means reading the disk, finding the spot, moving things around, writing - Disks are SLOW at finding random spots (like flipping through a book)

Why sequential writes are fast:

  • Just write at the end of the file, like adding to a diary - No searching, no moving things around - Disks are FAST at sequential writes (like writing on a blank page)

LSM trees convert random writes into sequential writes!

LSM Tree Structure

How writing works (step by step):

  1. 1.Write to the Write-Ahead Log (WAL): This is just appending to a file. If we crash, we can recover from this log. This is sequential - very fast!
  2. 2.Write to the MemTable: This is an in-memory sorted data structure (like a balanced tree). Super fast because it is in memory.
  3. 3.Return success to the user: We are done! The user does not wait for disk.
  4. 4.Later: Flush to disk: When the MemTable gets big (like 64MB), we write it all to disk as a sorted file called an SSTable (Sorted String Table). This is sequential - fast!

No random disk writes ever! That is why LSM trees are so fast for writes.

How reading works (step by step):

  1. 1.Check the MemTable first: Maybe the data is recent and still in memory. Instant!
  2. 2.Check disk files from newest to oldest: The data might be in any file, so we check them in order.
  3. 3.Use Bloom filters to skip files: A bloom filter is a small summary that tells us "this key is DEFINITELY NOT in this file" or "this key MIGHT be in this file". We skip files that definitely do not have our key.

Reads can be slower because we might check multiple files. But bloom filters help a lot - we skip most files.

FUNCTION write(key, value):
    // Step 1: Write to log (crash recovery)
    append_to_file(WAL, key, value)
+ 52 more lines...

Compaction - Cleaning up old files:

Over time, we get many SSTable files. This is bad because: - Reads check many files - Deleted data still takes space (tombstones) - Same key might be in multiple files (old versions)

Compaction merges files together: - Combine small files into bigger files - Remove old versions (keep only newest) - Remove tombstones (actually delete the data) - Result: Fewer, more organized files

The tradeoff: Write Amplification

LSM trees have a cost: data gets written multiple times. First to the MemTable, then to Level 0, then merged to Level 1, and so on. A piece of data might be written 10-30 times total! This is called write amplification. We trade extra disk writes for faster user-facing writes. For most apps, this is a great tradeoff.

OperationLSM TreeB+ Tree (traditional)Which is better?
WriteSuper fast - just appendSlower - find the right spot firstLSM Tree wins
Read (found)Might check multiple filesOne lookupB+ Tree wins
Read (not found)Fast with bloom filtersOne lookupAbout the same
Range scan (get A-Z)Merge from multiple filesJust read in orderB+ Tree wins
Best forWrite-heavy workloadsRead-heavy workloadsDepends on your app!

Handling Conflicts: When Two Writes Disagree

The golden rule

Never lose data that we promised to save. If we told a user their write succeeded, that data must be recoverable. When two writes conflict, we must decide which one wins using a rule that everyone agrees on.

The conflict problem explained simply:

Imagine two computers (A and B) cannot talk to each other (network problem).

  1. 1.User 1 writes to Computer A: "user_123 = Alice" 2. At the same time, User 2 writes to Computer B: "user_123 = Bob" 3. Both computers say "Success!" 4. Network problem is fixed. A and B sync up. 5. Computer A has "Alice", Computer B has "Bob" 6. Which one wins? They need to agree!

Write Conflict During Network Split

Solution 1: Last-Write-Wins (LWW) - The simple approach

Every write gets a timestamp. Higher timestamp wins.

  • Alice was written at time 100 - Bob was written at time 102 - Bob wins because 102 > 100

Pros: Super simple. No application changes needed. Cons: Alice's write is silently lost! She thought it saved but it did not "win". When to use: When losing occasional writes is okay. Most apps use this.

STRUCTURE VersionedValue:
    value: the actual data
    timestamp: when it was written (from computer clock)
+ 12 more lines...

Solution 2: Vector Clocks - The careful approach

Instead of a single timestamp, track the "version history" from each computer.

Example: - Alice is written on Computer A: version {A: 1} - Bob is written on Computer B: version {B: 1}

When syncing, we see: - {A: 1} vs {B: 1} - Neither includes the other! - This is a TRUE CONFLICT - we cannot automatically pick a winner

The system keeps BOTH values and asks the application to decide.

Pros: Never silently loses data. Detects real conflicts. Cons: Complex. Application must handle conflicts. When to use: When no write can be lost (important data).

Solution 3: CRDTs (Conflict-free Replicated Data Types) - The clever approach

Special data structures designed to never conflict!

Example: A "grow-only counter" - Computer A: counter = 5 - Computer B: counter = 3 - To merge: just take the maximum = 5 - No conflict possible!

Types of CRDTs: - Counters: Track adds and subtracts separately, merge by adding - Sets: Track adds and removes separately, merge by combining - Registers: Like LWW but smarter

Pros: No conflicts ever. Automatic merging. Cons: Only works for specific data types. Cannot do everything. When to use: Likes, counters, collaborative editing.

What to tell the interviewer

For most key-value stores, Last-Write-Wins is fine. DynamoDB uses it with optional conditional writes (only write if current value is X). For critical data where no write can be lost, use vector clocks or application-level conflict resolution. For counters and likes, CRDTs work great.

Handling Failures: What Can Go Wrong

Tell the interviewer about failures

Good engineers think about what can break. Let me walk through failure scenarios and how we handle them. This is critical for a storage system - losing data is unacceptable!

Types of failures and how we handle them:

What breaksHow we noticeWhat we doHow we recover
Computer crashesOther computers stop hearing from it (heartbeat)Send requests to other copiesStart a new computer, copy data to it
Network splitComputers in Group A cannot reach Group BEach group keeps working with available copiesWhen fixed, sync and resolve conflicts
Disk diesDisk errors when reading/writingMark computer unhealthy, use other copiesReplace disk, copy data from replicas
Computer is slowRequests take too longSend same request to another computer tooIf slow often, remove from rotation
Data corruptionChecksum does not matchReturn error, try another copyReplace corrupted data from good copy

How do computers know others are alive? Gossip Protocol

Imagine a small town where people gossip: - Every second, each person tells a random neighbor: "Have you heard about Bob? I saw him 2 seconds ago." - If everyone says "I have not heard from Bob in 10 seconds," Bob is probably not around.

Computers do the same thing!

  1. 1.Every second, each computer picks a random other computer 2. They share what they know: "Computer A was healthy 1 second ago, Computer B was healthy 3 seconds ago" 3. If nobody has heard from Computer C in 10 seconds, Computer C is probably dead 4. Mark Computer C as unhealthy and stop sending it requests
STRUCTURE ComputerHealth:
    computer_id: which computer
    last_seen: when we last heard they were alive
+ 34 more lines...

Read Repair - Fixing inconsistencies while reading

When we read from multiple computers and get different answers:

  1. 1.We return the newest value to the user (based on timestamp) 2. In the background, we update the computers that had old data

This is like noticing a typo while reading and fixing it for next time.

FUNCTION read_with_repair(key):
    // Get all 3 copies
    computers = find_replicas(key, 3)
+ 14 more lines...

Anti-entropy - Background consistency checker

In addition to read repair, we run a background process that:

  1. 1.Compares data between replicas 2. Finds differences 3. Fixes them

This catches inconsistencies even for data that is never read.

Uses Merkle Trees for efficiency: - Instead of comparing every key (slow!) - Create a "summary hash" of all keys - Compare summaries first - Only look at details where summaries differ - Much faster!

Defense in depth

We have multiple layers of protection: (1) Write to multiple computers immediately, (2) Hinted handoff if one is down, (3) Read repair fixes problems during normal reads, (4) Anti-entropy fixes problems in background. Data would have to get past ALL these protections to be lost!

Growing the System Over Time

What to tell the interviewer

This design works great up to thousands of computers and petabytes of data. Let me explain how a system starts small and grows, and what changes at each stage.

Evolution path - From startup to giant:

Stage 1: Just starting (up to 1TB of data) - Single computer with in-memory storage - Simple hash map for data - Write to disk for safety - Good enough for development and small apps

Stage 2: Growing (up to 10TB) - One primary computer handles all writes - Add read replicas to handle more reads - Simple to understand and operate - Limitations: Primary is a bottleneck, if primary dies there is downtime

Stage 3: Big scale (up to PBs) - What we designed - Consistent hashing spreads data across many computers - No single boss - any computer can handle requests - 3 copies of everything - Can lose computers without any downtime

Stage 4: Global scale (multiple data centers) - Computers in different cities/countries - Users talk to the nearest data center - Data copies across data centers - Survives entire data center failures

Multi-Region Deployment

Different types of key-value stores for different needs:

SystemHow it worksConsistencyBest for
RedisOne primary per shard, data mostly in memoryStrong per shardCaching, sessions, leaderboards. Super fast.
CassandraNo primary, any node can write, tunableEventual or tunableLogs, time-series, write-heavy apps.
DynamoDBManaged service, no primary, tunableEventual or strongGeneral purpose, serverless apps.
etcdRaft consensus, one leaderStrongConfig storage, leader election. Small data.
CockroachDBRaft per range, SQL supportStrongWhen you need SQL + distribution.

What would I change for different use cases?

For caching (like Redis): - Keep data only in memory - no disk - Simpler replication - one primary per shard - Add eviction - delete old data when memory is full

For time-series data (like metrics): - Partition by time - all March data together - Compress aggressively - metrics compress well - Auto-delete old data - keep only last 30 days

For configuration storage (like etcd): - Strong consistency always - config must be correct - Smaller data size - megabytes not terabytes - Add "watch" feature - notify when config changes

For global deployment: - Prefer reading from local data center (fast) - Write might need to cross data centers (slower but safer) - Use conflict resolution for cross-region writes

Final interview tip

Remember: start simple! Do not jump to the complex global design immediately. Start with single node, then add sharding, then add replication, then add multiple regions. The interviewer wants to see that you know when to add complexity and when to keep it simple.

Design Trade-offs

Advantages

  • +Every read sees the most recent write
  • +No surprises - what you write is what you read
  • +Simple for application developers - no stale data to worry about

Disadvantages

  • -Slower - must wait for multiple computers to agree
  • -Less available - if computers cannot talk, we refuse requests
  • -Higher latency especially across data centers
When to use

Bank balances, inventory counts, anything where wrong data causes real problems. Use when correctness matters more than speed.