Design Walkthrough
Problem Statement
The Question: Design a system that stores data as key-value pairs across many computers. It should handle millions of requests per second and never lose data.
What is a key-value store? Think of it like a dictionary or a phone book: - The KEY is what you look up (like a person's name) - The VALUE is what you get back (like their phone number)
Real examples of key-value stores:
- 1.Caching (Redis, Memcached) - Store frequently used data in memory for super fast access. Like keeping your favorite books on your desk instead of walking to the library.
- 2.Database (DynamoDB, Cassandra) - Store important data permanently. Like a filing cabinet that remembers everything.
- 3.User sessions - When you log into a website, it remembers you. The key is your session ID, the value is who you are.
- 4.Shopping carts - Key is your user ID, value is what is in your cart.
- 5.Settings and configs (etcd, Consul) - Key is the setting name, value is the setting value.
What to say first
Before I start designing, let me ask some questions. Key-value stores can be very different from each other. Some keep data in memory and lose it when restarted (like a cache). Others save data to disk and keep it forever (like a database). I need to know which one we are building.
What the interviewer really wants to see: - Do you understand that we cannot have perfect speed, perfect reliability, AND perfect consistency at the same time? (CAP theorem) - Can you explain how to split data across many computers fairly? (consistent hashing) - Do you know how to make copies of data so nothing is lost? (replication) - What happens when computers break or cannot talk to each other? (failure handling) - Do you know the difference between a fast cache and a permanent database?
Clarifying Questions
Before you design anything, ask questions. The answers completely change what you build. Good questions show the interviewer you think carefully.
Question 1: Do we need to keep data forever?
Is this a cache (okay to lose data when computer restarts) or a database (data must survive forever)?
Why ask this: This changes EVERYTHING about how we build it.
If it is a cache: Keep data only in memory (RAM). Super fast but data disappears when computer restarts. Simpler to build.
If it is a database: Must save data to disk. Need multiple copies. Must survive computer crashes. More complex.
What interviewers usually say: Build a database - data must survive restarts and computer failures.
How this changes your design: We need to write data to disk, make 3 copies on different computers, and have a plan when computers die.
Question 2: How fresh does data need to be?
When someone reads data, must they always see the very latest version? Or is it okay if they sometimes see slightly old data?
Why ask this: This is THE most important question. It decides everything about how we handle copies.
Strong consistency: Every read sees the latest write. Like a bank account - you must see your real balance. - How it works: We must check with multiple computers before answering. - Downside: Slower, and sometimes we cannot answer if computers cannot talk to each other.
Eventual consistency: Reads might show slightly old data, but it will catch up. Like social media likes - if the count is off by 1 for a second, nobody cares. - How it works: We can answer from any computer that has the data. - Downside: You might see old data briefly.
What interviewers usually say: We want eventual consistency most of the time, but let users choose strong consistency when they need it.
How this changes your design: We use a voting system. For strong consistency, we ask multiple computers and wait for them to agree. For eventual consistency, we ask just one.
Question 3: How much data and how big are the values?
How many keys will we store? How big is each value - tiny (a few bytes) or huge (megabytes)?
Why ask this: Determines how many computers we need and how we store data.
Small values (under 1KB): Most key-value stores. User profiles, settings, counters. Can fit lots in memory.
Large values (MB or more): Need special handling. Might need to split into chunks. Think storing images or files.
What interviewers usually say: 10 billion keys, each value is about 1KB on average, some up to 400KB.
How this changes your design: 10 billion keys times 1KB = 10 terabytes of data. With 3 copies = 30 TB. We need about 100 computers with 300GB each.
Question 4: More reads or more writes?
Do people read data more often than they write it? Are there some keys that everyone wants (hot keys)?
Why ask this: Helps us optimize for what happens most often.
Read-heavy (most apps): 100 reads for every 1 write. Make more copies so we can read from many places.
Write-heavy (logs, events): Lots of new data coming in. Optimize for fast writes.
Hot keys problem: If everyone wants the same key (like a celebrity's profile), that one computer gets overloaded.
What interviewers usually say: Read-heavy, about 10 reads for every 1 write. Some hot keys exist.
How this changes your design: More copies for popular data. Special handling for hot keys - maybe cache them everywhere.
Summarize your assumptions
Let me summarize what I will design: A key-value database (not just a cache) with 10 billion keys, values up to 400KB, eventual consistency by default with strong consistency option, and a read-heavy workload. I will make 3 copies of each key for safety.
The Hard Part
Say this to the interviewer
The hardest part is not storing the data - it is handling failures. Computers break. Networks have problems. When bad things happen, we must keep working AND not lose any data. Let me explain the tricky scenarios.
Why failures are genuinely hard (explained simply):
- 1.Is the computer dead or just slow? If a computer does not respond, we do not know if it crashed or if the network is slow. If we think it is dead but it is not, we might have two computers both thinking they are the boss!
- 2.Adding new computers is tricky. When a new computer joins, we need to move some data to it. But we cannot stop the whole system to do this. We must keep working while moving data around.
- 3.Two people write at the same time. Person A and Person B both try to change the same key at the same instant. Each talks to a different computer. Now we have two different values for the same key. Which one wins?
- 4.Reading what you just wrote. You save something, then immediately read it back. But your read might go to a different computer that has not received your write yet! You see old data even though you just updated it.
- 5.Computer comes back from the dead. A computer was "dead" for 5 minutes. When it comes back, suddenly everyone sends requests to it. It gets overwhelmed and might crash again!
Common mistake candidates make
Many people only design for the happy case when everything works. Interviewers want to see you think about failures. What if a computer dies in the middle of saving? What if the network splits in half? Always think: What could go wrong?
The CAP Theorem - The rule you cannot break
CAP stands for: - Consistency: Everyone sees the same data at the same time - Availability: The system always responds to requests - Partition tolerance: The system works even when some computers cannot talk to each other
The rule: When computers cannot talk to each other (partition), you MUST choose: either be consistent OR be available. You cannot have both.
Imagine two computers, A and B, cannot talk to each other: - If someone writes to A and someone reads from B, what happens? - Choose consistency: Refuse the read from B (not available) because B does not have the latest data. - Choose availability: Let B answer with old data (not consistent) so we can still respond.
Most key-value stores choose AVAILABILITY because: - A little bit of old data is usually okay - Being unavailable directly hurts users - We can fix the data later when computers reconnect
CAP Theorem - You Must Choose
Scale and Access Patterns
Before designing, let me figure out how big this system needs to be. This helps us choose the right approach.
| What we are measuring | Number | What this means for our design |
|---|---|---|
| Total keys | 10 billion | Way too many for one computer. Must split across many machines. |
| Total data size | 10 TB (before copies) | 10 billion x 1KB. With 3 copies = 30 TB total. |
What to tell the interviewer
With 10 billion keys and 3 copies, we need about 100 computers. Since we have 10 times more reads than writes, we should make reads super fast using caching and spreading data well. Most requests should never hit the disk - they should come from memory.
STORAGE MATH:
- 10 billion keys x 1KB average = 10 TB of data
- With 3 copies = 30 TB total storageHow people use key-value stores (from most common to least):
1. Read a single key - "Give me the value for user_123" - This must be FAST. Most common operation.
2. Write a single key - "Save this value for user_123" - Needs to be durable (not lost).
3. Delete a key - "Remove user_123" - Less common but important.
4. Scan a range - "Give me all keys from user_100 to user_200" - Some key-value stores support this, some do not.
Access patterns to know about: - Zipf distribution: A small number of keys get most of the traffic. Think celebrities on social media. - Hot spots: When everyone wants the same key at once, that computer gets overwhelmed. - Write bursts: Sometimes lots of writes come at once (like during a sale or event).
High-Level Architecture
Now let me draw the big picture. I will explain what each part does and WHY we need it.
What to tell the interviewer
I will use a masterless design where every computer can accept both reads and writes. This means if any computer dies, the others keep working. There is no single boss computer that could become a single point of failure. This is how DynamoDB and Cassandra work.
Key-Value Store - The Big Picture
What each part does and WHY it exists:
| Part | What it does | Why we need it |
|---|---|---|
| Client Library | Lives in your app. Knows which node owns which key. Sends requests directly to the right node. | Saves time by going directly to the right computer instead of asking around. |
| Load Balancer | Spreads traffic across healthy nodes. Removes dead nodes from rotation. | Protects against overloading one computer. Hides failures from users. |
| Coordinator Node | Receives a request and talks to the other nodes that have copies. Waits for enough responses. | Makes sure data is saved to multiple places before saying done. |
| Storage Node | Saves data to memory and disk. Accepts copies from other nodes. | Actually stores the data! |
| Gossip Protocol | Nodes chat with each other to share health info. Who is alive? Who is dead? | Lets nodes discover problems without a central boss. |
How a write request works (step by step):
- 1.App wants to save key "user_123" with value "John" 2. Client library calculates: "user_123" belongs to Node 2 (using consistent hashing) 3. Request goes to Node 2 (the coordinator) 4. Node 2 figures out: "This key also needs copies on Node 3 and Node 4" 5. Node 2 sends the write to Node 2, Node 3, and Node 4 at the same time 6. Node 2 waits for 2 of 3 to say "saved!" (this is called quorum) 7. Node 2 tells the app "success!" 8. The third node might still be saving - that is okay
How a read request works (step by step):
- 1.App wants to read key "user_123" 2. Client library calculates: "user_123" is on Nodes 2, 3, and 4 3. For fast eventual consistency: Ask just one node (fastest) 4. For strong consistency: Ask 2 nodes, make sure they agree 5. Return the value to the app
Common interview question: Why no leader/boss?
The interviewer might ask: Why not have one leader node? Answer: A leader is a single point of failure. If the leader dies, the whole system stops until we pick a new one. With a masterless design, any node can do any job. If one dies, the others keep working instantly. The tradeoff: it is harder to keep data perfectly consistent without a leader.
Technology choices and why:
Storage: LSM Tree (Log-Structured Merge Tree) - Why: Super fast writes because we just append to a file - Used by: Cassandra, RocksDB, LevelDB - Tradeoff: Reads can be slower (might check multiple files)
In-memory index: Hash Table or Skip List - Why: Find any key instantly (O(1) lookup) - Keeps pointers to where data lives on disk
Communication: gRPC - Why: Fast, supports streaming, works well between services - Alternative: HTTP REST is simpler but slower
How real companies do it
DynamoDB (Amazon): Masterless, eventual consistency default, can choose strong consistency. Cassandra (Open source): Same idea, you run it yourself. Redis: Simpler, has a leader for each piece of data, super fast because mostly in-memory. etcd: Strong consistency using Raft consensus, used for important configs.
Data Partitioning: Consistent Hashing
How do we decide which computer stores which key? This is called partitioning. We use a clever trick called consistent hashing.
What to tell the interviewer
We use consistent hashing to split data across computers. The magic is: when we add or remove a computer, only a tiny fraction of data needs to move. With the simple approach (hash mod N), adding one computer means EVERYTHING moves. Consistent hashing is much better.
The problem with simple hashing:
Simple approach: Computer number = hash(key) mod number_of_computers
Example with 4 computers: - hash("user_1") = 17, 17 mod 4 = 1, goes to Computer 1 - hash("user_2") = 22, 22 mod 4 = 2, goes to Computer 2
Now we add a 5th computer: - hash("user_1") = 17, 17 mod 5 = 2, NOW goes to Computer 2! - hash("user_2") = 22, 22 mod 5 = 2, stays on Computer 2
Almost ALL keys change computers! This is terrible - we have to move all our data just to add one machine.
Consistent hashing - the smart way:
Imagine a clock face (a circle) numbered 0 to 100.
- 1.Put each computer at a spot on the circle (based on its hash) 2. Put each key at a spot on the circle (based on its hash) 3. Each key is owned by the first computer clockwise from it
Now when we add a new computer: - It takes over some keys from its clockwise neighbor - ONLY those keys move - Everything else stays where it is!
Consistent Hashing Ring
How it works step by step:
- 1.Hash the computer name to get its position: hash("Computer_A") = 25 2. Hash the key name to get its position: hash("user_123") = 30 3. Walk clockwise from the key's position until you hit a computer 4. That computer owns the key!
In the example above: - Key X (position 30) walks clockwise and hits Computer B (position 50) - Key Y (position 60) walks clockwise and hits Computer C (position 75) - Key Z (position 10) walks clockwise and hits Computer A (position 25)
When we add Computer D at position 60: - Keys between 50 and 60 move from Computer C to Computer D - Everything else stays the same! - Only about 1/4 of the keys move (instead of all of them)
Virtual nodes - making it fair:
Problem: With only a few computers, one might get way more keys than others (unfair!)
Solution: Each real computer pretends to be many "virtual" computers on the ring.
- Computer A appears at positions: 10, 35, 60, 85 (4 virtual nodes) - Computer B appears at positions: 20, 45, 70, 95 (4 virtual nodes)
Now the keys are spread more evenly!
In real systems: - Each computer has 100-200 virtual nodes - This gives very even distribution - Bigger computers can have more virtual nodes (they handle more data)
FUNCTION find_owner(key):
// Step 1: Hash the key to get its position on the ring
position = hash(key) mod RING_SIZEImportant detail for replication
When finding computers for copies, skip virtual nodes that belong to the same physical machine! You want copies on DIFFERENT machines. If Machine A has virtual nodes at positions 10, 35, and 60, and your key is at position 8, do not put all copies on Machine A just because its virtual nodes are next on the ring.
Replication: Making Copies for Safety
Replication means keeping copies of data on multiple computers. If one computer dies, we do not lose any data because other computers have copies.
What to tell the interviewer
We copy each piece of data to N computers (usually 3). When writing, we wait for W computers to confirm. When reading, we ask R computers. The magic formula is: if W + R > N, we always get consistent data because at least one computer must have seen the latest write.
The three numbers that control everything: N, W, R
- N = Number of copies (how many computers have this data) - W = Write quorum (how many must confirm before we say "saved!") - R = Read quorum (how many we ask when reading)
The magic rule: W + R > N means strong consistency
Why? If we write to W computers and read from R computers, and W + R > N, then at least one computer has seen both the write AND is being read from. So we always get fresh data!
| Setting | N | W | R | W+R | What happens |
|---|---|---|---|---|---|
| Strong consistency | 3 | 2 | 2 | 4 > 3 ✓ | Always see latest data. Slower writes. |
| Fast writes | 3 | 1 | 3 | 4 > 3 ✓ | Writes are fast (only wait for 1). Reads must check all 3. |
| Fast reads | 3 | 3 | 1 | 4 > 3 ✓ | Reads are fast (only ask 1). Writes must wait for all 3. |
| Eventual consistency | 3 | 1 | 1 | 2 < 3 ✗ | Super fast but might see old data. Used by most apps. |
How a Write Works with W=2, N=3
Three ways to do replication:
1. Synchronous (wait for all) - Wait for ALL N computers before saying done - Pros: Very safe - data is definitely everywhere - Cons: Slow - as slow as the slowest computer. If one is dead, we are stuck.
2. Asynchronous (do not wait at all) - Say "done!" immediately, copy to others in background - Pros: Super fast writes - Cons: If the first computer dies before copying, data is lost!
3. Quorum (wait for some) - RECOMMENDED - Wait for W out of N computers - Pros: Balance of speed and safety - Cons: Need to choose W and R carefully
FUNCTION write_with_quorum(key, value, N=3, W=2):
// Find the N nodes that should store this key
nodes = find_replicas(key, N)Hinted Handoff - What happens when a node is temporarily down
Imagine we need to write to Nodes 1, 2, and 3, but Node 3 is down.
- 1.We write to Nodes 1 and 2 (success! W=2 is met) 2. For Node 3, we write to Node 4 instead with a "hint" 3. The hint says: "This data is meant for Node 3. Give it to them when they wake up." 4. When Node 3 comes back online, Node 4 sends it the data 5. Node 4 then deletes its temporary copy
This keeps the system available even when nodes are temporarily down!
What DynamoDB does
DynamoDB uses N=3 by default. For eventual consistency reads (default), R=1 (fast!). For strongly consistent reads, R=2. For writes, W=2. This gives fast eventual reads while still being safe.
Storage Engine: How We Save Data
Now let me explain how each computer actually saves the data. We use something called an LSM Tree (Log-Structured Merge Tree). It sounds scary but it is actually simple!
What to tell the interviewer
We use an LSM tree storage engine because it is incredibly fast for writes. Instead of updating data in place (which requires finding it first), we just write everything at the end of a file. This is used by Cassandra, RocksDB, LevelDB, and inside DynamoDB.
Why random writes are slow:
Imagine a traditional database stored alphabetically on disk: - To save "user_banana", we find the B section and insert it - This means reading the disk, finding the spot, moving things around, writing - Disks are SLOW at finding random spots (like flipping through a book)
Why sequential writes are fast:
- Just write at the end of the file, like adding to a diary - No searching, no moving things around - Disks are FAST at sequential writes (like writing on a blank page)
LSM trees convert random writes into sequential writes!
LSM Tree Structure
How writing works (step by step):
- 1.Write to the Write-Ahead Log (WAL): This is just appending to a file. If we crash, we can recover from this log. This is sequential - very fast!
- 2.Write to the MemTable: This is an in-memory sorted data structure (like a balanced tree). Super fast because it is in memory.
- 3.Return success to the user: We are done! The user does not wait for disk.
- 4.Later: Flush to disk: When the MemTable gets big (like 64MB), we write it all to disk as a sorted file called an SSTable (Sorted String Table). This is sequential - fast!
No random disk writes ever! That is why LSM trees are so fast for writes.
How reading works (step by step):
- 1.Check the MemTable first: Maybe the data is recent and still in memory. Instant!
- 2.Check disk files from newest to oldest: The data might be in any file, so we check them in order.
- 3.Use Bloom filters to skip files: A bloom filter is a small summary that tells us "this key is DEFINITELY NOT in this file" or "this key MIGHT be in this file". We skip files that definitely do not have our key.
Reads can be slower because we might check multiple files. But bloom filters help a lot - we skip most files.
FUNCTION write(key, value):
// Step 1: Write to log (crash recovery)
append_to_file(WAL, key, value)Compaction - Cleaning up old files:
Over time, we get many SSTable files. This is bad because: - Reads check many files - Deleted data still takes space (tombstones) - Same key might be in multiple files (old versions)
Compaction merges files together: - Combine small files into bigger files - Remove old versions (keep only newest) - Remove tombstones (actually delete the data) - Result: Fewer, more organized files
The tradeoff: Write Amplification
LSM trees have a cost: data gets written multiple times. First to the MemTable, then to Level 0, then merged to Level 1, and so on. A piece of data might be written 10-30 times total! This is called write amplification. We trade extra disk writes for faster user-facing writes. For most apps, this is a great tradeoff.
| Operation | LSM Tree | B+ Tree (traditional) | Which is better? |
|---|---|---|---|
| Write | Super fast - just append | Slower - find the right spot first | LSM Tree wins |
| Read (found) | Might check multiple files | One lookup | B+ Tree wins |
| Read (not found) | Fast with bloom filters | One lookup | About the same |
| Range scan (get A-Z) | Merge from multiple files | Just read in order | B+ Tree wins |
| Best for | Write-heavy workloads | Read-heavy workloads | Depends on your app! |
Handling Conflicts: When Two Writes Disagree
The golden rule
Never lose data that we promised to save. If we told a user their write succeeded, that data must be recoverable. When two writes conflict, we must decide which one wins using a rule that everyone agrees on.
The conflict problem explained simply:
Imagine two computers (A and B) cannot talk to each other (network problem).
- 1.User 1 writes to Computer A: "user_123 = Alice" 2. At the same time, User 2 writes to Computer B: "user_123 = Bob" 3. Both computers say "Success!" 4. Network problem is fixed. A and B sync up. 5. Computer A has "Alice", Computer B has "Bob" 6. Which one wins? They need to agree!
Write Conflict During Network Split
Solution 1: Last-Write-Wins (LWW) - The simple approach
Every write gets a timestamp. Higher timestamp wins.
- Alice was written at time 100 - Bob was written at time 102 - Bob wins because 102 > 100
Pros: Super simple. No application changes needed. Cons: Alice's write is silently lost! She thought it saved but it did not "win". When to use: When losing occasional writes is okay. Most apps use this.
STRUCTURE VersionedValue:
value: the actual data
timestamp: when it was written (from computer clock)Solution 2: Vector Clocks - The careful approach
Instead of a single timestamp, track the "version history" from each computer.
Example: - Alice is written on Computer A: version {A: 1} - Bob is written on Computer B: version {B: 1}
When syncing, we see: - {A: 1} vs {B: 1} - Neither includes the other! - This is a TRUE CONFLICT - we cannot automatically pick a winner
The system keeps BOTH values and asks the application to decide.
Pros: Never silently loses data. Detects real conflicts. Cons: Complex. Application must handle conflicts. When to use: When no write can be lost (important data).
Solution 3: CRDTs (Conflict-free Replicated Data Types) - The clever approach
Special data structures designed to never conflict!
Example: A "grow-only counter" - Computer A: counter = 5 - Computer B: counter = 3 - To merge: just take the maximum = 5 - No conflict possible!
Types of CRDTs: - Counters: Track adds and subtracts separately, merge by adding - Sets: Track adds and removes separately, merge by combining - Registers: Like LWW but smarter
Pros: No conflicts ever. Automatic merging. Cons: Only works for specific data types. Cannot do everything. When to use: Likes, counters, collaborative editing.
What to tell the interviewer
For most key-value stores, Last-Write-Wins is fine. DynamoDB uses it with optional conditional writes (only write if current value is X). For critical data where no write can be lost, use vector clocks or application-level conflict resolution. For counters and likes, CRDTs work great.
Handling Failures: What Can Go Wrong
Tell the interviewer about failures
Good engineers think about what can break. Let me walk through failure scenarios and how we handle them. This is critical for a storage system - losing data is unacceptable!
Types of failures and how we handle them:
| What breaks | How we notice | What we do | How we recover |
|---|---|---|---|
| Computer crashes | Other computers stop hearing from it (heartbeat) | Send requests to other copies | Start a new computer, copy data to it |
| Network split | Computers in Group A cannot reach Group B | Each group keeps working with available copies | When fixed, sync and resolve conflicts |
| Disk dies | Disk errors when reading/writing | Mark computer unhealthy, use other copies | Replace disk, copy data from replicas |
| Computer is slow | Requests take too long | Send same request to another computer too | If slow often, remove from rotation |
| Data corruption | Checksum does not match | Return error, try another copy | Replace corrupted data from good copy |
How do computers know others are alive? Gossip Protocol
Imagine a small town where people gossip: - Every second, each person tells a random neighbor: "Have you heard about Bob? I saw him 2 seconds ago." - If everyone says "I have not heard from Bob in 10 seconds," Bob is probably not around.
Computers do the same thing!
- 1.Every second, each computer picks a random other computer 2. They share what they know: "Computer A was healthy 1 second ago, Computer B was healthy 3 seconds ago" 3. If nobody has heard from Computer C in 10 seconds, Computer C is probably dead 4. Mark Computer C as unhealthy and stop sending it requests
STRUCTURE ComputerHealth:
computer_id: which computer
last_seen: when we last heard they were aliveRead Repair - Fixing inconsistencies while reading
When we read from multiple computers and get different answers:
- 1.We return the newest value to the user (based on timestamp) 2. In the background, we update the computers that had old data
This is like noticing a typo while reading and fixing it for next time.
FUNCTION read_with_repair(key):
// Get all 3 copies
computers = find_replicas(key, 3)Anti-entropy - Background consistency checker
In addition to read repair, we run a background process that:
- 1.Compares data between replicas 2. Finds differences 3. Fixes them
This catches inconsistencies even for data that is never read.
Uses Merkle Trees for efficiency: - Instead of comparing every key (slow!) - Create a "summary hash" of all keys - Compare summaries first - Only look at details where summaries differ - Much faster!
Defense in depth
We have multiple layers of protection: (1) Write to multiple computers immediately, (2) Hinted handoff if one is down, (3) Read repair fixes problems during normal reads, (4) Anti-entropy fixes problems in background. Data would have to get past ALL these protections to be lost!
Growing the System Over Time
What to tell the interviewer
This design works great up to thousands of computers and petabytes of data. Let me explain how a system starts small and grows, and what changes at each stage.
Evolution path - From startup to giant:
Stage 1: Just starting (up to 1TB of data) - Single computer with in-memory storage - Simple hash map for data - Write to disk for safety - Good enough for development and small apps
Stage 2: Growing (up to 10TB) - One primary computer handles all writes - Add read replicas to handle more reads - Simple to understand and operate - Limitations: Primary is a bottleneck, if primary dies there is downtime
Stage 3: Big scale (up to PBs) - What we designed - Consistent hashing spreads data across many computers - No single boss - any computer can handle requests - 3 copies of everything - Can lose computers without any downtime
Stage 4: Global scale (multiple data centers) - Computers in different cities/countries - Users talk to the nearest data center - Data copies across data centers - Survives entire data center failures
Multi-Region Deployment
Different types of key-value stores for different needs:
| System | How it works | Consistency | Best for |
|---|---|---|---|
| Redis | One primary per shard, data mostly in memory | Strong per shard | Caching, sessions, leaderboards. Super fast. |
| Cassandra | No primary, any node can write, tunable | Eventual or tunable | Logs, time-series, write-heavy apps. |
| DynamoDB | Managed service, no primary, tunable | Eventual or strong | General purpose, serverless apps. |
| etcd | Raft consensus, one leader | Strong | Config storage, leader election. Small data. |
| CockroachDB | Raft per range, SQL support | Strong | When you need SQL + distribution. |
What would I change for different use cases?
For caching (like Redis): - Keep data only in memory - no disk - Simpler replication - one primary per shard - Add eviction - delete old data when memory is full
For time-series data (like metrics): - Partition by time - all March data together - Compress aggressively - metrics compress well - Auto-delete old data - keep only last 30 days
For configuration storage (like etcd): - Strong consistency always - config must be correct - Smaller data size - megabytes not terabytes - Add "watch" feature - notify when config changes
For global deployment: - Prefer reading from local data center (fast) - Write might need to cross data centers (slower but safer) - Use conflict resolution for cross-region writes
Final interview tip
Remember: start simple! Do not jump to the complex global design immediately. Start with single node, then add sharding, then add replication, then add multiple regions. The interviewer wants to see that you know when to add complexity and when to keep it simple.