System Design Masterclass
Infrastructuredisaster-recoveryhigh-availabilityfailoverreplicationmulti-regionadvanced

Design Multi-Region Disaster Recovery System

Design a system that keeps apps running when entire data centers fail

Millions of requests per second|Similar to Netflix, Amazon, Google, Banks, Trading Platforms, Healthcare Systems|45 min read

Summary

A disaster recovery system keeps your app running even when an entire data center goes down (fire, power outage, network failure). The hard parts are: copying data to another location fast enough so you do not lose recent changes (RPO - Recovery Point Objective), switching users to the backup location quickly (RTO - Recovery Time Objective), and knowing when a real disaster happened vs just a small hiccup. Companies like Netflix, Amazon, and banks ask this question because downtime costs millions of dollars per minute.

Key Takeaways

Core Problem

When a data center catches fire or loses power, we need another data center ready to take over. The challenge is keeping the backup in sync without slowing down the main system.

The Hard Part

Data takes time to copy between regions. If the main region fails before data is copied, that data is lost forever. Faster copying means more cost and slower performance.

Key Metrics

RPO (Recovery Point Objective) = how much data can you lose? RTO (Recovery Time Objective) = how long can users wait? A bank might need RPO of 0 seconds and RTO of 30 seconds. A blog might accept RPO of 1 hour and RTO of 4 hours.

Critical Invariant

Never let both regions think they are the primary at the same time (split-brain). This causes data conflicts that are nearly impossible to fix.

Performance Requirement

Failover should complete in under 60 seconds for most apps. During normal operation, replication should not add more than 50ms latency to writes.

Key Tradeoff

Synchronous replication = zero data loss but slower writes. Asynchronous replication = fast writes but might lose recent data. Most systems use async with careful monitoring.

Design Walkthrough

Problem Statement

The Question: Design a disaster recovery system that can automatically switch to a backup data center when the main one fails, with minimal data loss and downtime.

What the system needs to do (most important first):

  1. 1.Keep data in sync - Copy all changes from the main data center to the backup data center continuously.
  2. 2.Detect failures - Know when the main data center is truly down (not just a small network hiccup).
  3. 3.Switch traffic automatically - When disaster happens, send all users to the backup data center without manual work.
  4. 4.Minimize data loss - Lose as little recent data as possible (ideally zero).
  5. 5.Fast recovery - Get users back to a working app within minutes, not hours.
  6. 6.Prevent split-brain - Never let both data centers accept writes at the same time.
  7. 7.Support failback - After the main data center is fixed, move back to it safely.

What to say first

Let me first understand the requirements. What kind of application is this - a banking system, social media, or e-commerce? How much data loss is acceptable - zero, a few seconds, or minutes? How long can users wait during failover - seconds or minutes? These answers will completely change my design.

What the interviewer really wants to see: - Do you understand RPO (how much data loss is OK) and RTO (how long users can wait)? - Can you explain the tradeoff between synchronous and asynchronous replication? - Do you know how to detect a real disaster vs a false alarm? - Can you prevent split-brain (both sides thinking they are primary)?

Clarifying Questions

Before you start designing, ask questions to understand what you are building. These questions show the interviewer you understand disaster recovery is not one-size-fits-all.

Question 1: What is the acceptable data loss (RPO)?

If the main data center suddenly dies, how much recent data can we lose? Zero? A few seconds? Minutes?

Why ask this: This is the MOST important question. It decides everything.

What interviewers usually say: - Banking/Trading: Zero data loss (RPO = 0) - E-commerce: Up to 5 seconds of data loss is OK - Social media: Up to 1 minute is acceptable

How this changes your design: - RPO = 0 means synchronous replication (every write waits for backup to confirm) - RPO > 0 allows asynchronous replication (much faster, but might lose recent data)

Question 2: How long can users wait (RTO)?

When disaster happens, how many seconds or minutes can users see errors before we must be back online?

Why ask this: This decides how automated your failover needs to be.

What interviewers usually say: - Trading platform: Under 30 seconds (every second costs money) - E-commerce: Under 5 minutes (users will retry) - Internal tools: Under 1 hour (employees can wait)

How this changes your design: - RTO < 1 minute needs fully automatic failover with hot standby - RTO < 15 minutes can use warm standby with some manual steps - RTO > 1 hour can use cold standby (cheaper but slower)

Question 3: What components need protection?

Is it just the database, or do we need to protect application servers, caches, message queues, and external integrations too?

Why ask this: Database failover is hard. Full application failover is much harder.

What interviewers usually say: The whole application stack - databases, app servers, caches, and queues.

How this changes your design: We need to coordinate failover across multiple systems, not just flip a database switch.

Question 4: Automatic or manual failover?

Should the system failover automatically when it detects a problem, or should a human approve it first?

Why ask this: Automatic failover is faster but risky (what if it is a false alarm?). Manual is safer but slower.

What interviewers usually say: Automatic for clear disasters, but alert humans for uncertain situations.

How this changes your design: We need smart health checks that can tell the difference between a true disaster and a temporary glitch.

Summarize your assumptions

Let me summarize: We need RPO of 5 seconds (can lose up to 5 seconds of data), RTO of 60 seconds (users should be back within 1 minute), full stack protection (database, app servers, cache), and automatic failover with human alerts. I will design for an e-commerce platform handling 10,000 requests per second.

The Hard Part

Say this to the interviewer

The hardest part of disaster recovery is the replication lag problem. Data takes time to travel between data centers. If Region A fails before data reaches Region B, that data is gone forever. We must choose between speed and safety.

Why data replication is tricky (explained simply):

  1. 1.Distance causes delay - Light takes about 70 milliseconds to travel from New York to London. Every write that waits for confirmation from London is 70ms slower.
  2. 2.Synchronous is safe but slow - If we wait for London to confirm every write, our app becomes 70ms slower per write. For 1000 writes per second, that is painful.
  3. 3.Asynchronous is fast but risky - If we do not wait, writes are fast. But if New York dies before London got the data, it is lost.
  4. 4.The lag keeps changing - Sometimes the backup is 100ms behind, sometimes 5 seconds behind. We need to monitor this constantly.
  5. 5.Not just databases - We also need to replicate cache data, message queues, and session data. Each has different replication methods.

Common mistake candidates make

Many candidates say: Just use synchronous replication for everything. This is wrong because: (1) it makes the app very slow - every write waits for cross-region confirmation, (2) if the network between regions has problems, the whole app stops, (3) it costs much more money for network bandwidth.

Three ways to replicate data:

Option 1: Synchronous Replication (Zero data loss) - Every write waits until both regions have the data - If Region B cannot confirm, the write fails - Good: Never lose data - Bad: Very slow, app stops if network has issues - Use for: Bank account balances, stock trades

Option 2: Asynchronous Replication (Fast but might lose data) - Write succeeds immediately in Region A - Region B gets the data a few seconds later - Good: Fast writes, app works even if network has issues - Bad: Might lose last few seconds of data in disaster - Use for: Social media posts, shopping carts

Option 3: Semi-synchronous (Middle ground) - Write waits for at least one backup to confirm - If backup is too slow, write succeeds anyway after timeout - Good: Usually zero data loss, app does not stop if backup is slow - Bad: Still some risk of data loss if backup is behind - Use for: E-commerce orders, user profiles

Synchronous vs Asynchronous Replication

Scale and Access Patterns

Before designing, let me figure out what we need to protect and how much data we are dealing with.

What we are measuringNumberWhat this means for our design
Requests per second10,000Need fast health checks that do not add latency
Database size500 GBCan replicate in near real-time with good network
+ 5 more rows...

What to tell the interviewer

For an e-commerce site doing 10,000 requests per second with 5-second RPO and 60-second RTO, I will use asynchronous replication with careful monitoring. The backup region will be hot (running and ready) so we can switch in under 60 seconds. I will use a global load balancer to route traffic.

What needs to be replicated (in order of importance):

  1. 1.Database - The most critical. Contains orders, users, products. Must replicate continuously.
  2. 2.Message queues - If someone placed an order but the message is stuck in a queue in the failed region, we lose that order.
  3. 3.Session data - If users are logged in and we lose their sessions, they all have to log in again (annoying but not catastrophic).
  4. 4.Cache data - Cache can be rebuilt from the database. Losing it means slower performance for a few minutes, not data loss.
  5. 5.Static files - Images, CSS, JavaScript. These should already be on a CDN (Content Delivery Network) in multiple locations.
How much network bandwidth do we need for replication?

- Average write size: 2 KB (database row + indexes)
+ 9 more lines...

High-Level Architecture

Now let me draw the big picture of how all the pieces fit together. I will show two regions - the primary (main) and the secondary (backup).

What to tell the interviewer

I will use an active-passive setup. The primary region handles all traffic. The secondary region receives replicated data and stays ready. A global load balancer decides which region to send users to. Health checkers watch for problems and trigger failover.

Disaster Recovery Architecture

What each part does and WHY:

ComponentWhat it doesWhy it matters for DR
Global Load BalancerSends users to the right region. Examples: AWS Route 53, Cloudflare, Google Cloud DNSThis is how we switch traffic during failover. It can route based on health checks.
Health CheckerConstantly checks if each region is working. Checks every 10-30 seconds.Detects disasters. Must be OUTSIDE both regions (otherwise it fails with the region).
+ 4 more rows...

Common interview question: Why not active-active?

Interviewers often ask: Why not have both regions active and serving traffic? Your answer: Active-active is great for read-heavy workloads, but creates problems for writes. If a user updates their profile in Region A and another update happens in Region B at the same time, which one wins? Conflict resolution is very hard. Active-passive is simpler and safer for most applications.

Three standby strategies - pick based on your RTO and budget:

StrategyWhat it meansRTOCostWhen to use
Hot StandbyBackup region is running with warm caches, ready to serve instantly30-60 secondsHigh (paying for idle servers)Trading platforms, banks, critical e-commerce
Warm StandbyBackup has fewer servers running, needs to scale up5-15 minutesMediumMost business applications
Cold StandbyBackup region has no servers running, just data backups1-4 hoursLowDev environments, internal tools, non-critical apps

Real world example: Netflix

Netflix runs active-active across 3 AWS regions. They use a system called Zuul for routing and can survive any single region failure. They test by actually killing regions regularly (Chaos Engineering). This costs them extra but they cannot afford downtime - millions of users would be angry.

Health Checking and Failure Detection

Knowing WHEN to failover is just as hard as knowing HOW to failover. We must avoid two bad situations:

  1. 1.False positive - We think the region is dead but it is actually fine. We failover unnecessarily, which is disruptive and might cause data issues.
  2. 2.False negative - The region is actually dead but we think it is fine. Users see errors while we wait.

What to tell the interviewer

I will use multiple health checkers from different locations. They vote on whether a region is healthy. A single checker saying unhealthy is ignored (might be network issue to that checker). But if 3 out of 5 checkers agree the region is unhealthy for 30 seconds, we trigger failover.

What we check for health:

  1. 1.Can we reach the load balancer? - Basic network connectivity 2. Do app servers respond? - Send a test request, expect response in under 1 second 3. Is the database working? - Can we read and write a test row? 4. Is replication working? - Is the replica more than X seconds behind? 5. Are background jobs processing? - Is the message queue backing up?

Health Check Architecture

FUNCTION check_region_health(region):
    // Run health checks from 5 different locations
    results = []
+ 36 more lines...

The split-brain problem

The scariest scenario: Region A thinks it is down (network issue) but it is actually fine and still serving some users. Region B takes over and also serves users. Now BOTH regions are accepting writes and their databases diverge. This is called split-brain and causes data corruption that is very hard to fix.

How to prevent split-brain:

  1. 1.Fencing - Before Region B becomes primary, it must make sure Region A cannot accept writes anymore. This is called fencing or STONITH (Shoot The Other Node In The Head).
  2. 2.Lease-based leadership - Region A has a lease (like a rental agreement) that says I am the primary until time X. It must renew this lease regularly. If it cannot renew (network down), the lease expires and Region B can take over.
  3. 3.Quorum-based systems - Use a coordination service like ZooKeeper or etcd that requires majority agreement before any region can claim to be primary.
SETUP:
    Region A holds a lease that says: "I am primary until 10:00:05"
    Region A must renew this lease every 5 seconds
+ 21 more lines...

The Failover Process

When we decide to failover, we need to do it carefully and in the right order. Rushing can cause data loss or split-brain.

What to tell the interviewer

Failover has 5 steps: (1) Stop sending new traffic to Region A, (2) Wait for in-flight requests to finish, (3) Confirm Region A is truly stopped, (4) Promote Region B database to primary, (5) Send traffic to Region B. Each step must complete before the next starts.

Failover Process Step by Step

FUNCTION execute_failover(from_region, to_region):
    
    log("Starting failover from " + from_region + " to " + to_region)
+ 60 more lines...

How long does each step take?

Step 1 (stop traffic): 1-2 seconds. Step 2 (drain requests): 5-10 seconds. Step 3 (fence): 2-5 seconds. Step 4 (promote): 5-15 seconds. Step 5 (route traffic): 2-5 seconds. Total: 15-40 seconds for a well-prepared hot standby. DNS propagation can add minutes, which is why we use Global Load Balancers that do not rely on DNS.

Data Replication Strategies

Let me dive deeper into how we actually copy data between regions. Different data needs different strategies.

What to tell the interviewer

I will use different replication strategies for different types of data. Critical financial data uses synchronous replication. User activity and logs use asynchronous. Cache data does not need replication - we rebuild it from the database.

Data TypeReplication StrategyWhyWhat if we lose it?
Account balances, paymentsSynchronousCannot lose money transactionsCustomer loses money, lawsuits
Orders, bookingsSemi-synchronousImportant but can recover from logsNeed to reprocess, customer inconvenience
User sessionsAsynchronousCan regenerate, users just re-loginAnnoying but not critical
Cache dataDo not replicateRebuilt from database automaticallySlower performance for a few minutes
Logs and analyticsAsynchronous with batchingHigh volume, not real-time criticalMight lose recent logs, usually OK

Database replication in detail:

For PostgreSQL: - Built-in streaming replication sends every change to replicas - Can configure synchronous_commit for critical tables - Replication lag is usually under 1 second on good networks

For MySQL: - Semi-synchronous replication waits for at least one replica - Group Replication provides automatic failover - Can use tools like Orchestrator for managing replicas

FUNCTION monitor_replication_lag():
    // Run this every second
    // Alert if lag gets too high
+ 22 more lines...

Message queue replication:

Queues like Kafka and RabbitMQ need special handling:

Option 1: Mirrored queues - Every message is copied to the backup region immediately. More reliable but slower.

Option 2: Consumer acknowledgment - Messages stay in the queue until the consumer confirms processing. If region fails, unprocessed messages are reprocessed in the new region.

Option 3: Idempotent processing - Design your system so processing the same message twice is safe. Then you can replay messages after failover without duplicates.

The hardest part: In-flight transactions

What happens to a request that was in the middle of processing when the region died? The user clicked Buy Now, we charged their card, but the region died before we saved the order. Now the user is charged but has no order. Solution: Use a two-phase approach - first reserve (card auth), then confirm (charge) only after the order is safely saved.

Failback: Returning to the Original Region

After we fix the original region, we usually want to move back to it. This is called failback. It is not as urgent as failover, but we need to do it carefully.

What to tell the interviewer

Failback is planned, not emergency. We take our time to: (1) Bring the old region back online, (2) Sync all the data that changed during the outage, (3) Test thoroughly that everything works, (4) Schedule failback during low traffic, (5) Monitor closely after switching back.

Why failback is different from failover:

  1. 1.Not urgent - Users are already being served. We can take hours or days.
  2. 2.Data sync is backwards - Now Region B has the latest data and Region A needs to catch up.
  3. 3.Need to validate - Region A failed for a reason. We must verify it is truly fixed.
  4. 4.Can schedule for low traffic - Do it at 3 AM on a Sunday, not during peak hours.
FUNCTION execute_failback(current_region, original_region):
    // current_region = where we are running now (was the backup)
    // original_region = where we want to return to (was the primary)
+ 55 more lines...

Do you always need to failback?

Not necessarily. If both regions are equally capable, you can just stay in the new region and make the old one your new backup. Failback is mainly needed when: (1) The original region has better hardware, (2) Cost is lower in the original region, (3) Most users are closer to the original region, (4) Compliance requires primary to be in a specific location.

Testing Your Disaster Recovery

Say this to the interviewer

A disaster recovery plan that is never tested is not a plan - it is a hope. We must regularly test failover to make sure it actually works. Netflix famously runs Chaos Monkey that randomly kills servers. We should do similar testing for regional failover.

Types of DR testing (from least to most realistic):

1. Tabletop exercise (easiest) - Team sits in a room and walks through the failover plan - Ask what if questions: What if the database takes 10 minutes to promote? - Find gaps in documentation and runbooks - Do this quarterly

2. Component testing (medium) - Test individual pieces: Can we promote the database? Does DNS update work? - Does not test the full end-to-end flow - Do this monthly

3. Simulated failover (harder) - Route a small percentage of traffic (5%) to the backup region - Verify it handles real user requests correctly - Do this monthly

4. Full failover drill (hardest but most valuable) - Actually failover the entire application to the backup region - Real users are served from the backup - Do this quarterly or twice a year - Schedule during low traffic and have engineers ready

// Netflix-style chaos testing
// Run these tests regularly to find weaknesses
+ 28 more lines...

Game days

Many companies run Game Days where they intentionally cause failures to test their systems. Google, Amazon, and Netflix do this regularly. The rule is: if you are afraid to test your disaster recovery, that means you REALLY need to test it.

What Can Go Wrong

Tell the interviewer about failures

Good engineers think about what can break. Let me walk through the things that can go wrong with disaster recovery and how we protect against them.

What breaksWhat happensHow we prevent itHow we detect it
Replication falls behindData loss during failoverMonitor lag, alert if > 5 secondsMetrics dashboard, alerts
False failover triggerUnnecessary disruption, possible data issuesMultiple health checkers must agree, wait 30 secondsRequire human confirmation for uncertain cases
+ 5 more rows...

The nightmare scenario: Split-brain

This is the worst thing that can happen. Here is how it could occur:

  1. 1.Region A is the primary, serving users 2. Network between Region A and the health checkers fails (but Region A is fine) 3. Health checkers think Region A is dead 4. Region B becomes primary 5. But Region A is still running and some users are still reaching it! 6. Now BOTH regions are accepting writes 7. User updates their email to alice@new.com in Region A 8. Same user updates their email to alice@other.com in Region B 9. When regions reconnect, which email wins?

Prevention: Region A must KNOW it lost the health check connection and stop accepting writes itself. This is why we use leases - Region A stops accepting writes when its lease expires, even if it thinks it is fine.

What happens if split-brain does occur?

You need to: (1) Immediately stop one region, (2) Compare the data between both regions, (3) Manually merge the differences (this is very hard), (4) Figure out what caused it and fix the prevention mechanism. Some companies have automated conflict resolution (last write wins) but this can lose data. It is better to prevent split-brain than to recover from it.

Cost Considerations

Disaster recovery costs money. The interviewer may ask about this. Here is how to think about it.

What to tell the interviewer

DR cost is about trading money for safety. Hot standby costs roughly double (duplicate everything). Warm standby costs 30-50% more. Cold standby costs 10-20% more. The right choice depends on how much downtime costs your business.

Cost ComponentHot StandbyWarm StandbyCold Standby
Compute (servers)100% duplicate20-30% of primary0% (start when needed)
DatabaseFull replica runningFull replica runningBackup snapshots only
+ 4 more rows...

How to calculate if DR is worth it:

  1. 1.Estimate downtime cost: How much money do you lose per minute of downtime? For a big e-commerce site doing $10M/day, that is about $7,000 per minute.
  2. 2.Estimate disaster probability: How often do major outages happen? AWS regions have about 99.99% uptime, meaning roughly 1 hour of downtime per year.
  3. 3.Calculate expected loss without DR: 1 hour of downtime x $7,000/minute = $420,000 per year
  4. 4.Compare to DR cost: If hot standby costs $200,000/year but saves $420,000 in downtime, it is worth it.
  5. 5.Add reputation damage: Downtime hurts trust. Customers might leave permanently. This is hard to measure but real.
COMPANY: E-commerce site
Revenue: $10 million per day = $7,000 per minute
Primary infrastructure cost: $50,000 per month
+ 21 more lines...

Cloud provider DR features

AWS, Google Cloud, and Azure all have built-in DR features that can reduce costs: (1) Reserved instances for standby (30-40% cheaper), (2) Auto-scaling that can spin up servers quickly, (3) Managed database failover (RDS Multi-AZ, Cloud SQL HA), (4) Cross-region replication built into their services. Using these managed services is often cheaper than building your own DR system.

Design Trade-offs

Advantages

  • +Simple to understand and operate
  • +No write conflicts - only one region accepts writes
  • +Lower cost - standby does not need full capacity
  • +Easier to reason about data consistency

Disadvantages

  • -Standby resources are mostly idle (wasteful)
  • -Failover takes time (30-60 seconds for hot standby)
  • -Users in backup region have higher latency normally
When to use

Most applications. Good default choice. Use this unless you have specific reasons for active-active.