Why Distributed Systems Fail: 15 Failure Scenarios Every Engineer Must Know
A comprehensive guide to the most common failure modes in distributed systems, from network partitions to split-brain scenarios, with practical fixes for each.
Ready to Master System Design Interviews?
Learn from 25+ real interview problems from Netflix, Uber, Google, and Stripe. Created by a senior engineer who's taken 200+ system design interviews at FAANG companies.
Complete Solutions
Architecture diagrams & trade-off analysis
Real Interview Problems
From actual FAANG interviews
7-day money-back guarantee • Lifetime access • New problems added quarterly
It was 2 AM on Black Friday. Our payment system processed $50 million per hour. Then a single network switch failed.
Within 90 seconds, our entire checkout system went down. Not because the switch was critical. It was not. But because of a cascade of failures we never anticipated. A timeout here. A retry storm there. A circuit breaker that did not trip. A health check that lied.
We lost $2.3 million in those 47 minutes of downtime.
Distributed systems do not fail like single machines. They fail in strange, unexpected ways. They fail partially. They fail silently. They fail in ways that make you question reality itself.
After 15 years of building systems at scale, and watching them break in spectacular fashion, I have seen the same failure patterns repeat across companies, technologies, and teams. This guide covers the 15 most common ways distributed systems fail, why they happen, and how to fix them.
The Nature of Distributed System Failures
Before we dive into specific failures, understand three fundamental truths:
1. Partial failures are the norm. In a distributed system, some parts can fail while others work fine. This is fundamentally different from a single machine, which is either up or down.
2. Failures are often silent. A service might return wrong data instead of an error. A message might be lost without notification. A node might think it is the leader when it is not.
3. Failures combine in unexpected ways. Two small issues that are harmless alone can combine to create catastrophic failures. This is why distributed systems are hard to reason about.
Let us examine each failure mode in detail.
1. Network Partitions
What happens: Nodes cannot communicate with each other, even though each node is healthy. The network splits into separate groups that cannot reach each other.
Real example: In 2012, a network partition split a major cloud provider's cluster. Nodes on both sides thought they were the majority. Both sides elected leaders. Both sides accepted writes. When the partition healed, they had two conflicting versions of the truth.
Why it is dangerous:
- Nodes cannot distinguish between "other node is dead" and "network is broken"
- Each partition may continue operating independently
- Data can diverge, creating conflicts that are hard to resolve
How to fix it:
-
Accept that partitions will happen. Design your system assuming network failures are normal, not exceptional.
-
Use quorum-based decisions. Require a majority of nodes to agree before making changes. A 3-node cluster needs 2 nodes. A 5-node cluster needs 3. This prevents split-brain scenarios.
-
Choose your CAP trade-off explicitly. During a partition, you must choose between consistency and availability. Know which one your system prioritizes.
-
Implement partition detection. Use heartbeats with reasonable timeouts. When a node stops responding, assume partition until proven otherwise.
-
Design for partition recovery. Have a clear strategy for reconciling data when partitions heal. Last-write-wins, vector clocks, or application-specific merge logic.
2. Split-Brain Syndrome
What happens: Multiple nodes believe they are the leader or primary. Each accepts writes independently, causing data divergence and corruption.
Real example: A database cluster used a simple heartbeat for leader election. The leader became slow due to garbage collection. Followers stopped receiving heartbeats and elected a new leader. The old leader finished GC, did not know it was demoted, and continued accepting writes. Two leaders, two versions of truth.
Why it is dangerous:
- Both leaders accept writes, creating conflicting data
- Clients get different responses depending on which leader they reach
- Data corruption may not be detected until much later
How to fix it:
-
Use fencing tokens. Every leader gets a unique, monotonically increasing token. Resources only accept requests from the highest token they have seen.
-
Implement STONITH (Shoot The Other Node In The Head). When you become leader, forcibly kill the old leader's ability to make changes. Cut its network, power off its storage access.
-
Use distributed consensus. Protocols like Raft or Paxos guarantee exactly one leader. They are complex but solve split-brain correctly.
-
Add lease-based leadership. Leaders must periodically renew their lease. If they cannot renew (due to slowness or partition), they automatically step down.
# Example: Fencing token approach
class FencedResource:
def __init__(self):
self.highest_token_seen = 0
def write(self, token, data):
if token < self.highest_token_seen:
raise StaleLeaderError("Your token is outdated")
self.highest_token_seen = token
# Proceed with write
self._do_write(data)
3. Cascading Failures
What happens: One component fails, causing dependent components to fail, which causes their dependents to fail. A small failure snowballs into a system-wide outage.
Real example: A single database became slow due to a bad query. Services waiting for the database accumulated connections. Connection pools filled up. Services could not get connections, so they timed out. Clients retried, creating more load. The retry storm made everything worse. Within minutes, the entire system was down.
Why it is dangerous:
- Small failures amplify into large ones
- Systems fail in non-obvious ways
- Recovery is hard because everything is broken at once
How to fix it:
- Implement circuit breakers. When a downstream service fails repeatedly, stop calling it. Return a fallback response or error immediately. This prevents waiting and resource exhaustion.
# Circuit breaker example
class CircuitBreaker:
def __init__(self, failure_threshold=5, reset_timeout=30):
self.failure_count = 0
self.failure_threshold = failure_threshold
self.reset_timeout = reset_timeout
self.state = "CLOSED" # CLOSED, OPEN, HALF_OPEN
self.last_failure_time = None
def call(self, func):
if self.state == "OPEN":
if time.time() - self.last_failure_time > self.reset_timeout:
self.state = "HALF_OPEN"
else:
raise CircuitOpenError("Circuit is open")
try:
result = func()
if self.state == "HALF_OPEN":
self.state = "CLOSED"
self.failure_count = 0
return result
except Exception as e:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = "OPEN"
raise
-
Use bulkheads. Isolate different parts of your system. If the payment service is failing, it should not bring down product browsing. Separate thread pools, connection pools, and resources.
-
Set aggressive timeouts. Do not wait forever for slow services. A 5-second timeout is often better than a 30-second one. Fast failure is better than slow failure.
-
Implement backpressure. When overwhelmed, reject new requests rather than queuing them infinitely. Tell callers to slow down.
-
Limit retry attempts. Retries with exponential backoff help, but unlimited retries during an outage create retry storms. Set maximum retry counts.
4. Byzantine Failures
What happens: A node behaves incorrectly, not by crashing, but by sending wrong or malicious data. It might lie about its state, send conflicting messages to different nodes, or corrupt data.
Real example: A server had faulty RAM that occasionally flipped bits. It did not crash. It just sometimes returned wrong data. Other servers received this wrong data and made decisions based on it. The corruption spread through the system before anyone noticed.
Why it is dangerous:
- The node does not appear to be failing
- Wrong data propagates through the system
- Standard failure detection does not catch it
- Other nodes trust the bad data
How to fix it:
-
Use checksums and hashes. Verify data integrity at every step. Corrupted data should be detected and rejected.
-
Implement Byzantine fault-tolerant protocols. Protocols like PBFT (Practical Byzantine Fault Tolerance) can tolerate up to f Byzantine nodes in a 3f+1 node system. Complex but necessary for critical systems.
-
Add redundancy with voting. Read from multiple sources and compare. If three nodes return data and two agree, use the majority answer.
-
Use cryptographic signatures. Nodes sign their messages. Other nodes verify signatures. This prevents nodes from claiming they sent different messages.
-
Monitor for inconsistencies. Continuously compare data across nodes. Alert when nodes disagree about facts they should agree on.
5. Clock Skew and Time Issues
What happens: Different servers have different ideas of what time it is. This breaks any logic that depends on timestamps: ordering events, cache expiration, certificate validation, distributed locks.
Real example: A distributed lock service used timestamps to determine lock ownership. Server A acquired a lock at 10:00:00 (its local time). Server B's clock was 5 seconds behind. It saw the lock acquired at 9:59:55. When B's clock reached 10:00:01, it thought the lock had expired and took ownership. Now both A and B believed they owned the lock.
Why it is dangerous:
- Events appear to happen in wrong order
- Locks expire too early or too late
- Cache invalidation fails
- Security certificates rejected incorrectly
How to fix it:
-
Never trust wall-clock time for ordering. Use logical clocks (Lamport clocks) or vector clocks instead. These track causal ordering without relying on physical time.
-
Use NTP consistently. Run NTP on all servers. Monitor clock drift. Alert when skew exceeds acceptable limits.
-
Use TrueTime or similar. Google's Spanner uses TrueTime, which provides bounded clock uncertainty. You know time is within a range, and you can wait out the uncertainty.
-
Design for clock uncertainty. If a lock expires at time T, consider it expired at T minus maximum clock skew. Build in safety margins.
-
Use lease durations, not absolute times. Instead of "expires at 10:00:00", use "expires in 30 seconds". This is more robust to clock differences.
# Bad: Absolute time
lock_expires_at = datetime(2024, 1, 15, 10, 0, 0)
# Good: Relative duration with safety margin
LOCK_DURATION = 30 # seconds
CLOCK_SKEW_MARGIN = 5 # seconds
effective_lock_duration = LOCK_DURATION - CLOCK_SKEW_MARGIN
6. Message Loss and Duplication
What happens: Messages between services get lost, duplicated, or delivered out of order. A message you sent might never arrive. Or it might arrive twice. Or messages sent in order A, B, C arrive as C, A, B.
Real example: An order service sent a "process payment" message to the payment service. The message was lost due to a network hiccup. The order remained stuck in "pending" state forever. No error was raised. The customer never got their order.
Why it is dangerous:
- Lost messages mean lost work
- Duplicate messages mean duplicate work (double charges, double orders)
- Out-of-order messages cause logical errors
How to fix it:
-
Use message queues with acknowledgments. Send message, wait for ACK. If no ACK, resend. Messages are not lost because unacknowledged messages are retried.
-
Make operations idempotent. If the same message is processed twice, the result should be the same. Use idempotency keys.
# Idempotent payment processing
def process_payment(idempotency_key, amount, user_id):
# Check if already processed
existing = db.query(
"SELECT * FROM payments WHERE idempotency_key = ?",
idempotency_key
)
if existing:
return existing # Return previous result, don't charge again
# Process new payment
result = payment_gateway.charge(amount, user_id)
# Store with idempotency key
db.insert("payments", {
"idempotency_key": idempotency_key,
"result": result
})
return result
-
Use sequence numbers for ordering. Each message gets a sequence number. Receivers process in order, buffering out-of-order messages.
-
Implement exactly-once semantics. This is hard. Most systems provide at-least-once (with idempotency) or at-most-once. True exactly-once requires careful design.
-
Set up dead letter queues. Messages that fail repeatedly go to a dead letter queue for manual inspection rather than being lost forever.
7. Thundering Herd
What happens: Many clients simultaneously make the same request, overwhelming a service. Common after a cache expires, a service restarts, or a popular item becomes available.
Real example: A flash sale started at noon. We cached product data with a 10-minute TTL. At 12:00:00, 100,000 users hit the site. At 12:10:00, the cache expired. All 100,000 users' requests went to the database simultaneously. The database could not handle 100,000 concurrent queries. It died. The site went down during peak traffic.
Why it is dangerous:
- Cache miss storms during high traffic
- Startup storms when services restart
- Database overload
- Self-inflicted denial of service
How to fix it:
- Use cache locking. When cache misses, only one request fetches from the database. Others wait for the cache to be populated.
def get_with_cache_lock(key):
# Try cache first
value = cache.get(key)
if value:
return value
# Try to acquire lock
lock_key = f"lock:{key}"
if cache.set(lock_key, "1", nx=True, ex=10): # Got the lock
# Fetch from database
value = database.get(key)
cache.set(key, value, ex=300)
cache.delete(lock_key)
return value
else:
# Someone else is fetching, wait and retry
time.sleep(0.1)
return get_with_cache_lock(key) # Retry
-
Stagger cache expiration. Add random jitter to TTLs. Instead of all caches expiring at exactly 10 minutes, use 9-11 minutes randomly. This spreads the load.
-
Use background refresh. Before cache expires, refresh it in the background. The cache never actually expires for users.
-
Implement request coalescing. If many identical requests arrive simultaneously, only make one backend request and share the result.
-
Use circuit breakers on the read path. If the database is overwhelmed, return stale cached data or a fallback response rather than adding more load.
8. Hot Spots and Uneven Load
What happens: Some nodes or partitions receive much more traffic than others. While most of your cluster sits idle, one node is overwhelmed.
Real example: We sharded user data by user ID. Justin Bieber created an account. His user ID landed on shard 7. Every time he tweeted, millions of followers queried shard 7. That single shard received 1000x more traffic than others. It became a bottleneck for the entire system.
Why it is dangerous:
- One node becomes a bottleneck
- Scaling out does not help (you just add idle nodes)
- Performance is limited by the hottest shard
- Single point of failure for hot data
How to fix it:
-
Use consistent hashing with virtual nodes. Each physical node has multiple virtual nodes spread across the hash ring. This improves distribution.
-
Identify and split hot keys. If user 123 is hot, split their data across multiple shards. Read from any shard, write to all.
-
Add caching for hot data. If a key is hot, cache it aggressively. Celebrities' profiles can be cached at edge nodes.
-
Use read replicas for hot reads. Fan out read traffic across multiple replicas. Only writes go to the primary.
-
Implement adaptive load balancing. Route traffic based on actual node load, not just consistent hashing. If a node is hot, send new requests elsewhere.
# Hot key splitting example
def write_celebrity_tweet(user_id, tweet):
if is_celebrity(user_id):
# Write to multiple shards for fan-out
shards = get_celebrity_shards(user_id) # e.g., 10 shards
for shard in shards:
shard.write(tweet)
else:
# Normal users: single shard
shard = get_shard(user_id)
shard.write(tweet)
def read_celebrity_tweets(user_id):
if is_celebrity(user_id):
# Read from any shard (random for load balancing)
shard = random.choice(get_celebrity_shards(user_id))
return shard.read(user_id)
else:
return get_shard(user_id).read(user_id)
9. Retry Storms
What happens: When a service slows down or fails, clients retry. The retries add more load. The service becomes slower. More retries. More load. A death spiral.
Real example: A payment gateway had a brief slowdown. Requests that normally took 100ms took 5 seconds. Clients had 3-second timeouts, so they retried. The gateway now had 2x the requests. It became slower. More timeouts. More retries. 4x requests. 8x. Within a minute, the gateway was processing 100x normal load and completely unresponsive.
Why it is dangerous:
- Retries amplify load during failures
- Can turn minor slowdowns into major outages
- Recovery becomes impossible while retries continue
- System cannot stabilize
How to fix it:
- Exponential backoff with jitter. Each retry waits longer than the last, with randomization to prevent synchronized retries.
def retry_with_backoff(func, max_retries=5):
for attempt in range(max_retries):
try:
return func()
except Exception as e:
if attempt == max_retries - 1:
raise
# Exponential backoff with jitter
base_delay = min(2 ** attempt, 60) # Max 60 seconds
jitter = random.uniform(0, base_delay * 0.1)
delay = base_delay + jitter
time.sleep(delay)
-
Limit retry count. Do not retry forever. After 3-5 retries, give up and return an error. Let the user decide whether to try again.
-
Use circuit breakers. If a service is failing, stop calling it entirely. This removes retry load completely.
-
Implement adaptive throttling. Track success rate. When it drops, reduce request rate proactively. Do not wait for failures.
-
Add client-side rate limiting. Even if servers are available, clients should limit their own request rate to prevent overwhelming the system.
10. Resource Exhaustion
What happens: A system runs out of a critical resource: memory, file descriptors, connections, disk space, threads. It stops functioning even though it is not technically "crashed".
Real example: A service had a slow memory leak. It lost 10MB per hour. Nobody noticed because memory was abundant. After 2 weeks, memory was full. The service started swapping to disk. Response times went from 10ms to 10 seconds. The service was technically running but practically dead.
Why it is dangerous:
- Service appears up but does not work
- Happens gradually, then suddenly
- Often not caught by simple health checks
- Can affect other services on same machine
Common resources that get exhausted:
| Resource | Symptom | Common Cause |
|---|---|---|
| Memory | Swap, OOM kills | Memory leaks, unbounded caches |
| File descriptors | "Too many open files" | Connection leaks, unclosed files |
| Connections | "Connection refused" | Pool exhaustion, slow queries |
| Disk space | Write failures | Logs, temp files, data growth |
| Threads | Deadlock, slow response | Blocking calls, thread leaks |
| CPU | High latency, timeouts | Infinite loops, inefficient code |
How to fix it:
-
Set resource limits. Use cgroups, container limits, or application-level limits. A runaway process should not take down the entire machine.
-
Implement graceful degradation. When resources are low, start rejecting new work rather than accepting everything and performing poorly.
-
Monitor resource usage. Alert before exhaustion. If memory is at 80%, investigate before it hits 100%.
-
Use timeouts and deadlines. Resources held indefinitely will eventually exhaust. Set limits on how long operations can hold resources.
-
Implement resource pools. Use connection pools, thread pools, and object pools. These provide bounded resource usage and faster leak detection.
# Connection pool with limits
from contextlib import contextmanager
class ConnectionPool:
def __init__(self, max_connections=100):
self.max_connections = max_connections
self.available = []
self.in_use = 0
self.lock = threading.Lock()
@contextmanager
def get_connection(self, timeout=30):
conn = self._acquire(timeout)
try:
yield conn
finally:
self._release(conn)
def _acquire(self, timeout):
deadline = time.time() + timeout
while True:
with self.lock:
if self.available:
self.in_use += 1
return self.available.pop()
if self.in_use < self.max_connections:
self.in_use += 1
return self._create_connection()
if time.time() > deadline:
raise TimeoutError("Could not acquire connection")
time.sleep(0.1)
11. Data Corruption and Inconsistency
What happens: Data becomes incorrect, inconsistent between copies, or violates expected invariants. This can happen silently, with no errors, just wrong data.
Real example: A user's account balance was stored in two places: a SQL database and a cache. A race condition occurred: the database was updated to $100, but before the cache was updated, another process read the old cache value ($50) and overwrote the database. The user lost $50. No errors. No crashes. Just silent data corruption.
Why it is dangerous:
- Data errors can go undetected for a long time
- Wrong data leads to wrong business decisions
- Fixing corrupted data is extremely difficult
- Trust in the system erodes
How to fix it:
-
Use transactions appropriately. Group related operations in transactions. Use appropriate isolation levels.
-
Implement optimistic concurrency. Use version numbers. When updating, check that the version has not changed since you read it.
def update_balance(user_id, amount):
while True:
# Read with version
user = db.query(
"SELECT balance, version FROM users WHERE id = ?",
user_id
)
new_balance = user.balance + amount
new_version = user.version + 1
# Update only if version unchanged
result = db.execute(
"""UPDATE users
SET balance = ?, version = ?
WHERE id = ? AND version = ?""",
new_balance, new_version, user_id, user.version
)
if result.rows_affected == 1:
return new_balance # Success
# Version changed, retry
-
Design for eventual consistency. If you cannot have strong consistency, make sure the system converges to correct state over time.
-
Validate data continuously. Run background jobs that check invariants. "Sum of all account balances should equal total deposits minus withdrawals."
-
Use event sourcing or audit logs. Store every change as an event. You can always replay events to reconstruct correct state.
12. Configuration Drift and Mismatch
What happens: Different instances of the same service have different configurations. Some have the bug fix, some do not. Some point to the new database, some to the old. Behavior becomes unpredictable.
Real example: We rolled out a configuration change to enable a new feature. 3 out of 10 servers got the new config. Users randomly got different behavior depending on which server handled their request. Some could use the feature, some could not. Some users saw the feature appear and disappear as they made requests.
Why it is dangerous:
- Inconsistent behavior across requests
- Some servers work, some do not
- Extremely hard to debug (works on my machine)
- Rollbacks are partial and messy
How to fix it:
-
Use centralized configuration. Store config in one place (Consul, etcd, AWS Parameter Store). All servers read from the same source.
-
Version your configurations. Every config change gets a version. You know exactly what config each server is running.
-
Implement atomic config updates. All servers should switch to new config at the same time, or roll back together.
-
Validate config before deployment. Check that new configuration is valid and does not break invariants before rolling it out.
-
Include config version in health checks. Health check should report which config version is running. Load balancers can route based on config version.
# Health check response should include config info
{
"status": "healthy",
"config_version": "v2.3.1",
"config_hash": "abc123",
"loaded_at": "2024-01-15T10:30:00Z"
}
13. Dependency Failures
What happens: Your system depends on external services (databases, APIs, cloud services). When they fail, your system fails. Often in surprising ways.
Real example: Our service depended on an external geocoding API. The API had 99.9% availability. Pretty good, right? But we called it on every request. Our P99 latency was entirely determined by that API. When it had a bad day (99% availability), 1% of our users had failures. We had thousands of complaints.
Why it is dangerous:
- You do not control external services
- Their failures become your failures
- Their slowdowns become your slowdowns
- Contractual SLAs do not prevent outages
How to fix it:
-
Implement fallbacks. If the geocoding API is down, use a cached result, a default value, or a degraded experience. Do not let one dependency take down everything.
-
Use bulkheads. Isolate dependencies so one failing dependency does not consume all your resources waiting.
-
Cache aggressively. Reduce dependency on external calls. Cache results where possible.
-
Implement async processing. Do not block user requests on non-critical dependencies. Send email async, update analytics async.
-
Have a dependency SLA budget. If you need 99.99% availability and have 5 dependencies at 99.9%, you are already over budget. Either improve dependencies or reduce coupling.
# Fallback pattern example
def get_user_location(user_id):
# Try primary: external API
try:
location = geocoding_api.get_location(user_id, timeout=2)
cache.set(f"location:{user_id}", location, ttl=3600)
return location
except (TimeoutError, APIError):
pass
# Fallback 1: cached value
cached = cache.get(f"location:{user_id}")
if cached:
return cached
# Fallback 2: user-provided location from profile
profile = db.get_user_profile(user_id)
if profile and profile.location:
return profile.location
# Fallback 3: default
return DEFAULT_LOCATION
14. Poison Messages and Bad Data
What happens: A malformed or unexpected message enters the system and causes repeated failures. The message is retried, fails again, retried, fails again. It blocks the entire queue.
Real example: A message queue had a message with an emoji in an unexpected field. The processing code crashed on that character. The message was retried. Crashed again. Retried. Crashed. The queue backed up. All messages behind the poison message were stuck. Processing stopped entirely.
Why it is dangerous:
- One bad message blocks all messages
- Retries do not help (message is inherently bad)
- Queue backs up rapidly
- Can go unnoticed until it is catastrophic
How to fix it:
-
Implement dead letter queues. After N failures, move the message to a separate queue for manual inspection. Do not let it block other messages.
-
Validate input early. Check message format before processing. Reject malformed messages immediately.
-
Use schema validation. Define message schemas (JSON Schema, Protobuf, Avro). Validate against schema on receive.
-
Handle errors gracefully. Do not let one field crash the entire processor. Log the error, skip the field, continue.
-
Monitor dead letter queue. Alert when messages go to DLQ. Investigate promptly.
# Poison message handling
def process_message(message):
try:
validate_schema(message)
do_processing(message)
return SUCCESS
except ValidationError as e:
# Bad message format - send to DLQ immediately
dead_letter_queue.send(message, error=str(e))
return REJECTED
except ProcessingError as e:
# Processing error - maybe retry
if message.retry_count < MAX_RETRIES:
return RETRY
else:
dead_letter_queue.send(message, error=str(e))
return REJECTED
15. Deployment and Rollout Failures
What happens: A deployment introduces a bug or incompatibility. But you only find out after it is fully rolled out, affecting all users.
Real example: A new version had a bug that only appeared under load. Testing in staging (low traffic) showed no issues. We deployed to 100% of production. Under real traffic, the bug manifested. We had to roll back, but rollback also takes time. Users were affected for 20 minutes.
Why it is dangerous:
- New code deployed to everyone at once
- No way to limit blast radius
- Rollbacks take time
- Some bugs only appear at scale
How to fix it:
-
Use canary deployments. Deploy to 1% of servers first. Compare metrics to baseline. If something is wrong, only 1% of users are affected.
-
Implement feature flags. Separate deployment from release. Deploy code dark, then enable for small percentage of users.
-
Have instant rollback. Rollback should be one click, one command. Practice it regularly.
-
Use progressive rollout. 1% → 5% → 25% → 50% → 100%, with monitoring at each stage.
-
Implement automatic rollback. If error rate exceeds threshold, automatically roll back without human intervention.
# Example: Progressive rollout with automatic rollback
rollout:
stages:
- percentage: 1
duration: 15m
rollback_threshold:
error_rate: 1%
latency_p99: 500ms
- percentage: 10
duration: 30m
rollback_threshold:
error_rate: 0.5%
latency_p99: 300ms
- percentage: 50
duration: 1h
rollback_threshold:
error_rate: 0.1%
latency_p99: 200ms
- percentage: 100
The Meta-Pattern: Defense in Depth
Notice a pattern? Most fixes involve:
- Detecting failures quickly (monitoring, health checks, tracing)
- Limiting blast radius (bulkheads, canaries, feature flags)
- Recovering automatically (circuit breakers, retries, fallbacks)
- Failing gracefully (degraded mode, timeouts, dead letter queues)
This is defense in depth. No single defense is perfect. Layer multiple defenses. When one fails, another catches the problem.
Summary: Your Distributed Systems Survival Kit
| Failure Mode | Quick Detection | Quick Fix |
|---|---|---|
| Network Partition | Heartbeat timeouts | Quorum-based decisions |
| Split-Brain | Leader health checks | Fencing tokens |
| Cascading Failure | Dependency metrics | Circuit breakers |
| Byzantine | Checksum validation | Redundancy + voting |
| Clock Skew | NTP monitoring | Logical clocks |
| Message Loss | Delivery confirmations | Idempotent operations |
| Thundering Herd | Cache hit rate | Cache locking + jitter |
| Hot Spots | Per-shard metrics | Key splitting |
| Retry Storm | Request rate monitoring | Exponential backoff |
| Resource Exhaustion | Resource metrics | Pools + limits |
| Data Corruption | Invariant checks | Optimistic concurrency |
| Config Drift | Config version checks | Centralized config |
| Dependency Failure | Dependency health | Fallbacks + caching |
| Poison Message | DLQ monitoring | Dead letter queues |
| Deployment Issues | Error rate spike | Canary + instant rollback |
Final Thoughts
Distributed systems will fail. That is not pessimism. It is physics. Networks are unreliable. Hardware breaks. Software has bugs. The question is not whether your system will fail, but how it will fail and how quickly you will recover.
The engineers who build reliable systems are not the ones who prevent all failures. They are the ones who:
- Expect failures and design for them
- Detect failures quickly through monitoring
- Limit blast radius when failures occur
- Recover automatically where possible
- Learn from failures through postmortems
Build your systems with these failure modes in mind. Test them with chaos engineering. Monitor them religiously. And when they fail (because they will), you will be ready.
The best distributed systems are not the ones that never fail. They are the ones that fail gracefully, recover quickly, and teach their operators something new with every incident.
Frequently Asked Questions
What is the single most important thing to implement first?
Start with monitoring and observability. You cannot fix what you cannot see. Implement distributed tracing, centralized logging, and basic metrics before worrying about complex fault tolerance patterns.
How do I know which failure modes apply to my system?
Every distributed system faces network partitions, message loss, and some form of resource exhaustion. Beyond those, it depends on your architecture. If you have a leader-based system, worry about split-brain. If you have caches, worry about thundering herd. If you have queues, worry about poison messages.
Is it worth implementing all these fixes?
No. Implement based on risk. Start with the failures most likely to occur and most damaging when they do. For most systems: timeouts, retries with backoff, circuit breakers, and monitoring cover 80% of cases.
How do I test for these failures?
Use chaos engineering. Tools like Chaos Monkey, Gremlin, or Litmus inject failures into your system. Kill processes, add network latency, fill disks. Test in production (carefully) or in production-like environments.
What about serverless and managed services?
They handle some failures for you (resource exhaustion, deployment rollouts) but introduce others (cold starts, vendor outages, rate limits). The principles still apply, but the specific failure modes change.
Ready to Master System Design Interviews?
Learn from 25+ real interview problems from Netflix, Uber, Google, and Stripe. Created by a senior engineer who's taken 200+ system design interviews at FAANG companies.
Complete Solutions
Architecture diagrams & trade-off analysis
Real Interview Problems
From actual FAANG interviews
7-day money-back guarantee • Lifetime access • New problems added quarterly
FREE: System Design Interview Cheat Sheet
Get the 7-page PDF cheat sheet with critical numbers, decision frameworks, and the interview approach used by 10,000+ engineers.
No spam. Unsubscribe anytime.
Related Articles
The 7 System Design Problems You Must Know Before Your Interview
These 7 system design questions appear in 80% of interviews at Google, Meta, Amazon, and Netflix. Master them, and you can handle any variation.
Read moreAmazon System Design Interview: Leadership Principles Meet Distributed Systems
How Amazon's system design interviews differ from other FAANG companies. Real questions, LP integration, and what bar raisers actually look for.
Read moreCAP Theorem Explained: What Every Engineer Should Know
A clear explanation of the CAP theorem with real-world examples. Learn how to apply CAP theorem thinking in system design interviews.
Read more