Patterns
35 items
35 items
Handling transient failures with exponential backoff
Retry with exponential backoff is a fundamental resilience pattern for handling transient failures in distributed systems. When a request fails, instead of immediately retrying (which can overwhelm an already-struggling service), the client waits progressively longer between attempts. Combined with jitter (random delay variation), this prevents the "thundering herd" problem where many clients retry simultaneously. This pattern is essential for production systems dealing with network hiccups, rate limiting, and service overload - Amazon, Google, and Microsoft all mandate exponential backoff in their API client guidelines.
The pattern only works for transient failures (network blips, temporary overload, rate limits). Permanent failures (404, invalid input) should fail fast without retry. Smart clients distinguish between retriable (5xx, timeouts) and non-retriable (4xx) errors.
Linear backoff (1s, 2s, 3s) is too aggressive. Exponential backoff (1s, 2s, 4s, 8s) gives failing services breathing room to recover. The exponential curve means most retry pressure dissipates quickly.
Without jitter, all clients retry at the same intervals, creating synchronized traffic spikes. Random jitter (±50%) spreads retries across time, preventing simultaneous storms that crash recovering services.
In distributed systems, temporary failures are not exceptions - they are the norm. Network packets get lost, services temporarily overload, rate limits trigger, and downstream dependencies hiccup. The question is not if requests will fail, but when and how to handle them.
Why failures happen: - Network instability: Packet loss, router congestion, DNS lookup failures - Service overload: CPU spikes, memory pressure, connection pool exhaustion - Rate limiting: APIs protecting themselves from excessive load - Cold starts: Serverless functions warming up, cache misses cascading - Deployment rollouts: Brief unavailability during code deploys
Immediately retrying failed requests floods an already-struggling service with more load, making the problem worse. This creates positive feedback loops where retries cause more failures, triggering more retries, until the entire system collapses.
Real-world example: In 2013, Google had a brief outage that lasted only 2-3 minutes. But because millions of clients immediately retried, the flood of retry traffic kept services down for an additional hour. Proper backoff would have allowed recovery in minutes.