Patterns

35 items

System Design Pattern

Reliabilityretrybackoffexponential-backoffjittertransient-failuresbeginner

Retry with Backoff Pattern

Handling transient failures with exponential backoff

Used in: API Clients, Message Queues, Distributed Systems|15 min read

Summary

Retry with exponential backoff is a fundamental resilience pattern for handling transient failures in distributed systems. When a request fails, instead of immediately retrying (which can overwhelm an already-struggling service), the client waits progressively longer between attempts. Combined with jitter (random delay variation), this prevents the "thundering herd" problem where many clients retry simultaneously. This pattern is essential for production systems dealing with network hiccups, rate limiting, and service overload - Amazon, Google, and Microsoft all mandate exponential backoff in their API client guidelines.

Key Takeaways

Transient vs Permanent Failures

The pattern only works for transient failures (network blips, temporary overload, rate limits). Permanent failures (404, invalid input) should fail fast without retry. Smart clients distinguish between retriable (5xx, timeouts) and non-retriable (4xx) errors.

Exponential Growth Prevents Overload

Linear backoff (1s, 2s, 3s) is too aggressive. Exponential backoff (1s, 2s, 4s, 8s) gives failing services breathing room to recover. The exponential curve means most retry pressure dissipates quickly.

Jitter Prevents Thundering Herd

Without jitter, all clients retry at the same intervals, creating synchronized traffic spikes. Random jitter (±50%) spreads retries across time, preventing simultaneous storms that crash recovering services.

In distributed systems, temporary failures are not exceptions - they are the norm. Network packets get lost, services temporarily overload, rate limits trigger, and downstream dependencies hiccup. The question is not if requests will fail, but when and how to handle them.

Why failures happen: - Network instability: Packet loss, router congestion, DNS lookup failures - Service overload: CPU spikes, memory pressure, connection pool exhaustion - Rate limiting: APIs protecting themselves from excessive load - Cold starts: Serverless functions warming up, cache misses cascading - Deployment rollouts: Brief unavailability during code deploys

The naive approach fails catastrophically

Immediately retrying failed requests floods an already-struggling service with more load, making the problem worse. This creates positive feedback loops where retries cause more failures, triggering more retries, until the entire system collapses.

Real-world example: In 2013, Google had a brief outage that lasted only 2-3 minutes. But because millions of clients immediately retried, the flood of retry traffic kept services down for an additional hour. Proper backoff would have allowed recovery in minutes.

Immediate Retry vs Exponential Backoff

Summary

Key Takeaways

Transient vs Permanent Failures

Exponential Growth Prevents Overload

Jitter Prevents Thundering Herd

Maximum Retry Budget

Infinite retries waste resources and hide bugs. Production systems set max attempts (typically 3-5) and total timeout (e.g., 30s). After exhausting retries, fail gracefully and alert monitoring.

Context-Aware Backoff Timing

Initial backoff should match failure type: network failures start at 100-500ms, rate limit errors should wait until the limit resets (check Retry-After header), circuit-open errors may need longer waits.

Idempotency is Critical

Retries can cause duplicate operations if requests succeed but responses are lost. Operations must be idempotent or use idempotency keys to ensure retries are safe.

Pattern Details

The naive approach fails catastrophically

Immediate Retry vs Exponential Backoff

Trade-offs

Aspect	Advantage	Disadvantage

Patterns

Horizontal Scaling Pattern