Patterns
35 items
35 items
Isolating failures to prevent system-wide impact
The Bulkhead pattern isolates failures to specific components, preventing cascade failures that take down entire systems. Named after ship compartments that contain flooding, bulkheads partition resources (thread pools, connection pools, circuit breakers) so that failure in one area doesn't consume resources needed by others. When one service degrades, bulkheads ensure other services continue functioning. This is critical for multi-tenant systems (isolating tenant failures), microservices (limiting blast radius), and any system with multiple critical paths (isolating authentication from data processing).
Like ship compartments that prevent one leak from sinking the entire ship, bulkheads isolate failures. If Service A fails and exhausts its thread pool, Services B and C continue using their separate pools.
Thread pools, connection pools, and rate limiters act as bulkheads. Instead of one shared pool (50 threads for all requests), partition into bulkheads (15 for critical API, 15 for reporting, 20 for background jobs).
In SaaS systems, one misbehaving tenant can consume all resources. Bulkheads isolate tenants - Tenant A's connection pool exhaustion doesn't impact Tenant B. Essential for fair resource allocation.
In distributed systems, components share resources: thread pools, connection pools, memory, CPU time, network bandwidth. When one component fails or slows down, it can consume all shared resources, causing unrelated components to fail. This creates cascade failures where a single failing service takes down the entire system.
Real-world cascade failure example:
Imagine an e-commerce API with three endpoints: - `/checkout` - Critical revenue path - `/search` - Important but not urgent - `/analytics` - Nice to have, low priority
All three share a single thread pool of 100 threads. The analytics endpoint starts timing out due to a database query problem. Each analytics request holds a thread for 30 seconds before timing out. Within 5 minutes:
One low-priority component took down the entire system.
In multi-tenant SaaS platforms, one tenant sending excessive traffic (noisy neighbor) can exhaust resources, degrading service for all other tenants. This is unacceptable - Tenant A should never impact Tenant B's experience.
Why this happens: