Patterns

35 items

System Design Pattern

Reliabilitybulkheadisolationfault-isolationresource-poolscompartmentalizationintermediate

Bulkhead Pattern

Isolating failures to prevent system-wide impact

Used in: Thread Pools, Connection Pools, Service Isolation|20 min read

Summary

The Bulkhead pattern isolates failures to specific components, preventing cascade failures that take down entire systems. Named after ship compartments that contain flooding, bulkheads partition resources (thread pools, connection pools, circuit breakers) so that failure in one area doesn't consume resources needed by others. When one service degrades, bulkheads ensure other services continue functioning. This is critical for multi-tenant systems (isolating tenant failures), microservices (limiting blast radius), and any system with multiple critical paths (isolating authentication from data processing).

Key Takeaways

Compartmentalization Prevents Cascade Failures

Like ship compartments that prevent one leak from sinking the entire ship, bulkheads isolate failures. If Service A fails and exhausts its thread pool, Services B and C continue using their separate pools.

Resource Pools as Bulkheads

Thread pools, connection pools, and rate limiters act as bulkheads. Instead of one shared pool (50 threads for all requests), partition into bulkheads (15 for critical API, 15 for reporting, 20 for background jobs).

Multi-Tenancy Requires Bulkheads

In SaaS systems, one misbehaving tenant can consume all resources. Bulkheads isolate tenants - Tenant A's connection pool exhaustion doesn't impact Tenant B. Essential for fair resource allocation.

In distributed systems, components share resources: thread pools, connection pools, memory, CPU time, network bandwidth. When one component fails or slows down, it can consume all shared resources, causing unrelated components to fail. This creates cascade failures where a single failing service takes down the entire system.

Real-world cascade failure example:

Imagine an e-commerce API with three endpoints: - `/checkout` - Critical revenue path - `/search` - Important but not urgent - `/analytics` - Nice to have, low priority

All three share a single thread pool of 100 threads. The analytics endpoint starts timing out due to a database query problem. Each analytics request holds a thread for 30 seconds before timing out. Within 5 minutes:

All 100 threads are stuck on analytics requests
Checkout requests arrive but no threads available
Revenue-critical checkout is down because analytics failed
Search also fails for the same reason

One low-priority component took down the entire system.

Cascade Failure: Shared Thread Pool

The Noisy Neighbor Problem

In multi-tenant SaaS platforms, one tenant sending excessive traffic (noisy neighbor) can exhaust resources, degrading service for all other tenants. This is unacceptable - Tenant A should never impact Tenant B's experience.

Why this happens:

Shared resource pools: All components use same threads/connections
No isolation: Slow component blocks fast components
Unbounded queueing: Requests pile up waiting for resources
No prioritization: Critical paths treated same as background jobs

Summary

Key Takeaways

Compartmentalization Prevents Cascade Failures

Resource Pools as Bulkheads

Multi-Tenancy Requires Bulkheads

In SaaS systems, one misbehaving tenant can consume all resources. Bulkheads isolate tenants - Tenant A's connection pool exhaustion doesn't impact Tenant B. Essential for fair resource allocation.

Sizing Bulkheads is Critical

Too small: artificial bottlenecks, underutilized resources. Too large: defeats isolation purpose. Size based on traffic patterns, criticality, and failure blast radius.

Bulkheads Create Graceful Degradation

When one bulkhead fills up, only that functionality degrades. User-facing API continues working while background analytics fail. System prioritizes critical paths over nice-to-have features.

Monitoring Each Bulkhead Separately

System-wide metrics hide problems. If total thread pool is 60% utilized but the critical API bulkhead is 100% full, you have an incident. Monitor saturation per bulkhead, not aggregate.

Pattern Details

Real-world cascade failure example:

Imagine an e-commerce API with three endpoints: - `/checkout` - Critical revenue path - `/search` - Important but not urgent - `/analytics` - Nice to have, low priority

All 100 threads are stuck on analytics requests
Checkout requests arrive but no threads available
Revenue-critical checkout is down because analytics failed
Search also fails for the same reason

One low-priority component took down the entire system.

Cascade Failure: Shared Thread Pool

The Noisy Neighbor Problem

Why this happens:

Shared resource pools: All components use same threads/connections
No isolation: Slow component blocks fast components
Unbounded queueing: Requests pile up waiting for resources
No prioritization: Critical paths treated same as background jobs

Trade-offs

Aspect	Advantage	Disadvantage

Patterns

Horizontal Scaling Pattern

Retry with Backoff Pattern

Replication Pattern

Caching Strategies Pattern

Persistent Connections Pattern

Load Balancing Pattern

Fan-out Pattern

Fan-in Pattern

Circuit Breaker Pattern

Eventual Consistency Pattern

Queue-based Load Leveling Pattern

Bloom Filters Pattern

Time-Series Storage Pattern