System Design Fundamentals

11 items

Fundamentalsavailabilityreliabilityslasloerror-budgetredundancyfundamentalssystem-designintermediate

Availability & Reliability Fundamentals

Understanding the nines - from 99% to 99.999% and everything in between

Foundation knowledge|35 min read

Summary

Availability is the percentage of time a system is operational and serving requests correctly. Reliability is the probability that a system will perform its intended function without failure. These related but distinct concepts are measured in "nines" (99.9%, 99.99%), and each additional nine is exponentially harder to achieve. Understanding SLAs, SLOs, error budgets, failure modes, and redundancy patterns is essential for designing systems that meet business requirements without over-engineering.

Key Takeaways

Each Nine is 10x Harder

Going from 99% to 99.9% isn't 0.9% harder—it's an order of magnitude harder. 99% allows 3.65 days downtime/year; 99.9% allows only 8.7 hours. Each additional nine requires fundamentally different architecture and operational practices.

Availability is Multiplicative, Not Additive

If you depend on three services each with 99.9% availability, your combined availability is 0.999³ = 99.7%, not 99.9%. Long dependency chains dramatically reduce overall availability. This is why microservices often have worse availability than monoliths.

MTTR Matters More Than MTBF

Reducing Mean Time To Recovery from 1 hour to 10 minutes improves availability more than doubling Mean Time Between Failures. Focus on fast detection, diagnosis, and recovery rather than preventing all failures.

Availability measures the proportion of time a system is operational and accessible.

Availability = Uptime / (Uptime + Downtime)
            = MTBF / (MTBF + MTTR)

Where: - MTBF: Mean Time Between Failures - MTTR: Mean Time To Recovery

Availability vs. Reliability: - Availability: Is the system up right now? - Reliability: Will the system keep working correctly over time?

A system can be highly available (rarely down) but unreliable (frequently returns wrong results). Or reliable (always correct when up) but not highly available (frequent outages).

Example: A database with 99.99% availability but occasional data corruption is available but unreliable. A database that's down for maintenance weekly but never corrupts data is reliable but not highly available.

Summary

Key Takeaways

Each Nine is 10x Harder

Availability is Multiplicative, Not Additive

MTTR Matters More Than MTBF

Redundancy Has Diminishing Returns

Two servers eliminate single points of failure. Three servers provide quorum. Beyond that, operational complexity often exceeds reliability gains. Most systems don't need more than N+2 redundancy.

Fail-Open vs Fail-Closed is a Business Decision

When a dependency fails, do you serve degraded results (fail-open) or reject requests (fail-closed)? There's no universal answer—it depends on whether showing stale/incomplete data is better or worse than showing nothing.

Planned Downtime Still Counts

Maintenance windows, deployments, and migrations all reduce availability. A system with 99.99% uptime can't take 4 hours of planned downtime per year. Factor operational needs into your availability targets.

Deep Dive

Availability measures the proportion of time a system is operational and accessible.

Availability = Uptime / (Uptime + Downtime)
            = MTBF / (MTBF + MTTR)

Where: - MTBF: Mean Time Between Failures - MTTR: Mean Time To Recovery

Availability vs. Reliability: - Availability: Is the system up right now? - Reliability: Will the system keep working correctly over time?

A system can be highly available (rarely down) but unreliable (frequently returns wrong results). Or reliable (always correct when up) but not highly available (frequent outages).

Trade-offs

Aspect	Advantage	Disadvantage
More Nines	Higher availability, better user experience, competitive advantage	Exponentially more expensive, diminishing returns, increased complexity
Fail-Open Strategy	Better user experience during partial failures, continued service availability	May serve incorrect/stale data, harder to detect silent failures, potential for inconsistency
Active-Active Redundancy	No wasted resources, instant failover, better resource utilization	Complex state synchronization, potential for split-brain, harder to reason about
Aggressive Health Checks	Fast failure detection, quick removal of unhealthy instances	False positives during transient issues, thundering herd on recovery, may cause cascading failures

System Design Fundamentals

Scalability Fundamentals

Latency, Throughput & Performance

Back-of-Envelope Calculations