SystemExpertsSystemExperts
Pricing

System Design Fundamentals

11 items

Scalability Fundamentals

25mbeginner

Latency, Throughput & Performance

30mbeginner

Back-of-Envelope Calculations

25mbeginner

Availability & Reliability Fundamentals

35mintermediate

CAP Theorem & Consistency Models

40mintermediate

Load Balancing Deep Dive

35mintermediate

Asynchronous Processing & Message Queues

30mintermediate

Networking & Protocols

30mintermediate

Caching Strategies

35mintermediate

System Design Fundamentals

20mintermediate

Database Fundamentals

40madvanced
Fundamentalsavailabilityreliabilityslasloerror-budgetredundancyfundamentalssystem-designintermediate

Availability & Reliability Fundamentals

Understanding the nines - from 99% to 99.999% and everything in between

Foundation knowledge|35 min read

Summary

Availability is the percentage of time a system is operational and serving requests correctly. Reliability is the probability that a system will perform its intended function without failure. These related but distinct concepts are measured in "nines" (99.9%, 99.99%), and each additional nine is exponentially harder to achieve. Understanding SLAs, SLOs, error budgets, failure modes, and redundancy patterns is essential for designing systems that meet business requirements without over-engineering.

Key Takeaways

Each Nine is 10x Harder

Going from 99% to 99.9% isn't 0.9% harder—it's an order of magnitude harder. 99% allows 3.65 days downtime/year; 99.9% allows only 8.7 hours. Each additional nine requires fundamentally different architecture and operational practices.

Availability is Multiplicative, Not Additive

If you depend on three services each with 99.9% availability, your combined availability is 0.999³ = 99.7%, not 99.9%. Long dependency chains dramatically reduce overall availability. This is why microservices often have worse availability than monoliths.

MTTR Matters More Than MTBF

Reducing Mean Time To Recovery from 1 hour to 10 minutes improves availability more than doubling Mean Time Between Failures. Focus on fast detection, diagnosis, and recovery rather than preventing all failures.

Availability measures the proportion of time a system is operational and accessible.

Availability = Uptime / (Uptime + Downtime)
            = MTBF / (MTBF + MTTR)

Where: - MTBF: Mean Time Between Failures - MTTR: Mean Time To Recovery

Availability vs. Reliability: - Availability: Is the system up right now? - Reliability: Will the system keep working correctly over time?

A system can be highly available (rarely down) but unreliable (frequently returns wrong results). Or reliable (always correct when up) but not highly available (frequent outages).

Example: A database with 99.99% availability but occasional data corruption is available but unreliable. A database that's down for maintenance weekly but never corrupts data is reliable but not highly available.

Summary

Availability is the percentage of time a system is operational and serving requests correctly. Reliability is the probability that a system will perform its intended function without failure. These related but distinct concepts are measured in "nines" (99.9%, 99.99%), and each additional nine is exponentially harder to achieve. Understanding SLAs, SLOs, error budgets, failure modes, and redundancy patterns is essential for designing systems that meet business requirements without over-engineering.

Key Takeaways

Each Nine is 10x Harder

Going from 99% to 99.9% isn't 0.9% harder—it's an order of magnitude harder. 99% allows 3.65 days downtime/year; 99.9% allows only 8.7 hours. Each additional nine requires fundamentally different architecture and operational practices.

Availability is Multiplicative, Not Additive

If you depend on three services each with 99.9% availability, your combined availability is 0.999³ = 99.7%, not 99.9%. Long dependency chains dramatically reduce overall availability. This is why microservices often have worse availability than monoliths.

MTTR Matters More Than MTBF

Reducing Mean Time To Recovery from 1 hour to 10 minutes improves availability more than doubling Mean Time Between Failures. Focus on fast detection, diagnosis, and recovery rather than preventing all failures.

Redundancy Has Diminishing Returns

Two servers eliminate single points of failure. Three servers provide quorum. Beyond that, operational complexity often exceeds reliability gains. Most systems don't need more than N+2 redundancy.

Fail-Open vs Fail-Closed is a Business Decision

When a dependency fails, do you serve degraded results (fail-open) or reject requests (fail-closed)? There's no universal answer—it depends on whether showing stale/incomplete data is better or worse than showing nothing.

Planned Downtime Still Counts

Maintenance windows, deployments, and migrations all reduce availability. A system with 99.99% uptime can't take 4 hours of planned downtime per year. Factor operational needs into your availability targets.

Deep Dive

Availability measures the proportion of time a system is operational and accessible.

Availability = Uptime / (Uptime + Downtime)
            = MTBF / (MTBF + MTTR)

Where: - MTBF: Mean Time Between Failures - MTTR: Mean Time To Recovery

Availability vs. Reliability: - Availability: Is the system up right now? - Reliability: Will the system keep working correctly over time?

A system can be highly available (rarely down) but unreliable (frequently returns wrong results). Or reliable (always correct when up) but not highly available (frequent outages).

Example: A database with 99.99% availability but occasional data corruption is available but unreliable. A database that's down for maintenance weekly but never corrupts data is reliable but not highly available.

Trade-offs

AspectAdvantageDisadvantage
More NinesHigher availability, better user experience, competitive advantageExponentially more expensive, diminishing returns, increased complexity
Fail-Open StrategyBetter user experience during partial failures, continued service availabilityMay serve incorrect/stale data, harder to detect silent failures, potential for inconsistency
Active-Active RedundancyNo wasted resources, instant failover, better resource utilizationComplex state synchronization, potential for split-brain, harder to reason about
Aggressive Health ChecksFast failure detection, quick removal of unhealthy instancesFalse positives during transient issues, thundering herd on recovery, may cause cascading failures

Premium Content

Sign in to access this content or upgrade for full access.