CAP Theorem Explained: What Every Engineer Should Know
A clear explanation of the CAP theorem with real-world examples. Learn how to apply CAP theorem thinking in system design interviews.
Ready to Master System Design Interviews?
Learn from 25+ real interview problems from Netflix, Uber, Google, and Stripe. Created by a senior engineer who's taken 200+ system design interviews at FAANG companies.
Complete Solutions
Architecture diagrams & trade-off analysis
Real Interview Problems
From actual FAANG interviews
7-day money-back guarantee • Lifetime access • New problems added quarterly
"We need strong consistency for this feature."
"But that will hurt availability during network partitions."
"Can't we have both?"
This conversation happens in engineering teams constantly. The CAP theorem explains why you can't have everything, and more importantly, how to make the right trade-offs for your system.
The CAP theorem is one of the most cited concepts in distributed systems, yet it's also one of the most misunderstood. In this guide, I'll explain what it actually means, clear up common misconceptions, and show you how to apply it in system design interviews.
What is the CAP Theorem?
The CAP theorem, proposed by Eric Brewer in 2000 and later proven by Seth Gilbert and Nancy Lynch, states:
A distributed data store cannot simultaneously provide more than two of these three guarantees:
- Consistency , Every read receives the most recent write
- Availability , Every request receives a response (not an error)
- Partition Tolerance , The system continues to operate despite network failures
The theorem is often stated as "pick two out of three," but this is misleading. Let me explain why.
The Three Properties Explained
Consistency (C)
Definition: Every read receives the most recent write, or an error.
If you write a value and then immediately read it from any node, you get the value you just wrote. All nodes see the same data at the same time.
Example:
Time 0: Write X=5 to Node A
Time 1: Read X from Node B → Returns 5 (not an old value)
This is called strong consistency or linearizability. It's what you'd expect from a single-server database.
What consistency is NOT:
- It's not about data being "correct" or "valid"
- It's not about transactions (that's ACID)
- It's specifically about read-after-write visibility across nodes
Availability (A)
Definition: Every request receives a (non-error) response, without guarantee that it contains the most recent write.
If you send a request to any working node, you get a response. The system doesn't hang, timeout, or return an error due to unavailable data.
Example:
Request: Read X from Node B
Response: X=3 (an older value, but still a valid response)
What availability is NOT:
- It doesn't mean 100% uptime
- It doesn't mean low latency
- It means: working nodes must respond to requests
Partition Tolerance (P)
Definition: The system continues to operate despite arbitrary message loss or failure of part of the system.
A network partition occurs when nodes can't communicate with each other. This happens due to:
- Network cable being cut
- Switch failure
- Data center connectivity issues
- Network congestion causing timeouts
Example:
Normal state: [Node A] ←→ [Node B] ←→ [Node C]
Partition: [Node A] ←→ [Node B] [Node C]
↑ cannot reach ↑
Partition tolerance means the system doesn't completely fail when this happens.
Why You Can't Have All Three
Here's the key insight: network partitions are unavoidable in distributed systems. Networks fail. This isn't a theoretical concern, it happens regularly in production.
So the real choice is: when a partition occurs, do you sacrifice Consistency or Availability?
The Proof (Simplified)
Imagine two nodes, A and B, and a network partition occurs between them.
Scenario: A client writes X=5 to Node A. Another client tries to read X from Node B.
Option 1: Prioritize Consistency (CP)
Node B can't verify the latest value with Node A (they're partitioned). So Node B either:
- Waits indefinitely (sacrificing availability)
- Returns an error (sacrificing availability)
Option 2: Prioritize Availability (AP)
Node B returns its local value (maybe X=3, the old value). The client gets a response (available), but it's stale (inconsistent).
You can't do both. If you want both nodes to respond AND return the same value, you need them to communicate. During a partition, they can't.
The Real Choice: CP vs AP
Since partitions are inevitable, every distributed system must choose:
| Choice | Behavior During Partition | Example Systems |
|---|---|---|
| CP | Sacrifices availability for consistency | MongoDB, HBase, Redis (in cluster mode) |
| AP | Sacrifices consistency for availability | Cassandra, DynamoDB, CouchDB |
CP Systems: Consistency over Availability
When a partition occurs, CP systems:
- Refuse to serve requests if they can't guarantee consistency
- Return errors rather than stale data
- May become unavailable to some or all users
When to choose CP:
- Financial transactions (bank transfers)
- Inventory systems (can't sell same item twice)
- Leader election (only one leader allowed)
- Anything where stale data causes real problems
Example: Banking
If Node A and Node B can't communicate, and both receive withdrawal requests for the same account, a CP system would:
- Only allow withdrawals from one node
- Reject requests to the other (unavailable)
- Prevent overdrafts (consistent)
AP Systems: Availability over Consistency
When a partition occurs, AP systems:
- Continue serving requests from all nodes
- May return stale or inconsistent data
- Resolve conflicts later (eventual consistency)
When to choose AP:
- Social media feeds (seeing a post 30 seconds late is fine)
- Product catalogs (slight staleness is acceptable)
- User sessions (some duplication is tolerable)
- Metrics and analytics (approximate is okay)
Example: Social Media
If Node A and Node B can't communicate:
- Both continue serving timeline requests
- A user might see slightly stale content
- When partition heals, systems sync up (eventual consistency)
Common Misconceptions
Misconception 1: "Pick Two"
Wrong: "We chose CA, so we don't need partition tolerance."
Reality: You don't choose partition tolerance, it's a reality of distributed systems. The choice is between CP and AP when partitions occur.
A "CA" system is essentially a single-server system (no partitions possible because there's only one node).
Misconception 2: CAP is Binary
Wrong: "MongoDB is CP, so it's never available during partitions."
Reality: CAP properties exist on a spectrum. Systems can:
- Be CP for some operations, AP for others
- Allow tunable consistency (choose per-request)
- Sacrifice availability only during actual partitions
Misconception 3: Consistency = ACID
Wrong: "CAP consistency means we have transactions."
Reality:
- CAP Consistency = All nodes see same data (linearizability)
- ACID Consistency = Data satisfies business rules
They're different concepts with confusingly similar names.
Misconception 4: Availability = Uptime
Wrong: "99.99% uptime means we're available in CAP terms."
Reality: CAP availability means every request to a non-failing node receives a response. A system could have 99.99% uptime but still sacrifice CAP availability during partitions by returning errors.
Real-World Database Examples
Let's examine how real databases handle CAP trade-offs.
MongoDB (Default: CP)
MongoDB uses a primary-secondary replication model.
During normal operation:
- Writes go to primary
- Primary replicates to secondaries
- Reads can go to primary (consistent) or secondaries (potentially stale)
During partition:
- If primary is isolated, election occurs
- Minority partition becomes unavailable (can't write)
- Majority partition elects new primary
- System prioritizes consistency over availability
Configuration options:
w: majority, Wait for majority acknowledgment (more consistent)w: 1, Only wait for primary (faster, less durable)readPreference: secondary, Allow stale reads (more available)
Cassandra (Default: AP)
Cassandra uses a peer-to-peer model with tunable consistency.
During normal operation:
- Writes go to multiple replicas
- Reads query multiple replicas
- Conflicts resolved by timestamp (last-write-wins)
During partition:
- Both partitions continue accepting reads and writes
- When partition heals, data synchronizes
- Conflicts resolved automatically or manually
Configuration options:
CONSISTENCY ONE, Fast, potentially staleCONSISTENCY QUORUM, BalancedCONSISTENCY ALL, Consistent but slow, less available
DynamoDB (Default: AP, Configurable)
AWS DynamoDB offers flexibility:
Eventually Consistent Reads (default):
- High availability
- May return stale data
- Lower latency, lower cost
Strongly Consistent Reads:
- Returns most recent write
- Higher latency
- May fail during partitions
PostgreSQL (Single-node: CA, Replicated: Varies)
Single node:
- No partitions possible
- Fully consistent and available
- Not partition tolerant (single point of failure)
With streaming replication:
- Synchronous: CP (waits for replica acknowledgment)
- Asynchronous: Potential for data loss on failover
Applying CAP in System Design Interviews
When discussing database choices in interviews, demonstrate CAP awareness:
Step 1: Identify Consistency Requirements
Ask yourself:
- What happens if users see stale data?
- Is this a financial/inventory system where correctness matters?
- Can we recover from inconsistency?
Step 2: Consider Availability Requirements
Ask yourself:
- What's the cost of downtime?
- Do users need 24/7 access?
- Is it better to return stale data or an error?
Step 3: Make and Justify the Trade-off
Example: E-commerce Product Catalog
"For the product catalog, I'd choose an AP system like Cassandra. Here's why:
Consistency: Product details (name, description, images) don't change frequently. If a user sees a slightly stale price for a few seconds, it's not catastrophic.
Availability: We need the catalog available 24/7. Users can't browse or buy if the catalog is down. Revenue impact is huge.
The trade-off: We'll use eventual consistency. Users might briefly see old data, but the system stays up during network issues."
Example: Bank Account Balance
"For account balances, I'd choose a CP system. Here's why:
Consistency: Users must see accurate balances. Showing a wrong balance leads to overdrafts, failed payments, and lost trust.
Availability: During partitions, I'd rather show 'temporarily unavailable' than show incorrect data.
The trade-off: During rare network partitions, users might see error messages. This is better than incorrect balances."
Step 4: Discuss Mitigation Strategies
Show that you understand the nuances:
For AP systems:
"We accept eventual consistency, but we can mitigate issues by:
- Using conflict resolution strategies (last-write-wins, application-specific merge)
- Showing users timestamps ('updated 5 seconds ago')
- Making critical paths strongly consistent while keeping others eventual"
For CP systems:
"We prioritize consistency, but we can improve availability by:
- Using read replicas for read-heavy workloads
- Implementing circuit breakers for graceful degradation
- Caching with short TTLs for frequently accessed data"
Beyond CAP: The PACELC Theorem
CAP only describes behavior during partitions. But what about normal operation?
The PACELC theorem extends CAP:
If there is a Partition, choose between Availability and Consistency. Else (no partition), choose between Latency and Consistency.
| System | During Partition | Normal Operation |
|---|---|---|
| Cassandra | PA (Available) | EL (Low Latency) |
| MongoDB | PC (Consistent) | EC (Consistent) |
| DynamoDB | PA (Available) | EL or EC (Configurable) |
| PNUTS (Yahoo) | PC (Consistent) | EL (Low Latency) |
Why PACELC Matters
Even without partitions, there's a consistency-latency trade-off:
Strong consistency requires coordination:
- Wait for acknowledgment from multiple nodes
- Higher latency
- Lower throughput
Eventual consistency avoids coordination:
- Write locally, sync later
- Lower latency
- Higher throughput
Example: Cassandra (PA/EL)
- During partition (PA): Stays available, accepts writes to both partitions, reconciles later
- Normal operation (EL): Default
CONSISTENCY ONEprioritizes latency over consistency
Example: MongoDB (PC/EC)
- During partition (PC): Primary-only writes, secondary reads may be stale, prioritizes consistency
- Normal operation (EC): Writes to primary with replication, consistent reads by default
CAP in Practice: Real Design Decisions
Scenario 1: User Session Storage
Requirements:
- Users should stay logged in
- Session data accessed with every request
- Global users across regions
Analysis:
- Stale session data is annoying (forced re-login) but not catastrophic
- Downtime means nobody can use the app
- High read volume, low write volume
Decision: AP system (Redis Cluster, DynamoDB)
"I'd use an AP approach. If a user's session is briefly stale, worst case they see slightly old preferences. But if sessions are unavailable, the whole app is unusable. I'd use DynamoDB with eventual consistency for reads."
Scenario 2: Order Processing
Requirements:
- Process customer orders
- Update inventory
- Handle payments
Analysis:
- Incorrect inventory leads to overselling (bad)
- Incorrect payment processing is unacceptable
- Downtime loses revenue but is recoverable
Decision: CP system (PostgreSQL, CockroachDB)
"I'd prioritize consistency here. If we oversell inventory or double-charge customers, we have serious problems. During partitions, I'd rather return errors than process incorrect orders. We can use a queue to retry failed orders when the partition heals."
Scenario 3: Social Media Feed
Requirements:
- Show posts from followed users
- Support millions of users globally
- Real-time-ish updates
Analysis:
- Seeing a post 10 seconds late is fine
- Complete downtime means users can't scroll (engagement loss)
- Write volume is high (many posts per second)
Decision: AP system (Cassandra, DynamoDB)
"Eventual consistency is perfect here. Users don't notice if a post appears a few seconds late. But if the feed is completely down, users leave. I'd use Cassandra with eventual consistency and optimize for availability and low latency."
Summary: CAP Decision Framework
Use this framework when making CAP decisions:
1. Ask: "What happens if data is stale?"
| Answer | Implication |
|---|---|
| "Users are confused but no real harm" | AP is probably fine |
| "Financial loss or data corruption" | CP is necessary |
| "Depends on the operation" | Consider per-operation consistency |
2. Ask: "What happens during downtime?"
| Answer | Implication |
|---|---|
| "Users can wait or retry later" | CP is acceptable |
| "Revenue loss, user churn" | AP is important |
| "Safety or compliance issues" | Depends on specifics |
3. Make the trade-off explicit
"For this system, I'm choosing [AP/CP] because [specific reason]. The trade-off is [what we sacrifice], which we'll mitigate by [strategy]."
Key Takeaways
-
CAP is about partitions. The real question is: when networks fail (and they will), do you prioritize consistency or availability?
-
"Pick two" is misleading. You don't pick partition tolerance, it's a fact of distributed life. You choose between CP and AP.
-
Most systems are configurable. Modern databases let you tune consistency per-operation. Use strong consistency where it matters, eventual elsewhere.
-
Consider PACELC. Even without partitions, there's a latency-consistency trade-off. Systems make choices on both axes.
-
Context matters. There's no universally correct choice. Banking needs CP. Social media often works fine with AP. Know your requirements.---
Frequently Asked Questions
Is CAP theorem still relevant with modern cloud databases?
Yes, but the landscape has evolved. Modern databases like CockroachDB and Spanner offer strong consistency with high availability through sophisticated techniques (synchronized clocks, consensus algorithms). But the fundamental trade-off still exists, these systems just push the boundaries.
Can a system be CA?
A true CA system would have no network partitions, which is only possible with a single server. As soon as you have multiple nodes, partitions become possible, and you must choose CP or AP when they occur.
How do I explain CAP in an interview?
Keep it simple: "CAP says that during a network partition, a distributed system must choose between consistency (all nodes see the same data) and availability (all nodes respond to requests). My choice depends on whether stale data or downtime is worse for this use case."
What about systems that claim to be "fully consistent and available"?
They likely mean one of:
- Single-node (no partitions possible)
- Consistency/availability under normal operation, with trade-offs during partitions
- Marketing speak that doesn't hold under scrutiny
Always understand what happens during network failures.
Ready to Master System Design Interviews?
Learn from 25+ real interview problems from Netflix, Uber, Google, and Stripe. Created by a senior engineer who's taken 200+ system design interviews at FAANG companies.
Complete Solutions
Architecture diagrams & trade-off analysis
Real Interview Problems
From actual FAANG interviews
7-day money-back guarantee • Lifetime access • New problems added quarterly
FREE: System Design Interview Cheat Sheet
Get the 7-page PDF cheat sheet with critical numbers, decision frameworks, and the interview approach used by 10,000+ engineers.
No spam. Unsubscribe anytime.
Related Articles
Why Distributed Systems Fail: 15 Failure Scenarios Every Engineer Must Know
A comprehensive guide to the most common failure modes in distributed systems, from network partitions to split-brain scenarios, with practical fixes for each.
Read moreBuilding Data-Intensive Systems: The Complete Guide (2026)
Learn how to choose the right database, scale your system, and design data architecture. Covers PostgreSQL vs MongoDB, database sharding, replication, caching with Redis, and CAP theorem with real examples from Instagram, Uber, and Netflix.
Read moreThe Universal System Design Interview Framework: 8 Steps to Design ANY System
Master the complete 8-step framework used by Principal Architects to ace system design interviews at Google, Amazon, Meta, and Netflix. Learn to design Twitter, Uber, or any system you've never seen before.
Read more