Back to Blog
distributed-systemscap-theoremfundamentals

CAP Theorem Explained: What Every Engineer Should Know

A clear explanation of the CAP theorem with real-world examples. Learn how to apply CAP theorem thinking in system design interviews.

14 min readBy SystemExperts
From the Interviewer's Side

Ready to Master System Design Interviews?

Learn from 25+ real interview problems from Netflix, Uber, Google, and Stripe. Created by a senior engineer who's taken 200+ system design interviews at FAANG companies.

Complete Solutions

Architecture diagrams & trade-off analysis

Real Interview Problems

From actual FAANG interviews

7-day money-back guarantee • Lifetime access • New problems added quarterly

"We need strong consistency for this feature."

"But that will hurt availability during network partitions."

"Can't we have both?"

This conversation happens in engineering teams constantly. The CAP theorem explains why you can't have everything, and more importantly, how to make the right trade-offs for your system.

The CAP theorem is one of the most cited concepts in distributed systems, yet it's also one of the most misunderstood. In this guide, I'll explain what it actually means, clear up common misconceptions, and show you how to apply it in system design interviews.


What is the CAP Theorem?

The CAP theorem, proposed by Eric Brewer in 2000 and later proven by Seth Gilbert and Nancy Lynch, states:

A distributed data store cannot simultaneously provide more than two of these three guarantees:

  1. Consistency , Every read receives the most recent write
  2. Availability , Every request receives a response (not an error)
  3. Partition Tolerance , The system continues to operate despite network failures

The theorem is often stated as "pick two out of three," but this is misleading. Let me explain why.


The Three Properties Explained

Consistency (C)

Definition: Every read receives the most recent write, or an error.

If you write a value and then immediately read it from any node, you get the value you just wrote. All nodes see the same data at the same time.

Example:

Time 0: Write X=5 to Node A
Time 1: Read X from Node B → Returns 5 (not an old value)

This is called strong consistency or linearizability. It's what you'd expect from a single-server database.

What consistency is NOT:

  • It's not about data being "correct" or "valid"
  • It's not about transactions (that's ACID)
  • It's specifically about read-after-write visibility across nodes

Availability (A)

Definition: Every request receives a (non-error) response, without guarantee that it contains the most recent write.

If you send a request to any working node, you get a response. The system doesn't hang, timeout, or return an error due to unavailable data.

Example:

Request: Read X from Node B
Response: X=3 (an older value, but still a valid response)

What availability is NOT:

  • It doesn't mean 100% uptime
  • It doesn't mean low latency
  • It means: working nodes must respond to requests

Partition Tolerance (P)

Definition: The system continues to operate despite arbitrary message loss or failure of part of the system.

A network partition occurs when nodes can't communicate with each other. This happens due to:

  • Network cable being cut
  • Switch failure
  • Data center connectivity issues
  • Network congestion causing timeouts

Example:

Normal state:     [Node A] ←→ [Node B] ←→ [Node C]

Partition:        [Node A] ←→ [Node B]     [Node C]
                         ↑ cannot reach ↑

Partition tolerance means the system doesn't completely fail when this happens.


Why You Can't Have All Three

Here's the key insight: network partitions are unavoidable in distributed systems. Networks fail. This isn't a theoretical concern, it happens regularly in production.

So the real choice is: when a partition occurs, do you sacrifice Consistency or Availability?

The Proof (Simplified)

Imagine two nodes, A and B, and a network partition occurs between them.

Scenario: A client writes X=5 to Node A. Another client tries to read X from Node B.

Option 1: Prioritize Consistency (CP)

Node B can't verify the latest value with Node A (they're partitioned). So Node B either:

  • Waits indefinitely (sacrificing availability)
  • Returns an error (sacrificing availability)

Option 2: Prioritize Availability (AP)

Node B returns its local value (maybe X=3, the old value). The client gets a response (available), but it's stale (inconsistent).

You can't do both. If you want both nodes to respond AND return the same value, you need them to communicate. During a partition, they can't.


The Real Choice: CP vs AP

Since partitions are inevitable, every distributed system must choose:

ChoiceBehavior During PartitionExample Systems
CPSacrifices availability for consistencyMongoDB, HBase, Redis (in cluster mode)
APSacrifices consistency for availabilityCassandra, DynamoDB, CouchDB

CP Systems: Consistency over Availability

When a partition occurs, CP systems:

  • Refuse to serve requests if they can't guarantee consistency
  • Return errors rather than stale data
  • May become unavailable to some or all users

When to choose CP:

  • Financial transactions (bank transfers)
  • Inventory systems (can't sell same item twice)
  • Leader election (only one leader allowed)
  • Anything where stale data causes real problems

Example: Banking

If Node A and Node B can't communicate, and both receive withdrawal requests for the same account, a CP system would:

  • Only allow withdrawals from one node
  • Reject requests to the other (unavailable)
  • Prevent overdrafts (consistent)

AP Systems: Availability over Consistency

When a partition occurs, AP systems:

  • Continue serving requests from all nodes
  • May return stale or inconsistent data
  • Resolve conflicts later (eventual consistency)

When to choose AP:

  • Social media feeds (seeing a post 30 seconds late is fine)
  • Product catalogs (slight staleness is acceptable)
  • User sessions (some duplication is tolerable)
  • Metrics and analytics (approximate is okay)

Example: Social Media

If Node A and Node B can't communicate:

  • Both continue serving timeline requests
  • A user might see slightly stale content
  • When partition heals, systems sync up (eventual consistency)

Common Misconceptions

Misconception 1: "Pick Two"

Wrong: "We chose CA, so we don't need partition tolerance."

Reality: You don't choose partition tolerance, it's a reality of distributed systems. The choice is between CP and AP when partitions occur.

A "CA" system is essentially a single-server system (no partitions possible because there's only one node).

Misconception 2: CAP is Binary

Wrong: "MongoDB is CP, so it's never available during partitions."

Reality: CAP properties exist on a spectrum. Systems can:

  • Be CP for some operations, AP for others
  • Allow tunable consistency (choose per-request)
  • Sacrifice availability only during actual partitions

Misconception 3: Consistency = ACID

Wrong: "CAP consistency means we have transactions."

Reality:

  • CAP Consistency = All nodes see same data (linearizability)
  • ACID Consistency = Data satisfies business rules

They're different concepts with confusingly similar names.

Misconception 4: Availability = Uptime

Wrong: "99.99% uptime means we're available in CAP terms."

Reality: CAP availability means every request to a non-failing node receives a response. A system could have 99.99% uptime but still sacrifice CAP availability during partitions by returning errors.


Real-World Database Examples

Let's examine how real databases handle CAP trade-offs.

MongoDB (Default: CP)

MongoDB uses a primary-secondary replication model.

During normal operation:

  • Writes go to primary
  • Primary replicates to secondaries
  • Reads can go to primary (consistent) or secondaries (potentially stale)

During partition:

  • If primary is isolated, election occurs
  • Minority partition becomes unavailable (can't write)
  • Majority partition elects new primary
  • System prioritizes consistency over availability

Configuration options:

  • w: majority , Wait for majority acknowledgment (more consistent)
  • w: 1 , Only wait for primary (faster, less durable)
  • readPreference: secondary , Allow stale reads (more available)

Cassandra (Default: AP)

Cassandra uses a peer-to-peer model with tunable consistency.

During normal operation:

  • Writes go to multiple replicas
  • Reads query multiple replicas
  • Conflicts resolved by timestamp (last-write-wins)

During partition:

  • Both partitions continue accepting reads and writes
  • When partition heals, data synchronizes
  • Conflicts resolved automatically or manually

Configuration options:

  • CONSISTENCY ONE , Fast, potentially stale
  • CONSISTENCY QUORUM , Balanced
  • CONSISTENCY ALL , Consistent but slow, less available

DynamoDB (Default: AP, Configurable)

AWS DynamoDB offers flexibility:

Eventually Consistent Reads (default):

  • High availability
  • May return stale data
  • Lower latency, lower cost

Strongly Consistent Reads:

  • Returns most recent write
  • Higher latency
  • May fail during partitions

PostgreSQL (Single-node: CA, Replicated: Varies)

Single node:

  • No partitions possible
  • Fully consistent and available
  • Not partition tolerant (single point of failure)

With streaming replication:

  • Synchronous: CP (waits for replica acknowledgment)
  • Asynchronous: Potential for data loss on failover

Applying CAP in System Design Interviews

When discussing database choices in interviews, demonstrate CAP awareness:

Step 1: Identify Consistency Requirements

Ask yourself:

  • What happens if users see stale data?
  • Is this a financial/inventory system where correctness matters?
  • Can we recover from inconsistency?

Step 2: Consider Availability Requirements

Ask yourself:

  • What's the cost of downtime?
  • Do users need 24/7 access?
  • Is it better to return stale data or an error?

Step 3: Make and Justify the Trade-off

Example: E-commerce Product Catalog

"For the product catalog, I'd choose an AP system like Cassandra. Here's why:

Consistency: Product details (name, description, images) don't change frequently. If a user sees a slightly stale price for a few seconds, it's not catastrophic.

Availability: We need the catalog available 24/7. Users can't browse or buy if the catalog is down. Revenue impact is huge.

The trade-off: We'll use eventual consistency. Users might briefly see old data, but the system stays up during network issues."

Example: Bank Account Balance

"For account balances, I'd choose a CP system. Here's why:

Consistency: Users must see accurate balances. Showing a wrong balance leads to overdrafts, failed payments, and lost trust.

Availability: During partitions, I'd rather show 'temporarily unavailable' than show incorrect data.

The trade-off: During rare network partitions, users might see error messages. This is better than incorrect balances."

Step 4: Discuss Mitigation Strategies

Show that you understand the nuances:

For AP systems:

"We accept eventual consistency, but we can mitigate issues by:

  • Using conflict resolution strategies (last-write-wins, application-specific merge)
  • Showing users timestamps ('updated 5 seconds ago')
  • Making critical paths strongly consistent while keeping others eventual"

For CP systems:

"We prioritize consistency, but we can improve availability by:

  • Using read replicas for read-heavy workloads
  • Implementing circuit breakers for graceful degradation
  • Caching with short TTLs for frequently accessed data"

Beyond CAP: The PACELC Theorem

CAP only describes behavior during partitions. But what about normal operation?

The PACELC theorem extends CAP:

If there is a Partition, choose between Availability and Consistency. Else (no partition), choose between Latency and Consistency.

SystemDuring PartitionNormal Operation
CassandraPA (Available)EL (Low Latency)
MongoDBPC (Consistent)EC (Consistent)
DynamoDBPA (Available)EL or EC (Configurable)
PNUTS (Yahoo)PC (Consistent)EL (Low Latency)

Why PACELC Matters

Even without partitions, there's a consistency-latency trade-off:

Strong consistency requires coordination:

  • Wait for acknowledgment from multiple nodes
  • Higher latency
  • Lower throughput

Eventual consistency avoids coordination:

  • Write locally, sync later
  • Lower latency
  • Higher throughput

Example: Cassandra (PA/EL)

  • During partition (PA): Stays available, accepts writes to both partitions, reconciles later
  • Normal operation (EL): Default CONSISTENCY ONE prioritizes latency over consistency

Example: MongoDB (PC/EC)

  • During partition (PC): Primary-only writes, secondary reads may be stale, prioritizes consistency
  • Normal operation (EC): Writes to primary with replication, consistent reads by default

CAP in Practice: Real Design Decisions

Scenario 1: User Session Storage

Requirements:

  • Users should stay logged in
  • Session data accessed with every request
  • Global users across regions

Analysis:

  • Stale session data is annoying (forced re-login) but not catastrophic
  • Downtime means nobody can use the app
  • High read volume, low write volume

Decision: AP system (Redis Cluster, DynamoDB)

"I'd use an AP approach. If a user's session is briefly stale, worst case they see slightly old preferences. But if sessions are unavailable, the whole app is unusable. I'd use DynamoDB with eventual consistency for reads."

Scenario 2: Order Processing

Requirements:

  • Process customer orders
  • Update inventory
  • Handle payments

Analysis:

  • Incorrect inventory leads to overselling (bad)
  • Incorrect payment processing is unacceptable
  • Downtime loses revenue but is recoverable

Decision: CP system (PostgreSQL, CockroachDB)

"I'd prioritize consistency here. If we oversell inventory or double-charge customers, we have serious problems. During partitions, I'd rather return errors than process incorrect orders. We can use a queue to retry failed orders when the partition heals."

Scenario 3: Social Media Feed

Requirements:

  • Show posts from followed users
  • Support millions of users globally
  • Real-time-ish updates

Analysis:

  • Seeing a post 10 seconds late is fine
  • Complete downtime means users can't scroll (engagement loss)
  • Write volume is high (many posts per second)

Decision: AP system (Cassandra, DynamoDB)

"Eventual consistency is perfect here. Users don't notice if a post appears a few seconds late. But if the feed is completely down, users leave. I'd use Cassandra with eventual consistency and optimize for availability and low latency."


Summary: CAP Decision Framework

Use this framework when making CAP decisions:

1. Ask: "What happens if data is stale?"

AnswerImplication
"Users are confused but no real harm"AP is probably fine
"Financial loss or data corruption"CP is necessary
"Depends on the operation"Consider per-operation consistency

2. Ask: "What happens during downtime?"

AnswerImplication
"Users can wait or retry later"CP is acceptable
"Revenue loss, user churn"AP is important
"Safety or compliance issues"Depends on specifics

3. Make the trade-off explicit

"For this system, I'm choosing [AP/CP] because [specific reason]. The trade-off is [what we sacrifice], which we'll mitigate by [strategy]."


Key Takeaways

  1. CAP is about partitions. The real question is: when networks fail (and they will), do you prioritize consistency or availability?

  2. "Pick two" is misleading. You don't pick partition tolerance, it's a fact of distributed life. You choose between CP and AP.

  3. Most systems are configurable. Modern databases let you tune consistency per-operation. Use strong consistency where it matters, eventual elsewhere.

  4. Consider PACELC. Even without partitions, there's a latency-consistency trade-off. Systems make choices on both axes.

  5. Context matters. There's no universally correct choice. Banking needs CP. Social media often works fine with AP. Know your requirements.---

Frequently Asked Questions

Is CAP theorem still relevant with modern cloud databases?

Yes, but the landscape has evolved. Modern databases like CockroachDB and Spanner offer strong consistency with high availability through sophisticated techniques (synchronized clocks, consensus algorithms). But the fundamental trade-off still exists, these systems just push the boundaries.

Can a system be CA?

A true CA system would have no network partitions, which is only possible with a single server. As soon as you have multiple nodes, partitions become possible, and you must choose CP or AP when they occur.

How do I explain CAP in an interview?

Keep it simple: "CAP says that during a network partition, a distributed system must choose between consistency (all nodes see the same data) and availability (all nodes respond to requests). My choice depends on whether stale data or downtime is worse for this use case."

What about systems that claim to be "fully consistent and available"?

They likely mean one of:

  1. Single-node (no partitions possible)
  2. Consistency/availability under normal operation, with trade-offs during partitions
  3. Marketing speak that doesn't hold under scrutiny

Always understand what happens during network failures.

From the Interviewer's Side

Ready to Master System Design Interviews?

Learn from 25+ real interview problems from Netflix, Uber, Google, and Stripe. Created by a senior engineer who's taken 200+ system design interviews at FAANG companies.

Complete Solutions

Architecture diagrams & trade-off analysis

Real Interview Problems

From actual FAANG interviews

7-day money-back guarantee • Lifetime access • New problems added quarterly

FREE DOWNLOAD • 7-PAGE PDF

FREE: System Design Interview Cheat Sheet

Get the 7-page PDF cheat sheet with critical numbers, decision frameworks, and the interview approach used by 10,000+ engineers.

Includes:Critical NumbersDecision Frameworks35 Patterns5-Step Method

No spam. Unsubscribe anytime.