Building Data-Intensive Systems: The Complete Guide (2026)

You're building an application. It works great with 100 users. Then 10,000. Then 1 million. Suddenly, your database is slow, queries timeout, and users complain.

This is where most engineers panic. They hear about "sharding" and "NoSQL" and start rewriting everything. Usually making things worse.

After working on systems handling billions of records at companies like Instagram, Uber, and Netflix, I've learned the truth: 90% of scaling problems are solved by boring solutions. PostgreSQL + Redis handles more than you think.

This guide teaches you the practical framework for data-intensive systems:

When to use which database (PostgreSQL vs MongoDB vs Redis vs Cassandra)
How to scale reads with replication and caching
When you actually need sharding (later than you think)
Real architectures from Instagram, Uber, Netflix, and Stack Overflow

The hardest part is not the technology, it's knowing when to use what. This guide gives you that framework.

What is a Data-Intensive System?

A data-intensive system is one where the main challenge is handling data, storing it, finding it, moving it, and keeping it correct. Compare this to a compute-intensive system (like video encoding) where the challenge is doing lots of calculations.

Real-world examples of data-intensive systems:

Facebook: Stores billions of posts, photos, and messages
Google Search: Indexes trillions of web pages
Netflix: Stores and streams millions of videos
Your bank: Tracks every transaction you've ever made
Even a simple blog: Stores posts, comments, and user accounts

Every data system must do three things:

Store data so you can find it later (databases)
Remember expensive calculations so you don't redo them (caches)
Send messages between different parts of the system (queues)

This guide covers all three, but mainly focuses on choosing the right database and scaling it.

The Golden Rule of Database Selection: Boring technology is usually the right choice. PostgreSQL, MySQL, Redis, and Kafka are "boring" because millions of people use them. That means bugs are found and fixed, documentation is excellent, and you can hire engineers who know them. Exciting new databases are exciting because nobody knows their problems yet.

How to Choose the Right Database Type

Think of databases like containers for data. Different containers are good for different things:

Your Data Shape	Best Database Type	Examples
Structured with relationships	Relational	PostgreSQL, MySQL
Flexible documents	Document	MongoDB, CouchDB
Simple key-value lookups	Key-Value	Redis, DynamoDB
Connected graph of things	Graph	Neo4j, Amazon Neptune
Time-ordered events	Time-Series	TimescaleDB, InfluxDB

Quick mental models:

Relational database = Filing cabinet. Great when your data has a clear structure, like customer records with name, address, and order history.
Document database = Folder of papers. Great when each item is different, like blog posts where each post has different fields.
Key-value store = Dictionary. Great when you just need to look things up by name, like session data.
Graph database = Family tree. Great when relationships matter most, like social networks.

Picking the right database makes your life easy. Picking the wrong one means fighting your database every day.

SQL vs NoSQL: Choosing the Right Data Model

Understanding data models is crucial for system design interviews and real-world architecture decisions. Let's break down each type.

Relational Databases (SQL): PostgreSQL, MySQL, SQLite

This is what most engineers think of when they hear "database." Data is organized into tables. Each row is one record. Each column is one piece of information.

Example: A customer database

id	name	email	city
1	Alice	alice@mail.com	New York
2	Bob	bob@mail.com	London

When to use relational databases:

Your data has a clear, fixed structure
You need to join data together (show me all orders for customer X)
You need transactions (transfer money from A to B - both must succeed or both must fail)
You are building most web applications

Real-world examples: Banking systems, e-commerce, inventory management, most SaaS products.

Why relational databases are so popular: Relational databases have been around for 50 years. They are well understood, well documented, and handle 99% of use cases. PostgreSQL and MySQL are free, reliable, and can handle millions of records easily. When in doubt, start with PostgreSQL.

Document Databases (NoSQL): MongoDB, CouchDB, Firestore

Instead of rows in a table, you store documents (JSON-like objects). Each document can have different fields. This is where "NoSQL" became popular.

Example: A blog post

{
  "id": "post_123",
  "title": "My First Post",
  "content": "Hello world...",
  "author": {
    "name": "Alice",
    "bio": "Writer and coder"
  },
  "tags": ["hello", "first"],
  "comments": [
    {"user": "Bob", "text": "Great post!"}
  ]
}

When to use document databases:

Each record can have different fields
You mostly read and write whole documents at once
Your data has natural groupings (a blog post with its comments)
You need flexible schemas that change often

Real-world examples: Content management, user profiles with varying fields, product catalogs where products have different attributes.

Key-Value Stores: Redis, Memcached, DynamoDB

The simplest and fastest data model. You store a value and give it a key (a name). To get it back, you ask for it by key.

Think of it like a giant dictionary:

Key: user:123:session → Value: {loggedIn: true, cart: [...]}
Key: cache:homepage → Value: <html>...</html>

When to use key-value stores:

You always know the key you want
You do not need to search inside the values
You need super fast reads and writes
You are building caches or session storage

Real-world examples: Session storage, caching, feature flags, rate limiting, leaderboards.

Warning - Key-value trap: Key-value stores are fast because they are simple. If you start doing complex queries (find all users in New York), you are using the wrong database. Key-value is for lookups by key, not for searching.

Graph Databases: Neo4j, Amazon Neptune, JanusGraph

When the connections between things matter more than the things themselves, use a graph database. This is essential for social networks, recommendation engines, and fraud detection.

Example: Social network

Alice FOLLOWS Bob
Bob FOLLOWS Charlie
Charlie FOLLOWS Alice
Question: Who should we recommend Alice follow? (friends of friends)

In a relational database, this query needs many JOINs and gets slow fast. In a graph database, it is natural and fast.

When to use graph databases:

Relationships are the main thing you query
You need to traverse connections (friend of friend of friend)
Your queries are about paths and connections
Examples: Social networks, recommendation engines, fraud detection, knowledge graphs

Real-world examples: LinkedIn connections, fraud detection (who is connected to known fraudsters), recommendation systems.

SQL vs NoSQL: Complete Comparison Table

Data Model	Best For	Not Good For	Popular Databases
Relational (SQL)	Structured data, complex queries, ACID transactions	Highly connected data, very flexible schemas	PostgreSQL, MySQL, SQLite
Document (NoSQL)	Flexible schemas, self-contained records	Many-to-many relationships, complex joins	MongoDB, CouchDB, Firestore
Key-Value	Caching, sessions, simple lookups	Complex queries, searching by value	Redis, Memcached, DynamoDB
Graph	Social networks, recommendations, fraud detection	Simple CRUD, tabular reports	Neo4j, Amazon Neptune, JanusGraph
Time-Series	Metrics, IoT data, logs, events over time	Random access, complex relationships	TimescaleDB, InfluxDB, Prometheus

Bottom line: Start with PostgreSQL. Add Redis for caching. Only add other databases when you have a specific, measured problem that PostgreSQL can't solve.

B-Trees vs LSM-Trees: Database Storage Engines Explained

Understanding how databases store data helps you pick the right one and tune it properly. This is a common system design interview topic.

There are two main approaches, and they're optimized for different workloads.

Approach 1: B-Trees (Most Relational Databases)

Think of a B-tree like a really efficient filing system. Data is organized in a tree structure where finding any record takes the same amount of time.

Imagine a library:

To find a book, you do not search every shelf
You check the catalog: Fiction → Mystery → Authors A-D → Shelf 7
Each step narrows down where to look
Finding any book takes about the same number of steps

How B-trees work (simplified):

Data is stored in sorted order
There is an index (like a book catalog) that points to where data lives
Updates modify the data in place
Good for both reads AND writes

B-trees are used by: PostgreSQL, MySQL, SQLite, Oracle - most relational databases.

B-Tree: Like a library catalog system

              Root: 1-1000
         /      |      |      \
      1-250  251-500  501-750  751-1000
       /  \
 Records   Records
   1-50     51-100

Approach 2: LSM-Trees (Log-Structured Merge Trees)

Instead of updating data in place, LSM-trees just write new data to the end of a log. Periodically, they merge and compact these logs.

Imagine a notebook:

Instead of erasing and rewriting, you just add new entries at the end
To find something, you search backwards from the newest entry
Periodically, you rewrite the notebook, removing old crossed-out entries

How LSM-trees work (simplified):

All writes go to memory first (very fast)
When memory is full, flush to disk as a sorted file
Periodically merge these files together
Reading might need to check multiple files

LSM-trees are used by: Cassandra, RocksDB, LevelDB, HBase - databases optimized for writes.

B-Tree vs LSM-Tree Comparison

Feature	B-Tree	LSM-Tree
Write speed	Good - updates in place	Excellent - just append to log
Read speed	Excellent - one lookup	Good - might check multiple files
Space usage	Good - no duplication	Needs more space - old data until compaction
Write amplification	Lower - each write updates once	Higher - data rewritten during compaction
Best for	Balanced read/write workloads	Write-heavy workloads

Which one to pick? For most applications, B-trees (PostgreSQL, MySQL) are the right choice. They are well-understood and handle both reads and writes well. Only consider LSM-trees if you have a write-heavy workload (like logging or time-series data) AND you have measured that writes are actually your bottleneck.

Row-Oriented vs Column-Oriented Storage

Most databases store data row by row:

Row 1: [Alice, 25, New York]
Row 2: [Bob, 30, London]
Row 3: [Charlie, 35, Paris]

This is great when you read whole rows (show me everything about Alice).

But what if you want to calculate the average age of all users? You need to read every row, but you only care about the age column. That is wasteful.

Column-oriented databases store data column by column:

Names: [Alice, Bob, Charlie]
Ages: [25, 30, 35]
Cities: [New York, London, Paris]

Now calculating average age just reads the ages column. Much faster for analytics!

When to use column-oriented databases:

You have millions/billions of rows
You run analytical queries (averages, sums, counts)
You read many rows but few columns
You are building data warehouses or analytics systems

Examples: Amazon Redshift, Google BigQuery, ClickHouse, Apache Parquet files.

Warning: Do not use column stores for OLTP. Column stores are terrible for normal application workloads (show me user 123, update their address). They are designed for analytics (what is the average order value by country). Using a column store for your main application database is a common expensive mistake.

Database Replication: Master-Slave, Multi-Master, and Leaderless

Database replication is how you achieve high availability and scale reads. This section covers the three main replication strategies.

Why You Need Database Replication

Imagine you have one computer storing all your data. What happens when:

That computer crashes? Your app is down.
Users in another country connect? Slow response.
Lots of people read at once? Computer gets overloaded.

The solution: Keep copies of your data on multiple computers.

The three reasons to replicate:

High Availability - If one computer dies, others keep working
Lower Latency - Put copies close to users (Europe copy for European users)
Read Scaling - Spread read load across many computers

But there is a catch: Keeping copies in sync is HARD. What if someone updates the data on one copy? How do other copies find out?

Leader-Follower Replication (Master-Slave)

This is the most common replication pattern. One database is the leader (master). All writes go to the leader. The leader sends changes to followers (slaves). Reads can go to any copy.

Think of it like a classroom:

The teacher (leader) writes on the main whiteboard
Students (followers) copy what the teacher writes to their notebooks
If you want to change something, tell the teacher
If you just want to read, look at any notebook

                    ┌──────────────────┐
    All Writes ───▶ │  Leader (Main)   │
                    └────────┬─────────┘
                             │ Copy changes
              ┌──────────────┼──────────────┐
              ▼              ▼              ▼
        ┌──────────┐  ┌──────────┐  ┌──────────┐
        │Follower 1│  │Follower 2│  │Follower 3│
        └──────────┘  └──────────┘  └──────────┘
              ▲              ▲              ▲
              └──────────────┴──────────────┘
                      Read Requests

The Problem with Followers: Replication Lag

Followers get updates AFTER the leader. There is always a small delay (usually milliseconds, but can be seconds under load).

What can go wrong:

User updates their profile (goes to leader)
User immediately views their profile (goes to follower)
Follower has not received the update yet
User sees old data and thinks the update failed!

Solutions to replication lag:

Read-your-own-writes: After a user writes, read from the leader for a few seconds
Monotonic reads: Always route the same user to the same follower
Consistent prefix reads: Make sure related data arrives in order

Most databases have these built in - you just need to configure them.

Synchronous vs Asynchronous replication: Synchronous: Leader waits for followers to confirm before saying write is done. Safer but slower. Asynchronous: Leader says done immediately, followers catch up later. Faster but might lose data if leader crashes. Most systems use async with one sync follower as a compromise.

Multi-Leader Replication (For Multiple Data Centers)

What if you have data centers in USA and Europe? With single leader, all writes go to one place. European users have slow writes.

Solution: Leaders in both data centers. Each accepts writes and syncs with the other.

BUT this creates a new problem: write conflicts.

Imagine:

USA leader: Set username to "alice"
Europe leader: Set username to "alice_eu"
Both happened at the same time. Which one wins?

Conflict resolution strategies:

Last write wins: Use timestamps, newest change wins (but clocks can be wrong!)
Merge values: Keep both values somehow
Custom logic: Let the application decide how to merge
Avoid conflicts: Route all writes for one user to the same leader

Warning: Avoid multi-leader if possible. Multi-leader replication is complex and error-prone. Write conflicts are painful to debug and fix. Only use it if you truly need writes in multiple geographic regions AND the latency to a single leader is unacceptable. Most apps do fine with single leader + read replicas.

Leaderless Replication (Dynamo-Style)

No single leader. Writes go to multiple nodes. Reads check multiple nodes and use the newest value.

How it works:

To write: Send to 3 nodes. If 2 confirm, success.
To read: Ask 3 nodes. If 2 agree, that is the answer.
If nodes disagree, pick the newer value.

This is called quorum: As long as majority agrees, you are good.

Used by: Amazon DynamoDB, Cassandra, Riak

When to use: When you need very high availability and can tolerate some inconsistency. Good for things like shopping carts where being always available matters more than perfect consistency.

Replication Comparison

Replication Type	Write Speed	Consistency	Availability	Best For
Single Leader	Fast (one place)	Strong (reads might be stale)	Good (leader can fail)	Most applications
Multi-Leader	Fast everywhere	Weak (conflicts possible)	Excellent	Multi-region apps
Leaderless	Fast everywhere	Eventually consistent	Excellent	High-availability needs

Database Sharding: When and How to Partition Data

Sharding (also called partitioning) is how you scale writes and handle massive datasets. But it comes with significant complexity, don't do it until you absolutely need to.

When to Consider Database Sharding

Your database has 1 billion records. One machine cannot:

Store all that data (too much disk space needed)
Handle all the queries (too slow)
Keep up with all the writes (CPU maxed out)

Solution: Split the data across multiple machines

This is called partitioning (or sharding). Each machine holds a piece of the data.

Think of it like a library with multiple buildings:

Building A: Books by authors A-H
Building B: Books by authors I-P
Building C: Books by authors Q-Z

To find a book, first figure out which building, then search there.

Warning: Do not shard too early! Sharding adds HUGE complexity. Cross-shard queries become hard. Joins become impossible or slow. Operations become much harder. One PostgreSQL server handles millions of rows easily. Only shard when you have MEASURED that one machine cannot handle your load. Most companies never need to shard.

How to Decide Which Machine Gets Which Data?

Method 1: Range-Based Partitioning

Split data by ranges of a key.

Example: Users table partitioned by user ID:

Shard 1: Users 1-1,000,000
Shard 2: Users 1,000,001-2,000,000
Shard 3: Users 2,000,001-3,000,000

Good: Range queries are efficient (get all users 500-600) Bad: Can get unbalanced (if most new users are in shard 3, it gets overloaded)

Method 2: Hash-Based Partitioning

Hash the key and use that to pick the shard.

Example: hash(user_id) % 3 = shard number

User 123 → hash = 456 → 456 % 3 = 0 → Shard 0
User 789 → hash = 234 → 234 % 3 = 0 → Shard 0
User 555 → hash = 111 → 111 % 3 = 2 → Shard 2

Good: Data spreads evenly across shards Bad: Range queries need to check ALL shards (slow)

The Hot Spot Problem

Even with good partitioning, some keys get way more traffic than others.

Example: A celebrity with 100 million followers posts something. Everyone reads that one post. The shard holding that post gets crushed.

Solutions to hot spots:

Add randomness: Instead of storing under post_123, store under post_123_0, post_123_1, etc. Spread the load.
Caching: Put hot data in a cache (Redis) in front of the database.
Application awareness: Detect hot items and handle them specially.

There is no perfect automatic solution. You need monitoring to detect hot spots and engineering to fix them.

What About Cross-Shard Queries?

If you shard by user_id, queries like "show me all orders for user 123" are fast (just check one shard).

But what about "show me all orders placed today"? Orders are spread across ALL shards. You need to:

Send the query to every shard
Wait for all to respond
Merge the results

This is called scatter-gather and it is slow.

How to handle this:

Design your sharding key carefully: Shard by the thing you query by most
Denormalize: Store data redundantly so common queries hit one shard
Secondary indexes: Build indexes that span shards (complex!)
Accept the cost: Sometimes scatter-gather is okay for rare queries

Partitioning strategy: Think about your most common queries FIRST. If you mostly query by user_id, shard by user_id. If you mostly query by time, shard by time. The wrong sharding key makes your database slower, not faster.

Partitioning Methods Comparison

Partitioning Method	Good For	Bad For	Example Use
Range (by ID)	Sequential access, range queries	Hot spots at one end	Time-series data, logs
Range (by time)	Time-based queries, archiving old data	Recent data gets all traffic	Events, metrics
Hash	Even distribution, random access	Range queries, sorting	User data, sessions
Geographic	Location-based queries, data locality	Users who move, global queries	Local businesses, regional data
Composite	Complex access patterns	Complexity, more planning needed	Multi-tenant SaaS

CAP Theorem, ACID Transactions, and Consistency Models

Understanding consistency is essential for system design interviews. This section covers the CAP theorem, ACID transactions, and isolation levels.

Strong Consistency vs Eventual Consistency

Consistency means everyone sees the same data. Sounds simple, but in distributed systems, it is really hard.

Imagine you post a photo:

You: "I just posted! I can see it!"
Your friend: "I do not see any new post."
Both are looking at the same app, but seeing different data.

This happens because your post went to one server, and your friend is reading from another server that has not received the update yet.

Levels of consistency (from strongest to weakest):

Strong consistency: Everyone sees the same data. After a write completes, all readers see it.
Eventual consistency: Eventually everyone sees the same data. There might be a delay.
No guarantee: Different readers might see different data forever (broken system!).

The CAP Theorem Explained: You cannot have all three: Consistency, Availability, Partition tolerance. Since network failures happen (partitions are inevitable), you must choose: either stay consistent (some requests fail during failures) or stay available (requests succeed but might return stale data). Most systems choose availability and deal with eventual consistency. This is why understanding CAP theorem is critical for system design interviews.

ACID Transactions: What They Mean and Why They Matter

A transaction is a group of operations that either ALL succeed or ALL fail. No partial results.

Example: Transferring money

Transaction:
  1. Subtract $100 from Alice's account
  2. Add $100 to Bob's account

If step 1 succeeds but step 2 fails, Alice loses $100 and Bob gets nothing. That is broken!

With a transaction, if anything fails, the whole thing is rolled back. Alice keeps her money.

ACID Properties (what transactions guarantee):

Atomicity: All or nothing. Either the whole transaction succeeds or nothing changes.
Consistency: The database stays valid. Rules (like "balance cannot be negative") are always followed.
Isolation: Concurrent transactions do not interfere with each other.
Durability: Once a transaction commits, it is saved forever (even if power goes out).

WITHOUT TRANSACTION (dangerous!):

    subtract_from_alice(100)   // Works
    add_to_bob(100)            // FAILS! Database error!

    Result: Alice lost $100, Bob got nothing.
    Money disappeared!


WITH TRANSACTION (safe):

    BEGIN TRANSACTION
        subtract_from_alice(100)   // Tentative
        add_to_bob(100)            // FAILS!
    ROLLBACK (because something failed)

    Result: Both operations are undone.
    Alice still has her $100. Nothing changed.


HAPPY PATH:

    BEGIN TRANSACTION
        subtract_from_alice(100)   // Works
        add_to_bob(100)            // Works
    COMMIT

    Result: Alice has $100 less, Bob has $100 more.
    Money moved safely.

Isolation Levels: How Much Do Transactions See Each Other?

When two transactions run at the same time, what happens?

Read Uncommitted (weakest, rarely used)

Transactions can see changes that have not been committed yet
Problem: You might see data that gets rolled back (dirty reads)

Read Committed (common default)

You only see committed changes
Problem: If you read twice, you might see different values (non-repeatable reads)

Repeatable Read (PostgreSQL default)

Once you read a value, it stays the same for your whole transaction
Problem: New rows might appear (phantom reads)

Serializable (strongest)

Transactions behave as if they ran one after another
Safest but slowest

Most apps use the default and it is fine. Do not overthink isolation levels when starting. The database default (usually Read Committed or Repeatable Read) works for most applications. Only change it if you have a specific problem. Higher isolation = slower performance.

Distributed Transactions: Across Multiple Databases

Transactions within ONE database are well-solved. But what if your transaction spans multiple systems?

Example: Order placement

Charge credit card (Payment Service)
Reserve inventory (Inventory Service)
Create order record (Order Service)

All three must succeed or all must fail.

Solution 1: Two-Phase Commit (2PC)

Phase 1 (Prepare): Ask everyone "can you commit?" Phase 2 (Commit): If everyone said yes, tell everyone to commit.

Problem: If the coordinator crashes between phases, everything is stuck.

Solution 2: Saga Pattern (more common)

Each step has a compensating action (undo). If something fails, run the undos.

Charge card → If later steps fail, refund card
Reserve inventory → If later steps fail, release inventory
Create order → Done!

Sagas do not give true atomicity, but they are more practical for microservices.

Distributed Transaction Approaches

Approach	Consistency	Availability	Complexity	Use When
Single DB Transaction	Strong	Depends on DB	Low	Everything in one database
Two-Phase Commit	Strong	Lower (blocking)	High	Need true atomicity across DBs
Saga with Compensation	Eventual	High	Medium	Microservices, can tolerate retries
Eventual Consistency	Weak	Highest	Low	Can show stale data briefly

Database Selection Guide: How to Choose the Right Database

This is the most important section. Use this decision framework for system design interviews and real architecture decisions.

The Most Important Advice: Start Simple

Most applications work great with:

PostgreSQL (or MySQL) for the main database
Redis for caching and sessions
That's it.

Seriously. Instagram ran on PostgreSQL for years with billions of photos. Stack Overflow handles millions of requests with just two SQL Server instances.

Add complexity only when you MEASURE a real problem. Premature optimization is the root of all evil.

The Decision Process:

Start with PostgreSQL

Add Redis when you need caching

Add read replicas when reads are slow

Add a search engine when full-text search is needed

Consider sharding when one machine truly cannot handle the load

At each step, MEASURE first, then decide.

Database Selection Framework

START HERE: What are you building?

├── Normal web application (users, orders, content)
│   └── PostgreSQL or MySQL
│       ├── Need caching? Add Redis
│       ├── Need search? Add Elasticsearch
│       └── Need analytics? Add read replica or data warehouse

├── High-speed caching or sessions
│   └── Redis or Memcached

├── Real-time analytics on events
│   └── ClickHouse, TimescaleDB, or Apache Druid

├── Document storage with flexible schema
│   └── MongoDB or PostgreSQL JSONB
│       (PostgreSQL JSONB is often good enough!)

├── Graph relationships (social network, recommendations)
│   └── Neo4j or Amazon Neptune
│       (Or PostgreSQL with recursive CTEs for simpler cases)

├── Time-series data (metrics, IoT, logs)
│   └── TimescaleDB, InfluxDB, or Prometheus

├── Full-text search
│   └── Elasticsearch or PostgreSQL full-text search
│       (PostgreSQL is often good enough for basic search)

├── Message queue / event streaming
│   └── Kafka, RabbitMQ, or Redis Streams

└── Blob storage (files, images, videos)
    └── S3, GCS, or Azure Blob Storage
        (NOT a database - databases are bad at storing files)

Signs You Might Need to Scale

Symptom	Possible Cause	Solution to Try First	Nuclear Option
Slow queries	Missing indexes	Add indexes	Read replicas
Database CPU high	Too many queries	Add caching	Read replicas
Slow writes	Write contention	Optimize queries	Write sharding
Running out of disk	Too much data	Archive old data	Sharding
Slow joins	Too many tables	Denormalize	Different database

Warning - Common mistakes to avoid:

Using MongoDB because "it scales" when PostgreSQL would work fine

Sharding when adding an index would fix the problem

Adding a new database for each feature instead of using one well

Choosing a database because a famous company uses it

Not measuring before optimizing

How Big Tech Companies Actually Build Their Data Architecture

These real-world examples prove that boring technology scales better than you think.

Instagram (billions of photos, millions of users)

Main database: PostgreSQL (heavily sharded by user ID)
Caching: Memcached and Redis
Photo storage: Custom blob storage → later moved to Facebook infrastructure
Search: Elasticsearch
Key lesson: Started with Django + PostgreSQL and scaled it

Uber (millions of trips per day)

Main database: Started MySQL, moved to PostgreSQL
Real-time: Redis
Analytics: Apache Hive, Presto
Time-series: M3 (custom)
Key lesson: PostgreSQL handles more than you think

Stack Overflow (millions of questions, billions of page views)

Main database: SQL Server (just TWO servers!)
Caching: Redis
Search: Elasticsearch
Key lesson: You can go incredibly far with traditional databases and good engineering

Netflix (200 million subscribers, billions of hours streamed)

Main database: Started with Oracle, moved to Cassandra for scale
Caching: EVCache (based on Memcached)
Search: Elasticsearch
Analytics: Apache Spark, Druid
Key lesson: Only moved to NoSQL at massive scale

Caching Strategies: Redis, Memcached, and Cache Patterns

Caching is the fastest way to improve performance. Most read-heavy applications should implement caching before considering database changes.

Why Caching Makes Everything 10x Faster

Databases are slow. Reading from disk takes milliseconds. Reading from memory takes microseconds - 1000x faster.

Caching stores frequently accessed data in memory so you do not have to ask the database every time.

Example:

Without cache: 100ms to fetch user profile (database query)
With cache: 1ms to fetch user profile (memory lookup)
If 90% of requests hit cache: Average = 0.9 * 1ms + 0.1 * 100ms = 10.9ms

That is 10x faster with caching!

Cache-Aside Pattern (Most Common)

The application manages the cache. To read:

Check cache. If found, return it (cache hit).
If not found (cache miss), read from database.
Store in cache for next time.
Return the data.

To write:

Write to database.
Invalidate (delete) the cache entry.
Next read will repopulate the cache.

FUNCTION get_user(user_id):

    // Step 1: Check cache first
    cache_key = "user:" + user_id
    cached_user = REDIS.get(cache_key)

    IF cached_user exists:
        // Cache hit! Return immediately (fast)
        RETURN cached_user

    // Step 2: Cache miss - get from database (slow)
    user = DATABASE.query("SELECT * FROM users WHERE id = ?", user_id)

    // Step 3: Store in cache for next time
    // Expire after 1 hour so stale data eventually goes away
    REDIS.set(cache_key, user, expire_in=1_hour)

    RETURN user


FUNCTION update_user(user_id, new_data):

    // Step 1: Update database
    DATABASE.query("UPDATE users SET ... WHERE id = ?", user_id, new_data)

    // Step 2: Invalidate cache (delete the old cached version)
    cache_key = "user:" + user_id
    REDIS.delete(cache_key)

    // Next read will fetch fresh data and re-cache it

Warning: Cache invalidation is hard. "There are only two hard things in Computer Science: cache invalidation and naming things." - Phil Karlton

The problem: When data changes, you must remember to invalidate the cache. If you forget, users see stale data. If you invalidate too aggressively, you lose the caching benefit.

Other Caching Patterns

Write-Through: Write to cache AND database together. Cache is always fresh. But writes are slower (two writes).

Write-Behind (Write-Back): Write to cache only. Cache writes to database later in batches. Very fast writes but risk of data loss if cache crashes before writing.

Read-Through: Cache talks to database automatically on miss. Application only talks to cache. Simpler application code.

For most applications, Cache-Aside is the right choice. It is simple, well-understood, and gives you full control.

What to Cache?

Cache things that are:

Read frequently
Expensive to compute or fetch
Do not change often
Okay to be slightly stale

Good candidates:

User profiles (read on every page, change rarely)
Product listings (read constantly, update occasionally)
Session data (read every request)
Computed results (leaderboards, statistics)

Bad candidates:

Data that changes constantly (real-time stock prices)
Data that must be 100% fresh (bank balances)
Rarely accessed data (wastes memory)
Data that is already fast to fetch (indexed lookups)

Caching Patterns Comparison

Pattern	How It Works	Best For	Drawback
Cache-Aside	App manages cache, reads from DB on miss	Most applications	App must handle cache logic
Write-Through	Write to cache + DB together	Data that must not be lost	Slower writes
Write-Behind	Write to cache, DB updated later	Write-heavy workloads	Data loss if cache fails
Read-Through	Cache fetches from DB automatically	Simpler application code	Less control

Cache expiration strategy: Always set a TTL (time to live) on cache entries. Even if you invalidate on updates, TTL is a safety net. If something goes wrong with invalidation, stale data will eventually expire. Start with 1 hour TTL and adjust based on your needs.

Message Queues vs Event Streams: Kafka, RabbitMQ, and Redis

Message queues and event streams enable asynchronous processing and service decoupling. Understanding when to use each is important for system design interviews.

When to Use Message Queues

Sometimes you do not want to do something immediately. You want to do it later, or have someone else do it.

Example: User signs up

Must do immediately: Create account, log them in
Can do later: Send welcome email, notify analytics, sync to CRM

Without a queue: User waits while all those things happen (slow). With a queue: Create account, put "send email" task in queue, respond to user. Background worker sends email later.

Benefits of queues:

Decoupling: Services do not need to know about each other
Resilience: If email service is down, messages wait in queue
Scaling: Add more workers to process queue faster
Smoothing: Handle bursts by queueing and processing steadily

Synchronous - User Waits:
User → Create Account → Send Email → Notify Analytics → Response

Asynchronous - Fast Response:
User → Create Account → Response (Done!)
              ↓
           Queue
         ↙      ↘
   Send Email   Notify Analytics

Message Queue vs Event Stream

Message Queue (RabbitMQ, SQS, etc.)

Messages are tasks to be done
Once a message is processed, it is deleted
Each message goes to ONE consumer
Use case: Background jobs, task distribution

Event Stream (Kafka, etc.)

Events are facts about what happened
Events are kept for a while (days/weeks/forever)
Multiple consumers can read the same event
Use case: Event sourcing, real-time analytics, data pipelines

Message Queue vs Event Stream Comparison

Feature	Message Queue	Event Stream
Message lifetime	Deleted after processing	Kept for configurable time
Consumer model	One consumer per message	Multiple consumers can read same event
Ordering	Best effort (some guarantee order)	Strong ordering per partition
Replay	Cannot replay processed messages	Can replay from any point
Use case	Task processing, work distribution	Event sourcing, data pipelines, analytics

When to Use What?

Use a Message Queue (RabbitMQ, Amazon SQS, Redis) when:

You have tasks to distribute to workers
Messages should be processed exactly once then forgotten
You need simple pub/sub for notifications
Example: Send email, process image, generate report

Use an Event Stream (Kafka, Amazon Kinesis) when:

You need to keep a log of everything that happened
Multiple systems need to react to the same event
You need to replay events (rebuild state, debug issues)
You are doing real-time analytics
Example: User activity stream, order events, IoT sensor data

Start with Redis for simple queues. For many applications, Redis lists or streams work great as a simple queue. You probably already have Redis for caching. Only add Kafka or RabbitMQ when you outgrow Redis or need their specific features.

Common Patterns with Queues

Work Queue: Multiple workers process tasks from one queue. Good for parallel processing.

Pub/Sub: One message goes to all subscribers. Good for notifications and broadcasting.

Request-Reply: Send a message and wait for response in a reply queue. Good for async RPC.

Dead Letter Queue: Messages that fail repeatedly go to a special queue for investigation.

Delay Queue: Messages are not processed until a specified time. Good for scheduled tasks.

System Architecture Examples: From Startup to Scale

These reference architectures show how to evolve your system as you grow. Use these as templates for system design interviews.

Architecture 1: Startup / MVP (0 to 1M users)

For most startups and small-to-medium applications. This handles more than you think.

Main Database: PostgreSQL (handles almost everything)
Cache: Redis (sessions, hot data)
Background Jobs: Redis queue or PostgreSQL-based (Sidekiq, Celery)
Search: PostgreSQL full-text search (upgrade to Elasticsearch later if needed)

This handles millions of users. Seriously. Do not over-engineer.

                    Load Balancer
                         │
              ┌──────────┴──────────┐
              ▼                     ▼
        ┌──────────┐         ┌──────────┐
        │App Server│         │App Server│
        └────┬─────┘         └────┬─────┘
             │                    │
        ┌────┴────────────────────┴────┐
        ▼                              ▼
   ┌──────────┐                 ┌──────────┐
   │PostgreSQL│                 │  Redis   │
   └──────────┘                 └────┬─────┘
                                     │
                          ┌──────────┴──────────┐
                          ▼                     ▼
                    ┌──────────┐         ┌──────────┐
                    │ Worker 1 │         │ Worker 2 │
                    └──────────┘         └──────────┘

Architecture 2: Growth Stage (1M to 10M users)

When reads are the bottleneck but writes are still manageable. Add read replicas and distributed caching.

Main Database: PostgreSQL primary + read replicas
Cache: Redis cluster
Search: Elasticsearch (for complex search requirements)
Background Jobs: Dedicated job queue (Sidekiq, Celery)
CDN: CloudFront/Cloudflare for static assets

    CDN ─── Load Balancer
                │
    ┌───────────┼───────────┐
    ▼           ▼           ▼
┌───────┐ ┌───────┐ ┌───────┐
│App 1  │ │App 2  │ │App 3  │
└───┬───┘ └───┬───┘ └───┬───┘
    │         │         │
    └─────────┼─────────┘
              │
    ┌─────────┼─────────────────┐
    ▼         ▼                 ▼
┌────────┐ ┌────────────┐ ┌──────────────┐
│ Redis  │ │Elasticsearch│ │  PostgreSQL  │
│Cluster │ │             │ │  (Primary)   │
└────────┘ └────────────┘ └───────┬──────┘
                                  │
                    ┌─────────────┼─────────────┐
                    ▼             ▼             ▼
               ┌────────┐   ┌────────┐   ┌────────┐
               │Replica │   │Replica │   │Replica │
               └────────┘   └────────┘   └────────┘

Architecture 3: Hyperscale (10M+ users, Multiple Teams)

When you truly need to handle massive scale. Only move here when you have multiple teams AND operational maturity.

Application Layer: Microservices with API Gateway
Data Layer: Service-specific databases (polyglot persistence)
Event Bus: Kafka for inter-service communication
Caching: Distributed Redis cluster
Search: Elasticsearch cluster
Analytics: Separate data warehouse (Redshift, BigQuery)

              CDN ─── API Gateway
                         │
    ┌────────────────────┼────────────────────┐
    ▼                    ▼                    ▼
┌──────────┐      ┌──────────┐         ┌──────────┐
│User Svc  │      │Order Svc │         │Product   │
│          │      │          │         │Svc       │
└────┬─────┘      └────┬─────┘         └────┬─────┘
     │                 │                    │
     ▼                 ▼                    ▼
┌──────────┐      ┌──────────┐         ┌──────────┐
│PostgreSQL│      │PostgreSQL│         │MongoDB   │
└──────────┘      └──────────┘         └──────────┘
     │                 │                    │
     └─────────────────┼────────────────────┘
                       ▼
                 ┌──────────┐
                 │  Kafka   │
                 └────┬─────┘
              ┌───────┼───────┐
              ▼       ▼       ▼
         ┌───────┐ ┌───────┐ ┌─────────┐
         │Notify │ │Search │ │Data     │
         │Svc    │ │Svc    │ │Warehouse│
         └───────┘ └───────┘ └─────────┘

Warning: Do not start with Architecture 3! Many teams jump to microservices too early. Start with Architecture 1. Move to Architecture 2 when you have real scaling problems. Only move to Architecture 3 when you have multiple teams that need to work independently AND you have the operational maturity to handle it.

Key Takeaways: Data-Intensive Systems Checklist

Use this checklist when designing systems or preparing for system design interviews:

Start simple: PostgreSQL + Redis handles most applications (proven by Instagram, Stack Overflow)
Measure first: Do not optimize for imaginary problems. Profile before you change.
Add complexity only when needed: Each new database is operational burden
Understand the tradeoffs: There is no perfect database, only tradeoffs (CAP theorem)
Design for your access patterns: How you query data determines your architecture
Plan for growth but don't over-engineer: You can always scale later
Pick boring technology: PostgreSQL, MySQL, Redis, well-understood tools are easier to operate
Replication before partitioning: Scale reads with replicas before you shard
Cache aggressively: The fastest database query is one you don't make
Monitor everything: You cannot fix what you cannot measure

Tradeoffs Summary

Single Relational Database (PostgreSQL/MySQL)

Pros: Simple to operate, ACID transactions, well-understood, can handle millions of records, good tooling and documentation

Cons: Single point of failure without replicas, limited horizontal scaling for writes, schema changes can be painful

When to use: Start here. Most applications never outgrow this. Add complexity only when you have measured problems.

Relational + Read Replicas

Pros: Scales reads well, same familiar database, good for read-heavy workloads, simple failover

Cons: Replication lag (eventual consistency), writes still limited to one node, more operational complexity

When to use: When reads are your bottleneck but writes are still manageable. Most medium-sized applications.

Document Database (MongoDB)

Pros: Flexible schema, good for nested/hierarchical data, scales horizontally, developer friendly

Cons: No joins (denormalization required), weaker consistency guarantees, can lead to data duplication

When to use: When schema flexibility is critical and data is naturally document-shaped. Content management, catalogs with varying attributes.

Key-Value Store (Redis/DynamoDB)

Pros: Extremely fast, simple model, scales well, great for caching

Cons: Limited query capabilities, no relationships, memory can be expensive

When to use: Caching, sessions, rate limiting, leaderboards. NOT as a primary database for complex data.

Sharded Database

Pros: Handles massive scale, horizontal write scaling, data locality

Cons: Complex to implement and operate, cross-shard queries are hard, resharding is painful, joins become impossible

When to use: Only when a single database truly cannot handle your load. This is rarer than you think. Most companies never need sharding.

Polyglot Persistence (Multiple Databases)

Pros: Right tool for each job, independent scaling, optimized for specific use cases

Cons: Operational complexity, data consistency challenges, more things to monitor and maintain

When to use: Large applications with distinct data needs. Search engine for search, time-series DB for metrics, etc. Add databases one at a time for specific needs.

Strong Consistency

Pros: Data always correct, simpler application logic, no stale reads

Cons: Higher latency, lower availability during partitions, harder to scale globally

When to use: Financial transactions, inventory systems, anything where stale data causes real problems.

Eventual Consistency

Pros: Higher availability, better latency, easier to scale globally

Cons: Application must handle stale data, more complex reasoning, conflicts possible

When to use: Social feeds, product catalogs, analytics - where slightly stale data is acceptable.---

Frequently Asked Questions

When should I use PostgreSQL vs MongoDB?

Use PostgreSQL when you have structured data with relationships, need ACID transactions, or are building most web applications. PostgreSQL handles 99% of use cases and scales further than most people think (Instagram used it for years).

Use MongoDB when you have truly flexible schemas where each document has different fields, or when you're storing self-contained documents like blog posts with embedded comments. However, PostgreSQL's JSONB often provides the same flexibility with better querying capabilities.

How do I know when to shard my database?

You should consider sharding only when you've exhausted these options first:

Query optimization and proper indexing
Vertical scaling (bigger server)
Read replicas for read-heavy workloads
Caching with Redis

If you still have problems after all of these, and your data exceeds 10TB or you need more than 50,000 write operations per second, then consider sharding. Most companies never reach this point.

What's the difference between replication and sharding?

Replication creates copies of your entire database on multiple servers. All servers have the same data. This improves read performance and availability but doesn't help with write scaling.

Sharding splits your data across multiple servers. Each server has different data. This helps with both storage capacity and write scaling but adds significant complexity.

Rule of thumb: Try replication first. Only shard when you've measured that writes are your bottleneck.

Should I use Redis or Memcached for caching?

Use Redis for most applications. It supports more data structures (lists, sets, sorted sets), has persistence options, and can serve as both a cache and a message queue. It's the industry standard.

Use Memcached only if you need simple key-value caching at massive scale with multi-threaded performance, or if you're already running it.

How does the CAP theorem affect my database choice?

The CAP theorem states you can only have two of three properties: Consistency, Availability, and Partition tolerance. Since network partitions are inevitable, you're really choosing between:

CP (Consistency + Partition tolerance): PostgreSQL, MySQL with synchronous replication. System may become unavailable during network issues but data is always consistent.
AP (Availability + Partition tolerance): Cassandra, DynamoDB, MongoDB (with certain configurations). System stays available but may return stale data during network issues.

Most applications should start with CP (PostgreSQL) and only move to AP systems when availability requirements justify the consistency tradeoffs.

Can a single PostgreSQL database really handle millions of users?

Yes. With proper indexing, query optimization, and caching (Redis), a single PostgreSQL server can handle millions of users. Stack Overflow serves billions of page views with just two SQL Server instances.

The key is:

Proper indexes on frequently queried columns
Redis caching for hot data
Connection pooling (PgBouncer)
Query optimization (avoid N+1 queries)

Only scale horizontally when you've maximized what a single server can do.