databasesscalingpartitioningshardingreplicationconsistencydata-modelsstorage-enginesdistributed-systemsarchitecture

Building Data-Intensive Systems

A practical guide to choosing databases, scaling strategies, and data architecture - based on real-world principles

|60 min read

Summary

This guide teaches you how to pick the right database and scale your system. Think of data like water - you need the right container for the right job. A bucket works for a garden, but you need pipes and tanks for a city. Similarly, a simple database works for small apps, but big systems need special techniques like splitting data across machines (sharding) or keeping copies everywhere (replication). The hardest part is not the technology - it is knowing WHEN to use WHAT. This guide gives you a simple framework: start simple, measure problems, then pick the right tool for that specific problem.

Key Takeaways

Core Philosophy

Start simple. Measure real problems. Pick the right tool for each specific problem. Do not optimize for problems you do not have yet.

The Common Mistake

Engineers often pick fancy databases because they sound cool. Then they spend months fighting the database instead of building features. Pick boring technology that you understand.

The Framework

Ask three questions: (1) What shape is my data? (2) How will I read and write it? (3) What happens when it grows 10x? The answers tell you which database to pick.

The Tradeoff Triangle

You cannot have everything. Pick two: Strong Consistency (always correct), High Availability (always works), Partition Tolerance (survives network splits). This is called CAP theorem.

Performance Reality

Most performance problems are not about the database. They are about bad queries, missing indexes, or fetching too much data. Fix these first before changing databases.

Evolution Strategy

Systems evolve. Start with one database. When specific problems appear, add specialized tools for those problems. Do not build for imaginary scale.

Deep Dive

The Big Picture: What is a Data-Intensive System?

What makes a system data-intensive?

A data-intensive system is one where the main challenge is handling data - storing it, finding it, moving it, and keeping it correct. Compare this to a compute-intensive system (like video encoding) where the challenge is doing lots of calculations.

Examples of data-intensive systems: - Facebook: Stores billions of posts, photos, and messages - Google Search: Indexes trillions of web pages - Netflix: Stores and streams millions of videos - Your bank: Tracks every transaction you have ever made - Even a simple blog: Stores posts, comments, and user accounts

The three things every data system must do:

  1. Store data so you can find it later (databases)
  2. Remember expensive calculations so you do not redo them (caches)
  3. Send messages between different parts of the system (queues)

This guide focuses on choosing and using the right tools for each job.

The Golden Rule

Boring technology is usually the right choice. PostgreSQL, MySQL, Redis, and Kafka are boring because millions of people use them. That means bugs are found and fixed, documentation is great, and you can hire people who know them. Exciting new databases are exciting because nobody knows their problems yet.

How to think about databases

Databases are like containers for data. Different containers are good for different things:

  • A filing cabinet (relational database) is great when your data has a clear structure - like customer records with name, address, and order history.
  • A document folder (document database) is great when each item is different - like blog posts where each post has different fields.
  • A dictionary (key-value store) is great when you just need to look things up by name - like session data where you look up by session ID.
  • A family tree (graph database) is great when relationships matter most - like social networks where you care about who knows whom.

Picking the right container makes your life easy. Picking the wrong one means fighting your database every day.

Different databases for different shapes of data

Data Models: How to Organize Your Data

The Relational Model (Tables with Rows and Columns)

This is what most people think of when they hear database. Data is organized into tables. Each row is one record. Each column is one piece of information.

Example: A customer database

| id | name | email | city | |---|---|---|---| | 1 | Alice | alice@mail.com | New York | | 2 | Bob | bob@mail.com | London |

When to use relational databases: - Your data has a clear, fixed structure - You need to join data together (show me all orders for customer X) - You need transactions (transfer money from A to B - both must succeed or both must fail) - You are building most web applications

Real-world examples: Banking systems, e-commerce, inventory management, most SaaS products.

Why relational databases are so popular

Relational databases have been around for 50 years. They are well understood, well documented, and handle 99% of use cases. PostgreSQL and MySQL are free, reliable, and can handle millions of records easily. When in doubt, start with PostgreSQL.

The Document Model (Flexible JSON-like Documents)

Instead of rows in a table, you store documents. Each document can have different fields. Think of it like storing JSON objects.

Example: A blog post

json
{ "id": "post_123", "title": "My First Post", "content": "Hello world...", "author": { "name": "Alice", "bio": "Writer and coder" }, "tags": ["hello", "first"], "comments": [ {"user": "Bob", "text": "Great post!"} ] }

When to use document databases: - Each record can have different fields - You mostly read and write whole documents at once - Your data has natural groupings (a blog post with its comments) - You need flexible schemas that change often

Real-world examples: Content management, user profiles with varying fields, product catalogs where products have different attributes.

The Key-Value Model (Simple Lookups)

The simplest model. You store a value and give it a key (a name). To get it back, you ask for it by key.

Think of it like a giant dictionary: - Key: `user:123:session` → Value: `{loggedIn: true, cart: [...]}` - Key: `cache:homepage` → Value: `<html>...</html>`

When to use key-value stores: - You always know the key you want - You do not need to search inside the values - You need super fast reads and writes - You are building caches or session storage

Real-world examples: Session storage, caching, feature flags, rate limiting, leaderboards.

Key-value trap

Key-value stores are fast because they are simple. If you start doing complex queries (find all users in New York), you are using the wrong database. Key-value is for lookups by key, not for searching.

The Graph Model (Relationships First)

When the connections between things matter more than the things themselves, use a graph database.

Example: Social network - Alice FOLLOWS Bob - Bob FOLLOWS Charlie - Charlie FOLLOWS Alice - Question: Who should we recommend Alice follow? (friends of friends)

In a relational database, this query needs many JOINs and gets slow fast. In a graph database, it is natural and fast.

When to use graph databases: - Relationships are the main thing you query - You need to traverse connections (friend of friend of friend) - Your queries are about paths and connections - Examples: Social networks, recommendation engines, fraud detection, knowledge graphs

Real-world examples: LinkedIn connections, fraud detection (who is connected to known fraudsters), recommendation systems.

Data ModelBest ForNot Good ForExample Databases
RelationalStructured data, complex queries, transactionsHighly connected data, very flexible schemasPostgreSQL, MySQL, SQLite
DocumentFlexible schemas, self-contained recordsMany-to-many relationships, complex joinsMongoDB, CouchDB, Firestore
Key-ValueCaching, sessions, simple lookupsComplex queries, searching by valueRedis, Memcached, DynamoDB
GraphHighly connected data, relationship queriesSimple CRUD, tabular reportsNeo4j, Amazon Neptune, JanusGraph
Time-SeriesEvents over time, metrics, logsRandom access, complex relationshipsTimescaleDB, InfluxDB, Prometheus

Storage Engines: How Databases Actually Work

Why does this matter?

Understanding how databases store data helps you pick the right one and tune it properly. There are two main approaches, and they are good at different things.

Approach 1: B-Trees (Most Relational Databases)

Think of a B-tree like a really efficient filing system. Data is organized in a tree structure where finding any record takes the same amount of time.

Imagine a library: - To find a book, you do not search every shelf - You check the catalog: Fiction → Mystery → Authors A-D → Shelf 7 - Each step narrows down where to look - Finding any book takes about the same number of steps

How B-trees work (simplified): 1. Data is stored in sorted order 2. There is an index (like a book catalog) that points to where data lives 3. Updates modify the data in place 4. Good for both reads AND writes

B-trees are used by: PostgreSQL, MySQL, SQLite, Oracle - most relational databases.

B-Tree: Like a library catalog system

Approach 2: LSM-Trees (Log-Structured Merge Trees)

Instead of updating data in place, LSM-trees just write new data to the end of a log. Periodically, they merge and compact these logs.

Imagine a notebook: - Instead of erasing and rewriting, you just add new entries at the end - To find something, you search backwards from the newest entry - Periodically, you rewrite the notebook, removing old crossed-out entries

How LSM-trees work (simplified): 1. All writes go to memory first (very fast) 2. When memory is full, flush to disk as a sorted file 3. Periodically merge these files together 4. Reading might need to check multiple files

LSM-trees are used by: Cassandra, RocksDB, LevelDB, HBase - databases optimized for writes.

FeatureB-TreeLSM-Tree
Write speedGood - updates in placeExcellent - just append to log
Read speedExcellent - one lookupGood - might check multiple files
Space usageGood - no duplicationNeeds more space - old data until compaction
Write amplificationLower - each write updates onceHigher - data rewritten during compaction
Best forBalanced read/write workloadsWrite-heavy workloads

Which one to pick?

For most applications, B-trees (PostgreSQL, MySQL) are the right choice. They are well-understood and handle both reads and writes well. Only consider LSM-trees if you have a write-heavy workload (like logging or time-series data) AND you have measured that writes are actually your bottleneck.

Row-Oriented vs Column-Oriented Storage

Most databases store data row by row:

Row 1: [Alice, 25, New York]
Row 2: [Bob, 30, London]
Row 3: [Charlie, 35, Paris]

This is great when you read whole rows (show me everything about Alice).

But what if you want to calculate the average age of all users? You need to read every row, but you only care about the age column. That is wasteful.

Column-oriented databases store data column by column:

Names: [Alice, Bob, Charlie]
Ages: [25, 30, 35]
Cities: [New York, London, Paris]

Now calculating average age just reads the ages column. Much faster for analytics!

When to use column-oriented databases: - You have millions/billions of rows - You run analytical queries (averages, sums, counts) - You read many rows but few columns - You are building data warehouses or analytics systems

Examples: Amazon Redshift, Google BigQuery, ClickHouse, Apache Parquet files.

Do not use column stores for OLTP

Column stores are terrible for normal application workloads (show me user 123, update their address). They are designed for analytics (what is the average order value by country). Using a column store for your main application database is a common expensive mistake.

Replication: Keeping Copies of Your Data

Why make copies?

Imagine you have one computer storing all your data. What happens when: - That computer crashes? Your app is down. - Users in another country connect? Slow response. - Lots of people read at once? Computer gets overloaded.

The solution: Keep copies of your data on multiple computers.

The three reasons to replicate:

  1. High Availability - If one computer dies, others keep working
  2. Lower Latency - Put copies close to users (Europe copy for European users)
  3. Read Scaling - Spread read load across many computers

But there is a catch: Keeping copies in sync is HARD. What if someone updates the data on one copy? How do other copies find out?

Leader-Follower Replication (Most Common)

One database is the leader. All writes go to the leader. The leader sends changes to followers. Reads can go to any copy.

Think of it like a classroom: - The teacher (leader) writes on the main whiteboard - Students (followers) copy what the teacher writes to their notebooks - If you want to change something, tell the teacher - If you just want to read, look at any notebook

Leader-Follower Replication

The problem with followers: Replication Lag

Followers get updates AFTER the leader. There is always a small delay (usually milliseconds, but can be seconds under load).

What can go wrong: 1. User updates their profile (goes to leader) 2. User immediately views their profile (goes to follower) 3. Follower has not received the update yet 4. User sees old data and thinks the update failed!

Solutions to replication lag:

  1. Read-your-own-writes: After a user writes, read from the leader for a few seconds
  2. Monotonic reads: Always route the same user to the same follower
  3. Consistent prefix reads: Make sure related data arrives in order

Most databases have these built in - you just need to configure them.

Synchronous vs Asynchronous replication

Synchronous: Leader waits for followers to confirm before saying write is done. Safer but slower. Asynchronous: Leader says done immediately, followers catch up later. Faster but might lose data if leader crashes. Most systems use async with one sync follower as a compromise.

Multi-Leader Replication (For Multiple Data Centers)

What if you have data centers in USA and Europe? With single leader, all writes go to one place. European users have slow writes.

Solution: Leaders in both data centers. Each accepts writes and syncs with the other.

BUT this creates a new problem: write conflicts.

Imagine: - USA leader: Set username to "alice" - Europe leader: Set username to "alice_eu" - Both happened at the same time. Which one wins?

Conflict resolution strategies: 1. Last write wins: Use timestamps, newest change wins (but clocks can be wrong!) 2. Merge values: Keep both values somehow 3. Custom logic: Let the application decide how to merge 4. Avoid conflicts: Route all writes for one user to the same leader

Avoid multi-leader if possible

Multi-leader replication is complex and error-prone. Write conflicts are painful to debug and fix. Only use it if you truly need writes in multiple geographic regions AND the latency to a single leader is unacceptable. Most apps do fine with single leader + read replicas.

Leaderless Replication (Dynamo-Style)

No single leader. Writes go to multiple nodes. Reads check multiple nodes and use the newest value.

How it works: 1. To write: Send to 3 nodes. If 2 confirm, success. 2. To read: Ask 3 nodes. If 2 agree, that is the answer. 3. If nodes disagree, pick the newer value.

This is called quorum: As long as majority agrees, you are good.

Used by: Amazon DynamoDB, Cassandra, Riak

When to use: When you need very high availability and can tolerate some inconsistency. Good for things like shopping carts where being always available matters more than perfect consistency.

Replication TypeWrite SpeedConsistencyAvailabilityBest For
Single LeaderFast (one place)Strong (reads might be stale)Good (leader can fail)Most applications
Multi-LeaderFast everywhereWeak (conflicts possible)ExcellentMulti-region apps
LeaderlessFast everywhereEventually consistentExcellentHigh-availability needs

Partitioning: Splitting Data Across Machines

When one machine is not enough

Your database has 1 billion records. One machine cannot: - Store all that data (too much disk space needed) - Handle all the queries (too slow) - Keep up with all the writes (CPU maxed out)

Solution: Split the data across multiple machines

This is called partitioning (or sharding). Each machine holds a piece of the data.

Think of it like a library with multiple buildings: - Building A: Books by authors A-H - Building B: Books by authors I-P - Building C: Books by authors Q-Z

To find a book, first figure out which building, then search there.

Do not shard too early!

Sharding adds HUGE complexity. Cross-shard queries become hard. Joins become impossible or slow. Operations become much harder. One PostgreSQL server handles millions of rows easily. Only shard when you have MEASURED that one machine cannot handle your load. Most companies never need to shard.

How to decide which machine gets which data?

Method 1: Range-Based Partitioning

Split data by ranges of a key.

Example: Users table partitioned by user ID: - Shard 1: Users 1-1,000,000 - Shard 2: Users 1,000,001-2,000,000 - Shard 3: Users 2,000,001-3,000,000

Good: Range queries are efficient (get all users 500-600) Bad: Can get unbalanced (if most new users are in shard 3, it gets overloaded)

Method 2: Hash-Based Partitioning

Hash the key and use that to pick the shard.

Example: hash(user_id) % 3 = shard number - User 123 → hash = 456 → 456 % 3 = 0 → Shard 0 - User 789 → hash = 234 → 234 % 3 = 0 → Shard 0 - User 555 → hash = 111 → 111 % 3 = 2 → Shard 2

Good: Data spreads evenly across shards Bad: Range queries need to check ALL shards (slow)

Range vs Hash Partitioning

The hot spot problem

Even with good partitioning, some keys get way more traffic than others.

Example: A celebrity with 100 million followers posts something. Everyone reads that one post. The shard holding that post gets crushed.

Solutions to hot spots: 1. Add randomness: Instead of storing under post_123, store under post_123_0, post_123_1, etc. Spread the load. 2. Caching: Put hot data in a cache (Redis) in front of the database. 3. Application awareness: Detect hot items and handle them specially.

There is no perfect automatic solution. You need monitoring to detect hot spots and engineering to fix them.

What about cross-shard queries?

If you shard by user_id, queries like "show me all orders for user 123" are fast (just check one shard).

But what about "show me all orders placed today"? Orders are spread across ALL shards. You need to: 1. Send the query to every shard 2. Wait for all to respond 3. Merge the results

This is called scatter-gather and it is slow.

How to handle this: 1. Design your sharding key carefully: Shard by the thing you query by most 2. Denormalize: Store data redundantly so common queries hit one shard 3. Secondary indexes: Build indexes that span shards (complex!) 4. Accept the cost: Sometimes scatter-gather is okay for rare queries

Partitioning strategy

Think about your most common queries FIRST. If you mostly query by user_id, shard by user_id. If you mostly query by time, shard by time. The wrong sharding key makes your database slower, not faster.

Partitioning MethodGood ForBad ForExample Use
Range (by ID)Sequential access, range queriesHot spots at one endTime-series data, logs
Range (by time)Time-based queries, archiving old dataRecent data gets all trafficEvents, metrics
HashEven distribution, random accessRange queries, sortingUser data, sessions
GeographicLocation-based queries, data localityUsers who move, global queriesLocal businesses, regional data
CompositeComplex access patternsComplexity, more planning neededMulti-tenant SaaS

Consistency and Transactions

What is consistency?

Consistency means everyone sees the same data. Sounds simple, but in distributed systems, it is really hard.

Imagine you post a photo: - You: "I just posted! I can see it!" - Your friend: "I do not see any new post." - Both are looking at the same app, but seeing different data.

This happens because your post went to one server, and your friend is reading from another server that has not received the update yet.

Levels of consistency (from strongest to weakest):

  1. Strong consistency: Everyone sees the same data. After a write completes, all readers see it.
  2. Eventual consistency: Eventually everyone sees the same data. There might be a delay.
  3. No guarantee: Different readers might see different data forever (broken system!).

The CAP Theorem (simplified)

You cannot have all three: Consistency, Availability, Partition tolerance. Since network failures happen (partitions), you must choose: either stay consistent (some requests fail during failures) or stay available (requests succeed but might return stale data). Most systems choose availability and deal with eventual consistency.

Transactions: All or Nothing

A transaction is a group of operations that either ALL succeed or ALL fail. No partial results.

Example: Transferring money

Transaction:
  1. Subtract $100 from Alice account
  2. Add $100 to Bob account

If step 1 succeeds but step 2 fails, Alice loses $100 and Bob gets nothing. That is broken!

With a transaction, if anything fails, the whole thing is rolled back. Alice keeps her money.

ACID Properties (what transactions guarantee):

  • Atomicity: All or nothing. Either the whole transaction succeeds or nothing changes.
  • Consistency: The database stays valid. Rules (like "balance cannot be negative") are always followed.
  • Isolation: Concurrent transactions do not interfere with each other.
  • Durability: Once a transaction commits, it is saved forever (even if power goes out).
Transaction Example (simplified)
WITHOUT TRANSACTION (dangerous!):

    subtract_from_alice(100)   // Works
    add_to_bob(100)            // FAILS! Database error!
    
    Result: Alice lost $100, Bob got nothing.
    Money disappeared!


WITH TRANSACTION (safe):

    BEGIN TRANSACTION
        subtract_from_alice(100)   // Tentative
        add_to_bob(100)            // FAILS!
    ROLLBACK (because something failed)
    
    Result: Both operations are undone.
    Alice still has her $100. Nothing changed.


HAPPY PATH:

    BEGIN TRANSACTION
        subtract_from_alice(100)   // Works
        add_to_bob(100)            // Works
    COMMIT
    
    Result: Alice has $100 less, Bob has $100 more.
    Money moved safely.

Isolation Levels: How much do transactions see each other?

When two transactions run at the same time, what happens?

Read Uncommitted (weakest, rarely used) - Transactions can see changes that have not been committed yet - Problem: You might see data that gets rolled back (dirty reads)

Read Committed (common default) - You only see committed changes - Problem: If you read twice, you might see different values (non-repeatable reads)

Repeatable Read (PostgreSQL default) - Once you read a value, it stays the same for your whole transaction - Problem: New rows might appear (phantom reads)

Serializable (strongest) - Transactions behave as if they ran one after another - Safest but slowest

Most apps use the default and it is fine

Do not overthink isolation levels when starting. The database default (usually Read Committed or Repeatable Read) works for most applications. Only change it if you have a specific problem. Higher isolation = slower performance.

Distributed Transactions: Across Multiple Databases

Transactions within ONE database are well-solved. But what if your transaction spans multiple systems?

Example: Order placement 1. Charge credit card (Payment Service) 2. Reserve inventory (Inventory Service) 3. Create order record (Order Service)

All three must succeed or all must fail.

Solution 1: Two-Phase Commit (2PC)

Phase 1 (Prepare): Ask everyone "can you commit?" Phase 2 (Commit): If everyone said yes, tell everyone to commit.

Problem: If the coordinator crashes between phases, everything is stuck.

Solution 2: Saga Pattern (more common)

Each step has a compensating action (undo). If something fails, run the undos.

  1. Charge card → If later steps fail, refund card
  2. Reserve inventory → If later steps fail, release inventory
  3. Create order → Done!

Sagas do not give true atomicity, but they are more practical for microservices.

ApproachConsistencyAvailabilityComplexityUse When
Single DB TransactionStrongDepends on DBLowEverything in one database
Two-Phase CommitStrongLower (blocking)HighNeed true atomicity across DBs
Saga with CompensationEventualHighMediumMicroservices, can tolerate retries
Eventual ConsistencyWeakHighestLowCan show stale data briefly

When to Use What: A Decision Framework

The most important advice: Start simple

Most applications work great with: - PostgreSQL (or MySQL) for the main database - Redis for caching and sessions - That is it.

Seriously. Instagram ran on PostgreSQL for years with billions of photos. Stack Overflow handles millions of requests with two SQL Server instances.

Add complexity only when you MEASURE a real problem.

The Decision Process

1. Start with PostgreSQL 2. Add Redis when you need caching 3. Add read replicas when reads are slow 4. Add a search engine when full-text search is needed 5. Consider sharding when one machine truly cannot handle the load At each step, MEASURE first, then decide.

Decision Tree: Which Database for Your Use Case?

Database Selection Framework
START HERE: What are you building?

├── Normal web application (users, orders, content)
│   └── PostgreSQL or MySQL
│       ├── Need caching? Add Redis
│       ├── Need search? Add Elasticsearch
│       └── Need analytics? Add read replica or data warehouse
│
├── High-speed caching or sessions
│   └── Redis or Memcached
│
├── Real-time analytics on events
│   └── ClickHouse, TimescaleDB, or Apache Druid
│
├── Document storage with flexible schema
│   └── MongoDB or PostgreSQL JSONB
│       (PostgreSQL JSONB is often good enough!)
│
├── Graph relationships (social network, recommendations)
│   └── Neo4j or Amazon Neptune
│       (Or PostgreSQL with recursive CTEs for simpler cases)
│
├── Time-series data (metrics, IoT, logs)
│   └── TimescaleDB, InfluxDB, or Prometheus
│
├── Full-text search
│   └── Elasticsearch or PostgreSQL full-text search
│       (PostgreSQL is often good enough for basic search)
│
├── Message queue / event streaming
│   └── Kafka, RabbitMQ, or Redis Streams
│
└── Blob storage (files, images, videos)
    └── S3, GCS, or Azure Blob Storage
        (NOT a database - databases are bad at storing files)

Signs you might need to scale

| Symptom | Possible Cause | Solution to Try First | Nuclear Option | |---------|---------------|----------------------|----------------| | Slow queries | Missing indexes | Add indexes | Read replicas | | Database CPU high | Too many queries | Add caching | Read replicas | | Slow writes | Write contention | Optimize queries | Write sharding | | Running out of disk | Too much data | Archive old data | Sharding | | Slow joins | Too many tables | Denormalize | Different database |

Common mistakes to avoid

1. Using MongoDB because "it scales" when PostgreSQL would work fine 2. Sharding when adding an index would fix the problem 3. Adding a new database for each feature instead of using one well 4. Choosing a database because a famous company uses it 5. Not measuring before optimizing

Real-world example: How companies actually do it

Instagram (billions of photos, millions of users) - Main database: PostgreSQL (heavily sharded by user ID) - Caching: Memcached and Redis - Photo storage: Custom blob storage → later moved to Facebook infrastructure - Search: Elasticsearch

Uber (millions of trips per day) - Main database: Started MySQL, moved to PostgreSQL - Real-time: Redis - Analytics: Apache Hive, Presto - Time-series: M3 (custom)

Stack Overflow (millions of questions, billions of page views) - Main database: SQL Server (just TWO servers!) - Caching: Redis - Search: Elasticsearch - Lesson: You can go very far with traditional databases and good engineering

Netflix (200 million subscribers, billions of hours streamed) - Main database: Started with Oracle, moved to Cassandra for scale - Caching: EVCache (based on Memcached) - Search: Elasticsearch - Analytics: Apache Spark, Druid

Caching: Making Things Fast

Why cache?

Databases are slow. Reading from disk takes milliseconds. Reading from memory takes microseconds - 1000x faster.

Caching stores frequently accessed data in memory so you do not have to ask the database every time.

Example: - Without cache: 100ms to fetch user profile (database query) - With cache: 1ms to fetch user profile (memory lookup) - If 90% of requests hit cache: Average = 0.9 * 1ms + 0.1 * 100ms = 10.9ms

That is 10x faster with caching!

Cache-Aside Pattern (Most Common)

The application manages the cache. To read: 1. Check cache. If found, return it (cache hit). 2. If not found (cache miss), read from database. 3. Store in cache for next time. 4. Return the data.

To write: 1. Write to database. 2. Invalidate (delete) the cache entry. 3. Next read will repopulate the cache.

Cache-Aside Pattern (step by step)
FUNCTION get_user(user_id):
    
    // Step 1: Check cache first
    cache_key = "user:" + user_id
    cached_user = REDIS.get(cache_key)
    
    IF cached_user exists:
        // Cache hit! Return immediately (fast)
        RETURN cached_user
    
    // Step 2: Cache miss - get from database (slow)
    user = DATABASE.query("SELECT * FROM users WHERE id = ?", user_id)
    
    // Step 3: Store in cache for next time
    // Expire after 1 hour so stale data eventually goes away
    REDIS.set(cache_key, user, expire_in=1_hour)
    
    RETURN user


FUNCTION update_user(user_id, new_data):
    
    // Step 1: Update database
    DATABASE.query("UPDATE users SET ... WHERE id = ?", user_id, new_data)
    
    // Step 2: Invalidate cache (delete the old cached version)
    cache_key = "user:" + user_id
    REDIS.delete(cache_key)
    
    // Next read will fetch fresh data and re-cache it

Cache invalidation is hard

There are only two hard things in Computer Science: cache invalidation and naming things. - Phil Karlton The problem: When data changes, you must remember to invalidate the cache. If you forget, users see stale data. If you invalidate too aggressively, you lose the caching benefit.

Other Caching Patterns

Write-Through: Write to cache AND database together. Cache is always fresh. But writes are slower (two writes).

Write-Behind (Write-Back): Write to cache only. Cache writes to database later in batches. Very fast writes but risk of data loss if cache crashes before writing.

Read-Through: Cache talks to database automatically on miss. Application only talks to cache. Simpler application code.

For most applications, Cache-Aside is the right choice. It is simple, well-understood, and gives you full control.

What to cache?

Cache things that are: - Read frequently - Expensive to compute or fetch - Do not change often - Okay to be slightly stale

Good candidates: - User profiles (read on every page, change rarely) - Product listings (read constantly, update occasionally) - Session data (read every request) - Computed results (leaderboards, statistics)

Bad candidates: - Data that changes constantly (real-time stock prices) - Data that must be 100% fresh (bank balances) - Rarely accessed data (wastes memory) - Data that is already fast to fetch (indexed lookups)

PatternHow It WorksBest ForDrawback
Cache-AsideApp manages cache, reads from DB on missMost applicationsApp must handle cache logic
Write-ThroughWrite to cache + DB togetherData that must not be lostSlower writes
Write-BehindWrite to cache, DB updated laterWrite-heavy workloadsData loss if cache fails
Read-ThroughCache fetches from DB automaticallySimpler application codeLess control

Cache expiration strategy

Always set a TTL (time to live) on cache entries. Even if you invalidate on updates, TTL is a safety net. If something goes wrong with invalidation, stale data will eventually expire. Start with 1 hour TTL and adjust based on your needs.

Message Queues and Stream Processing

Why use message queues?

Sometimes you do not want to do something immediately. You want to do it later, or have someone else do it.

Example: User signs up - Must do immediately: Create account, log them in - Can do later: Send welcome email, notify analytics, sync to CRM

Without a queue: User waits while all those things happen (slow). With a queue: Create account, put "send email" task in queue, respond to user. Background worker sends email later.

Benefits of queues: 1. Decoupling: Services do not need to know about each other 2. Resilience: If email service is down, messages wait in queue 3. Scaling: Add more workers to process queue faster 4. Smoothing: Handle bursts by queueing and processing steadily

Synchronous vs Asynchronous Processing

Message Queue vs Event Stream

Message Queue (RabbitMQ, SQS, etc.) - Messages are tasks to be done - Once a message is processed, it is deleted - Each message goes to ONE consumer - Use case: Background jobs, task distribution

Event Stream (Kafka, etc.) - Events are facts about what happened - Events are kept for a while (days/weeks/forever) - Multiple consumers can read the same event - Use case: Event sourcing, real-time analytics, data pipelines

FeatureMessage QueueEvent Stream
Message lifetimeDeleted after processingKept for configurable time
Consumer modelOne consumer per messageMultiple consumers can read same event
OrderingBest effort (some guarantee order)Strong ordering per partition
ReplayCannot replay processed messagesCan replay from any point
Use caseTask processing, work distributionEvent sourcing, data pipelines, analytics

When to use what?

Use a Message Queue (RabbitMQ, Amazon SQS, Redis) when: - You have tasks to distribute to workers - Messages should be processed exactly once then forgotten - You need simple pub/sub for notifications - Example: Send email, process image, generate report

Use an Event Stream (Kafka, Amazon Kinesis) when: - You need to keep a log of everything that happened - Multiple systems need to react to the same event - You need to replay events (rebuild state, debug issues) - You are doing real-time analytics - Example: User activity stream, order events, IoT sensor data

Start with Redis for simple queues

For many applications, Redis lists or streams work great as a simple queue. You probably already have Redis for caching. Only add Kafka or RabbitMQ when you outgrow Redis or need their specific features.

Common patterns with queues

Work Queue: Multiple workers process tasks from one queue. Good for parallel processing.

Pub/Sub: One message goes to all subscribers. Good for notifications and broadcasting.

Request-Reply: Send a message and wait for response in a reply queue. Good for async RPC.

Dead Letter Queue: Messages that fail repeatedly go to a special queue for investigation.

Delay Queue: Messages are not processed until a specified time. Good for scheduled tasks.

Putting It All Together: Reference Architectures

Architecture 1: Simple Web Application

For most startups and small-to-medium applications.

  • Main Database: PostgreSQL (handles almost everything)
  • Cache: Redis (sessions, hot data)
  • Background Jobs: Redis queue or PostgreSQL-based (Sidekiq, Celery)
  • Search: PostgreSQL full-text search (upgrade to Elasticsearch later if needed)

This handles millions of users. Seriously. Do not over-engineer.

Simple Web Application Architecture

Architecture 2: Growing Application with Read Scaling

When reads are the bottleneck but writes are still manageable.

  • Main Database: PostgreSQL primary + read replicas
  • Cache: Redis cluster
  • Search: Elasticsearch (for complex search requirements)
  • Background Jobs: Dedicated job queue (Sidekiq, Celery)
  • CDN: CloudFront/Cloudflare for static assets

Read-Scaled Architecture

Architecture 3: Large-Scale Application

When you truly need to handle massive scale.

  • Application Layer: Microservices with API Gateway
  • Data Layer: Service-specific databases (polyglot persistence)
  • Event Bus: Kafka for inter-service communication
  • Caching: Distributed Redis cluster
  • Search: Elasticsearch cluster
  • Analytics: Separate data warehouse (Redshift, BigQuery)

Large-Scale Microservices Architecture

Do not start with Architecture 3!

Many teams jump to microservices too early. Start with Architecture 1. Move to Architecture 2 when you have real scaling problems. Only move to Architecture 3 when you have multiple teams that need to work independently AND you have the operational maturity to handle it.

Key Takeaways

  1. Start simple: PostgreSQL + Redis handles most applications
  1. Measure first: Do not optimize for imaginary problems
  1. Add complexity only when needed: Each new database is operational burden
  1. Understand the tradeoffs: There is no perfect database, only tradeoffs
  1. Design for your access patterns: How you query data determines your architecture
  1. Plan for growth but do not over-engineer: You can always scale later
  1. Pick boring technology: Well-understood tools are easier to operate
  1. Replication before partitioning: Scale reads with replicas before you shard
  1. Cache aggressively: The fastest database query is one you do not make
  1. Monitor everything: You cannot fix what you cannot measure

Trade-offs

AspectAdvantageDisadvantage