Database Sharding: When and How to Scale Your Database

Your database has 500 million rows. Queries that used to take milliseconds now take seconds. Adding indexes doesn't help anymore. You've upgraded to the biggest server you can buy, and it's still not enough.

You need to shard.

Database sharding is one of the most important concepts in system design, yet it's also one of the most misunderstood. In this guide, I'll explain what sharding is, when you actually need it (hint: later than you think), how to choose a sharding strategy, and the pitfalls that catch most teams.

What is Database Sharding?

Sharding is the practice of splitting a database into multiple smaller databases, called shards, where each shard contains a subset of the data.

Before Sharding:
┌─────────────────────────────────────┐
│         Single Database             │
│   Users: 500 million rows           │
│   Orders: 2 billion rows            │
└─────────────────────────────────────┘

After Sharding:
┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│   Shard 1     │ │   Shard 2     │ │   Shard 3     │ │   Shard 4     │
│ Users A-F     │ │ Users G-M     │ │ Users N-S     │ │ Users T-Z     │
│ 125M rows     │ │ 125M rows     │ │ 125M rows     │ │ 125M rows     │
└───────────────┘ └───────────────┘ └───────────────┘ └───────────────┘

Each shard is a complete database server with its own CPU, memory, and storage. Queries only hit the shard containing the relevant data.

Sharding vs. Partitioning

These terms are often confused:

Term	Definition	Example
Partitioning	Splitting data within a single database instance	PostgreSQL table partitioning
Sharding	Splitting data across multiple database instances	Each shard is a separate server

Partitioning is a database feature. Sharding is an architectural pattern.

Horizontal vs. Vertical Sharding

Horizontal sharding (most common): Split rows across shards. Each shard has all columns but only some rows.

Horizontal: Split by user_id
Shard 1: user_id 1-1000000
Shard 2: user_id 1000001-2000000

Vertical sharding: Split columns across databases. Each database has all rows but only some columns.

Vertical: Split by data type
Database A: user_id, name, email (frequently accessed)
Database B: user_id, bio, preferences (rarely accessed)

When people say "sharding," they usually mean horizontal sharding.

When Do You Need Sharding?

The short answer: Later than you think.

Sharding adds significant complexity. Before you shard, try these alternatives:

Level 1: Optimize Queries

Add proper indexes
Rewrite slow queries
Use EXPLAIN ANALYZE
Denormalize where appropriate

When it works: Most applications. You'd be surprised how far good indexing takes you.

Level 2: Vertical Scaling

Upgrade to more CPU, RAM, faster storage
Move to SSDs if you haven't
Use cloud instances with more resources

When it works: Up to ~1TB of data, ~10,000 QPS for simple queries.

Level 3: Read Replicas

Create read-only copies of your database
Route read traffic to replicas
Keep writes on the primary

┌──────────┐
│  Primary │ ← All writes
└────┬─────┘
     │ Replication
┌────┼────────────────┐
│    │                │
▼    ▼                ▼
┌────────┐  ┌────────┐  ┌────────┐
│Replica1│  │Replica2│  │Replica3│ ← All reads
└────────┘  └────────┘  └────────┘

When it works: Read-heavy workloads (90%+ reads). Most web applications.

Level 4: Caching

Add Redis/Memcached for frequently accessed data
Cache query results
Cache computed values

When it works: When the same data is read repeatedly.

Level 5: Sharding

Only consider sharding when:

Data exceeds single-server storage capacity
Write throughput exceeds single-server capability
You've exhausted the options above

When you need it:

10+ TB of data
50,000+ write QPS
Geographic distribution requirements

Sharding Strategies

There are several approaches to deciding which data goes to which shard.

Strategy 1: Range-Based Sharding

Divide data based on ranges of a key value.

Shard 1: user_id 1 - 1,000,000
Shard 2: user_id 1,000,001 - 2,000,000
Shard 3: user_id 2,000,001 - 3,000,000

Pros:

Simple to understand and implement
Range queries are efficient (all data in one shard)
Easy to add new shards for new ranges

Cons:

Hot spots: New users all go to the latest shard
Uneven distribution: Some ranges may be more active
Rebalancing is complex

When to use: Time-series data, sequential IDs where old data is accessed less.

Example implementation:

def get_shard(user_id):
    if user_id <= 1_000_000:
        return "shard_1"
    elif user_id <= 2_000_000:
        return "shard_2"
    elif user_id <= 3_000_000:
        return "shard_3"
    else:
        return "shard_4"

Strategy 2: Hash-Based Sharding

Apply a hash function to the shard key, then modulo by number of shards.

shard_id = hash(user_id) % num_shards

user_id: 12345
hash(12345) = 7832619
7832619 % 4 = 3
→ Goes to Shard 3

Pros:

Even distribution of data
No hot spots (assuming good hash function)
Simple routing logic

Cons:

Range queries span all shards (expensive)
Adding shards requires data redistribution
Hash collisions possible (rare)

When to use: Uniform access patterns, key-value lookups.

Example implementation:

import hashlib

def get_shard(user_id, num_shards=4):
    hash_value = int(hashlib.md5(str(user_id).encode()).hexdigest(), 16)
    return f"shard_{hash_value % num_shards}"

Strategy 3: Consistent Hashing

A variation of hash-based sharding that minimizes data movement when adding/removing shards.

Hash Ring:
           0
         / | \
       /   |   \
     Shard1    Shard2
       \   |   /
         \ | /
          128
         / | \
       /   |   \
     Shard3    Shard4
       \   |   /
         \ | /
          255

Keys are hashed to positions on a ring. Each key goes to the first shard clockwise from its position.

Pros:

Adding a shard only moves K/N keys (K = total keys, N = shards)
Removing a shard only affects that shard's keys
Smooth scaling

Cons:

More complex to implement
Can still have uneven distribution (solved with virtual nodes)

When to use: Systems that frequently add/remove shards.

Strategy 4: Directory-Based Sharding

Maintain a lookup table mapping keys to shards.

Lookup Table:
┌──────────┬─────────┐
│ user_id  │ shard   │
├──────────┼─────────┤
│ 1-1000   │ shard_1 │
│ 1001-1500│ shard_2 │
│ 1501-3000│ shard_3 │
│ premium  │ shard_4 │
└──────────┴─────────┘

Pros:

Maximum flexibility
Can optimize placement per key
Easy to move specific keys

Cons:

Lookup table becomes a single point of failure
Extra hop for every query
Table can become large

When to use: Complex sharding logic, premium/VIP user handling.

Strategy Comparison

Strategy	Distribution	Range Queries	Resharding	Complexity
Range	Uneven	Fast	Difficult	Low
Hash	Even	Slow	Difficult	Low
Consistent Hash	Even	Slow	Easy	Medium
Directory	Custom	Depends	Easy	High

Choosing a Shard Key

The shard key is the most important decision in sharding. Choose wrong, and you'll either have hot spots or expensive cross-shard queries.

Good Shard Key Characteristics

1. High cardinality

Many unique values distribute data evenly.

Good: user_id (millions of unique values)
Bad: country (only ~200 values, US gets 50% of traffic)

2. Even distribution

Values should be accessed uniformly.

Good: user_id (users accessed somewhat evenly)
Bad: created_date (recent dates accessed much more)

3. Query patterns align

Most queries should hit a single shard.

Good: user_id when queries are "SELECT * FROM orders WHERE user_id = ?"
Bad: user_id when queries are "SELECT * FROM orders WHERE date = ?"

Shard Key Examples

E-commerce: Orders table

Key Option	Pros	Cons
user_id	One user's orders on one shard	Reporting by date is slow
order_id	Even distribution	User's order history spans shards
date	Time-range queries fast	Hot spot on recent dates
(user_id, order_id)	Balanced	Complex

Best choice for most cases: user_id if queries are user-centric.

Social Media: Posts table

Key Option	Pros	Cons
user_id	User's posts together	Celebrity users = hot spots
post_id	Even distribution	User's posts scattered
(user_id, post_id)	Balanced	Complex routing

Best choice: user_id with special handling for high-follower accounts.

Chat App: Messages table

Key Option	Pros	Cons
conversation_id	Conversation messages together	Some conversations huge
message_id	Even distribution	Conversation messages scattered
(sender_id, recipient_id)	User's messages together	Group chats complex

Best choice: conversation_id or composite key.

Anti-Patterns

1. Sharding by low-cardinality field

Bad: Shard by country
     US = 50% of traffic → one hot shard

2. Sharding by monotonically increasing key

Bad: Shard by auto-increment ID with range sharding
     All new writes go to the last shard

3. Sharding without considering query patterns

Bad: Shard users by user_id
     Then query: "SELECT * FROM users WHERE email = ?"
     Must query all shards

Cross-Shard Operations

The hardest part of sharding is operations that span multiple shards.

Cross-Shard Queries

Scatter-Gather Pattern:

Query all shards, combine results.

def search_users(query):
    results = []
    for shard in all_shards:
        results.extend(shard.query(f"SELECT * FROM users WHERE name LIKE '%{query}%'"))
    return sorted(results)[:100]  # Combine and limit

Performance: Slowest shard determines response time. Avoid when possible.

Cross-Shard Joins

Problem: User data on shard 1, orders on shard 3.

Solutions:

Denormalize: Store user data with orders (data duplication)
Application-level join: Query both shards, join in application
Broadcast table: Replicate small tables to all shards

# Application-level join
user = shard_1.query("SELECT * FROM users WHERE id = 123")
orders = shard_3.query("SELECT * FROM orders WHERE user_id = 123")
# Join in application
result = {**user, "orders": orders}

Cross-Shard Transactions

The hardest problem. Distributed transactions across shards are:

Slow (2-phase commit)
Complex (failure handling)
Sometimes impossible (CAP theorem)

Solutions:

Design to avoid them: Keep related data on same shard
Saga pattern: Sequence of local transactions with compensating actions
Eventual consistency: Accept temporary inconsistency

Saga Example (Order Processing):

1. Create order (Order Shard) → Success
2. Reserve inventory (Inventory Shard) → Success
3. Charge payment (Payment Shard) → Fail
4. Compensate: Release inventory (Inventory Shard)
5. Compensate: Cancel order (Order Shard)

Rebalancing and Resharding

As data grows, you'll need to add shards. This is painful.

The Resharding Problem

Hash-based sharding nightmare:

Before: 4 shards
hash(user_123) % 4 = 3 → Shard 3

After: 5 shards
hash(user_123) % 5 = 3 → Shard 3 (lucky!)

But:
hash(user_456) % 4 = 2 → Was Shard 2
hash(user_456) % 5 = 1 → Now Shard 1 (must move!)

With simple hash sharding, adding a shard moves ~75% of your data!

Solutions

1. Consistent hashing

Only K/N keys move when adding a shard.

2. Virtual shards

Create more logical shards than physical servers. Moving is just reassigning virtual shards.

Physical: 4 servers
Virtual: 256 shards (64 per server)

To add a 5th server:
Move 51 virtual shards to new server (20% of data)

3. Double-write migration

During migration:

Write to both old and new shard
Background job migrates existing data
When complete, switch reads to new shard
Stop writes to old shard

def write_user(user):
    old_shard = get_old_shard(user.id)
    new_shard = get_new_shard(user.id)

    old_shard.write(user)
    if old_shard != new_shard:
        new_shard.write(user)  # Double-write during migration

Sharding Implementation Approaches

Application-Level Sharding

Application code handles shard routing.

class ShardRouter:
    def __init__(self, shards):
        self.shards = shards

    def get_shard(self, user_id):
        shard_id = hash(user_id) % len(self.shards)
        return self.shards[shard_id]

    def query(self, user_id, sql):
        shard = self.get_shard(user_id)
        return shard.execute(sql)

# Usage
router = ShardRouter([db1, db2, db3, db4])
user = router.query(user_id=123, sql="SELECT * FROM users WHERE id = 123")

Pros: Full control, no external dependencies Cons: Complex application code, shard logic everywhere

Proxy-Level Sharding

A proxy sits between application and databases.

┌─────────┐     ┌─────────┐     ┌──────────────────────┐
│   App   │────▶│  Proxy  │────▶│  Shard 1, 2, 3, 4    │
└─────────┘     │(Vitess) │     └──────────────────────┘
                └─────────┘

Examples: Vitess, ProxySQL, PgBouncer

Pros: Transparent to application, centralized routing Cons: Additional infrastructure, potential bottleneck

Database-Native Sharding

Some databases have built-in sharding.

MongoDB: Built-in sharding with config servers

sh.shardCollection("mydb.users", { user_id: "hashed" })

CockroachDB, Spanner: Automatic sharding and rebalancing

PostgreSQL + Citus: Extension for distributed PostgreSQL

Pros: Less custom code, built-in rebalancing Cons: Vendor lock-in, limited flexibility

Common Pitfalls

Pitfall 1: Sharding Too Early

Problem: You shard at 100GB when you could handle 1TB with optimization.

Impact: Years of unnecessary complexity, slower development.

Solution: Exhaust alternatives first. Most startups never need sharding.

Pitfall 2: Wrong Shard Key

Problem: Sharded by timestamp, all writes go to one shard.

Impact: Hot spots, no write scaling benefit.

Solution: Analyze query patterns before choosing shard key.

Pitfall 3: Forgetting Cross-Shard Queries

Problem: Designed for single-shard queries, then business needs reporting.

Impact: Reports take hours or crash the system.

Solution: Plan for analytical queries from the start. Consider a separate analytics database.

Pitfall 4: No Rebalancing Strategy

Problem: Shards become uneven over time, no plan to fix it.

Impact: One shard becomes bottleneck, others underutilized.

Solution: Use consistent hashing or virtual shards. Plan for rebalancing.

Pitfall 5: Ignoring Operational Complexity

Problem: 4 databases is 4x the backups, monitoring, and on-call alerts.

Impact: Team drowns in operations, shipping slows.

Solution: Invest in automation. Consider managed databases.

Sharding in System Design Interviews

When to bring up sharding:

Mention Sharding When:

Scale is clearly massive (billions of rows)
Write throughput is explicitly high
Interviewer asks about horizontal scaling
Data doesn't fit on single server

Don't Jump to Sharding When:

Scale is unclear (ask first)
Read-heavy workload (replicas might suffice)
Early in the interview (show you know simpler solutions)

How to Discuss Sharding

1. Acknowledge the trade-off:

"Sharding adds complexity, cross-shard queries, distributed transactions, and operational overhead. So I'd first try read replicas and caching. But at this scale, we likely need to shard."

2. Choose a shard key:

"For this user-centric application, I'd shard by user_id. This keeps all of a user's data on one shard, making most queries single-shard operations."

3. Address cross-shard concerns:

"For the reporting use case, I'd use a separate analytics database that receives events from all shards. This way, operational queries stay fast while analytics has all the data it needs."

4. Discuss the strategy:

"I'd use consistent hashing to make future resharding easier. We could start with 4 shards and double when needed."

Summary: Sharding Decision Framework

┌─────────────────────────────────────────────────────────────┐
│                    Do You Need Sharding?                     │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
              ┌───────────────────────────────┐
              │ Have you optimized queries?   │
              │ (indexes, query rewriting)    │
              └───────────────┬───────────────┘
                              │ Yes
                              ▼
              ┌───────────────────────────────┐
              │ Have you scaled vertically?   │
              │ (bigger server, SSD)          │
              └───────────────┬───────────────┘
                              │ Yes
                              ▼
              ┌───────────────────────────────┐
              │ Is workload read-heavy?       │
              │ (>90% reads)                  │
              └───────────────┬───────────────┘
                              │ Yes → Add read replicas
                              │ No
                              ▼
              ┌───────────────────────────────┐
              │ Is hot data cacheable?        │
              └───────────────┬───────────────┘
                              │ Yes → Add caching
                              │ No
                              ▼
              ┌───────────────────────────────┐
              │ Data > 10TB OR                │
              │ Write QPS > 50,000?           │
              └───────────────┬───────────────┘
                              │ Yes
                              ▼
                    ┌─────────────────┐
                    │ Consider Sharding│
                    └─────────────────┘

Frequently Asked Questions

At what data size should I consider sharding?

There's no magic number, but as rough guidelines: if you're under 1TB and under 10,000 write QPS, you probably don't need sharding. Modern databases on good hardware can handle surprising amounts of data.

Can I shard after launch, or is it too late?

You can, but it's expensive. Resharding a production system requires careful planning, double-writes, data migration, and rollback plans. It's not too late, but it's harder than starting with sharding.

Should I use a database with built-in sharding or implement my own?

For most teams, use a database with built-in sharding (MongoDB, CockroachDB, Spanner). Implementing your own is a massive undertaking. Only build custom sharding if you have specific requirements that off-the-shelf solutions can't meet.

How do I handle auto-increment IDs across shards?

Options:

UUID: Globally unique, no coordination needed
Snowflake IDs: Time-based, includes shard ID
Central ID service: Assigns ID ranges to shards
Composite keys: (shard_id, local_id)

Most teams use UUIDs or Snowflake-style IDs.

Database Sharding: When and How to Scale Your Database

Ready to Master System Design Interviews?

What is Database Sharding?

Sharding vs. Partitioning

Horizontal vs. Vertical Sharding

When Do You Need Sharding?

Level 1: Optimize Queries

Level 2: Vertical Scaling

Level 3: Read Replicas

Level 4: Caching

Level 5: Sharding

Sharding Strategies

Strategy 1: Range-Based Sharding

Strategy 2: Hash-Based Sharding

Strategy 3: Consistent Hashing

Strategy 4: Directory-Based Sharding

Strategy Comparison

Choosing a Shard Key

Good Shard Key Characteristics

Shard Key Examples

Anti-Patterns

Cross-Shard Operations

Cross-Shard Queries

Cross-Shard Joins

Cross-Shard Transactions

Rebalancing and Resharding

The Resharding Problem

Solutions

Sharding Implementation Approaches

Application-Level Sharding

Proxy-Level Sharding

Database-Native Sharding

Common Pitfalls

Pitfall 1: Sharding Too Early

Pitfall 2: Wrong Shard Key

Pitfall 3: Forgetting Cross-Shard Queries

Pitfall 4: No Rebalancing Strategy

Pitfall 5: Ignoring Operational Complexity

Sharding in System Design Interviews

Mention Sharding When:

Don't Jump to Sharding When:

How to Discuss Sharding

Summary: Sharding Decision Framework

Frequently Asked Questions

At what data size should I consider sharding?

Can I shard after launch, or is it too late?

Should I use a database with built-in sharding or implement my own?

How do I handle auto-increment IDs across shards?

Ready to Master System Design Interviews?

FREE: System Design Interview Cheat Sheet

Related Articles

Building Data-Intensive Systems: The Complete Guide (2026)

The Universal System Design Interview Framework: 8 Steps to Design ANY System