System Design Masterclass

58 items

System Design Masterclass

Storagecontrol-planedistributed-databasemetadatacoordinationconsensusadvanced

Design Database Control Plane

Design a control plane for managing 1000s of distributed database nodes

1000s of nodes, PBs of data|Similar to Google, CockroachDB, PingCAP, Vitess, Amazon Aurora|45 min read

Summary

A database control plane manages metadata, coordinates cluster membership, handles leader election, and orchestrates schema changes across thousands of nodes. The core challenge is maintaining strong consistency for metadata while the data plane operates at high throughput. This pattern powers CockroachDB, Spanner, TiDB, Vitess, and every distributed database.

Key Takeaways

Core Problem

This is fundamentally a consensus and coordination problem. The control plane must maintain a single source of truth for cluster state while the data plane scales horizontally.

The Hard Part

Preventing split-brain scenarios where two partitions both think they are the leader, which can cause data corruption or loss.

Scaling Axis

Control plane scales by sharding metadata (range/table ownership) while keeping consensus groups small (3-7 nodes per Raft group).

The Question: Design a control plane for a distributed database that manages 1000+ nodes, handles automatic failover, coordinates schema changes, and maintains cluster metadata.

The control plane is responsible for: - Cluster membership: Which nodes are alive, their roles and capabilities - Data placement: Which node owns which data ranges/shards - Leader election: Ensuring exactly one leader per shard/partition - Schema management: Coordinating DDL across all nodes - Configuration: Cluster-wide settings and policies

What to say first

Before designing, let me clarify the separation between control plane and data plane. Control plane manages metadata and coordination. Data plane handles actual data read/writes. They have very different consistency and performance requirements.

Hidden requirements interviewers are testing: - Do you understand the CAP implications for metadata vs data? - Can you prevent split-brain scenarios? - How do you handle the thundering herd on leader failure? - Can you reason about consensus protocols at scale?

Summary

Key Takeaways

Core Problem

This is fundamentally a consensus and coordination problem. The control plane must maintain a single source of truth for cluster state while the data plane scales horizontally.

The Hard Part

Preventing split-brain scenarios where two partitions both think they are the leader, which can cause data corruption or loss.

Scaling Axis

Control plane scales by sharding metadata (range/table ownership) while keeping consensus groups small (3-7 nodes per Raft group).

Critical Invariant

Must never allow two nodes to believe they own the same data range simultaneously. Split-brain equals data corruption.

Performance Insight

Control plane operations (leader election, membership changes) can be slow (seconds). Data plane operations must be fast (milliseconds). Keep them separate.

Key Tradeoff

Strong consistency for metadata vs availability. We choose CP over AP for the control plane because incorrect metadata causes data loss.

Design Walkthrough

Problem Statement

The Question: Design a control plane for a distributed database that manages 1000+ nodes, handles automatic failover, coordinates schema changes, and maintains cluster metadata.

What to say first

Clarifying Questions

Ask these questions to demonstrate senior thinking. Each answer shapes your architecture.

Question 1: Scale

How many nodes in the cluster? How many shards/partitions? What is the expected metadata size?

Why this matters: Determines if metadata fits on a single node or needs sharding. Typical answer: 1000-5000 nodes, 100K+ shards, metadata in GB range Architecture impact: Need hierarchical control plane or sharded metadata

Question 2: Consistency Requirements

What happens if control plane is unavailable? Can data plane continue serving reads/writes?

Why this matters: Determines coupling between control and data plane. Typical answer: Data plane should continue with cached metadata for some time Architecture impact: Need lease-based ownership with expiry, not hard locks

Question 3: Failure Detection

How quickly must we detect node failures? What is acceptable failover time?

Why this matters: Faster detection means more false positives. Typical answer: Detect within 10s, failover within 30s Architecture impact: Need tunable failure detectors, not just heartbeats

Question 4: Multi-Region

Single region or multi-region deployment? What are the latency constraints?

Why this matters: Cross-region consensus adds 100s of ms latency. Typical answer: Multi-region for disaster recovery, accept higher latency for control plane ops Architecture impact: Hierarchical control plane with regional leaders

Stating assumptions

I will assume: 1000 nodes, 100K shards, multi-region deployment, data plane can operate with cached metadata for minutes, failover target under 30 seconds.

The Hard Part

Say this out loud

The hard part here is preventing split-brain scenarios where two nodes both believe they are the leader for the same data, while still achieving fast failover when nodes actually fail.

Why this is genuinely hard:

1.Split-Brain Prevention: If network partitions a cluster, both sides might elect leaders. Two leaders writing to same data = corruption.
2.Failure Detection vs False Positives: Fast detection catches failures quickly but triggers false failovers during GC pauses or network blips.
3.Thundering Herd: When a leader fails, all followers might simultaneously try to become leader or query the control plane.
4.Consistency During Reconfiguration: Adding/removing nodes while serving traffic without losing data.

Common mistake

Using simple heartbeats without quorum-based failure detection. A node isolated by network partition keeps serving stale data while a new leader is elected.

The fundamental tradeoff:

Faster failover detection increases risk of false positives. Slower failover detection increases unavailability window.

Solution: Use leases - a node owns data only while its lease is valid. Lease must expire before new leader can be elected. This bounds the window of potential split-brain.

Split-Brain Prevention with Leases

Scale and Access Patterns

Let me estimate the scale and understand access patterns for the control plane.

Dimension	Value	Impact
Nodes	1,000 - 5,000	Membership state manageable on single consensus group
Shards/Ranges	100,000+	Need to shard metadata or use hierarchical approach

+ 4 more rows...

What to say

Control plane operations are low throughput but require strong consistency. Data plane is high throughput with relaxed consistency. These different profiles suggest separating them architecturally.

Access Pattern Analysis:

Reads dominate: Nodes read cluster state frequently, write rarely - Locality matters: Each node mostly cares about shards it owns - Bursty writes: Leader failures cause burst of metadata updates - Strong consistency required: Stale metadata can cause data loss

Heartbeat traffic:
- 1000 nodes x 1 heartbeat/second = 1000 msg/s
- Each heartbeat ~200 bytes = 200 KB/s (trivial)

+ 14 more lines...

High-Level Architecture

Let me design a hierarchical control plane architecture.

What to say

I will use a hierarchical approach: a global control plane for cluster-wide decisions, and per-shard Raft groups for data leadership. This separates cluster management from data replication.

Control Plane Architecture

Component Responsibilities:

1. Global Control Plane (3-5 nodes) - Cluster membership management - Shard-to-node assignment (shard map) - Schema versioning and distribution - Global configuration - Uses Raft/Paxos for consensus

2. Metadata Store - Cluster state: node list, health status, capacity - Shard map: which nodes own which ranges - Schema: table definitions, indexes, version - Stored in control plane nodes (replicated via Raft)

3. Per-Shard Raft Groups - Each shard has its own Raft group (typically 3-5 replicas) - Handles data replication and leader election for that shard - Reports to global control plane for coordination

Real-world reference

CockroachDB uses this pattern - a global meta range stores the shard map, and each data range has its own Raft group. TiDB has separate PD (Placement Driver) nodes for control plane.

Data Model and Storage

The control plane manages several types of metadata, each with different consistency requirements.

What to say

The control plane stores three categories of data: cluster topology (changes rarely), shard map (changes on rebalancing), and ephemeral state (heartbeats, locks). Each has different storage needs.

// Cluster membership
message Node {
  string node_id = 1;

+ 40 more lines...

Storage Choices:

Data Type	Storage	Consistency	Reason
Cluster membership	Raft-replicated	Strong	Must never disagree on who is in cluster
Shard map	Raft-replicated	Strong	Data loss if two nodes think they own same shard
Schema	Raft-replicated	Strong	Queries fail if schema version mismatch
Heartbeats	Leader memory	Eventual	High frequency, leader tracks locally
Leases	Raft-replicated	Strong	Critical for split-brain prevention

type LeaseManager struct {
    raft     *RaftNode
    leases   map[string]*Lease  // shardID -> Lease

+ 39 more lines...

Consensus and Leader Election

The control plane uses consensus (Raft or Paxos) for strong consistency. Let me explain the leader election flow.

What to say

I will use Raft for the control plane consensus. It is easier to understand and implement than Paxos, with the same consistency guarantees. The key insight is that Raft leader election provides exactly-once leader semantics.

Raft Leader Election Flow

Two Levels of Leader Election:

1. Control Plane Leader (Raft) - 3-5 control plane nodes form a Raft group - One is elected leader, handles all writes - Followers serve read replicas - Election on leader failure (typically 1-5 seconds)

2. Shard Leader (Per-Shard Raft) - Each shard has its own Raft group (3 replicas) - Shard leader handles reads/writes for that shard - Control plane tracks who the shard leaders are - Election independent of control plane

type FailureDetector struct {
    suspicionThreshold time.Duration  // e.g., 5 seconds
    deadThreshold      time.Duration  // e.g., 30 seconds

+ 42 more lines...

Critical detail

Always wait for the old lease to expire before granting a new one. This is the ONLY way to prevent split-brain. The old leader will stop serving once its lease expires, even if it cannot reach the control plane.

Shard Management and Rebalancing

The control plane must handle shard splits, merges, and rebalancing as data grows or nodes are added/removed.

What to say

Shard management has three operations: split (when a shard gets too large), merge (when adjacent shards are small), and move (for rebalancing load). Each must maintain consistency during the transition.

Shard Split Process

Shard Split Algorithm:

1.Trigger: Shard exceeds size threshold (e.g., 512MB) 2. Find split point: Middle key or based on access patterns 3. Atomic split: Leader creates two child ranges, replicates via Raft 4. Update metadata: Control plane updates shard map 5. Client refresh: Clients learn about new ranges on next access

type Rebalancer struct {
    cp            *ControlPlane
    targetPerNode int  // target shards per node

+ 61 more lines...

Real-world insight

CockroachDB and TiDB both limit concurrent rebalancing operations to avoid overwhelming the cluster. Typically only 1-3 shard moves happen simultaneously.

Schema Changes (DDL)

Schema changes (CREATE TABLE, ADD COLUMN, CREATE INDEX) must be coordinated across all nodes without blocking reads/writes.

Say this out loud

Online schema changes are surprisingly hard. The challenge is that different nodes see the new schema at different times. A node with the old schema might reject data that is valid under the new schema.

The Problem:

Imagine adding a new column with a NOT NULL constraint: - Node A gets new schema at T=0, starts requiring the column - Node B still has old schema at T=1, sends data without the column. - Node A rejects it. Inconsistency!

Solution: Two-Phase Schema Change

Online Schema Change Protocol

type SchemaState int

const (

+ 57 more lines...

Real-world reference

Google F1 and CockroachDB use this two-phase (or multi-phase) approach. Each phase ensures no node is more than one version ahead or behind any other.

Consistency and Invariants

System Invariants

1) At most one leader per shard at any time. 2) No two nodes can believe they own overlapping key ranges. 3) Schema version on any node is within 1 of any other node. Violation of any invariant means data corruption.

Why strong consistency is non-negotiable for control plane:

Shard ownership: If two nodes think they own [A-M], they will both accept writes. On reconciliation, one set of writes is lost.
Leader election: Two leaders means split-brain. Clients see inconsistent data.
Schema: If node A has column X and node B does not, data written through A cannot be read through B.

Invariant	Violation Consequence	Prevention Mechanism
Single leader per shard	Split-brain, data loss	Leases with epoch numbers
Non-overlapping ranges	Duplicate or lost data	Raft-replicated shard map
Schema version bounded	Query failures, data corruption	Two-phase schema changes
Cluster membership consistent	Phantom nodes, missed failures	Consensus-based membership

// Every write includes the leaders epoch
// Storage nodes reject writes with old epochs

+ 27 more lines...

What to say

We use epochs (also called terms or generations) as a logical clock. Any operation with a stale epoch is rejected. This ensures that even if an old leader wakes up from a network partition, its writes will be rejected by storage nodes that have seen a higher epoch.

Failure Modes and Resilience

Proactively discuss failures

Let me walk through the failure modes. Control plane failures are especially critical because they can affect the entire cluster.

Failure	Impact	Mitigation	Recovery Time
CP leader crash	No new elections or config changes	Raft elects new CP leader	1-5 seconds
CP minority down	No impact (quorum still available)	Replace failed nodes	N/A

+ 4 more rows...

Handling Control Plane Failures:

// Data nodes can continue operating with cached metadata
// when control plane is unavailable

+ 40 more lines...

Graceful degradation

When control plane is down: reads continue, writes continue until lease expires (typically 30-60s), no rebalancing or failover, no schema changes. The cluster is in a degraded but functional state.

Disaster Recovery:

1.CP state backup: Periodically snapshot Raft state to external storage (S3) 2. Bootstrap from backup: If majority of CP nodes lost, restore from snapshot 3. Data node rejoins: Nodes compare their state with CP and catch up

Evolution and Scaling

What to say

This design works well for clusters up to a few thousand nodes. Let me discuss how it evolves for 10K+ nodes and multi-region deployments.

Evolution Path:

Stage 1: Single Control Plane (up to 1000 nodes) - 3-5 CP nodes in single region - All metadata in single Raft group - Simple, easy to operate

Stage 2: Hierarchical Control Plane (1000-10000 nodes) - Regional control planes for local operations - Global control plane for cross-region coordination - Metadata sharded by table or key range

Stage 3: Federated Control Plane (10000+ nodes) - Multiple independent clusters - Global catalog service for discovery - Cross-cluster replication for DR

Multi-Region Control Plane

Scaling Challenges at 10K+ Nodes:

1.Heartbeat storm: 10K heartbeats/second overwhelms single CP - Solution: Hierarchical reporting, sampling, or gossip protocols
2.Metadata size: Millions of shards = GBs of metadata - Solution: Shard the shard map itself, lazy loading
3.Failure detection latency: Checking 10K nodes takes time - Solution: Parallel checks, regional aggregation
4.Consensus latency: Cross-region Raft adds 100s of ms - Solution: Regional consensus for local ops, global for cross-region

Alternative approach

If we needed to scale beyond 10K nodes, I would consider a gossip-based protocol (like Cassandra) for failure detection and use the control plane only for authoritative decisions. This trades consistency for scalability.

What I would do differently for...

Latency-critical systems: Use regional control planes with async replication to global. Accept that global view might be stale by seconds.

Cost-sensitive deployments: Co-locate control plane with data nodes (embedded mode). Fewer dedicated machines but more complex operations.

Multi-tenant: Separate control planes per tenant for isolation. Share data nodes with namespace separation.

Design Trade-offs

Advantages

+Simple implementation
+Strong consistency
+Easy to reason about

Disadvantages

-Limited to 5-7 nodes
-Single region latency
-All metadata on every node

When to use

Clusters under 1000 nodes, single region

System Design Masterclass

Weather Application with Forecasting

URL Shortener

Live Comments Feature

API Rate Limiter

On-Call Escalation System

Hotel Booking and Reservation System

Parts Compatibility Validation

Real-time Stock Price Viewer

Top-K Rankings System

File Download and Sync Library

Real-time Active Viewers

Marketplace Features

Price Alert System

Netflix Screen Concurrency Limits

Live Reactions System

Top K Most Shared Articles

High-Profile Likes Counter

Authentication and User Login

Google Calendar

Web Crawler

News Feed

Video Streaming Platform

IoC / Dependency Injection Framework

Distributed Control Infrastructure

Notification Service

Distributed Tracing System

P2P File Transfer System

Large Data Migration to Cloud

Wire Transfer API

Large Data Sorting and Processing

Database Control Plane

Distributed Metrics Logging and Aggregation

Ads Management & Delivery System

Flash Sale Backend

Photo Sharing Platform

Cluster Health Monitoring System

Rider Matching System

Surge Pricing System

Collaborative Editing System

Server Metrics Collection System

User Analytics Dashboard & Event Pipeline

Dropbox / Google Drive

Distributed Message Queue

ETA and Live Location Sharing

Distributed Key-Value Store

Distributed Stream Processing System

Payment Processing System

Distributed Job Scheduler

WhatsApp / Messenger

Payment Wallet at Global Scale

Uber / Ride Sharing

Web Search Engine

Globally Distributed SQL Database

Real-Time Analytics System

Recommendation System (Netflix)

Multi-Region Disaster Recovery System

Time-Series Database

Fraud Detection System

Design Database Control Plane

Summary

Key Takeaways

Core Problem

The Hard Part

Scaling Axis

Problem Statement

What to say first

Premium Content