Design Walkthrough
Problem Statement
The Question: Design a control plane for a distributed database that manages 1000+ nodes, handles automatic failover, coordinates schema changes, and maintains cluster metadata.
The control plane is responsible for: - Cluster membership: Which nodes are alive, their roles and capabilities - Data placement: Which node owns which data ranges/shards - Leader election: Ensuring exactly one leader per shard/partition - Schema management: Coordinating DDL across all nodes - Configuration: Cluster-wide settings and policies
What to say first
Before designing, let me clarify the separation between control plane and data plane. Control plane manages metadata and coordination. Data plane handles actual data read/writes. They have very different consistency and performance requirements.
Hidden requirements interviewers are testing: - Do you understand the CAP implications for metadata vs data? - Can you prevent split-brain scenarios? - How do you handle the thundering herd on leader failure? - Can you reason about consensus protocols at scale?
Clarifying Questions
Ask these questions to demonstrate senior thinking. Each answer shapes your architecture.
Question 1: Scale
How many nodes in the cluster? How many shards/partitions? What is the expected metadata size?
Why this matters: Determines if metadata fits on a single node or needs sharding. Typical answer: 1000-5000 nodes, 100K+ shards, metadata in GB range Architecture impact: Need hierarchical control plane or sharded metadata
Question 2: Consistency Requirements
What happens if control plane is unavailable? Can data plane continue serving reads/writes?
Why this matters: Determines coupling between control and data plane. Typical answer: Data plane should continue with cached metadata for some time Architecture impact: Need lease-based ownership with expiry, not hard locks
Question 3: Failure Detection
How quickly must we detect node failures? What is acceptable failover time?
Why this matters: Faster detection means more false positives. Typical answer: Detect within 10s, failover within 30s Architecture impact: Need tunable failure detectors, not just heartbeats
Question 4: Multi-Region
Single region or multi-region deployment? What are the latency constraints?
Why this matters: Cross-region consensus adds 100s of ms latency. Typical answer: Multi-region for disaster recovery, accept higher latency for control plane ops Architecture impact: Hierarchical control plane with regional leaders
Stating assumptions
I will assume: 1000 nodes, 100K shards, multi-region deployment, data plane can operate with cached metadata for minutes, failover target under 30 seconds.
The Hard Part
Say this out loud
The hard part here is preventing split-brain scenarios where two nodes both believe they are the leader for the same data, while still achieving fast failover when nodes actually fail.
Why this is genuinely hard:
- 1.Split-Brain Prevention: If network partitions a cluster, both sides might elect leaders. Two leaders writing to same data = corruption.
- 2.Failure Detection vs False Positives: Fast detection catches failures quickly but triggers false failovers during GC pauses or network blips.
- 3.Thundering Herd: When a leader fails, all followers might simultaneously try to become leader or query the control plane.
- 4.Consistency During Reconfiguration: Adding/removing nodes while serving traffic without losing data.
Common mistake
Using simple heartbeats without quorum-based failure detection. A node isolated by network partition keeps serving stale data while a new leader is elected.
The fundamental tradeoff:
Faster failover detection increases risk of false positives. Slower failover detection increases unavailability window.
Solution: Use leases - a node owns data only while its lease is valid. Lease must expire before new leader can be elected. This bounds the window of potential split-brain.
Split-Brain Prevention with Leases
Scale and Access Patterns
Let me estimate the scale and understand access patterns for the control plane.
| Dimension | Value | Impact |
|---|---|---|
| Nodes | 1,000 - 5,000 | Membership state manageable on single consensus group |
| Shards/Ranges | 100,000+ | Need to shard metadata or use hierarchical approach |
What to say
Control plane operations are low throughput but require strong consistency. Data plane is high throughput with relaxed consistency. These different profiles suggest separating them architecturally.
Access Pattern Analysis:
- Reads dominate: Nodes read cluster state frequently, write rarely - Locality matters: Each node mostly cares about shards it owns - Bursty writes: Leader failures cause burst of metadata updates - Strong consistency required: Stale metadata can cause data loss
Heartbeat traffic:
- 1000 nodes x 1 heartbeat/second = 1000 msg/s
- Each heartbeat ~200 bytes = 200 KB/s (trivial)High-Level Architecture
Let me design a hierarchical control plane architecture.
What to say
I will use a hierarchical approach: a global control plane for cluster-wide decisions, and per-shard Raft groups for data leadership. This separates cluster management from data replication.
Control Plane Architecture
Component Responsibilities:
1. Global Control Plane (3-5 nodes) - Cluster membership management - Shard-to-node assignment (shard map) - Schema versioning and distribution - Global configuration - Uses Raft/Paxos for consensus
2. Metadata Store - Cluster state: node list, health status, capacity - Shard map: which nodes own which ranges - Schema: table definitions, indexes, version - Stored in control plane nodes (replicated via Raft)
3. Per-Shard Raft Groups - Each shard has its own Raft group (typically 3-5 replicas) - Handles data replication and leader election for that shard - Reports to global control plane for coordination
Real-world reference
CockroachDB uses this pattern - a global meta range stores the shard map, and each data range has its own Raft group. TiDB has separate PD (Placement Driver) nodes for control plane.
Data Model and Storage
The control plane manages several types of metadata, each with different consistency requirements.
What to say
The control plane stores three categories of data: cluster topology (changes rarely), shard map (changes on rebalancing), and ephemeral state (heartbeats, locks). Each has different storage needs.
// Cluster membership
message Node {
string node_id = 1;Storage Choices:
| Data Type | Storage | Consistency | Reason |
|---|---|---|---|
| Cluster membership | Raft-replicated | Strong | Must never disagree on who is in cluster |
| Shard map | Raft-replicated | Strong | Data loss if two nodes think they own same shard |
| Schema | Raft-replicated | Strong | Queries fail if schema version mismatch |
| Heartbeats | Leader memory | Eventual | High frequency, leader tracks locally |
| Leases | Raft-replicated | Strong | Critical for split-brain prevention |
type LeaseManager struct {
raft *RaftNode
leases map[string]*Lease // shardID -> LeaseConsensus and Leader Election
The control plane uses consensus (Raft or Paxos) for strong consistency. Let me explain the leader election flow.
What to say
I will use Raft for the control plane consensus. It is easier to understand and implement than Paxos, with the same consistency guarantees. The key insight is that Raft leader election provides exactly-once leader semantics.
Raft Leader Election Flow
Two Levels of Leader Election:
1. Control Plane Leader (Raft) - 3-5 control plane nodes form a Raft group - One is elected leader, handles all writes - Followers serve read replicas - Election on leader failure (typically 1-5 seconds)
2. Shard Leader (Per-Shard Raft) - Each shard has its own Raft group (3 replicas) - Shard leader handles reads/writes for that shard - Control plane tracks who the shard leaders are - Election independent of control plane
type FailureDetector struct {
suspicionThreshold time.Duration // e.g., 5 seconds
deadThreshold time.Duration // e.g., 30 secondsCritical detail
Always wait for the old lease to expire before granting a new one. This is the ONLY way to prevent split-brain. The old leader will stop serving once its lease expires, even if it cannot reach the control plane.
Shard Management and Rebalancing
The control plane must handle shard splits, merges, and rebalancing as data grows or nodes are added/removed.
What to say
Shard management has three operations: split (when a shard gets too large), merge (when adjacent shards are small), and move (for rebalancing load). Each must maintain consistency during the transition.
Shard Split Process
Shard Split Algorithm:
- 1.Trigger: Shard exceeds size threshold (e.g., 512MB) 2. Find split point: Middle key or based on access patterns 3. Atomic split: Leader creates two child ranges, replicates via Raft 4. Update metadata: Control plane updates shard map 5. Client refresh: Clients learn about new ranges on next access
type Rebalancer struct {
cp *ControlPlane
targetPerNode int // target shards per nodeReal-world insight
CockroachDB and TiDB both limit concurrent rebalancing operations to avoid overwhelming the cluster. Typically only 1-3 shard moves happen simultaneously.
Schema Changes (DDL)
Schema changes (CREATE TABLE, ADD COLUMN, CREATE INDEX) must be coordinated across all nodes without blocking reads/writes.
Say this out loud
Online schema changes are surprisingly hard. The challenge is that different nodes see the new schema at different times. A node with the old schema might reject data that is valid under the new schema.
The Problem:
Imagine adding a new column with a NOT NULL constraint: - Node A gets new schema at T=0, starts requiring the column - Node B still has old schema at T=1, sends data without the column. - Node A rejects it. Inconsistency!
Solution: Two-Phase Schema Change
Online Schema Change Protocol
type SchemaState int
const (Real-world reference
Google F1 and CockroachDB use this two-phase (or multi-phase) approach. Each phase ensures no node is more than one version ahead or behind any other.
Consistency and Invariants
System Invariants
1) At most one leader per shard at any time. 2) No two nodes can believe they own overlapping key ranges. 3) Schema version on any node is within 1 of any other node. Violation of any invariant means data corruption.
Why strong consistency is non-negotiable for control plane:
- Shard ownership: If two nodes think they own [A-M], they will both accept writes. On reconciliation, one set of writes is lost.
- Leader election: Two leaders means split-brain. Clients see inconsistent data.
- Schema: If node A has column X and node B does not, data written through A cannot be read through B.
| Invariant | Violation Consequence | Prevention Mechanism |
|---|---|---|
| Single leader per shard | Split-brain, data loss | Leases with epoch numbers |
| Non-overlapping ranges | Duplicate or lost data | Raft-replicated shard map |
| Schema version bounded | Query failures, data corruption | Two-phase schema changes |
| Cluster membership consistent | Phantom nodes, missed failures | Consensus-based membership |
// Every write includes the leaders epoch
// Storage nodes reject writes with old epochs
What to say
We use epochs (also called terms or generations) as a logical clock. Any operation with a stale epoch is rejected. This ensures that even if an old leader wakes up from a network partition, its writes will be rejected by storage nodes that have seen a higher epoch.
Failure Modes and Resilience
Proactively discuss failures
Let me walk through the failure modes. Control plane failures are especially critical because they can affect the entire cluster.
| Failure | Impact | Mitigation | Recovery Time |
|---|---|---|---|
| CP leader crash | No new elections or config changes | Raft elects new CP leader | 1-5 seconds |
| CP minority down | No impact (quorum still available) | Replace failed nodes | N/A |
Handling Control Plane Failures:
// Data nodes can continue operating with cached metadata
// when control plane is unavailable
Graceful degradation
When control plane is down: reads continue, writes continue until lease expires (typically 30-60s), no rebalancing or failover, no schema changes. The cluster is in a degraded but functional state.
Disaster Recovery:
- 1.CP state backup: Periodically snapshot Raft state to external storage (S3) 2. Bootstrap from backup: If majority of CP nodes lost, restore from snapshot 3. Data node rejoins: Nodes compare their state with CP and catch up
Evolution and Scaling
What to say
This design works well for clusters up to a few thousand nodes. Let me discuss how it evolves for 10K+ nodes and multi-region deployments.
Evolution Path:
Stage 1: Single Control Plane (up to 1000 nodes) - 3-5 CP nodes in single region - All metadata in single Raft group - Simple, easy to operate
Stage 2: Hierarchical Control Plane (1000-10000 nodes) - Regional control planes for local operations - Global control plane for cross-region coordination - Metadata sharded by table or key range
Stage 3: Federated Control Plane (10000+ nodes) - Multiple independent clusters - Global catalog service for discovery - Cross-cluster replication for DR
Multi-Region Control Plane
Scaling Challenges at 10K+ Nodes:
- 1.Heartbeat storm: 10K heartbeats/second overwhelms single CP - Solution: Hierarchical reporting, sampling, or gossip protocols
- 2.Metadata size: Millions of shards = GBs of metadata - Solution: Shard the shard map itself, lazy loading
- 3.Failure detection latency: Checking 10K nodes takes time - Solution: Parallel checks, regional aggregation
- 4.Consensus latency: Cross-region Raft adds 100s of ms - Solution: Regional consensus for local ops, global for cross-region
Alternative approach
If we needed to scale beyond 10K nodes, I would consider a gossip-based protocol (like Cassandra) for failure detection and use the control plane only for authoritative decisions. This trades consistency for scalability.
What I would do differently for...
Latency-critical systems: Use regional control planes with async replication to global. Accept that global view might be stale by seconds.
Cost-sensitive deployments: Co-locate control plane with data nodes (embedded mode). Fewer dedicated machines but more complex operations.
Multi-tenant: Separate control planes per tenant for isolation. Share data nodes with namespace separation.