Design Walkthrough
Problem Statement
The Question: Design a distributed command and control system that can manage 100,000+ nodes across multiple regions with reliable command delivery and state reporting.
This system pattern is essential for: - Infrastructure orchestration - Kubernetes, Nomad, cloud control planes - Configuration management - Puppet, Chef, Ansible at scale - Fleet management - Managing server fleets, IoT devices, edge nodes - Deployment systems - Rolling out changes across distributed infrastructure
What to say first
Before I design, let me clarify the requirements. I want to understand the scale, command types, consistency needs, and security requirements for this control plane.
Hidden requirements interviewers test: - Can you design for network unreliability? - Do you understand the CAP tradeoffs for control planes? - How do you handle partial failures? - Can you reason about security in distributed systems?
Clarifying Questions
Ask these questions to demonstrate senior thinking. Each shapes your architecture significantly.
Question 1: Scale and Distribution
How many nodes do we need to manage? Are they globally distributed or regional? What is the network topology?
Why this matters: Determines hierarchy depth and communication patterns. Typical answer: 100K nodes across 5 regions, mix of cloud and edge Architecture impact: Need hierarchical control with regional aggregation
Question 2: Command Types
What types of commands? Real-time (immediate execution) vs batch? Are commands idempotent?
Why this matters: Determines delivery guarantees and retry logic. Typical answer: Mix of immediate (security patches) and scheduled (config updates) Architecture impact: Need priority queues and idempotent command design
Question 3: Consistency Requirements
Do all nodes need to execute commands simultaneously? What happens if some nodes are offline?
Why this matters: Determines consistency model. Typical answer: Eventual execution OK, but need tracking of which nodes completed Architecture impact: At-least-once delivery with acknowledgment tracking
Question 4: Security Requirements
What authentication and authorization model? Do we need audit logging? Encryption requirements?
Why this matters: Control planes are high-value targets. Typical answer: Mutual TLS, RBAC, complete audit trail Architecture impact: PKI infrastructure, authorization service, audit logging
Stating assumptions
I will assume: 100K nodes across 5 regions, mix of immediate and batch commands, eventual consistency acceptable, strong security requirements with mTLS and RBAC.
The Hard Part
Say this out loud
The hard part here is ensuring reliable command delivery and execution tracking across an unreliable network where nodes can be offline, partitioned, or fail at any time.
Why this is genuinely hard:
- 1.Unreliable Network: Nodes go offline, network partitions happen, messages get lost. How do you guarantee a command eventually reaches every target?
- 2.Partial Failures: Command reaches 90% of nodes, then controller crashes. How do you resume? How do you avoid re-executing on nodes that already completed?
- 3.State Consistency: 100K nodes each reporting state. How do you maintain consistent view without overwhelming the central controller?
- 4.Security Surface: Every communication channel is an attack vector. Compromised node should not be able to issue commands to others.
Common mistake
Candidates often design a simple request-response system. This fails when nodes are offline. You need a persistent command queue with acknowledgment tracking.
The fundamental insight:
Push Model: Controller pushes to nodes
- Problem: What if node is offline?
Pull Model: Nodes pull from controller
- Problem: Polling overhead at scale
Hybrid: Nodes pull + push notifications
- Best of both: Reliable + responsiveMost production systems use pull-based with push notifications for urgency.
Scale and Access Patterns
Let me estimate the scale and understand access patterns.
| Dimension | Value | Impact |
|---|---|---|
| Managed Nodes | 100,000 | Need hierarchical architecture |
| Regions | 5 | Regional controllers for locality |
What to say
This is not a high-throughput system in terms of commands, but it is high-fanout. The challenge is reliable delivery to 100K endpoints, not raw throughput.
Access Pattern Analysis:
- Command flow: One-to-many broadcast (1 command to N nodes) - State flow: Many-to-one aggregation (N nodes reporting to 1 controller) - Asymmetric: More state flowing up than commands flowing down - Bursty commands: Most time quiet, then burst during deployments - Continuous state: Constant stream of heartbeats and status
State aggregation load:
- 100K nodes x 1 update/30s = 3,333 updates/second
- Each update ~1KB = 3.3 MB/secondHigh-Level Architecture
Let me design a hierarchical control plane with reliable command delivery.
What to say
I will use a hierarchical architecture with regional controllers. This provides locality, fault isolation, and scales the fanout. We scale by adding regions, not by scaling central controller.
Distributed Control Architecture
Component Responsibilities:
- 1.API Gateway: Authentication, authorization, rate limiting
- 2.Command Service: Accepts commands, validates, persists to queue - Assigns unique command ID - Determines target nodes (all, by tag, by region) - Tracks command lifecycle
- 3.Command Queue: Durable message queue (Kafka/SQS) - Persists commands until acknowledged - Supports replay for failed deliveries - Partitioned by region
- 4.Regional Controller: Manages nodes in one region - Receives commands from central queue - Fans out to local agents - Aggregates state from agents - Reports summary to central
- 5.Node Agent: Runs on each managed node - Pulls commands from regional controller - Executes commands locally - Reports status and heartbeat - Handles local retries
Real-world reference
Kubernetes uses similar pattern: API server as central, etcd for state, controllers for reconciliation. Puppet uses pull-based with catalog compilation. AWS Systems Manager uses regional endpoints with central orchestration.
Data Model and Storage
The data model must support command tracking, node state, and audit logging.
What to say
We need three main data stores: a durable queue for commands, a database for state, and a time-series store for metrics. The queue is the source of truth for pending commands.
-- Managed nodes registry
CREATE TABLE nodes (
node_id UUID PRIMARY KEY,Command Queue Structure (Kafka/SQS):
{
"command_id": "uuid-here",
"type": "execute_script",State Storage Strategy:
- Hot state (Redis): Current node status, active commands - Warm state (PostgreSQL): Command history, node registry - Cold state (S3/Object Storage): Audit logs, old executions
Important detail
Commands must be signed. Each command includes a cryptographic signature that agents verify before execution. This prevents command injection if the queue is compromised.
Command Delivery Deep Dive
Reliable command delivery is the core challenge. Let me walk through the delivery mechanism.
Command Delivery Flow
Pull-based Delivery with Push Notification:
Agents primarily pull commands, but we add push for urgency:
class ControlAgent:
def __init__(self, node_id: str, controller_url: str):
self.node_id = node_idIdempotency and Exactly-Once Execution:
class IdempotentExecutor:
def __init__(self):
self.executed_commands = {} # Persisted to diskKey insight
The agent persists executed command IDs locally before reporting success. This ensures that if the agent crashes after execution but before reporting, it will not re-execute on restart.
Consistency and Invariants
System Invariants
1. Commands must never execute without valid authentication and authorization. 2. Every command execution must be logged in audit trail. 3. Commands must not execute after expiry time.
Consistency Model:
We use eventual consistency with these guarantees:
| Guarantee | How We Achieve It | Why It Matters |
|---|---|---|
| At-least-once delivery | Persistent queue + acks | Commands are not lost |
| Exactly-once execution | Idempotency at agent | No double execution |
| Ordered per node | Sequence numbers | Dependencies work |
| Eventual completion | Retry with backoff | Offline nodes catch up |
| Audit completeness | Write-ahead logging | Complete history |
Business impact mapping
If a command takes 5 minutes to reach all nodes instead of 30 seconds, operations are slower but safe. If a command executes twice or skips authorization, we have a security incident. We optimize for safety over speed.
Security Invariants:
class CommandVerifier:
def verify_command(self, cmd: Command) -> bool:
# 1. Check expiryFailure Modes and Resilience
Proactively discuss failures
Let me walk through failure modes. Control planes must be highly resilient since they manage critical infrastructure.
| Failure | Impact | Mitigation | Recovery |
|---|---|---|---|
| Central controller down | Cannot issue new commands | Multi-AZ deployment, automatic failover | Standby promotes, queue replays |
| Regional controller down | Region cannot receive commands | Multiple controllers per region, load balanced | Traffic shifts to healthy controller |
Handling Offline Nodes:
class RegionalController:
async def handle_node_reconnect(self, node_id: str, last_command_id: str):
# Find all commands this node missedSplit Brain Prevention:
class RegionalControllerCluster:
def __init__(self, region: str, zk_client: KazooClient):
self.region = regionGraceful degradation
When central control plane is unreachable, regional controllers continue operating with cached policies. Nodes continue running their last known configuration. New commands queue until connectivity restores.
Evolution and Scaling
What to say
This design works well to 100K nodes with 5 regions. Let me discuss how it evolves for 1M+ nodes and global deployment.
Evolution Path:
Stage 1: Single Region (up to 10K nodes) - Single controller, single queue - Direct agent connections - Simple but limited
Stage 2: Multi-Region (up to 100K nodes) - Regional controllers as designed - Hierarchical state aggregation - Regional autonomy during partitions
Stage 3: Global Scale (1M+ nodes) - Add another hierarchy level (zone controllers) - Edge caching of commands - Gossip protocol for state
Scaled Architecture
Advanced Features for Scale:
| Feature | Purpose | When to Add |
|---|---|---|
| Command batching | Reduce message overhead | More than 1K commands/minute |
| Delta state sync | Reduce bandwidth | State updates exceed 10MB/s |
| Gossip protocol | Decentralized state | Need faster local propagation |
| Command caching at edge | Reduce latency | Global deployment |
| Predictive pre-positioning | Anticipate commands | Repeated deployment patterns |
Alternative approaches
If we needed stronger consistency, we could use Raft consensus for the control plane (like etcd in Kubernetes). If we needed lower latency, we could use a mesh network where agents share commands peer-to-peer.
What I would do differently for...
IoT devices (millions of constrained nodes): Use MQTT with QoS levels, smaller message formats (protobuf), and edge gateways that aggregate.
Security-critical commands: Add multi-party approval, time-locked execution, and hardware attestation before execution.
Real-time requirements: Switch to push-only with persistent connections (gRPC streaming), accept higher infrastructure cost.