System Design Masterclass

58 items

System Design Masterclass

Infrastructuredistributed-systemscommand-controlfleet-managementorchestrationresilienceadvanced

Design Distributed Control Infrastructure

Design a distributed command and control system for fleet management

100K+ managed nodes|Similar to Google, AWS, HashiCorp, Puppet, Red Hat|45 min read

Summary

Distributed control systems enable centralized management of thousands of nodes across global infrastructure. The core challenge is maintaining reliable command delivery and state synchronization across unreliable networks while ensuring security and auditability. This pattern is used by Kubernetes, Puppet, Chef, Ansible Tower, and cloud provider control planes.

Key Takeaways

Core Problem

This is fundamentally a reliable broadcast and state synchronization problem. Commands must reach all target nodes exactly once, and we need consistent view of fleet state.

The Hard Part

Ensuring commands are delivered and executed despite network partitions, node failures, and varying network conditions. We cannot lose commands or execute them multiple times.

Scaling Axis

Scale by hierarchical delegation - regional controllers manage local nodes, central controller manages regions. Fan-out at each level.

The Question: Design a distributed command and control system that can manage 100,000+ nodes across multiple regions with reliable command delivery and state reporting.

This system pattern is essential for: - Infrastructure orchestration - Kubernetes, Nomad, cloud control planes - Configuration management - Puppet, Chef, Ansible at scale - Fleet management - Managing server fleets, IoT devices, edge nodes - Deployment systems - Rolling out changes across distributed infrastructure

What to say first

Before I design, let me clarify the requirements. I want to understand the scale, command types, consistency needs, and security requirements for this control plane.

Hidden requirements interviewers test: - Can you design for network unreliability? - Do you understand the CAP tradeoffs for control planes? - How do you handle partial failures? - Can you reason about security in distributed systems?

Summary

Key Takeaways

Core Problem

This is fundamentally a reliable broadcast and state synchronization problem. Commands must reach all target nodes exactly once, and we need consistent view of fleet state.

The Hard Part

Ensuring commands are delivered and executed despite network partitions, node failures, and varying network conditions. We cannot lose commands or execute them multiple times.

Scaling Axis

Scale by hierarchical delegation - regional controllers manage local nodes, central controller manages regions. Fan-out at each level.

Critical Invariant

Commands must be authenticated, authorized, and audited. No unauthorized command execution. Complete audit trail of all operations.

Performance Requirement

Command propagation to 95% of nodes within 30 seconds. Full fleet state refresh within 5 minutes.

Key Tradeoff

We choose eventual consistency with guaranteed delivery over strong consistency. Commands will eventually execute everywhere, but not simultaneously.

Design Walkthrough

Problem Statement

The Question: Design a distributed command and control system that can manage 100,000+ nodes across multiple regions with reliable command delivery and state reporting.

What to say first

Before I design, let me clarify the requirements. I want to understand the scale, command types, consistency needs, and security requirements for this control plane.

Clarifying Questions

Ask these questions to demonstrate senior thinking. Each shapes your architecture significantly.

Question 1: Scale and Distribution

How many nodes do we need to manage? Are they globally distributed or regional? What is the network topology?

Why this matters: Determines hierarchy depth and communication patterns. Typical answer: 100K nodes across 5 regions, mix of cloud and edge Architecture impact: Need hierarchical control with regional aggregation

Question 2: Command Types

What types of commands? Real-time (immediate execution) vs batch? Are commands idempotent?

Why this matters: Determines delivery guarantees and retry logic. Typical answer: Mix of immediate (security patches) and scheduled (config updates) Architecture impact: Need priority queues and idempotent command design

Question 3: Consistency Requirements

Do all nodes need to execute commands simultaneously? What happens if some nodes are offline?

Why this matters: Determines consistency model. Typical answer: Eventual execution OK, but need tracking of which nodes completed Architecture impact: At-least-once delivery with acknowledgment tracking

Question 4: Security Requirements

What authentication and authorization model? Do we need audit logging? Encryption requirements?

Why this matters: Control planes are high-value targets. Typical answer: Mutual TLS, RBAC, complete audit trail Architecture impact: PKI infrastructure, authorization service, audit logging

Stating assumptions

I will assume: 100K nodes across 5 regions, mix of immediate and batch commands, eventual consistency acceptable, strong security requirements with mTLS and RBAC.

The Hard Part

Say this out loud

The hard part here is ensuring reliable command delivery and execution tracking across an unreliable network where nodes can be offline, partitioned, or fail at any time.

Why this is genuinely hard:

1.Unreliable Network: Nodes go offline, network partitions happen, messages get lost. How do you guarantee a command eventually reaches every target?
2.Partial Failures: Command reaches 90% of nodes, then controller crashes. How do you resume? How do you avoid re-executing on nodes that already completed?
3.State Consistency: 100K nodes each reporting state. How do you maintain consistent view without overwhelming the central controller?
4.Security Surface: Every communication channel is an attack vector. Compromised node should not be able to issue commands to others.

Common mistake

Candidates often design a simple request-response system. This fails when nodes are offline. You need a persistent command queue with acknowledgment tracking.

The fundamental insight:

Push Model: Controller pushes to nodes
  - Problem: What if node is offline?
  
Pull Model: Nodes pull from controller  
  - Problem: Polling overhead at scale
  
Hybrid: Nodes pull + push notifications
  - Best of both: Reliable + responsive

Most production systems use pull-based with push notifications for urgency.

Scale and Access Patterns

Let me estimate the scale and understand access patterns.

Dimension	Value	Impact
Managed Nodes	100,000	Need hierarchical architecture
Regions	5	Regional controllers for locality

+ 4 more rows...

What to say

This is not a high-throughput system in terms of commands, but it is high-fanout. The challenge is reliable delivery to 100K endpoints, not raw throughput.

Access Pattern Analysis:

Command flow: One-to-many broadcast (1 command to N nodes) - State flow: Many-to-one aggregation (N nodes reporting to 1 controller) - Asymmetric: More state flowing up than commands flowing down - Bursty commands: Most time quiet, then burst during deployments - Continuous state: Constant stream of heartbeats and status

State aggregation load:
- 100K nodes x 1 update/30s = 3,333 updates/second
- Each update ~1KB = 3.3 MB/second

+ 11 more lines...

High-Level Architecture

Let me design a hierarchical control plane with reliable command delivery.

What to say

I will use a hierarchical architecture with regional controllers. This provides locality, fault isolation, and scales the fanout. We scale by adding regions, not by scaling central controller.

Distributed Control Architecture

Component Responsibilities:

1.API Gateway: Authentication, authorization, rate limiting
2.Command Service: Accepts commands, validates, persists to queue - Assigns unique command ID - Determines target nodes (all, by tag, by region) - Tracks command lifecycle
3.Command Queue: Durable message queue (Kafka/SQS) - Persists commands until acknowledged - Supports replay for failed deliveries - Partitioned by region
4.Regional Controller: Manages nodes in one region - Receives commands from central queue - Fans out to local agents - Aggregates state from agents - Reports summary to central
5.Node Agent: Runs on each managed node - Pulls commands from regional controller - Executes commands locally - Reports status and heartbeat - Handles local retries

Real-world reference

Kubernetes uses similar pattern: API server as central, etcd for state, controllers for reconciliation. Puppet uses pull-based with catalog compilation. AWS Systems Manager uses regional endpoints with central orchestration.

Data Model and Storage

The data model must support command tracking, node state, and audit logging.

What to say

We need three main data stores: a durable queue for commands, a database for state, and a time-series store for metrics. The queue is the source of truth for pending commands.

-- Managed nodes registry
CREATE TABLE nodes (
    node_id         UUID PRIMARY KEY,

+ 47 more lines...

Command Queue Structure (Kafka/SQS):

{
  "command_id": "uuid-here",
  "type": "execute_script",

+ 19 more lines...

State Storage Strategy:

Hot state (Redis): Current node status, active commands - Warm state (PostgreSQL): Command history, node registry - Cold state (S3/Object Storage): Audit logs, old executions

Important detail

Commands must be signed. Each command includes a cryptographic signature that agents verify before execution. This prevents command injection if the queue is compromised.

Command Delivery Deep Dive

Reliable command delivery is the core challenge. Let me walk through the delivery mechanism.

Command Delivery Flow

Pull-based Delivery with Push Notification:

Agents primarily pull commands, but we add push for urgency:

class ControlAgent:
    def __init__(self, node_id: str, controller_url: str):
        self.node_id = node_id

+ 55 more lines...

Idempotency and Exactly-Once Execution:

class IdempotentExecutor:
    def __init__(self):
        self.executed_commands = {}  # Persisted to disk

+ 20 more lines...

Key insight

The agent persists executed command IDs locally before reporting success. This ensures that if the agent crashes after execution but before reporting, it will not re-execute on restart.

Consistency and Invariants

System Invariants

1. Commands must never execute without valid authentication and authorization. 2. Every command execution must be logged in audit trail. 3. Commands must not execute after expiry time.

Consistency Model:

We use eventual consistency with these guarantees:

Guarantee	How We Achieve It	Why It Matters
At-least-once delivery	Persistent queue + acks	Commands are not lost
Exactly-once execution	Idempotency at agent	No double execution
Ordered per node	Sequence numbers	Dependencies work
Eventual completion	Retry with backoff	Offline nodes catch up
Audit completeness	Write-ahead logging	Complete history

Business impact mapping

If a command takes 5 minutes to reach all nodes instead of 30 seconds, operations are slower but safe. If a command executes twice or skips authorization, we have a security incident. We optimize for safety over speed.

Security Invariants:

class CommandVerifier:
    def verify_command(self, cmd: Command) -> bool:
        # 1. Check expiry

+ 33 more lines...

Failure Modes and Resilience

Proactively discuss failures

Let me walk through failure modes. Control planes must be highly resilient since they manage critical infrastructure.

Failure	Impact	Mitigation	Recovery
Central controller down	Cannot issue new commands	Multi-AZ deployment, automatic failover	Standby promotes, queue replays
Regional controller down	Region cannot receive commands	Multiple controllers per region, load balanced	Traffic shifts to healthy controller

+ 4 more rows...

Handling Offline Nodes:

class RegionalController:
    async def handle_node_reconnect(self, node_id: str, last_command_id: str):
        # Find all commands this node missed

+ 20 more lines...

Split Brain Prevention:

class RegionalControllerCluster:
    def __init__(self, region: str, zk_client: KazooClient):
        self.region = region

+ 22 more lines...

Graceful degradation

When central control plane is unreachable, regional controllers continue operating with cached policies. Nodes continue running their last known configuration. New commands queue until connectivity restores.

Evolution and Scaling

What to say

This design works well to 100K nodes with 5 regions. Let me discuss how it evolves for 1M+ nodes and global deployment.

Evolution Path:

Stage 1: Single Region (up to 10K nodes) - Single controller, single queue - Direct agent connections - Simple but limited

Stage 2: Multi-Region (up to 100K nodes) - Regional controllers as designed - Hierarchical state aggregation - Regional autonomy during partitions

Stage 3: Global Scale (1M+ nodes) - Add another hierarchy level (zone controllers) - Edge caching of commands - Gossip protocol for state

Scaled Architecture

Advanced Features for Scale:

Feature	Purpose	When to Add
Command batching	Reduce message overhead	More than 1K commands/minute
Delta state sync	Reduce bandwidth	State updates exceed 10MB/s
Gossip protocol	Decentralized state	Need faster local propagation
Command caching at edge	Reduce latency	Global deployment
Predictive pre-positioning	Anticipate commands	Repeated deployment patterns

Alternative approaches

If we needed stronger consistency, we could use Raft consensus for the control plane (like etcd in Kubernetes). If we needed lower latency, we could use a mesh network where agents share commands peer-to-peer.

What I would do differently for...

IoT devices (millions of constrained nodes): Use MQTT with QoS levels, smaller message formats (protobuf), and edge gateways that aggregate.

Security-critical commands: Add multi-party approval, time-locked execution, and hardware attestation before execution.

Real-time requirements: Switch to push-only with persistent connections (gRPC streaming), accept higher infrastructure cost.

Design Trade-offs

Advantages

+Reliable delivery to offline nodes
+Agents control their load
+Simple firewall rules

Disadvantages

-Higher latency
-Polling overhead
-Delayed urgent commands

When to use

Default choice for fleet management, configuration management

System Design Masterclass

Weather Application with Forecasting

URL Shortener

Live Comments Feature

API Rate Limiter

On-Call Escalation System

Hotel Booking and Reservation System

Parts Compatibility Validation

Real-time Stock Price Viewer

Top-K Rankings System

File Download and Sync Library

Real-time Active Viewers

Marketplace Features

Price Alert System

Netflix Screen Concurrency Limits

Live Reactions System

Top K Most Shared Articles

High-Profile Likes Counter

Authentication and User Login

Google Calendar

Web Crawler

News Feed

Video Streaming Platform

IoC / Dependency Injection Framework

Distributed Control Infrastructure

Notification Service

Distributed Tracing System

P2P File Transfer System

Large Data Migration to Cloud

Wire Transfer API

Large Data Sorting and Processing

Database Control Plane

Distributed Metrics Logging and Aggregation

Ads Management & Delivery System

Flash Sale Backend

Photo Sharing Platform

Cluster Health Monitoring System

Rider Matching System

Surge Pricing System

Collaborative Editing System

Server Metrics Collection System

User Analytics Dashboard & Event Pipeline

Dropbox / Google Drive

Distributed Message Queue

ETA and Live Location Sharing

Distributed Key-Value Store

Distributed Stream Processing System

Payment Processing System

Distributed Job Scheduler

WhatsApp / Messenger

Payment Wallet at Global Scale

Uber / Ride Sharing

Web Search Engine

Globally Distributed SQL Database

Real-Time Analytics System

Recommendation System (Netflix)

Multi-Region Disaster Recovery System

Time-Series Database

Fraud Detection System

Design Distributed Control Infrastructure

Summary

Key Takeaways

Core Problem

The Hard Part

Scaling Axis

Problem Statement

What to say first

Premium Content