System Design Masterclass

58 items

System Design Masterclass

Messagingmessage-queuedistributed-systemsrabbitmqsqskafkaadvanced

Design Distributed Message Queue

Design a distributed queue system like RabbitMQ or Amazon SQS

Millions of messages/sec|Similar to Amazon, LinkedIn, Uber, Slack, Discord|45 min read

Summary

A distributed message queue decouples producers from consumers, enabling asynchronous communication at scale. The core challenges are ensuring exactly-once delivery semantics, maintaining message ordering, and handling consumer failures without losing messages. This is asked at Amazon, LinkedIn, Uber, and any company building event-driven architectures.

Key Takeaways

Core Problem

This is fundamentally a distributed log with consumer offset tracking. The queue is an abstraction over an append-only log.

The Hard Part

Guaranteeing exactly-once processing in a distributed system where consumers can fail mid-processing. At-least-once is easy; exactly-once requires idempotency.

Scaling Axis

Scale by partitioning queues. Each partition is an independent ordered log. Consumers scale by assigning partitions.

The Question: Design a distributed message queue that can handle millions of messages per second with reliability guarantees.

Message queues are essential for: - Decoupling services: Producer does not need to know about consumers - Load leveling: Handle traffic spikes by buffering messages - Reliability: Persist messages until successfully processed - Async processing: Fire-and-forget for non-blocking operations

What to say first

Before designing, I need to clarify the delivery semantics, ordering requirements, and scale. These fundamentally shape the architecture.

What interviewers are really testing: - Do you understand delivery guarantees (at-least-once vs exactly-once)? - Can you reason about ordering in distributed systems? - How do you handle consumer failures? - Do you understand the CAP theorem implications?

Real-world context

RabbitMQ: Traditional queue with complex routing. Kafka: Distributed log with partitions. SQS: Managed queue with visibility timeout. Each optimizes for different use cases.

Summary

Key Takeaways

Core Problem

This is fundamentally a distributed log with consumer offset tracking. The queue is an abstraction over an append-only log.

The Hard Part

Guaranteeing exactly-once processing in a distributed system where consumers can fail mid-processing. At-least-once is easy; exactly-once requires idempotency.

Scaling Axis

Scale by partitioning queues. Each partition is an independent ordered log. Consumers scale by assigning partitions.

Critical Invariant

A message must never be lost after acknowledgment from the queue. If we say we accepted it, we must deliver it.

Performance Insight

Throughput comes from batching and sequential disk I/O. Latency comes from keeping data in memory. Choose your optimization target.

Key Tradeoff

Ordering vs throughput: strict global ordering limits parallelism. Most systems offer partition-level ordering as compromise.

Design Walkthrough

Problem Statement

The Question: Design a distributed message queue that can handle millions of messages per second with reliability guarantees.

What to say first

Before designing, I need to clarify the delivery semantics, ordering requirements, and scale. These fundamentally shape the architecture.

Real-world context

RabbitMQ: Traditional queue with complex routing. Kafka: Distributed log with partitions. SQS: Managed queue with visibility timeout. Each optimizes for different use cases.

Clarifying Questions

These questions reveal senior thinking and shape your entire design.

Question 1: Delivery Semantics

What delivery guarantee do we need? At-most-once, at-least-once, or exactly-once?

Why this matters: Exactly-once is extremely hard (requires idempotency). At-least-once with idempotent consumers is the practical choice.

Typical answer: At-least-once with consumer idempotency

Architecture impact: Need message acknowledgment, redelivery on timeout, deduplication keys

Question 2: Ordering

Do messages need to be processed in order? Globally or per-key?

Why this matters: Global ordering kills parallelism. Per-key ordering (like Kafka) allows scaling.

Typical answer: Per-key ordering (all messages for user X in order)

Architecture impact: Partition by key, single consumer per partition

Question 3: Scale

What is the message throughput? Size of messages? Retention period?

Why this matters: Determines storage strategy, replication factor, partition count.

Typical answer: 1M messages/sec, 1KB average, 7-day retention

Architecture impact: Need ~600TB storage, multiple brokers, tiered storage

Question 4: Consumer Model

Point-to-point (one consumer) or pub-sub (multiple consumers)?

Why this matters: Pub-sub requires tracking multiple consumer offsets.

Typical answer: Both - consumer groups for scaling, multiple groups for pub-sub

Architecture impact: Need consumer group abstraction, offset storage per group

Stating assumptions

I will assume: at-least-once delivery, per-partition ordering, 1M msg/sec throughput, 1KB messages, 7-day retention, consumer groups with pub-sub support.

The Hard Part

Say this out loud

The hard part here is guaranteeing that every message is processed exactly once, even when consumers fail mid-processing or network partitions occur.

Why this is genuinely hard:

1.The Two Generals Problem: You cannot know if consumer received your message AND processed it successfully in a distributed system.
2.Consumer Failure Mid-Processing: Consumer reads message, starts processing, crashes. Did it complete? Should we redeliver?
3.Duplicate Delivery: Consumer processes message, sends ACK, network drops ACK. Queue redelivers. Now processed twice.
4.Ordering with Parallelism: To scale, you want multiple consumers. But multiple consumers break ordering.

The Acknowledgment Problem

Common mistake

Candidates claim exactly-once delivery without explaining how. True exactly-once requires the consumer to be idempotent OR transactional outbox pattern. The queue alone cannot guarantee it.

The practical solution:

Exactly-Once = At-Least-Once Delivery + Idempotent Consumer

The queue guarantees at-least-once (will keep trying until ACK). The consumer must handle duplicates using: - Deduplication by message ID - Idempotent operations (SET vs INCREMENT) - Transactional outbox pattern

Scale & Access Patterns

Let me estimate the scale and understand access patterns.

Dimension	Value	Impact
Messages/sec	1,000,000	Need partitioning, multiple brokers
Message size	1 KB average	1 GB/sec write throughput needed

+ 4 more rows...

What to say

At 1M messages/sec with 1KB each, we need 1 GB/sec sustained write throughput. This requires partitioning across many brokers. Sequential disk writes can handle 100-200 MB/sec per disk.

Write throughput:
- 1M msg/sec x 1KB = 1 GB/sec
- Per broker (good SSD): ~200 MB/sec sequential write

+ 11 more lines...

Access Pattern Analysis:

Writes: Append-only, sequential (fast) - Reads: Sequential from offset (fast), random seek for replay (slower) - Hot data: Recent messages (keep in page cache) - Cold data: Old messages (read from disk, consider tiered storage)

High-Level Architecture

I will design a Kafka-style distributed log architecture, as it scales better than traditional queue semantics.

What to say

I will model this as a distributed commit log with partitions. Each partition is an ordered, immutable sequence of messages. We scale by adding partitions.

Distributed Message Queue Architecture

Core Components:

1. Brokers: Store and serve messages - Each broker stores multiple partition replicas - Leaders handle reads/writes, followers replicate - Stateless (partition assignment from coordinator)

2. Partitions: Unit of parallelism and ordering - Ordered, append-only log - Messages assigned by key hash or round-robin - Each partition has one leader, N-1 followers

3. Coordinator (ZooKeeper/Raft): Cluster metadata - Broker membership and health - Partition leader election - Consumer group coordination

4. Consumer Groups: Scaling abstraction - Each partition assigned to one consumer in group - Multiple groups can read same partition (pub-sub) - Offset tracking per group

Why this design (Kafka-style)?

Traditional queues (RabbitMQ) delete messages after consumption. Log-based queues retain messages, enabling replay, multiple consumers, and simpler replication. This is why Kafka dominates at scale.

Data Model & Storage

The storage layer is critical for performance and durability.

What to say

The storage is an append-only log per partition. We optimize for sequential I/O which is 100x faster than random I/O. The OS page cache serves hot data.

partition-0/
  |-- 00000000000000000000.log    # Segment file (messages)
  |-- 00000000000000000000.index  # Offset -> position index

+ 8 more lines...

Message Format:

+----------+----------+------------+----------+-------+
| Offset   | Size     | Timestamp  | Key      | Value |
| (8 bytes)| (4 bytes)| (8 bytes)  | (varies) | (varies)

+ 12 more lines...

Consumer Offset Storage:

Option 1: Internal topic (Kafka __consumer_offsets)
  Key: (group_id, topic, partition)
  Value: (offset, metadata, timestamp)

+ 9 more lines...

Critical detail

The write path must fsync to disk before acknowledging. But fsync per message is slow. Solution: batch writes, fsync the batch. Configure based on durability vs latency tradeoff.

Algorithm Deep Dive

Let me explain the core algorithms that make this work.

1. Leader Election (Raft/ZAB)

When a partition leader fails, we need to elect a new leader from in-sync replicas (ISR).

Leader Election Flow

2. Replication Protocol

class PartitionLeader:
    def __init__(self):
        self.log = []              # Messages

+ 30 more lines...

3. Consumer Group Rebalancing

When consumers join/leave, partitions must be reassigned.

def range_assignment(partitions, consumers):
    """
    Range: Divide partitions evenly, assign ranges

+ 33 more lines...

Interview insight

Mention that rebalancing causes consumer pause (stop-the-world in older protocols). Cooperative rebalancing in modern systems minimizes this. This shows you understand operational concerns.

Consistency & Invariants

System Invariants

1. A message acknowledged to producer must never be lost. 2. A message must be delivered to consumers at least once. 3. Messages within a partition must be consumed in order.

Durability Guarantee:

When can we acknowledge a message to the producer?

acks Setting	Durability	Latency	Use Case
acks=0	None (fire and forget)	Lowest	Metrics, logs where loss OK
acks=1	Leader only	Low	Most use cases
acks=all	All ISR replicas	Higher	Financial, critical data

def produce_message(message, acks="all"):
    # Send to partition leader
    leader = get_partition_leader(message.partition)

+ 14 more lines...

Business impact mapping

Lost payment event = lost revenue and angry customer. Lost click tracking event = slightly inaccurate analytics. Choose acks setting based on business impact, not technical preference.

Ordering Guarantee:

Ordering is guaranteed ONLY within a partition. Cross-partition ordering requires: - Single partition (kills parallelism) - External sequencing service - Consumer-side reordering with sequence numbers

Ordering Within Partition

Failure Modes & Resilience

Proactively discuss failures

Let me walk through failure scenarios and how the system handles them.

Failure	Impact	Mitigation	Recovery
Broker failure	Partitions on broker unavailable	Replication factor 3, leader election	New leader elected from ISR in seconds
Consumer crash	Partition stops processing	Heartbeat timeout, rebalance	Partition assigned to another consumer

+ 4 more rows...

Handling Consumer Failure:

class Consumer:
    def __init__(self, group_id):
        self.group_id = group_id

+ 30 more lines...

The poison pill problem

A malformed message that always fails processing causes infinite redelivery. Solution: Dead letter queue (DLQ) after N retries. Move poison messages out of main flow for manual inspection.

def process_with_dlq(message, max_retries=3):
    retry_count = message.headers.get("retry_count", 0)

+ 20 more lines...

Evolution & Scaling

What to say

This design handles 1M messages/sec. Let me discuss evolution for 10x scale and operational improvements.

Scaling Strategies:

1. Add Partitions (horizontal scaling) - More partitions = more parallelism - But: cannot reduce partitions, choose carefully - Recommendation: Start with 3x expected consumers

2. Add Brokers (capacity scaling) - Rebalance partitions across new brokers - Each broker handles portion of partitions

3. Tiered Storage (cost optimization) - Hot data: Local SSD - Cold data: S3/Cloud storage - Kafka 3.0+ supports this natively

Tiered Storage Architecture

Evolution Path:

Stage 1: Single Cluster (up to 100K msg/sec) - 3-5 brokers, ZooKeeper quorum - Good for starting, simple operations

Stage 2: Large Cluster (up to 1M msg/sec) - 20-50 brokers, dedicated controller - Remove ZooKeeper (KRaft mode in Kafka 3.0+)

Stage 3: Multi-Cluster (10M+ msg/sec) - Multiple clusters per region - MirrorMaker for cross-cluster replication - Global namespace with routing

Alternative approaches

If I needed simpler operations over raw performance, I would use a managed service like Amazon SQS/SNS or Cloud Pub/Sub. If I needed complex routing logic, RabbitMQ with exchanges. The right choice depends on operational capacity and specific requirements.

Approach	When to Use	Tradeoff
Self-managed Kafka	High throughput, need replay, have ops team	Complex operations, but full control
Managed Kafka (MSK/Confluent)	Want Kafka semantics without ops burden	Higher cost, some limitations
Amazon SQS	Simple queue, serverless, low ops	No ordering (FIFO limited), no replay
RabbitMQ	Complex routing, lower throughput OK	Harder to scale, no native replay
Redis Streams	Already have Redis, moderate scale	Memory-bound, less durable

Design Trade-offs

Advantages

+Simple to implement
+No message loss
+Works with retries

Disadvantages

-Requires idempotent consumers
-Duplicate processing possible

When to use

Default choice for most systems with idempotent operations

System Design Masterclass

Weather Application with Forecasting

URL Shortener

Hotel Booking and Reservation System

Top K Most Shared Articles

API Rate Limiter

Real-time Stock Price Viewer

Parts Compatibility Validation

Live Comments Feature

Authentication and User Login

On-Call Escalation System

File Download and Sync Library

Marketplace Features

Price Alert System

High-Profile Likes Counter

Live Reactions System

Netflix Screen Concurrency Limits

Real-time Active Viewers

Top-K Rankings System

Google Calendar

Web Crawler

News Feed

Video Streaming Platform

Distributed Control Infrastructure

Cluster Health Monitoring System

User Analytics Dashboard & Event Pipeline

Surge Pricing System

Flash Sale Backend

Distributed Metrics Logging and Aggregation

IoC / Dependency Injection Framework

Wire Transfer API

Notification Service

ETA and Live Location Sharing

P2P File Transfer System

Photo Sharing Platform

Ads Management & Delivery System

Rider Matching System

Collaborative Editing System

Server Metrics Collection System

Distributed Key-Value Store

Payment Processing System

Distributed Stream Processing System

Distributed Job Scheduler

Distributed Message Queue

Dropbox / Google Drive

Distributed Tracing System

Large Data Sorting and Processing

Database Control Plane

Large Data Migration to Cloud

WhatsApp / Messenger

Payment Wallet at Global Scale

Uber / Ride Sharing

Web Search Engine

Globally Distributed SQL Database

Real-Time Analytics System

Recommendation System (Netflix)

Multi-Region Disaster Recovery System

Time-Series Database

Fraud Detection System

Design Distributed Message Queue

Summary

Key Takeaways

Core Problem

The Hard Part

Scaling Axis

Problem Statement

What to say first

Real-world context

Premium Content