Design Walkthrough
Problem Statement
The Question: Design a distributed message queue that can handle millions of messages per second with reliability guarantees.
Message queues are essential for: - Decoupling services: Producer does not need to know about consumers - Load leveling: Handle traffic spikes by buffering messages - Reliability: Persist messages until successfully processed - Async processing: Fire-and-forget for non-blocking operations
What to say first
Before designing, I need to clarify the delivery semantics, ordering requirements, and scale. These fundamentally shape the architecture.
What interviewers are really testing: - Do you understand delivery guarantees (at-least-once vs exactly-once)? - Can you reason about ordering in distributed systems? - How do you handle consumer failures? - Do you understand the CAP theorem implications?
Real-world context
RabbitMQ: Traditional queue with complex routing. Kafka: Distributed log with partitions. SQS: Managed queue with visibility timeout. Each optimizes for different use cases.
Clarifying Questions
These questions reveal senior thinking and shape your entire design.
Question 1: Delivery Semantics
What delivery guarantee do we need? At-most-once, at-least-once, or exactly-once?
Why this matters: Exactly-once is extremely hard (requires idempotency). At-least-once with idempotent consumers is the practical choice.
Typical answer: At-least-once with consumer idempotency
Architecture impact: Need message acknowledgment, redelivery on timeout, deduplication keys
Question 2: Ordering
Do messages need to be processed in order? Globally or per-key?
Why this matters: Global ordering kills parallelism. Per-key ordering (like Kafka) allows scaling.
Typical answer: Per-key ordering (all messages for user X in order)
Architecture impact: Partition by key, single consumer per partition
Question 3: Scale
What is the message throughput? Size of messages? Retention period?
Why this matters: Determines storage strategy, replication factor, partition count.
Typical answer: 1M messages/sec, 1KB average, 7-day retention
Architecture impact: Need ~600TB storage, multiple brokers, tiered storage
Question 4: Consumer Model
Point-to-point (one consumer) or pub-sub (multiple consumers)?
Why this matters: Pub-sub requires tracking multiple consumer offsets.
Typical answer: Both - consumer groups for scaling, multiple groups for pub-sub
Architecture impact: Need consumer group abstraction, offset storage per group
Stating assumptions
I will assume: at-least-once delivery, per-partition ordering, 1M msg/sec throughput, 1KB messages, 7-day retention, consumer groups with pub-sub support.
The Hard Part
Say this out loud
The hard part here is guaranteeing that every message is processed exactly once, even when consumers fail mid-processing or network partitions occur.
Why this is genuinely hard:
- 1.The Two Generals Problem: You cannot know if consumer received your message AND processed it successfully in a distributed system.
- 2.Consumer Failure Mid-Processing: Consumer reads message, starts processing, crashes. Did it complete? Should we redeliver?
- 3.Duplicate Delivery: Consumer processes message, sends ACK, network drops ACK. Queue redelivers. Now processed twice.
- 4.Ordering with Parallelism: To scale, you want multiple consumers. But multiple consumers break ordering.
The Acknowledgment Problem
Common mistake
Candidates claim exactly-once delivery without explaining how. True exactly-once requires the consumer to be idempotent OR transactional outbox pattern. The queue alone cannot guarantee it.
The practical solution:
Exactly-Once = At-Least-Once Delivery + Idempotent ConsumerThe queue guarantees at-least-once (will keep trying until ACK). The consumer must handle duplicates using: - Deduplication by message ID - Idempotent operations (SET vs INCREMENT) - Transactional outbox pattern
Scale & Access Patterns
Let me estimate the scale and understand access patterns.
| Dimension | Value | Impact |
|---|---|---|
| Messages/sec | 1,000,000 | Need partitioning, multiple brokers |
| Message size | 1 KB average | 1 GB/sec write throughput needed |
What to say
At 1M messages/sec with 1KB each, we need 1 GB/sec sustained write throughput. This requires partitioning across many brokers. Sequential disk writes can handle 100-200 MB/sec per disk.
Write throughput:
- 1M msg/sec x 1KB = 1 GB/sec
- Per broker (good SSD): ~200 MB/sec sequential writeAccess Pattern Analysis:
- Writes: Append-only, sequential (fast) - Reads: Sequential from offset (fast), random seek for replay (slower) - Hot data: Recent messages (keep in page cache) - Cold data: Old messages (read from disk, consider tiered storage)
High-Level Architecture
I will design a Kafka-style distributed log architecture, as it scales better than traditional queue semantics.
What to say
I will model this as a distributed commit log with partitions. Each partition is an ordered, immutable sequence of messages. We scale by adding partitions.
Distributed Message Queue Architecture
Core Components:
1. Brokers: Store and serve messages - Each broker stores multiple partition replicas - Leaders handle reads/writes, followers replicate - Stateless (partition assignment from coordinator)
2. Partitions: Unit of parallelism and ordering - Ordered, append-only log - Messages assigned by key hash or round-robin - Each partition has one leader, N-1 followers
3. Coordinator (ZooKeeper/Raft): Cluster metadata - Broker membership and health - Partition leader election - Consumer group coordination
4. Consumer Groups: Scaling abstraction - Each partition assigned to one consumer in group - Multiple groups can read same partition (pub-sub) - Offset tracking per group
Why this design (Kafka-style)?
Traditional queues (RabbitMQ) delete messages after consumption. Log-based queues retain messages, enabling replay, multiple consumers, and simpler replication. This is why Kafka dominates at scale.
Data Model & Storage
The storage layer is critical for performance and durability.
What to say
The storage is an append-only log per partition. We optimize for sequential I/O which is 100x faster than random I/O. The OS page cache serves hot data.
partition-0/
|-- 00000000000000000000.log # Segment file (messages)
|-- 00000000000000000000.index # Offset -> position indexMessage Format:
+----------+----------+------------+----------+-------+
| Offset | Size | Timestamp | Key | Value |
| (8 bytes)| (4 bytes)| (8 bytes) | (varies) | (varies)Consumer Offset Storage:
Option 1: Internal topic (Kafka __consumer_offsets)
Key: (group_id, topic, partition)
Value: (offset, metadata, timestamp)Critical detail
The write path must fsync to disk before acknowledging. But fsync per message is slow. Solution: batch writes, fsync the batch. Configure based on durability vs latency tradeoff.
Algorithm Deep Dive
Let me explain the core algorithms that make this work.
1. Leader Election (Raft/ZAB)
When a partition leader fails, we need to elect a new leader from in-sync replicas (ISR).
Leader Election Flow
2. Replication Protocol
class PartitionLeader:
def __init__(self):
self.log = [] # Messages3. Consumer Group Rebalancing
When consumers join/leave, partitions must be reassigned.
def range_assignment(partitions, consumers):
"""
Range: Divide partitions evenly, assign rangesInterview insight
Mention that rebalancing causes consumer pause (stop-the-world in older protocols). Cooperative rebalancing in modern systems minimizes this. This shows you understand operational concerns.
Consistency & Invariants
System Invariants
1. A message acknowledged to producer must never be lost. 2. A message must be delivered to consumers at least once. 3. Messages within a partition must be consumed in order.
Durability Guarantee:
When can we acknowledge a message to the producer?
| acks Setting | Durability | Latency | Use Case |
|---|---|---|---|
| acks=0 | None (fire and forget) | Lowest | Metrics, logs where loss OK |
| acks=1 | Leader only | Low | Most use cases |
| acks=all | All ISR replicas | Higher | Financial, critical data |
def produce_message(message, acks="all"):
# Send to partition leader
leader = get_partition_leader(message.partition)Business impact mapping
Lost payment event = lost revenue and angry customer. Lost click tracking event = slightly inaccurate analytics. Choose acks setting based on business impact, not technical preference.
Ordering Guarantee:
Ordering is guaranteed ONLY within a partition. Cross-partition ordering requires: - Single partition (kills parallelism) - External sequencing service - Consumer-side reordering with sequence numbers
Ordering Within Partition
Failure Modes & Resilience
Proactively discuss failures
Let me walk through failure scenarios and how the system handles them.
| Failure | Impact | Mitigation | Recovery |
|---|---|---|---|
| Broker failure | Partitions on broker unavailable | Replication factor 3, leader election | New leader elected from ISR in seconds |
| Consumer crash | Partition stops processing | Heartbeat timeout, rebalance | Partition assigned to another consumer |
Handling Consumer Failure:
class Consumer:
def __init__(self, group_id):
self.group_id = group_idThe poison pill problem
A malformed message that always fails processing causes infinite redelivery. Solution: Dead letter queue (DLQ) after N retries. Move poison messages out of main flow for manual inspection.
def process_with_dlq(message, max_retries=3):
retry_count = message.headers.get("retry_count", 0)
Evolution & Scaling
What to say
This design handles 1M messages/sec. Let me discuss evolution for 10x scale and operational improvements.
Scaling Strategies:
1. Add Partitions (horizontal scaling) - More partitions = more parallelism - But: cannot reduce partitions, choose carefully - Recommendation: Start with 3x expected consumers
2. Add Brokers (capacity scaling) - Rebalance partitions across new brokers - Each broker handles portion of partitions
3. Tiered Storage (cost optimization) - Hot data: Local SSD - Cold data: S3/Cloud storage - Kafka 3.0+ supports this natively
Tiered Storage Architecture
Evolution Path:
Stage 1: Single Cluster (up to 100K msg/sec) - 3-5 brokers, ZooKeeper quorum - Good for starting, simple operations
Stage 2: Large Cluster (up to 1M msg/sec) - 20-50 brokers, dedicated controller - Remove ZooKeeper (KRaft mode in Kafka 3.0+)
Stage 3: Multi-Cluster (10M+ msg/sec) - Multiple clusters per region - MirrorMaker for cross-cluster replication - Global namespace with routing
Alternative approaches
If I needed simpler operations over raw performance, I would use a managed service like Amazon SQS/SNS or Cloud Pub/Sub. If I needed complex routing logic, RabbitMQ with exchanges. The right choice depends on operational capacity and specific requirements.
| Approach | When to Use | Tradeoff |
|---|---|---|
| Self-managed Kafka | High throughput, need replay, have ops team | Complex operations, but full control |
| Managed Kafka (MSK/Confluent) | Want Kafka semantics without ops burden | Higher cost, some limitations |
| Amazon SQS | Simple queue, serverless, low ops | No ordering (FIFO limited), no replay |
| RabbitMQ | Complex routing, lower throughput OK | Harder to scale, no native replay |
| Redis Streams | Already have Redis, moderate scale | Memory-bound, less durable |