Design Walkthrough
Problem Statement
The Question: Design an on-call escalation system that handles millions of alerts per day, ensuring critical incidents reach the right responders with automatic escalation.
On-call escalation systems are essential for: - Incident Response: Getting the right person notified when systems fail - Escalation Policies: Auto-escalating if primary on-call does not respond - Schedule Management: Handling rotations, overrides, and handoffs - Multi-Channel Delivery: SMS, phone calls, push notifications, email - Audit Trail: Recording who was notified, when, and their response
What to say first
Before I design, let me clarify the requirements. I want to understand the scale, delivery guarantees needed, and the complexity of scheduling rules we need to support.
Hidden requirements interviewers are testing: - Can you design a reliable system where failure means missed incidents? - Do you understand the state machine complexity of escalations? - Can you handle the scheduling complexity (timezones, overrides, rotations)? - Do you know how to guarantee delivery across unreliable channels?
Clarifying Questions
Ask these questions to demonstrate senior thinking. Each answer shapes your architecture.
Question 1: Scale
How many alerts per day? How many on-call schedules and users do we need to support?
Why this matters: Determines storage and processing requirements. Typical answer: 1M alerts/day, 10K organizations, 100K users Architecture impact: Need distributed processing, cannot be single node
Question 2: Delivery Guarantee
What is the SLA for alert delivery? Is it acceptable to occasionally send duplicate notifications?
Why this matters: Determines consistency model. Typical answer: 99.99% delivery within 30 seconds, duplicates are acceptable Architecture impact: At-least-once delivery, retry aggressively
Question 3: Notification Channels
Which notification channels do we need to support? Phone calls, SMS, push, email, Slack?
Why this matters: Each channel has different reliability and latency characteristics. Typical answer: All of the above, with user-configurable preferences Architecture impact: Need abstraction layer for channels, handle partial failures
Question 4: Schedule Complexity
Do we need to support schedule overrides, rotations, and follow-the-sun?
Why this matters: Scheduling is surprisingly complex. Typical answer: Yes, full scheduling with overrides taking precedence Architecture impact: Need sophisticated schedule resolution engine
Stating assumptions
Based on this, I will assume: 1M alerts/day, 99.99% delivery SLA, multi-channel notifications, complex scheduling with overrides, and at-least-once delivery is acceptable.
The Hard Part
Say this out loud
The hard part here is guaranteeing that unacknowledged alerts always escalate, while correctly resolving who is on-call at any given moment across complex schedules with overrides.
Why this is genuinely hard:
- 1.Guaranteed Escalation: If the escalation timer process crashes, alerts could be stuck forever. We need fault-tolerant timer management.
- 2.Schedule Resolution: Who is on-call right now? Depends on: base rotation, any active overrides, timezone of the schedule, gaps in coverage, and escalation level.
- 3.Multi-Channel Delivery: SMS providers fail, phone calls go to voicemail, push tokens expire. Each channel has different failure modes.
- 4.Clock and Timer Reliability: Escalation after 5 minutes means exactly 5 minutes, not whenever the cron job runs next.
Common mistake
Candidates often use simple cron jobs for escalations. But cron is not precise, not distributed, and loses state on restart. You need a proper distributed timer system.
The fundamental reliability challenge:
This system has an unusual property: failure is catastrophic. If PagerDuty fails to escalate an alert, a production outage goes unnoticed for hours. The entire value proposition is reliability.
Compare to other systems: - E-commerce: Missed order is bad but recoverable - Social media: Delayed notification is annoying but fine - On-call: Missed escalation means extended outage, potential revenue loss, customer impact
Scale and Access Patterns
Let me estimate the scale and understand access patterns.
| Dimension | Value | Impact |
|---|---|---|
| Alerts per day | 1,000,000 | About 12 alerts/second average, bursty |
| Peak alerts/second | 1,000 | Need to handle correlated failures |
What to say
At 12 alerts/second average with 1000/second peaks, we need to handle correlated failures where one outage triggers thousands of alerts. The system is write-heavy during incidents but read-heavy for schedule lookups.
Access Pattern Analysis:
- Bursty writes: Alerts come in waves during outages - Time-sensitive: Escalations must fire precisely - Read-heavy schedules: Who is on-call is queried constantly - Audit requirements: Every action must be logged for compliance - Multi-tenant: Strong isolation between organizations
Storage needed:
- 1M alerts/day x 365 days x 1KB = 365 GB/year for alerts
- 5M notifications/day x 365 x 500B = 912 GB/year for notifications
- Schedules: 50K policies x 10KB = 500 MB (fits in memory)
Timer requirements:
- 1M alerts x 3 escalation levels = 3M timers/day
- Average timer duration: 5 minutes
- Concurrent active timers: ~10,000 at any momentHigh-Level Architecture
Let me start with a simple architecture and evolve it.
What to say
I will design this as an event-driven system with clear separation between alert ingestion, escalation management, and notification delivery. The key insight is that escalations are a state machine.
On-Call Escalation Architecture
Component Responsibilities:
- 1.Alert Ingestion: API, webhooks, email parsing - normalize all alert sources
- 2.Deduplication: Group related alerts, prevent notification storms
- 3.Rule Engine: Match alerts to escalation policies based on service, severity, tags
- 4.Escalation Manager: Core state machine - tracks incident state, triggers escalations
- 5.Timer Service: Reliable distributed timers for escalation deadlines
- 6.Schedule Resolver: Determines who is on-call given complex schedule rules
- 7.Notification Queue: Reliable delivery to multiple channels with retries
Real-world reference
PagerDuty uses a similar event-driven architecture. They process alerts through a pipeline with deduplication, then manage escalation state machines independently per incident.
Data Model and Storage
Let me define the core data models and storage choices.
What to say
PostgreSQL for transactional data (incidents, schedules), Redis for timers and caching, and a time-series DB for metrics and audit logs.
-- Organizations and Users
CREATE TABLE organizations (
id UUID PRIMARY KEY,Incident State Machine:
The incident goes through well-defined states:
Incident State Machine
Important detail
State transitions must be atomic. Use database transactions to ensure the incident state and timer are updated together. A crash between them could leave the system in an inconsistent state.
Schedule Resolution Deep Dive
Schedule resolution is deceptively complex. Let me walk through the algorithm.
What to say
Schedule resolution requires layering: base rotation, then overrides on top, all calculated in the schedule timezone. I will show the algorithm.
def resolve_oncall(schedule_id: str, at_time: datetime) -> User:
"""
Determine who is on-call for a schedule at a given time.Edge Cases to Handle:
- 1.Timezone transitions: DST changes can cause 23 or 25 hour days 2. Schedule gaps: No one is on-call (should alert admins) 3. Deleted users: User in rotation no longer exists 4. Overlapping overrides: Later override wins 5. Handoff timing: Exactly at rotation boundary, who gets it?
Caching strategy
Cache resolved schedules for short periods (1-5 minutes). Schedules change rarely but are queried constantly during incidents. Invalidate on any schedule or override change.
Timer Service Deep Dive
The timer service is critical for reliable escalations. It must guarantee timers fire even if nodes fail.
Say this out loud
The timer service is the heart of reliability. If it fails to fire a timer, an incident goes unescalated. I will use a distributed approach with Redis sorted sets.
import redis
import time
import jsonWhy this design works:
- 1.Sorted set ordering: Timers are automatically ordered by fire time 2. Atomic claiming: Lua script ensures no double-firing 3. Processing set: Tracks in-flight timers for recovery 4. Multiple workers: Any worker can poll and process timers 5. Crash recovery: Stuck timers are moved back automatically
Critical consideration
Redis persistence is crucial here. Use AOF with fsync=always or accept that some timers might be lost on Redis crash. For production, replicate Redis and consider additional persistence layer.
Notification Delivery
Multi-channel notification delivery requires handling different failure modes per channel.
What to say
Each notification channel has different reliability characteristics. We use a priority queue with retries and fallback to secondary channels on failure.
| Channel | Latency | Reliability | Failure Modes |
|---|---|---|---|
| Push Notification | 1-5 seconds | Medium | Token expired, app not installed, phone off |
| SMS | 5-30 seconds | High | Carrier issues, international routing, spam filters |
| Phone Call | 10-60 seconds | High | Voicemail, no answer, busy signal |
| 30-300 seconds | Medium | Spam filters, delayed delivery, inbox full | |
| Slack/Teams | 1-5 seconds | High | Rate limits, user DND, channel muted |
class NotificationService:
def __init__(self):
self.providers = dict(Notification Deduplication:
Users should not receive duplicate notifications for the same incident level:
def should_notify(user_id: str, incident_id: str, level: int) -> bool:
"""Check if we should send notification. Prevents duplicates."""
# Build dedup key from user, incident, and level
dedup_key = "notified:" + user_id + ":" + incident_id + ":" + str(level)
# SETNX returns True if key was set (first notification)
# Key expires after 1 hour to allow re-notification on repeat
is_first = redis.set(dedup_key, "1", nx=True, ex=3600)
return is_firstConsistency and Invariants
System Invariants
The system must NEVER fail to escalate an unacknowledged incident. This is the core reliability guarantee that customers pay for.
Critical invariants:
- 1.Escalation guarantee: Unacknowledged incident must always escalate after timeout 2. Notification delivery: At least one channel must succeed or incident marked as delivery failure 3. Schedule accuracy: On-call resolution must be deterministic for audit 4. State consistency: Incident state and timers must be atomic
Why at-least-once delivery:
We explicitly choose at-least-once over exactly-once:
| Approach | Pros | Cons | When to Use |
|---|---|---|---|
| At-least-once | Guarantees delivery | May duplicate | Alerting systems (default) |
| Exactly-once | No duplicates | May miss delivery | Never for critical alerts |
| At-most-once | Simple, no retries | May lose alerts | Low-priority notifications only |
Business impact mapping
If a user gets two SMS messages for the same incident, they are mildly annoyed. If they get zero messages, a production outage goes unnoticed. We optimize for the failure mode that matters.
Ensuring atomic state transitions:
def escalate_incident(incident_id: str):
"""Atomically escalate incident to next level using DB transaction."""
with db.transaction():Failure Modes and Resilience
Proactively discuss failures
Let me walk through the failure modes. For an alerting system, reliability is the product - we must handle every failure gracefully.
| Failure | Impact | Mitigation | Why It Works |
|---|---|---|---|
| Database down | Cannot process alerts | Queue incoming alerts, process when recovered | Alerts are not lost, just delayed |
| Timer service crash | Escalations may be delayed | Multiple workers, stuck timer recovery | At least one worker will process |
Self-Monitoring:
An on-call system must monitor itself. If PagerDuty is down, who alerts the PagerDuty team?
class SelfMonitor:
"""Self-monitoring using independent infrastructure."""
Real-world approach
PagerDuty runs their self-monitoring on separate infrastructure in different cloud regions. If their primary system fails, the backup can still reach the on-call team.
Evolution and Scaling
What to say
This design handles our initial scale well. Let me discuss how it evolves for 10x growth and additional features.
Evolution Path:
Stage 1: Single Region (up to 1M alerts/day) - PostgreSQL with read replicas - Redis cluster for timers - Simple queue for notifications - Good for most organizations
Stage 2: Multi-Region (up to 10M alerts/day) - Regional deployments for latency - Cross-region replication for schedules - Global load balancing for API - Regional notification delivery
Stage 3: Enterprise Scale (up to 100M alerts/day) - Kafka for alert ingestion - Partitioned processing by org - Dedicated infrastructure for large customers - Custom integrations and workflows
Multi-Region Architecture
Additional Features for V2:
- 1.Incident Response Automation: Auto-remediation, runbooks 2. Analytics and Insights: MTTA, MTTR tracking, on-call load 3. Integrations: Slack workflows, Jira ticket creation 4. Intelligent Alerting: ML-based deduplication, correlation 5. Status Pages: Customer-facing incident communication
Alternative approach
If we needed lower latency for notifications, I would consider edge deployment with local notification delivery and centralized coordination. This is how some CDN-based solutions work.
What I would do differently for...
Consumer app notifications: Relax delivery guarantees, prioritize cost efficiency over reliability. One missed push notification is acceptable.
Financial alerts (trading): Use exactly-once semantics with acknowledgment tracking. Duplicate trade alerts could cause real harm.
IoT alerting: Optimize for high volume, low urgency. Batch notifications, aggregate related alerts more aggressively.