System Design Masterclass

58 items

System Design Masterclass

Infrastructureon-callalertingescalationincident-managementnotificationsintermediate

Design On-Call Escalation System

Design an incident management and escalation system like PagerDuty

Millions of alerts/day|Similar to PagerDuty, Opsgenie, VictorOps, Atlassian, Slack|45 min read

Summary

On-call escalation systems ensure critical alerts reach the right people at the right time, with automatic escalation if acknowledgment does not happen. The core challenge is guaranteeing alert delivery with minimal latency while handling complex scheduling rules and multi-channel notifications. This is asked at PagerDuty, Opsgenie, Atlassian, and companies building internal incident management.

Key Takeaways

Core Problem

This is fundamentally a reliable delivery problem with time-based state transitions. We must guarantee alerts reach humans and escalate automatically if not acknowledged.

The Hard Part

Guaranteeing alert delivery across multiple channels (SMS, phone, push, email) while correctly handling complex on-call schedules with overrides, rotations, and timezone rules.

Scaling Axis

Scale by partitioning alerts by organization/team. Each partition handles its own escalation timers independently.

The Question: Design an on-call escalation system that handles millions of alerts per day, ensuring critical incidents reach the right responders with automatic escalation.

On-call escalation systems are essential for: - Incident Response: Getting the right person notified when systems fail - Escalation Policies: Auto-escalating if primary on-call does not respond - Schedule Management: Handling rotations, overrides, and handoffs - Multi-Channel Delivery: SMS, phone calls, push notifications, email - Audit Trail: Recording who was notified, when, and their response

What to say first

Before I design, let me clarify the requirements. I want to understand the scale, delivery guarantees needed, and the complexity of scheduling rules we need to support.

Hidden requirements interviewers are testing: - Can you design a reliable system where failure means missed incidents? - Do you understand the state machine complexity of escalations? - Can you handle the scheduling complexity (timezones, overrides, rotations)? - Do you know how to guarantee delivery across unreliable channels?

Summary

Key Takeaways

Core Problem

This is fundamentally a reliable delivery problem with time-based state transitions. We must guarantee alerts reach humans and escalate automatically if not acknowledged.

The Hard Part

Guaranteeing alert delivery across multiple channels (SMS, phone, push, email) while correctly handling complex on-call schedules with overrides, rotations, and timezone rules.

Scaling Axis

Scale by partitioning alerts by organization/team. Each partition handles its own escalation timers independently.

Critical Invariant

An unacknowledged alert must ALWAYS escalate. Missing an escalation means a production outage goes unnoticed. This is the core reliability guarantee.

Performance Requirement

Alert-to-notification latency under 30 seconds. Escalation timers must fire within seconds of their deadline, not minutes.

Key Tradeoff

We choose at-least-once delivery over exactly-once. Duplicate notifications are annoying but acceptable; missed notifications are catastrophic.

Design Walkthrough

Problem Statement

The Question: Design an on-call escalation system that handles millions of alerts per day, ensuring critical incidents reach the right responders with automatic escalation.

What to say first

Before I design, let me clarify the requirements. I want to understand the scale, delivery guarantees needed, and the complexity of scheduling rules we need to support.

Clarifying Questions

Ask these questions to demonstrate senior thinking. Each answer shapes your architecture.

Question 1: Scale

How many alerts per day? How many on-call schedules and users do we need to support?

Why this matters: Determines storage and processing requirements. Typical answer: 1M alerts/day, 10K organizations, 100K users Architecture impact: Need distributed processing, cannot be single node

Question 2: Delivery Guarantee

What is the SLA for alert delivery? Is it acceptable to occasionally send duplicate notifications?

Why this matters: Determines consistency model. Typical answer: 99.99% delivery within 30 seconds, duplicates are acceptable Architecture impact: At-least-once delivery, retry aggressively

Question 3: Notification Channels

Which notification channels do we need to support? Phone calls, SMS, push, email, Slack?

Why this matters: Each channel has different reliability and latency characteristics. Typical answer: All of the above, with user-configurable preferences Architecture impact: Need abstraction layer for channels, handle partial failures

Question 4: Schedule Complexity

Do we need to support schedule overrides, rotations, and follow-the-sun?

Why this matters: Scheduling is surprisingly complex. Typical answer: Yes, full scheduling with overrides taking precedence Architecture impact: Need sophisticated schedule resolution engine

Stating assumptions

Based on this, I will assume: 1M alerts/day, 99.99% delivery SLA, multi-channel notifications, complex scheduling with overrides, and at-least-once delivery is acceptable.

The Hard Part

Say this out loud

The hard part here is guaranteeing that unacknowledged alerts always escalate, while correctly resolving who is on-call at any given moment across complex schedules with overrides.

Why this is genuinely hard:

1.Guaranteed Escalation: If the escalation timer process crashes, alerts could be stuck forever. We need fault-tolerant timer management.
2.Schedule Resolution: Who is on-call right now? Depends on: base rotation, any active overrides, timezone of the schedule, gaps in coverage, and escalation level.
3.Multi-Channel Delivery: SMS providers fail, phone calls go to voicemail, push tokens expire. Each channel has different failure modes.
4.Clock and Timer Reliability: Escalation after 5 minutes means exactly 5 minutes, not whenever the cron job runs next.

Common mistake

Candidates often use simple cron jobs for escalations. But cron is not precise, not distributed, and loses state on restart. You need a proper distributed timer system.

The fundamental reliability challenge:

This system has an unusual property: failure is catastrophic. If PagerDuty fails to escalate an alert, a production outage goes unnoticed for hours. The entire value proposition is reliability.

Compare to other systems: - E-commerce: Missed order is bad but recoverable - Social media: Delayed notification is annoying but fine - On-call: Missed escalation means extended outage, potential revenue loss, customer impact

Scale and Access Patterns

Let me estimate the scale and understand access patterns.

Dimension	Value	Impact
Alerts per day	1,000,000	About 12 alerts/second average, bursty
Peak alerts/second	1,000	Need to handle correlated failures

+ 5 more rows...

What to say

At 12 alerts/second average with 1000/second peaks, we need to handle correlated failures where one outage triggers thousands of alerts. The system is write-heavy during incidents but read-heavy for schedule lookups.

Access Pattern Analysis:

Bursty writes: Alerts come in waves during outages - Time-sensitive: Escalations must fire precisely - Read-heavy schedules: Who is on-call is queried constantly - Audit requirements: Every action must be logged for compliance - Multi-tenant: Strong isolation between organizations

Storage needed:
- 1M alerts/day x 365 days x 1KB = 365 GB/year for alerts
- 5M notifications/day x 365 x 500B = 912 GB/year for notifications
- Schedules: 50K policies x 10KB = 500 MB (fits in memory)

Timer requirements:
- 1M alerts x 3 escalation levels = 3M timers/day
- Average timer duration: 5 minutes
- Concurrent active timers: ~10,000 at any moment

High-Level Architecture

Let me start with a simple architecture and evolve it.

What to say

I will design this as an event-driven system with clear separation between alert ingestion, escalation management, and notification delivery. The key insight is that escalations are a state machine.

On-Call Escalation Architecture

Component Responsibilities:

1.Alert Ingestion: API, webhooks, email parsing - normalize all alert sources
2.Deduplication: Group related alerts, prevent notification storms
3.Rule Engine: Match alerts to escalation policies based on service, severity, tags
4.Escalation Manager: Core state machine - tracks incident state, triggers escalations
5.Timer Service: Reliable distributed timers for escalation deadlines
6.Schedule Resolver: Determines who is on-call given complex schedule rules
7.Notification Queue: Reliable delivery to multiple channels with retries

Real-world reference

PagerDuty uses a similar event-driven architecture. They process alerts through a pipeline with deduplication, then manage escalation state machines independently per incident.

Data Model and Storage

Let me define the core data models and storage choices.

What to say

PostgreSQL for transactional data (incidents, schedules), Redis for timers and caching, and a time-series DB for metrics and audit logs.

-- Organizations and Users
CREATE TABLE organizations (
    id UUID PRIMARY KEY,

+ 91 more lines...

Incident State Machine:

The incident goes through well-defined states:

Incident State Machine

Important detail

State transitions must be atomic. Use database transactions to ensure the incident state and timer are updated together. A crash between them could leave the system in an inconsistent state.

Schedule Resolution Deep Dive

Schedule resolution is deceptively complex. Let me walk through the algorithm.

What to say

Schedule resolution requires layering: base rotation, then overrides on top, all calculated in the schedule timezone. I will show the algorithm.

def resolve_oncall(schedule_id: str, at_time: datetime) -> User:
    """
    Determine who is on-call for a schedule at a given time.

+ 51 more lines...

Edge Cases to Handle:

1.Timezone transitions: DST changes can cause 23 or 25 hour days 2. Schedule gaps: No one is on-call (should alert admins) 3. Deleted users: User in rotation no longer exists 4. Overlapping overrides: Later override wins 5. Handoff timing: Exactly at rotation boundary, who gets it?

Caching strategy

Cache resolved schedules for short periods (1-5 minutes). Schedules change rarely but are queried constantly during incidents. Invalidate on any schedule or override change.

Timer Service Deep Dive

The timer service is critical for reliable escalations. It must guarantee timers fire even if nodes fail.

Say this out loud

The timer service is the heart of reliability. If it fails to fire a timer, an incident goes unescalated. I will use a distributed approach with Redis sorted sets.

import redis
import time
import json

+ 59 more lines...

Why this design works:

1.Sorted set ordering: Timers are automatically ordered by fire time 2. Atomic claiming: Lua script ensures no double-firing 3. Processing set: Tracks in-flight timers for recovery 4. Multiple workers: Any worker can poll and process timers 5. Crash recovery: Stuck timers are moved back automatically

Critical consideration

Redis persistence is crucial here. Use AOF with fsync=always or accept that some timers might be lost on Redis crash. For production, replicate Redis and consider additional persistence layer.

Notification Delivery

Multi-channel notification delivery requires handling different failure modes per channel.

What to say

Each notification channel has different reliability characteristics. We use a priority queue with retries and fallback to secondary channels on failure.

Channel	Latency	Reliability	Failure Modes
Push Notification	1-5 seconds	Medium	Token expired, app not installed, phone off
SMS	5-30 seconds	High	Carrier issues, international routing, spam filters
Phone Call	10-60 seconds	High	Voicemail, no answer, busy signal
Email	30-300 seconds	Medium	Spam filters, delayed delivery, inbox full
Slack/Teams	1-5 seconds	High	Rate limits, user DND, channel muted

class NotificationService:
    def __init__(self):
        self.providers = dict(

+ 46 more lines...

Notification Deduplication:

Users should not receive duplicate notifications for the same incident level:

def should_notify(user_id: str, incident_id: str, level: int) -> bool:
    """Check if we should send notification. Prevents duplicates."""
    # Build dedup key from user, incident, and level
    dedup_key = "notified:" + user_id + ":" + incident_id + ":" + str(level)
    
    # SETNX returns True if key was set (first notification)
    # Key expires after 1 hour to allow re-notification on repeat
    is_first = redis.set(dedup_key, "1", nx=True, ex=3600)
    
    return is_first

Consistency and Invariants

System Invariants

The system must NEVER fail to escalate an unacknowledged incident. This is the core reliability guarantee that customers pay for.

Critical invariants:

1.Escalation guarantee: Unacknowledged incident must always escalate after timeout 2. Notification delivery: At least one channel must succeed or incident marked as delivery failure 3. Schedule accuracy: On-call resolution must be deterministic for audit 4. State consistency: Incident state and timers must be atomic

Why at-least-once delivery:

We explicitly choose at-least-once over exactly-once:

Approach	Pros	Cons	When to Use
At-least-once	Guarantees delivery	May duplicate	Alerting systems (default)
Exactly-once	No duplicates	May miss delivery	Never for critical alerts
At-most-once	Simple, no retries	May lose alerts	Low-priority notifications only

Business impact mapping

If a user gets two SMS messages for the same incident, they are mildly annoyed. If they get zero messages, a production outage goes unnoticed. We optimize for the failure mode that matters.

Ensuring atomic state transitions:

def escalate_incident(incident_id: str):
    """Atomically escalate incident to next level using DB transaction."""
    with db.transaction():

+ 40 more lines...

Failure Modes and Resilience

Proactively discuss failures

Let me walk through the failure modes. For an alerting system, reliability is the product - we must handle every failure gracefully.

Failure	Impact	Mitigation	Why It Works
Database down	Cannot process alerts	Queue incoming alerts, process when recovered	Alerts are not lost, just delayed
Timer service crash	Escalations may be delayed	Multiple workers, stuck timer recovery	At least one worker will process

+ 4 more rows...

Self-Monitoring:

An on-call system must monitor itself. If PagerDuty is down, who alerts the PagerDuty team?

class SelfMonitor:
    """Self-monitoring using independent infrastructure."""

+ 44 more lines...

Real-world approach

PagerDuty runs their self-monitoring on separate infrastructure in different cloud regions. If their primary system fails, the backup can still reach the on-call team.

Evolution and Scaling

What to say

This design handles our initial scale well. Let me discuss how it evolves for 10x growth and additional features.

Evolution Path:

Stage 1: Single Region (up to 1M alerts/day) - PostgreSQL with read replicas - Redis cluster for timers - Simple queue for notifications - Good for most organizations

Stage 2: Multi-Region (up to 10M alerts/day) - Regional deployments for latency - Cross-region replication for schedules - Global load balancing for API - Regional notification delivery

Stage 3: Enterprise Scale (up to 100M alerts/day) - Kafka for alert ingestion - Partitioned processing by org - Dedicated infrastructure for large customers - Custom integrations and workflows

Multi-Region Architecture

Additional Features for V2:

1.Incident Response Automation: Auto-remediation, runbooks 2. Analytics and Insights: MTTA, MTTR tracking, on-call load 3. Integrations: Slack workflows, Jira ticket creation 4. Intelligent Alerting: ML-based deduplication, correlation 5. Status Pages: Customer-facing incident communication

Alternative approach

If we needed lower latency for notifications, I would consider edge deployment with local notification delivery and centralized coordination. This is how some CDN-based solutions work.

What I would do differently for...

Consumer app notifications: Relax delivery guarantees, prioritize cost efficiency over reliability. One missed push notification is acceptable.

Financial alerts (trading): Use exactly-once semantics with acknowledgment tracking. Duplicate trade alerts could cause real harm.

IoT alerting: Optimize for high volume, low urgency. Batch notifications, aggregate related alerts more aggressively.

Design Trade-offs

Advantages

+Simple implementation
+Easy to understand
+Works with any database

Disadvantages

-High database load
-Imprecise timing
-Scales poorly

When to use

Low volume, simple systems, prototypes

System Design Masterclass

Weather Application with Forecasting

URL Shortener

Live Comments Feature

API Rate Limiter

On-Call Escalation System

Hotel Booking and Reservation System

Parts Compatibility Validation

Real-time Stock Price Viewer

Top-K Rankings System

File Download and Sync Library

Real-time Active Viewers

Marketplace Features

Price Alert System

Netflix Screen Concurrency Limits

Live Reactions System

Top K Most Shared Articles

High-Profile Likes Counter

Authentication and User Login

Google Calendar

Web Crawler

News Feed

Video Streaming Platform

IoC / Dependency Injection Framework

Distributed Control Infrastructure

Notification Service

Distributed Tracing System

P2P File Transfer System

Large Data Migration to Cloud

Wire Transfer API

Large Data Sorting and Processing

Database Control Plane

Distributed Metrics Logging and Aggregation

Ads Management & Delivery System

Flash Sale Backend

Photo Sharing Platform

Cluster Health Monitoring System

Rider Matching System

Surge Pricing System

Collaborative Editing System

Server Metrics Collection System

User Analytics Dashboard & Event Pipeline

Dropbox / Google Drive

Distributed Message Queue

ETA and Live Location Sharing

Distributed Key-Value Store

Distributed Stream Processing System

Payment Processing System

Distributed Job Scheduler

WhatsApp / Messenger

Payment Wallet at Global Scale

Uber / Ride Sharing

Web Search Engine

Globally Distributed SQL Database

Real-Time Analytics System

Recommendation System (Netflix)

Multi-Region Disaster Recovery System

Time-Series Database

Fraud Detection System

Design On-Call Escalation System

Summary

Key Takeaways

Core Problem

The Hard Part

Scaling Axis

Problem Statement

What to say first

Premium Content