System Design Masterclass
Infrastructureon-callalertingescalationincident-managementnotificationsintermediate

Design On-Call Escalation System

Design an incident management and escalation system like PagerDuty

Millions of alerts/day|Similar to PagerDuty, Opsgenie, VictorOps, Atlassian, Slack|45 min read

Summary

On-call escalation systems ensure critical alerts reach the right people at the right time, with automatic escalation if acknowledgment does not happen. The core challenge is guaranteeing alert delivery with minimal latency while handling complex scheduling rules and multi-channel notifications. This is asked at PagerDuty, Opsgenie, Atlassian, and companies building internal incident management.

Key Takeaways

Core Problem

This is fundamentally a reliable delivery problem with time-based state transitions. We must guarantee alerts reach humans and escalate automatically if not acknowledged.

The Hard Part

Guaranteeing alert delivery across multiple channels (SMS, phone, push, email) while correctly handling complex on-call schedules with overrides, rotations, and timezone rules.

Scaling Axis

Scale by partitioning alerts by organization/team. Each partition handles its own escalation timers independently.

The Question: Design an on-call escalation system that handles millions of alerts per day, ensuring critical incidents reach the right responders with automatic escalation.

On-call escalation systems are essential for: - Incident Response: Getting the right person notified when systems fail - Escalation Policies: Auto-escalating if primary on-call does not respond - Schedule Management: Handling rotations, overrides, and handoffs - Multi-Channel Delivery: SMS, phone calls, push notifications, email - Audit Trail: Recording who was notified, when, and their response

What to say first

Before I design, let me clarify the requirements. I want to understand the scale, delivery guarantees needed, and the complexity of scheduling rules we need to support.

Hidden requirements interviewers are testing: - Can you design a reliable system where failure means missed incidents? - Do you understand the state machine complexity of escalations? - Can you handle the scheduling complexity (timezones, overrides, rotations)? - Do you know how to guarantee delivery across unreliable channels?

Premium Content

Sign in to access this content or upgrade for full access.