Design Walkthrough
Problem Statement
The Question: Design a job scheduler - a system that runs tasks automatically at scheduled times across many computers.
What is a job scheduler? Think of it like a smart alarm clock for your computer. Instead of waking you up, it runs programs at specific times.
Real examples of jobs that need scheduling:
- 1.Send reminder emails - At 9am every day, send reminders to users who have items in their shopping cart.
- 2.Process yesterday's data - At midnight, add up all sales from the day and save a report.
- 3.Clean up old files - Every Sunday at 3am, delete files older than 30 days.
- 4.Multi-step pipelines - First download data, then clean it, then analyze it, then send a report (each step depends on the one before it).
What the system needs to do (most important first):
- 1.Run jobs on time - When you say run this at 2pm, it runs at 2pm (or very close to it).
- 2.Never miss a job - If a job is scheduled, it MUST run. Silent failures are not acceptable.
- 3.Handle crashes - If a computer dies mid-job, detect it and run the job on another computer.
- 4.Support repeating schedules - Run every hour, every day at 9am, every Monday, etc.
- 5.Handle job dependencies - Run job B only after job A finishes successfully.
- 6.Scale to millions - Handle millions of jobs per day across hundreds of worker computers.
What to say first
Before I design, let me understand what type of scheduler we need. Are we building a time-based scheduler (run at 2pm every day), an event-based queue (run when a message arrives), or a workflow system where jobs have dependencies? Each one is different.
Three types of job schedulers (explained simply):
1. Time-based (like an alarm clock) Example: Run the report generator every night at midnight. How it works: Check the clock, when it is time, start the job.
2. Event-based (like a doorbell) Example: When a new order comes in, process it. How it works: Wait for a message, when it arrives, start the job.
3. Workflow (like a recipe) Example: First download data, then clean it, then analyze it. How it works: Run jobs in order, waiting for each step to finish before starting the next.
What the interviewer really wants to see: - Do you know that jobs can fail halfway through? - Can you make sure a job runs exactly once, not zero times and not twice? - How do you handle computers crashing? - Can you scale to millions of jobs?
Clarifying Questions
Before you start designing, ask questions to understand what you are building. Good questions show the interviewer you think before you code.
Question 1: How many jobs per day?
How many jobs will run each day? Is it thousands, millions, or billions? How many jobs run at the same time?
Why ask this: 1,000 jobs per day and 10 million jobs per day need completely different designs.
What interviewers usually say: About 1 million jobs per day, with up to 10,000 running at the same time. Jobs can be quick (a few seconds) or long (several hours).
How this changes your design: At this scale, one computer cannot handle it. We need many worker computers and a distributed scheduler.
Question 2: What types of jobs?
Do we need time-based jobs (run at 2pm), event-based jobs (run when something happens), or workflow jobs (run step by step in order)?
Why ask this: Workflow jobs with dependencies are much harder to build than simple time-based jobs.
What interviewers usually say: We need all three types - scheduled jobs, event-triggered jobs, and multi-step workflows.
How this changes your design: We need both a scheduler (for time-based) and an orchestrator (for workflows with steps).
Question 3: What happens when a job fails?
Should we retry failed jobs? How many times? What if a computer crashes in the middle of running a job?
Why ask this: Failure handling is the hardest part of job schedulers.
What interviewers usually say: Retry failed jobs up to 3 times with increasing wait times. Detect crashed computers and reassign their jobs. Never lose a job.
How this changes your design: We need heartbeats (regular I am alive signals), timeouts, and a dead letter queue for jobs that keep failing.
Question 4: Can jobs run twice?
Is it okay if a job accidentally runs twice? Or must it run exactly once?
Why ask this: Exactly-once is MUCH harder than at-least-once.
What interviewers usually say: Running twice is okay as long as the job is designed to handle it (idempotent). Focus on never missing a job.
How this changes your design: We can use simpler at-least-once delivery and require jobs to be safe to retry. This is much easier than guaranteeing exactly-once.
Summarize your assumptions
Let me summarize what I will design for: 1 million jobs per day, 10,000 concurrent jobs, all three job types (time, event, workflow), at-least-once delivery with retries up to 3 times, jobs can run from seconds to hours, and we need to detect crashed workers and reassign their jobs.
The Hard Part
Say this to the interviewer
The hardest part of a job scheduler is handling what happens when a worker computer dies in the middle of running a job. The job might be half-done. Should we run it again? What if that causes duplicate work? This is the core problem we need to solve.
Why this is genuinely hard (explained simply):
Problem 1: The Computer Dies Mid-Job
Imagine this story: - 10:00 AM - We give Job-123 to Worker-A - 10:01 AM - Worker-A starts processing 1 million rows of data - 10:02 AM - Worker-A has processed 500,000 rows (halfway done) - 10:03 AM - Worker-A crashes (power failure, memory full, network dies) - 10:08 AM - We notice Worker-A is not responding - 10:09 AM - We give Job-123 to Worker-B
The big question: Did the job finish? Half finish? Not start? - If we run it again, we might process the same 500,000 rows twice (duplicates!) - If we do not run it again, we lose the other 500,000 rows
What happens when a worker dies
Problem 2: Clock Disagreement
Different computers have slightly different times: - Server A thinks it is 2:00:00 PM - Server B thinks it is 1:59:58 PM - Server C thinks it is 2:00:02 PM
A job scheduled for 2:00 PM might: - Run on Server A (it is 2pm there) - Not run on Server B (it is not 2pm yet there) - If both servers are schedulers, the job might run twice!
Problem 3: Too Many Jobs at Once (Thundering Herd)
Many people schedule jobs at popular times: - 100,000 jobs scheduled for exactly midnight - 50,000 jobs scheduled for exactly 9:00 AM - All try to start at the same second - The system gets overwhelmed
Common mistake candidates make
Many candidates say: just use a database table with a status column and check it every second. This fails because: (1) checking the database every second for millions of jobs is slow, (2) when two servers check at the same time, they might pick the same job, (3) if the scheduler crashes, nobody knows which jobs it was about to run.
How we solve these hard problems:
| Problem | Solution | How it works (simple explanation) |
|---|---|---|
| Worker dies mid-job | Heartbeats + Idempotent jobs | Workers send I am alive signals every 30 seconds. If we do not hear from a worker for 5 minutes, we assume it died and reassign its jobs. Jobs must be designed so running twice is safe. |
| Clock disagreement | One scheduler decides the time | Only one scheduler is in charge. It uses its own clock to decide when jobs are due. Other schedulers are backups. |
| Too many jobs at once | Add random delays (jitter) | Instead of all jobs starting at exactly midnight, spread them out: some at 12:00:00, some at 12:00:01, some at 12:00:02, etc. |
| Same job assigned twice | Database locks | When a scheduler picks a job, it locks that row in the database. Other schedulers cannot pick the same job. |
Scale and Access Patterns
Before designing, let me figure out how big this system needs to be. This helps us choose the right tools.
| What we are measuring | Number | What this means for our design |
|---|---|---|
| Total jobs per day | 1 million | About 12 jobs per second on average, but with big spikes at midnight and top of hour |
| Jobs running at the same time | 10,000 | Need hundreds of worker computers, not just one |
What to tell the interviewer
At 1 million jobs per day, that is about 12 jobs per second on average. But at midnight and the top of each hour, we might see 100 times more jobs starting at once. So we need to handle bursts of 1,200 jobs per second, not just 12.
Jobs per day: 1,000,000
Jobs per second (average): 1,000,000 / 86,400 seconds = ~12 jobs/sec
Jobs per second (peak burst): 12 x 100 = 1,200 jobs/secHow people use the scheduler (from most common to least common):
- 1.Check job status - Did my job finish? Did it fail? What was the error?
- 2.Create new job runs - The scheduler checks every second for jobs that are due and creates run records.
- 3.Workers grab jobs - Workers ask for new jobs to run from the queue.
- 4.Update job progress - Workers report that they are still alive and making progress.
- 5.Handle failures - Detect dead workers and reassign their jobs.
The burst problem
The scheduler is mostly quiet, but at popular times (midnight, 9am, top of hour), there are HUGE spikes. We must design for the peak, not the average. If we can only handle 12 jobs per second and 1,200 come in at midnight, we will fall behind and jobs will be late.
High-Level Architecture
Now let me draw the big picture of how all the pieces fit together. I will keep it simple and explain what each part does.
What to tell the interviewer
I will separate the system into three main parts: the Scheduler (decides WHEN to run jobs), the Queue (holds jobs waiting to be picked up), and the Workers (actually run the jobs). This way, each part can grow independently.
Job Scheduler - The Big Picture
What each part does and WHY it is separate:
| Part | What it does | Why it is separate |
|---|---|---|
| API Service | Lets users create jobs, check status, and cancel jobs. | Users interact with this. It should always be available even if other parts are busy. |
| Scheduler | Checks the clock every second. When a job is due, it puts the job in the queue. | Why separate? The scheduler needs to be very reliable. If it dies, a backup takes over. We do not want user traffic to slow down the scheduler. |
Common interview question: Why not one big service?
Interviewers often ask: Why do you need separate parts? Can one service do everything? Your answer: Yes, for a small system with few jobs, one service works fine. We split into parts when: (1) different parts need to scale differently - we might need 100 workers but only 2 schedulers, (2) different parts can fail independently - if workers are slow, the scheduler should still work, (3) different parts have different reliability needs - the scheduler MUST always run, but one worker dying is okay.
Technology Choices - Why we picked these tools:
Database: PostgreSQL (Recommended) - Why we chose it: We need to query jobs by time (show me all jobs due before 2pm), track status changes with transactions, and PostgreSQL is very reliable - Other options we considered: - MySQL: Also works great - pick this if your team knows it better - MongoDB: Not ideal because we need transactions and complex queries - DynamoDB: Harder to query by time ranges
Job Queue: Redis or Kafka - Why we need it: Workers should not constantly ask the database for new jobs. The queue notifies them when jobs are ready. - Redis: Simple and fast, good for most cases - Kafka: Better if you need to replay old messages or handle huge scale - RabbitMQ: Good middle ground, widely used
Workers: Stateless processes - Why stateless: If a worker dies, we lose nothing. Another worker can pick up the job. - Can run on Kubernetes, EC2, or any cloud container service
Important interview tip
Pick technologies YOU know! If you have used MySQL at your job, use MySQL. If you know RabbitMQ, use RabbitMQ. Interviewers care more about your reasoning than the specific tool. Say: I will use PostgreSQL because I have experience with it and it handles our requirements well.
How real companies do it
Airflow uses PostgreSQL for job metadata and Celery (Redis/RabbitMQ) for the queue. Uber built Cadence (now called Temporal) which uses a more advanced event-sourcing approach. For most companies, PostgreSQL + Redis works perfectly fine.
Data Model and Storage
Now let me show how we organize the data in the database. Think of tables like spreadsheets - each one stores a different type of information.
What to tell the interviewer
I will use a SQL database with three main tables: job_definitions (what jobs exist and when they should run), job_executions (records of each time a job ran), and job_dependencies (which jobs must finish before other jobs can start).
Table 1: Job Definitions - What jobs exist and when they should run
This table stores the blueprint for each job - its name, schedule, and what to run.
| Column | What it stores | Example |
|---|---|---|
| job_id | Unique ID for this job | job_abc123 |
| name | Human-readable name | Daily Sales Report |
Table 2: Job Executions - Records of each time a job ran
Every time a job runs, we create a new row here. This tracks the history - when it started, when it finished, did it succeed or fail?
| Column | What it stores | Example |
|---|---|---|
| execution_id | Unique ID for this run | run_xyz789 |
| job_id | Which job this is a run of | job_abc123 |
Database Index (makes searches fast)
We add an INDEX on (status, scheduled_time). This makes it FAST to find all pending jobs that are due - the query the scheduler runs every second.
Table 3: Job Dependencies - Which jobs must run first
For workflows where Job B cannot start until Job A finishes. This table tracks the order.
| Column | What it stores | Example |
|---|---|---|
| id | Unique ID | dep_111 |
| job_id | The job that waits | job_B |
| depends_on_job_id | The job that must finish first | job_A |
| dag_id | Groups related jobs together | daily_pipeline |
Job States - The Life of a Job
A job moves through different states as it runs:
Job Status Changes
What each status means (explained simply):
- PENDING - Job is waiting for its scheduled time. Like an alarm set for 9am. - QUEUED - Time is up, job is in the queue waiting for a worker. Like standing in line. - RUNNING - A worker is executing the job right now. Like being served. - SUCCESS - Job completed without errors. Hooray! - FAILED - Job hit an error. We might retry it. - DEAD - Job failed too many times. It goes to the dead letter queue for humans to investigate.
Important: Why we need heartbeat_at
The heartbeat_at column is critical. Workers update this column every 30 seconds while running a job. If a worker dies, it stops updating. The scheduler checks for jobs where status is RUNNING but heartbeat_at is more than 5 minutes old - those workers are probably dead, and we need to reassign their jobs.
Scheduler Deep Dive
The scheduler is the brain of the system. It decides when jobs should run. Let me explain how it works step by step.
Key insight
The scheduler does NOT run jobs itself. It only looks at the clock, finds jobs that are due, and puts them in the queue. Workers do the actual work. This separation is very important for scaling.
What the scheduler does every second (the main loop):
- 1.Check if I am the leader - Only one scheduler should be active at a time. If I am not the leader, wait.
- 2.Find jobs that are due - Look at the database for jobs where scheduled_time is now or in the past, and status is pending.
- 3.Put them in the queue - For each job that is due, add it to the job queue so workers can pick it up.
- 4.Find stuck jobs - Look for jobs that are running but have not sent a heartbeat in 5 minutes. Those workers probably died.
- 5.Reassign stuck jobs - Put the stuck jobs back in pending so they can be picked up again.
- 6.Schedule next runs - For repeating jobs (like every hour), create the next scheduled run.
EVERY SECOND, THE SCHEDULER DOES THIS:
STEP 1: Am I the leader?Leader Election - Why we need it
Imagine if we had two schedulers both checking for due jobs at the same time. They would both find the same jobs and put them in the queue twice! The same job would run twice.
Solution: Only ONE scheduler is the leader. The others are backups. If the leader dies, a backup becomes the new leader.
LEADER ELECTION WITH DATABASE LOCK:
The database has a special "leader_lock" that only ONE scheduler can hold.Alternative: Use ZooKeeper or etcd
For bigger systems with multiple data centers, use ZooKeeper or etcd for leader election instead of a database lock. These tools are designed specifically for distributed coordination.
Understanding Cron Schedules
Cron is a standard way to write schedules. Here are some examples:
| What you want | Cron expression | Meaning |
|---|---|---|
| Every day at 9am | 0 9 * * * | At minute 0, hour 9, any day, any month, any weekday |
| Every hour | 0 * * * * | At minute 0, any hour |
| Every Monday at 8am | 0 8 * * 1 | At 8am, on day 1 (Monday) |
| Every 15 minutes | */15 * * * * | Every 15th minute |
| First day of month at midnight | 0 0 1 * * | At midnight on day 1 |
Worker Deep Dive
Workers are the muscles of the system. They do the actual work of running jobs. Let me explain how they work.
Key insight
Workers are stateless - they do not remember anything between jobs. If a worker dies, no data is lost because everything is saved in the database. This makes workers easy to add, remove, or replace.
What a worker does (explained simply):
- 1.Wait for a job - Check the queue for jobs to run. If empty, wait.
- 2.Grab a job - Take one job from the queue. The queue makes sure no two workers grab the same job.
- 3.Mark it as running - Update the database: this job is now running, I am the worker running it.
- 4.Start heartbeat - Every 30 seconds, tell the database I am still alive and working on this job.
- 5.Run the job - Execute the actual code (send email, process data, generate report, etc.).
- 6.Mark it as done - Update the database: job succeeded (or failed with error message).
- 7.Repeat - Go back to step 1 and get the next job.
WORKER MAIN LOOP:
FOREVER:Why heartbeats are critical
Without heartbeats, we cannot tell the difference between: - A slow job that is still working (might take hours) - A dead worker that will never finish
With heartbeats: - Worker says I am alive every 30 seconds - If we do not hear from a worker for 5 minutes, it is probably dead - We can reassign that job to another worker
Handling failures and retries
Jobs fail for many reasons: network timeouts, database errors, bugs in the code. Here is how we handle it:
WHEN A JOB FAILS:
STEP 1: Check if we can retryJobs must be idempotent (safe to run twice)
Since jobs can be retried, every job must be safe to run twice. For example, if a job inserts data, it should check if the data already exists first. If a job sends an email, it should use a unique ID to prevent sending the same email twice. This is called being "idempotent" - running twice has the same result as running once.
Example of making a job idempotent:
BAD JOB (not idempotent - running twice causes problems):
INSERT INTO processed_orders VALUES (order_123, ...)
-- If job runs twice, we get duplicate rows!Workflow Dependencies (DAGs)
Sometimes jobs need to run in order. For example: first download data, then clean it, then analyze it. We cannot analyze data we have not downloaded yet! This is called a workflow or DAG (Directed Acyclic Graph).
What is a DAG?
DAG stands for Directed Acyclic Graph. It is just a fancy way to say: jobs have an order, and you cannot have loops. Job A leads to Job B leads to Job C - but C cannot lead back to A.
Example: Daily Data Pipeline
How this workflow runs (explained step by step):
- 1.Jobs A and B start first (they have no dependencies) 2. Job C waits until BOTH A and B are done 3. Once C finishes, D and E can start (they were waiting for C) 4. Job F waits until BOTH D and E are done 5. When F finishes, the whole pipeline is complete
WHEN A DAG IS TRIGGERED:
STEP 1: Create execution records for ALL jobs in the DAGChecking if dependencies are satisfied:
TO CHECK IF JOB C CAN RUN:
Look up all jobs that C depends on (A and B)What if a dependency fails?
If Job A fails, then Job C cannot run (it needs A's output). We have choices: (1) Mark C as failed automatically, (2) Wait for a human to fix A and retry, (3) Skip C and continue with jobs that do not depend on A. Most systems mark downstream jobs as failed and alert the team.
Real-world example: Airflow
Apache Airflow is a popular job scheduler that uses DAGs. You define your workflow as a Python file, and Airflow figures out the order and runs jobs when their dependencies are done. Many companies use Airflow for data pipelines.
Preventing Double Execution
The golden rule
A job must never run twice at the same time. If your daily report job is still running from yesterday, today's run should wait - not start a second copy. Two copies running at once can corrupt data or cause chaos.
Two problems to prevent:
- 1.Double scheduling - The scheduler puts the same job in the queue twice 2. Concurrent execution - Two workers run the same job at the same time
Solution 1: Prevent double scheduling with database locks
THE PROBLEM:
- Scheduler A checks database at 9:00:00.000
- Scheduler B checks database at 9:00:00.001 (1 millisecond later)Solution 2: Prevent concurrent execution with distributed locks
Some jobs should NEVER run at the same time, even from different dag runs. For example, a database migration job.
BEFORE RUNNING A JOB:
STEP 1: Try to get a lock for this jobWhen to use distributed locks
Not all jobs need this! If each job run is independent (like sending different emails), running two at once is fine. Use locks for jobs that modify shared state (like database migrations) or jobs where two copies would cause problems (like generating the same report twice).
What Can Go Wrong and How We Handle It
Tell the interviewer about failures
Good engineers think about what can break. Let me walk through the things that can go wrong and how we protect against them.
Common failures and how we handle them:
| What breaks | What happens | How we fix it | Recovery time |
|---|---|---|---|
| Scheduler dies | New jobs are not being scheduled | Backup scheduler detects leader is gone and takes over | About 30 seconds |
| Worker dies mid-job | Job is stuck in running status forever | Scheduler sees heartbeat is old, marks job for retry | About 5 minutes |
The split-brain problem
What if the network breaks and Scheduler A cannot talk to the database, but is still sending heartbeats to itself?
THE PROBLEM:
- Scheduler A is the leader
- Network partitions (Scheduler A cannot reach database)Dead Letter Queue (DLQ) - Where failed jobs go to die
When a job fails too many times, we do not keep retrying forever. We move it to a special place called the dead letter queue.
WHEN A JOB EXCEEDS MAX RETRIES:
STEP 1: Move to dead letter queueWhy not retry forever?
If a job has a bug that always causes it to fail, retrying forever would: (1) waste resources, (2) fill up the queue with the same failing job, (3) potentially cause other problems. It is better to give up after a few tries and let a human investigate.
Growing the System Over Time
What to tell the interviewer
This design works great for up to a few million jobs per day with a single scheduler and database. Let me explain how we would grow it if we need to handle 10x or 100x more jobs.
How we grow step by step:
Stage 1: Starting out (up to 1 million jobs per day) - One PostgreSQL database - One main scheduler with a hot backup - 50-100 workers - One job queue (Redis or RabbitMQ) - This handles most companies' needs - Simple to operate and debug
Stage 2: Growing bigger (up to 10 million jobs per day) - Add read replicas to the database - Partition the job_executions table by date - Add more workers (easy to scale) - Maybe add a second queue for priority jobs - Archive old execution records to cold storage
Stage 3: Huge scale (100+ million jobs per day) - Shard the scheduler by tenant or job type - Each shard has its own database partition - Multiple queues for different job types - Consider event-sourcing (like Temporal) for complex workflows
Sharded Architecture (for very large scale)
Do not over-engineer!
Most companies never need sharding. A single PostgreSQL database handles millions of jobs per day. Only add complexity when you actually need it. Start simple, measure, and scale when you see real bottlenecks.
Different flavors of job schedulers:
If you need sub-second precision: Use in-memory scheduling (like Quartz for Java) instead of database polling. The scheduler keeps upcoming jobs in memory and triggers them precisely.
If you need complex workflows: Use event-sourcing like Temporal. Every step in the workflow is recorded as an event. You can replay the entire history to debug problems. This is what Uber built.
If you need machine learning pipelines: Add resource management (GPU allocation), artifact tracking (save model files), and experiment versioning. Look at tools like Kubeflow or MLflow.
If you need multi-region: Each region has its own scheduler and workers. Jobs are routed to the right region based on where the data lives. A global coordinator handles cross-region workflows.
How real companies do it
Airbnb uses Apache Airflow for data pipelines, running millions of tasks per day. Uber built Cadence (now Temporal) for complex workflows with exactly-once guarantees. Netflix uses Conductor for microservice orchestration. For most companies, Airflow or a simple Redis-based scheduler works perfectly fine.