System Design Masterclass
Infrastructurejob-schedulerdistributed-systemscrontask-queuedagadvanced

Design Distributed Job Scheduler

Design a system that runs tasks automatically on a schedule like Airflow or Kubernetes CronJobs

Millions of jobs/day|Similar to Airbnb, Uber, Netflix, Google, Meta|45 min read

Summary

A job scheduler is like an alarm clock for your computer - it runs tasks at specific times without anyone clicking a button. Think about sending millions of emails at 9am, or processing all of yesterday's orders at midnight. The tricky parts are: making sure each job runs exactly once (not zero times, not twice), handling what happens when a computer crashes in the middle of a job, and running jobs in the right order when one job depends on another. Companies like Airbnb, Uber, Netflix, and Google ask this question because they run millions of background tasks every day.

Key Takeaways

Core Problem

The main job is to make sure every scheduled task runs exactly once, at the right time, even when computers crash or the network has problems.

The Hard Part

What if a computer dies while running a job halfway? The job is half-done. If we run it again, we might do the same work twice. If we skip it, we lose data. This is the hardest problem to solve.

Scaling Axis

We split jobs by user or by job type. Each worker machine pulls tasks from a shared queue, so adding more workers lets us run more jobs at once.

Critical Invariant

A job that is supposed to run MUST run. It should never be silently skipped. Also, the same job should not run twice at the same time. A small delay is okay, but losing a job is not.

Performance Requirement

Being a few seconds late is fine. Missing a job completely is a disaster. We care more about reliability than speed.

Key Tradeoff

We trade speed for safety. Jobs go through a persistent queue (saved to disk) so even if a computer crashes, we never lose a job.

Design Walkthrough

Problem Statement

The Question: Design a job scheduler - a system that runs tasks automatically at scheduled times across many computers.

What is a job scheduler? Think of it like a smart alarm clock for your computer. Instead of waking you up, it runs programs at specific times.

Real examples of jobs that need scheduling:

  1. 1.Send reminder emails - At 9am every day, send reminders to users who have items in their shopping cart.
  2. 2.Process yesterday's data - At midnight, add up all sales from the day and save a report.
  3. 3.Clean up old files - Every Sunday at 3am, delete files older than 30 days.
  4. 4.Multi-step pipelines - First download data, then clean it, then analyze it, then send a report (each step depends on the one before it).

What the system needs to do (most important first):

  1. 1.Run jobs on time - When you say run this at 2pm, it runs at 2pm (or very close to it).
  2. 2.Never miss a job - If a job is scheduled, it MUST run. Silent failures are not acceptable.
  3. 3.Handle crashes - If a computer dies mid-job, detect it and run the job on another computer.
  4. 4.Support repeating schedules - Run every hour, every day at 9am, every Monday, etc.
  5. 5.Handle job dependencies - Run job B only after job A finishes successfully.
  6. 6.Scale to millions - Handle millions of jobs per day across hundreds of worker computers.

What to say first

Before I design, let me understand what type of scheduler we need. Are we building a time-based scheduler (run at 2pm every day), an event-based queue (run when a message arrives), or a workflow system where jobs have dependencies? Each one is different.

Three types of job schedulers (explained simply):

1. Time-based (like an alarm clock) Example: Run the report generator every night at midnight. How it works: Check the clock, when it is time, start the job.

2. Event-based (like a doorbell) Example: When a new order comes in, process it. How it works: Wait for a message, when it arrives, start the job.

3. Workflow (like a recipe) Example: First download data, then clean it, then analyze it. How it works: Run jobs in order, waiting for each step to finish before starting the next.

What the interviewer really wants to see: - Do you know that jobs can fail halfway through? - Can you make sure a job runs exactly once, not zero times and not twice? - How do you handle computers crashing? - Can you scale to millions of jobs?

Clarifying Questions

Before you start designing, ask questions to understand what you are building. Good questions show the interviewer you think before you code.

Question 1: How many jobs per day?

How many jobs will run each day? Is it thousands, millions, or billions? How many jobs run at the same time?

Why ask this: 1,000 jobs per day and 10 million jobs per day need completely different designs.

What interviewers usually say: About 1 million jobs per day, with up to 10,000 running at the same time. Jobs can be quick (a few seconds) or long (several hours).

How this changes your design: At this scale, one computer cannot handle it. We need many worker computers and a distributed scheduler.

Question 2: What types of jobs?

Do we need time-based jobs (run at 2pm), event-based jobs (run when something happens), or workflow jobs (run step by step in order)?

Why ask this: Workflow jobs with dependencies are much harder to build than simple time-based jobs.

What interviewers usually say: We need all three types - scheduled jobs, event-triggered jobs, and multi-step workflows.

How this changes your design: We need both a scheduler (for time-based) and an orchestrator (for workflows with steps).

Question 3: What happens when a job fails?

Should we retry failed jobs? How many times? What if a computer crashes in the middle of running a job?

Why ask this: Failure handling is the hardest part of job schedulers.

What interviewers usually say: Retry failed jobs up to 3 times with increasing wait times. Detect crashed computers and reassign their jobs. Never lose a job.

How this changes your design: We need heartbeats (regular I am alive signals), timeouts, and a dead letter queue for jobs that keep failing.

Question 4: Can jobs run twice?

Is it okay if a job accidentally runs twice? Or must it run exactly once?

Why ask this: Exactly-once is MUCH harder than at-least-once.

What interviewers usually say: Running twice is okay as long as the job is designed to handle it (idempotent). Focus on never missing a job.

How this changes your design: We can use simpler at-least-once delivery and require jobs to be safe to retry. This is much easier than guaranteeing exactly-once.

Summarize your assumptions

Let me summarize what I will design for: 1 million jobs per day, 10,000 concurrent jobs, all three job types (time, event, workflow), at-least-once delivery with retries up to 3 times, jobs can run from seconds to hours, and we need to detect crashed workers and reassign their jobs.

The Hard Part

Say this to the interviewer

The hardest part of a job scheduler is handling what happens when a worker computer dies in the middle of running a job. The job might be half-done. Should we run it again? What if that causes duplicate work? This is the core problem we need to solve.

Why this is genuinely hard (explained simply):

Problem 1: The Computer Dies Mid-Job

Imagine this story: - 10:00 AM - We give Job-123 to Worker-A - 10:01 AM - Worker-A starts processing 1 million rows of data - 10:02 AM - Worker-A has processed 500,000 rows (halfway done) - 10:03 AM - Worker-A crashes (power failure, memory full, network dies) - 10:08 AM - We notice Worker-A is not responding - 10:09 AM - We give Job-123 to Worker-B

The big question: Did the job finish? Half finish? Not start? - If we run it again, we might process the same 500,000 rows twice (duplicates!) - If we do not run it again, we lose the other 500,000 rows

What happens when a worker dies

Problem 2: Clock Disagreement

Different computers have slightly different times: - Server A thinks it is 2:00:00 PM - Server B thinks it is 1:59:58 PM - Server C thinks it is 2:00:02 PM

A job scheduled for 2:00 PM might: - Run on Server A (it is 2pm there) - Not run on Server B (it is not 2pm yet there) - If both servers are schedulers, the job might run twice!

Problem 3: Too Many Jobs at Once (Thundering Herd)

Many people schedule jobs at popular times: - 100,000 jobs scheduled for exactly midnight - 50,000 jobs scheduled for exactly 9:00 AM - All try to start at the same second - The system gets overwhelmed

Common mistake candidates make

Many candidates say: just use a database table with a status column and check it every second. This fails because: (1) checking the database every second for millions of jobs is slow, (2) when two servers check at the same time, they might pick the same job, (3) if the scheduler crashes, nobody knows which jobs it was about to run.

How we solve these hard problems:

ProblemSolutionHow it works (simple explanation)
Worker dies mid-jobHeartbeats + Idempotent jobsWorkers send I am alive signals every 30 seconds. If we do not hear from a worker for 5 minutes, we assume it died and reassign its jobs. Jobs must be designed so running twice is safe.
Clock disagreementOne scheduler decides the timeOnly one scheduler is in charge. It uses its own clock to decide when jobs are due. Other schedulers are backups.
Too many jobs at onceAdd random delays (jitter)Instead of all jobs starting at exactly midnight, spread them out: some at 12:00:00, some at 12:00:01, some at 12:00:02, etc.
Same job assigned twiceDatabase locksWhen a scheduler picks a job, it locks that row in the database. Other schedulers cannot pick the same job.

Scale and Access Patterns

Before designing, let me figure out how big this system needs to be. This helps us choose the right tools.

What we are measuringNumberWhat this means for our design
Total jobs per day1 millionAbout 12 jobs per second on average, but with big spikes at midnight and top of hour
Jobs running at the same time10,000Need hundreds of worker computers, not just one
+ 4 more rows...

What to tell the interviewer

At 1 million jobs per day, that is about 12 jobs per second on average. But at midnight and the top of each hour, we might see 100 times more jobs starting at once. So we need to handle bursts of 1,200 jobs per second, not just 12.

Jobs per day: 1,000,000
Jobs per second (average): 1,000,000 / 86,400 seconds = ~12 jobs/sec
Jobs per second (peak burst): 12 x 100 = 1,200 jobs/sec
+ 11 more lines...

How people use the scheduler (from most common to least common):

  1. 1.Check job status - Did my job finish? Did it fail? What was the error?
  2. 2.Create new job runs - The scheduler checks every second for jobs that are due and creates run records.
  3. 3.Workers grab jobs - Workers ask for new jobs to run from the queue.
  4. 4.Update job progress - Workers report that they are still alive and making progress.
  5. 5.Handle failures - Detect dead workers and reassign their jobs.

The burst problem

The scheduler is mostly quiet, but at popular times (midnight, 9am, top of hour), there are HUGE spikes. We must design for the peak, not the average. If we can only handle 12 jobs per second and 1,200 come in at midnight, we will fall behind and jobs will be late.

High-Level Architecture

Now let me draw the big picture of how all the pieces fit together. I will keep it simple and explain what each part does.

What to tell the interviewer

I will separate the system into three main parts: the Scheduler (decides WHEN to run jobs), the Queue (holds jobs waiting to be picked up), and the Workers (actually run the jobs). This way, each part can grow independently.

Job Scheduler - The Big Picture

What each part does and WHY it is separate:

PartWhat it doesWhy it is separate
API ServiceLets users create jobs, check status, and cancel jobs.Users interact with this. It should always be available even if other parts are busy.
SchedulerChecks the clock every second. When a job is due, it puts the job in the queue.Why separate? The scheduler needs to be very reliable. If it dies, a backup takes over. We do not want user traffic to slow down the scheduler.
+ 4 more rows...

Common interview question: Why not one big service?

Interviewers often ask: Why do you need separate parts? Can one service do everything? Your answer: Yes, for a small system with few jobs, one service works fine. We split into parts when: (1) different parts need to scale differently - we might need 100 workers but only 2 schedulers, (2) different parts can fail independently - if workers are slow, the scheduler should still work, (3) different parts have different reliability needs - the scheduler MUST always run, but one worker dying is okay.

Technology Choices - Why we picked these tools:

Database: PostgreSQL (Recommended) - Why we chose it: We need to query jobs by time (show me all jobs due before 2pm), track status changes with transactions, and PostgreSQL is very reliable - Other options we considered: - MySQL: Also works great - pick this if your team knows it better - MongoDB: Not ideal because we need transactions and complex queries - DynamoDB: Harder to query by time ranges

Job Queue: Redis or Kafka - Why we need it: Workers should not constantly ask the database for new jobs. The queue notifies them when jobs are ready. - Redis: Simple and fast, good for most cases - Kafka: Better if you need to replay old messages or handle huge scale - RabbitMQ: Good middle ground, widely used

Workers: Stateless processes - Why stateless: If a worker dies, we lose nothing. Another worker can pick up the job. - Can run on Kubernetes, EC2, or any cloud container service

Important interview tip

Pick technologies YOU know! If you have used MySQL at your job, use MySQL. If you know RabbitMQ, use RabbitMQ. Interviewers care more about your reasoning than the specific tool. Say: I will use PostgreSQL because I have experience with it and it handles our requirements well.

How real companies do it

Airflow uses PostgreSQL for job metadata and Celery (Redis/RabbitMQ) for the queue. Uber built Cadence (now called Temporal) which uses a more advanced event-sourcing approach. For most companies, PostgreSQL + Redis works perfectly fine.

Data Model and Storage

Now let me show how we organize the data in the database. Think of tables like spreadsheets - each one stores a different type of information.

What to tell the interviewer

I will use a SQL database with three main tables: job_definitions (what jobs exist and when they should run), job_executions (records of each time a job ran), and job_dependencies (which jobs must finish before other jobs can start).

Table 1: Job Definitions - What jobs exist and when they should run

This table stores the blueprint for each job - its name, schedule, and what to run.

ColumnWhat it storesExample
job_idUnique ID for this jobjob_abc123
nameHuman-readable nameDaily Sales Report
+ 8 more rows...

Table 2: Job Executions - Records of each time a job ran

Every time a job runs, we create a new row here. This tracks the history - when it started, when it finished, did it succeed or fail?

ColumnWhat it storesExample
execution_idUnique ID for this runrun_xyz789
job_idWhich job this is a run ofjob_abc123
+ 9 more rows...

Database Index (makes searches fast)

We add an INDEX on (status, scheduled_time). This makes it FAST to find all pending jobs that are due - the query the scheduler runs every second.

Table 3: Job Dependencies - Which jobs must run first

For workflows where Job B cannot start until Job A finishes. This table tracks the order.

ColumnWhat it storesExample
idUnique IDdep_111
job_idThe job that waitsjob_B
depends_on_job_idThe job that must finish firstjob_A
dag_idGroups related jobs togetherdaily_pipeline

Job States - The Life of a Job

A job moves through different states as it runs:

Job Status Changes

What each status means (explained simply):

  • PENDING - Job is waiting for its scheduled time. Like an alarm set for 9am. - QUEUED - Time is up, job is in the queue waiting for a worker. Like standing in line. - RUNNING - A worker is executing the job right now. Like being served. - SUCCESS - Job completed without errors. Hooray! - FAILED - Job hit an error. We might retry it. - DEAD - Job failed too many times. It goes to the dead letter queue for humans to investigate.

Important: Why we need heartbeat_at

The heartbeat_at column is critical. Workers update this column every 30 seconds while running a job. If a worker dies, it stops updating. The scheduler checks for jobs where status is RUNNING but heartbeat_at is more than 5 minutes old - those workers are probably dead, and we need to reassign their jobs.

Scheduler Deep Dive

The scheduler is the brain of the system. It decides when jobs should run. Let me explain how it works step by step.

Key insight

The scheduler does NOT run jobs itself. It only looks at the clock, finds jobs that are due, and puts them in the queue. Workers do the actual work. This separation is very important for scaling.

What the scheduler does every second (the main loop):

  1. 1.Check if I am the leader - Only one scheduler should be active at a time. If I am not the leader, wait.
  2. 2.Find jobs that are due - Look at the database for jobs where scheduled_time is now or in the past, and status is pending.
  3. 3.Put them in the queue - For each job that is due, add it to the job queue so workers can pick it up.
  4. 4.Find stuck jobs - Look for jobs that are running but have not sent a heartbeat in 5 minutes. Those workers probably died.
  5. 5.Reassign stuck jobs - Put the stuck jobs back in pending so they can be picked up again.
  6. 6.Schedule next runs - For repeating jobs (like every hour), create the next scheduled run.
EVERY SECOND, THE SCHEDULER DOES THIS:

STEP 1: Am I the leader?
+ 39 more lines...

Leader Election - Why we need it

Imagine if we had two schedulers both checking for due jobs at the same time. They would both find the same jobs and put them in the queue twice! The same job would run twice.

Solution: Only ONE scheduler is the leader. The others are backups. If the leader dies, a backup becomes the new leader.

LEADER ELECTION WITH DATABASE LOCK:

The database has a special "leader_lock" that only ONE scheduler can hold.
+ 19 more lines...

Alternative: Use ZooKeeper or etcd

For bigger systems with multiple data centers, use ZooKeeper or etcd for leader election instead of a database lock. These tools are designed specifically for distributed coordination.

Understanding Cron Schedules

Cron is a standard way to write schedules. Here are some examples:

What you wantCron expressionMeaning
Every day at 9am0 9 * * *At minute 0, hour 9, any day, any month, any weekday
Every hour0 * * * *At minute 0, any hour
Every Monday at 8am0 8 * * 1At 8am, on day 1 (Monday)
Every 15 minutes*/15 * * * *Every 15th minute
First day of month at midnight0 0 1 * *At midnight on day 1

Worker Deep Dive

Workers are the muscles of the system. They do the actual work of running jobs. Let me explain how they work.

Key insight

Workers are stateless - they do not remember anything between jobs. If a worker dies, no data is lost because everything is saved in the database. This makes workers easy to add, remove, or replace.

What a worker does (explained simply):

  1. 1.Wait for a job - Check the queue for jobs to run. If empty, wait.
  2. 2.Grab a job - Take one job from the queue. The queue makes sure no two workers grab the same job.
  3. 3.Mark it as running - Update the database: this job is now running, I am the worker running it.
  4. 4.Start heartbeat - Every 30 seconds, tell the database I am still alive and working on this job.
  5. 5.Run the job - Execute the actual code (send email, process data, generate report, etc.).
  6. 6.Mark it as done - Update the database: job succeeded (or failed with error message).
  7. 7.Repeat - Go back to step 1 and get the next job.
WORKER MAIN LOOP:

FOREVER:
+ 34 more lines...

Why heartbeats are critical

Without heartbeats, we cannot tell the difference between: - A slow job that is still working (might take hours) - A dead worker that will never finish

With heartbeats: - Worker says I am alive every 30 seconds - If we do not hear from a worker for 5 minutes, it is probably dead - We can reassign that job to another worker

Handling failures and retries

Jobs fail for many reasons: network timeouts, database errors, bugs in the code. Here is how we handle it:

WHEN A JOB FAILS:

STEP 1: Check if we can retry
+ 33 more lines...

Jobs must be idempotent (safe to run twice)

Since jobs can be retried, every job must be safe to run twice. For example, if a job inserts data, it should check if the data already exists first. If a job sends an email, it should use a unique ID to prevent sending the same email twice. This is called being "idempotent" - running twice has the same result as running once.

Example of making a job idempotent:

BAD JOB (not idempotent - running twice causes problems):
    INSERT INTO processed_orders VALUES (order_123, ...)
    -- If job runs twice, we get duplicate rows!
+ 13 more lines...

Workflow Dependencies (DAGs)

Sometimes jobs need to run in order. For example: first download data, then clean it, then analyze it. We cannot analyze data we have not downloaded yet! This is called a workflow or DAG (Directed Acyclic Graph).

What is a DAG?

DAG stands for Directed Acyclic Graph. It is just a fancy way to say: jobs have an order, and you cannot have loops. Job A leads to Job B leads to Job C - but C cannot lead back to A.

Example: Daily Data Pipeline

How this workflow runs (explained step by step):

  1. 1.Jobs A and B start first (they have no dependencies) 2. Job C waits until BOTH A and B are done 3. Once C finishes, D and E can start (they were waiting for C) 4. Job F waits until BOTH D and E are done 5. When F finishes, the whole pipeline is complete
WHEN A DAG IS TRIGGERED:

STEP 1: Create execution records for ALL jobs in the DAG
+ 24 more lines...

Checking if dependencies are satisfied:

TO CHECK IF JOB C CAN RUN:

Look up all jobs that C depends on (A and B)
+ 14 more lines...

What if a dependency fails?

If Job A fails, then Job C cannot run (it needs A's output). We have choices: (1) Mark C as failed automatically, (2) Wait for a human to fix A and retry, (3) Skip C and continue with jobs that do not depend on A. Most systems mark downstream jobs as failed and alert the team.

Real-world example: Airflow

Apache Airflow is a popular job scheduler that uses DAGs. You define your workflow as a Python file, and Airflow figures out the order and runs jobs when their dependencies are done. Many companies use Airflow for data pipelines.

Preventing Double Execution

The golden rule

A job must never run twice at the same time. If your daily report job is still running from yesterday, today's run should wait - not start a second copy. Two copies running at once can corrupt data or cause chaos.

Two problems to prevent:

  1. 1.Double scheduling - The scheduler puts the same job in the queue twice 2. Concurrent execution - Two workers run the same job at the same time

Solution 1: Prevent double scheduling with database locks

THE PROBLEM:
    - Scheduler A checks database at 9:00:00.000
    - Scheduler B checks database at 9:00:00.001 (1 millisecond later)
+ 19 more lines...

Solution 2: Prevent concurrent execution with distributed locks

Some jobs should NEVER run at the same time, even from different dag runs. For example, a database migration job.

BEFORE RUNNING A JOB:

STEP 1: Try to get a lock for this job
+ 24 more lines...

When to use distributed locks

Not all jobs need this! If each job run is independent (like sending different emails), running two at once is fine. Use locks for jobs that modify shared state (like database migrations) or jobs where two copies would cause problems (like generating the same report twice).

What Can Go Wrong and How We Handle It

Tell the interviewer about failures

Good engineers think about what can break. Let me walk through the things that can go wrong and how we protect against them.

Common failures and how we handle them:

What breaksWhat happensHow we fix itRecovery time
Scheduler diesNew jobs are not being scheduledBackup scheduler detects leader is gone and takes overAbout 30 seconds
Worker dies mid-jobJob is stuck in running status foreverScheduler sees heartbeat is old, marks job for retryAbout 5 minutes
+ 4 more rows...

The split-brain problem

What if the network breaks and Scheduler A cannot talk to the database, but is still sending heartbeats to itself?

THE PROBLEM:
    - Scheduler A is the leader
    - Network partitions (Scheduler A cannot reach database)
+ 17 more lines...

Dead Letter Queue (DLQ) - Where failed jobs go to die

When a job fails too many times, we do not keep retrying forever. We move it to a special place called the dead letter queue.

WHEN A JOB EXCEEDS MAX RETRIES:

STEP 1: Move to dead letter queue
+ 22 more lines...

Why not retry forever?

If a job has a bug that always causes it to fail, retrying forever would: (1) waste resources, (2) fill up the queue with the same failing job, (3) potentially cause other problems. It is better to give up after a few tries and let a human investigate.

Growing the System Over Time

What to tell the interviewer

This design works great for up to a few million jobs per day with a single scheduler and database. Let me explain how we would grow it if we need to handle 10x or 100x more jobs.

How we grow step by step:

Stage 1: Starting out (up to 1 million jobs per day) - One PostgreSQL database - One main scheduler with a hot backup - 50-100 workers - One job queue (Redis or RabbitMQ) - This handles most companies' needs - Simple to operate and debug

Stage 2: Growing bigger (up to 10 million jobs per day) - Add read replicas to the database - Partition the job_executions table by date - Add more workers (easy to scale) - Maybe add a second queue for priority jobs - Archive old execution records to cold storage

Stage 3: Huge scale (100+ million jobs per day) - Shard the scheduler by tenant or job type - Each shard has its own database partition - Multiple queues for different job types - Consider event-sourcing (like Temporal) for complex workflows

Sharded Architecture (for very large scale)

Do not over-engineer!

Most companies never need sharding. A single PostgreSQL database handles millions of jobs per day. Only add complexity when you actually need it. Start simple, measure, and scale when you see real bottlenecks.

Different flavors of job schedulers:

If you need sub-second precision: Use in-memory scheduling (like Quartz for Java) instead of database polling. The scheduler keeps upcoming jobs in memory and triggers them precisely.

If you need complex workflows: Use event-sourcing like Temporal. Every step in the workflow is recorded as an event. You can replay the entire history to debug problems. This is what Uber built.

If you need machine learning pipelines: Add resource management (GPU allocation), artifact tracking (save model files), and experiment versioning. Look at tools like Kubeflow or MLflow.

If you need multi-region: Each region has its own scheduler and workers. Jobs are routed to the right region based on where the data lives. A global coordinator handles cross-region workflows.

How real companies do it

Airbnb uses Apache Airflow for data pipelines, running millions of tasks per day. Uber built Cadence (now Temporal) for complex workflows with exactly-once guarantees. Netflix uses Conductor for microservice orchestration. For most companies, Airflow or a simple Redis-based scheduler works perfectly fine.

Design Trade-offs

Advantages

  • +Very simple to build
  • +Easy to understand and debug
  • +Uses familiar SQL database

Disadvantages

  • -Polling has overhead - checking every second uses resources
  • -Hot rows - many processes reading same rows
  • -Limited to about 1 million jobs per day
When to use

Best for most companies. Start with this unless you KNOW you need more scale. Simpler is better.