System Design Masterclass
Infrastructurerate-limitingdistributed-systemsredisapi-gatewaythrottlingintermediate

Design API Rate Limiter

Design a distributed rate limiting system for APIs

100K+ requests per second|Similar to Stripe, GitHub, Twitter, Cloudflare, AWS, Google|45 min read

Summary

A rate limiter is like a bouncer at a club - it controls how many requests each user can make in a given time. If someone tries to make too many requests (like 1000 per minute when the limit is 100), the rate limiter says "slow down" and blocks the extra requests. The tricky part is: when you have many servers, how do they all agree on the count? If Server A sees 50 requests and Server B sees 50 requests for the same user, together that is 100 - but neither server knows about the other! Companies like Stripe, GitHub, Twitter, and Cloudflare ask this question in interviews.

Key Takeaways

Core Problem

The main job is counting how many requests each user made recently. Sounds simple, but when you have 10 servers, they all need to agree on the count - and they need to agree FAST (under 5 milliseconds).

The Hard Part

Two requests from the same user can hit two different servers at the same instant. Both servers check the count, both see 99 (limit is 100), both allow. Now the count is 101. This is called a race condition.

Scaling Axis

We can split users across different rate limiter servers. User A goes to Server 1, User B goes to Server 2. This way each server handles fewer users.

Critical Invariant

Never let someone make way more requests than allowed. A small overrun (105 instead of 100) is okay, but 200 when limit is 100 is a bug. The system must be close to correct.

Performance Requirement

Checking the rate limit must be super fast - under 5 milliseconds. Every single API request goes through the rate limiter, so if it is slow, everything is slow.

Key Tradeoff

We pick speed over perfect accuracy. Why? A small overrun (5%) rarely hurts anyone. But if every request takes 50ms longer because we are double-checking counts, users will leave.

Design Walkthrough

Problem Statement

The Question: Design a rate limiter that can handle 100,000+ requests per second across many servers.

What is a rate limiter? Think of it like a water faucet with a limit. You can only fill one glass per minute. If you try to fill two glasses, the faucet stops you.

Why do we need rate limiting? (most important first):

  1. 1.Stop abuse and attacks - A bad person might try to crash your website by sending millions of requests. Rate limiting stops them.
  2. 2.Fair sharing - If one customer uses all the resources, other customers suffer. Rate limiting keeps things fair.
  3. 3.Control costs - APIs cost money to run. Rate limiting prevents surprise bills from runaway usage.
  4. 4.Keep the service running - If too many requests come at once, servers can crash. Rate limiting protects them.

What to say first

Before I start designing, let me ask some questions to understand what exactly we need to build. I want to know the scale, how accurate we need to be, and what happens when someone hits the limit.

What the interviewer really wants to see: - Do you understand that this is hard because of multiple servers? - Do you know that being super accurate means being slow, and being fast means being a bit inaccurate? - Can you think about what happens when things break? - Do you know different ways to count requests and when to use each?

Clarifying Questions

Before you start designing, ask questions to understand what you are building. Good questions show the interviewer you think before you code.

Question 1: How big is this?

How many requests per second do we need to handle? How many different users or API keys do we need to track?

Why ask this: If we only have 100 requests per second, one server can handle it easily. If we have 100,000 per second, we need many servers working together.

What interviewers usually say: 100,000 requests per second, 1 million different API keys to track.

How this changes your design: With this scale, we cannot use just one server. We need a distributed system with multiple rate limiters.

Question 2: How accurate does it need to be?

If the limit is 100 requests per minute, is it okay if sometimes we allow 105? Or must it be exactly 100, never more?

Why ask this: Perfect accuracy needs all servers to talk to each other before every request. This is slow. A little bit of error lets us be much faster.

What interviewers usually say: Small overrun is fine. We are not charging per request.

How this changes your design: We can use simpler and faster methods. We do not need complex locks that slow everything down.

Question 3: Fixed or sliding time window?

Should we count requests per clock minute (like 2:00 to 2:01) or any 60-second period (like the last 60 seconds from now)?

Why ask this: Fixed windows are simpler but have a problem - someone can make 100 requests at 2:00:59 and another 100 at 2:01:01. That is 200 in 2 seconds! Sliding windows are more accurate but harder to build.

What interviewers usually say: Sliding window is preferred to prevent this gaming.

How this changes your design: We need to use the sliding window counter method (I will explain later) instead of simple counting.

Question 4: One region or global?

Is the rate limit for one data center only, or should it count requests from all around the world together?

Why ask this: If someone in New York and someone in Tokyo both count toward the same limit, we need to share data across the world. This is much harder.

What interviewers usually say: Start with one region, mention global as a future improvement.

How this changes your design: We can start simpler with everything in one place, then talk about how to make it global later.

Summarize your assumptions

Let me summarize what I will design for: 100,000 requests per second, 1 million API keys, sliding window is preferred, small overrun (like 5%) is acceptable, and we will start with one region. I will mention global rate limiting as a future feature.

The Hard Part

Say this to the interviewer

The hardest part is counting accurately when you have many servers. Imagine you have 10 rate limiter servers. User A sends a request to Server 1. At the exact same moment, User A sends another request to Server 2. Both servers need to know the total count - but they do not know about each other!

Why distributed counting is tricky (explained simply):

  1. 1.The race condition problem - Two servers read the count at the same time. Both see 99. Both think "okay, limit is 100, I can allow this." Both allow. Now the real count is 101.
  2. 2.The speed problem - We could make all servers ask a central place before every request. But that central place becomes slow. And talking between servers takes time (network latency).
  3. 3.The time boundary problem - If the limit is 100 per minute, what happens at minute boundaries? Someone could make 100 requests at 2:00:59 (end of one minute) and 100 more at 2:01:01 (start of next minute). That is 200 requests in just 2 seconds!
  4. 4.The clock problem - Different servers have slightly different times. Server A thinks it is 2:00:00, Server B thinks it is 2:00:01. Their counts for "this minute" might not match.

Common mistake candidates make

Many people say: just use a counter and increment it! This works for one server, but breaks completely when you have many servers. Always think about the distributed case first - what if there are 10 servers?

The fundamental choice you must make:

Option 1: Be perfectly accurate (slow) - Before every request, all servers must agree on the count - This is called "distributed consensus" and takes 50-100 milliseconds - Your API becomes slow because of rate limiting

Option 2: Be very fast but slightly inaccurate (recommended) - Each server makes its own decision quickly - Sometimes the total count goes a bit over the limit - Your API stays fast

For rate limiting, we almost always pick Option 2. Why? If someone makes 105 requests instead of 100, nothing terrible happens. But if every API call takes 50ms longer, users will leave.

The race condition problem

Scale and Access Patterns

Before designing, let me figure out how big this system needs to be. This helps us choose the right tools.

What we are measuringNumberWhat this means for our design
Requests per second100,000One server cannot handle this alone - Redis can do 100K+ operations per second, so it works
Unique API keys1,000,000Need to track 1 million different users - this fits in memory easily
+ 4 more rows...

What to tell the interviewer

At 100K requests per second with 1 million keys, the data easily fits in memory (about 100 MB). Redis can handle this throughput on a single node with room to spare. The challenge is not storage or throughput - it is coordinating counts across multiple rate limiter instances while staying under 5ms latency.

How people use the rate limiter (from most common to least common):

  1. 1.Check and allow - Most requests (99%) are under the limit and pass through. This must be super fast.
  2. 2.Check and block - About 1% of requests hit the limit. We return "429 Too Many Requests" error.
  3. 3.Near the limit - When someone is close to their limit, they might slow down or retry later.
  4. 4.Way over limit - Bad actors or bugs that send way too many requests. Block quickly.
How much space does one user's rate limit data need?
- API key: about 30 bytes
- Current count: 8 bytes
+ 11 more lines...

Key insight about hot keys

Some API keys are much busier than others. One big customer might make 10,000 requests per second, while most make 1 per second. We call the busy keys "hot keys" and might need special handling for them (like local caching).

High-Level Architecture

Now let me draw the big picture of how all the pieces fit together. I will keep it simple and explain what each part does.

What to tell the interviewer

I will put the rate limiter in front of our API servers. Every request must pass through the rate limiter first. I will use Redis to store the counts because it is super fast (in-memory) and has atomic operations (no race conditions).

Rate Limiter System - The Big Picture

What each part does and WHY it is there:

PartWhat it doesWhy we need it (what to tell interviewer)
Load BalancerSpreads incoming requests across multiple rate limitersOne rate limiter cannot handle 100K requests per second. We need several working together.
Rate LimiterChecks if this user has made too many requests. If yes, block. If no, allow.This is the core of our system. It makes the allow/block decision for every request.
RedisStores the count of requests for each user. Super fast because it is all in memory.We need a central place to keep counts. Redis is perfect because: (1) in-memory = fast, (2) atomic operations = no race conditions, (3) TTL = automatic cleanup.
API ServersDo the actual work (whatever the API does).Only get requests that passed the rate limit check. Protected from overload.

Common interview question: Why not put rate limiting in the API server itself?

You could! For simple cases, this works fine. We use a separate rate limiter layer when: (1) we have many different API servers and want consistent limits, (2) we want to block bad traffic before it reaches our expensive API servers, (3) we want to change rate limit rules without redeploying all API servers.

Why we picked these tools:

Redis (Recommended for rate limiting) - Why we chose it: Super fast (in-memory), atomic operations (INCR command), automatic expiry (TTL), everyone uses it so lots of help available - Other options we considered: - Memcached: Also fast, but no atomic increment with expiry in one command - In-memory in each rate limiter: Would not share counts across servers - Database (PostgreSQL): Too slow for 100K operations per second

How real companies do it

Stripe uses Redis for rate limiting. GitHub uses a combination of Redis and in-memory caches. Cloudflare does rate limiting at the edge (their servers around the world) with local counting and periodic sync. For most companies, Redis works great.

Data Model and Storage

Now let me show how we store the rate limit data in Redis. The key design is simple but important.

What to tell the interviewer

I will use Redis as the single source of truth for request counts. The key pattern is rate_limit:api_key:window_id. I will use Redis Lua scripts to make check-and-increment atomic (no race conditions).

How we name the keys in Redis:

Think of Redis keys like file names. We need a naming system that is easy to understand and look up.

Key pattern: rate_limit:{api_key}:{time_window}

Examples:
+ 11 more lines...

Making it atomic (no race conditions):

The problem: If we do "read count, check if under limit, then increment" as separate steps, two requests can sneak through between the steps.

The solution: Use a Lua script that Redis runs as ONE atomic operation. Nobody can interrupt it.

-- This script runs as ONE atomic operation in Redis
-- Nobody can interrupt it, so no race conditions!
+ 21 more lines...

Why Lua scripts?

When Redis runs a Lua script, it blocks everything else until the script finishes. This means no other request can sneak in between our read and write. It is like putting a "do not disturb" sign on the door.

For Sliding Window (better accuracy):

The simple version above uses fixed windows (like 2:00 to 2:01). But someone could game it at the boundary. The sliding window counter is smarter:

-- Sliding Window Counter - More accurate than fixed windows
-- Uses a weighted average of current and previous window
+ 38 more lines...

Why sliding window is better

Imagine the limit is 100 per minute. With fixed windows, you could make 100 requests at 2:00:59 and 100 more at 2:01:01 (200 in 2 seconds!). With sliding window, we look at the last 60 seconds from RIGHT NOW, so this trick does not work.

Rate Limiting Algorithms Deep Dive

There are several ways to count requests and enforce limits. Let me explain the main ones in simple terms.

AlgorithmHow it works (simple explanation)Good thingsBad thingsWhen to use
Fixed WindowCount requests in each clock minute (2:00-2:01, 2:01-2:02)Super simple, very fastCan be gamed at boundaries (200 requests in 2 seconds)Simple cases where boundary gaming is okay
Sliding Window LogKeep a list of all request timestamps, count how many in last 60 secondsMost accurateUses lots of memory (storing every timestamp)When accuracy is critical and memory is not a problem
Sliding Window CounterKeep counts for current and previous window, calculate weighted averageAccurate enough, low memorySlightly approximateBest choice for most real systems
Token BucketImagine a bucket of tokens. Each request takes one. Tokens refill over time.Allows bursts (use saved tokens)More complex to implementWhen you want to allow occasional bursts
Leaky BucketRequests go into a bucket that leaks at a steady rateVery smooth output rateDoes not allow any burstsWhen you need strictly steady traffic

Algorithm 1: Fixed Window Counter (Simplest)

Think of this like a taxi meter that resets at the start of each minute.

Fixed Window - The Boundary Problem

The Problem: 200 requests in 20 seconds while the limit is 100 per minute! This happens because we reset the count at minute boundaries.

Algorithm 2: Sliding Window Counter (Recommended)

Instead of hard resets, we use a weighted average. It is like looking at a 60-second window that slides with time.

def check_rate_limit(user_id, limit=100, window_seconds=60):
    """
    Example: It is now 2:00:30 (30 seconds into minute 2:00)
+ 39 more lines...

Algorithm 3: Token Bucket (When you want to allow bursts)

Imagine a bucket that can hold 100 tokens. Each request needs one token. Tokens refill at a steady rate (like 1 per second).

def token_bucket_check(user_id, bucket_size=100, refill_rate=1.67):
    """
    bucket_size: How many tokens the bucket can hold (burst capacity)
+ 35 more lines...

Which algorithm should you pick?

For most API rate limiting: use Sliding Window Counter. It is simple, uses little memory, and accurate enough. Use Token Bucket only if you specifically want to allow bursts (like letting a user make 10 quick requests, then slow down).

What Can Go Wrong and How We Handle It

Tell the interviewer about failures

Good engineers think about what can break. Let me walk through the things that can go wrong and how we protect against them. Interviewers love when you bring this up without being asked!

Common failures and how we handle them:

What breaksWhat happens to usersHow we fix itWhy this works
Redis goes downCannot check rate limitsAllow all requests (fail open) + alert the teamBetter to let some extra traffic through than block everyone
Redis is slowEvery API request becomes slowSet a 5ms timeout + fail open if timeoutIf Redis does not respond in 5ms, just allow the request and move on
One rate limiter crashesTraffic shifts to other rate limitersUse health checks + auto-restartSince rate limiters are stateless, any one can handle any request
Network split between serversDifferent servers see different countsAccept some inaccuracy during the splitWhen network heals, counts become accurate again
One user sends millions of requestsTheir requests flood RedisUse local cache for hot keysCache the count locally for a few milliseconds to reduce Redis load

The big decision: Fail Open vs Fail Closed

When Redis is down, what should we do?

Fail Open (Allow all requests) - User experience: API keeps working - Risk: Some users might exceed their limits - When to use: Most rate limiting (protecting resources)

Fail Closed (Block all requests) - User experience: API stops working completely - Benefit: Strict limit enforcement - When to use: Only when overrun is truly dangerous (like billing, critical resources)

def check_rate_limit_safely(user_id, limit):
    """
    Check rate limit with proper error handling.
+ 28 more lines...

Circuit Breaker Pattern (Smart failure handling)

If Redis keeps failing, we do not want to keep trying and failing. Instead, we "trip the circuit" and skip Redis entirely for a while.

class RateLimiterWithCircuitBreaker:
    def __init__(self):
        self.failure_count = 0
+ 34 more lines...

What is a circuit breaker?

Like an electrical circuit breaker that trips when there is too much current, this pattern trips when there are too many failures. It protects Redis from being hammered when it is already struggling, and gives it time to recover.

Growing the System Over Time

What to tell the interviewer

This design works great for one region with up to 500K requests per second. Let me explain how we would grow it if we need to go bigger or serve users around the world.

How we grow step by step:

Stage 1: Single Redis (up to 100K requests per second) - One Redis server handles everything - Super simple, strongly consistent - Single point of failure (use a replica for backup) - This is enough for most companies!

Stage 2: Redis Cluster (up to 1M requests per second) - Split users across multiple Redis servers - User A goes to Redis 1, User B goes to Redis 2 - Each Redis handles fewer users = more capacity - Still in one data center

Stage 3: Multi-Region (Global users) - Users in US talk to US servers - Users in Europe talk to Europe servers - Either: separate limits per region, OR sync counts across regions

Growing from single Redis to global

Global Rate Limiting (The hard mode)

If a user has a limit of 100 per minute globally, but they are making requests from both US and Europe, how do we count?

Option 1: One central Redis (Simple but slow for distant users) - All rate limit checks go to one Redis in, say, US - Users in Europe have extra latency (crossing the ocean) - Simple and accurate

Option 2: Regional Redis with sync (Complex but fast) - Each region has its own Redis - Counts sync between regions every few seconds - Fast for users, but counts might be slightly off during sync delays

ApproachHow it worksGood thingsBad things
Single global RedisEveryone talks to one RedisSimple, always accurateSlow for distant users (100ms+ latency)
Regional with budgetEach region gets a share (US: 50, EU: 50)Fast, no sync neededUnfair if user only uses one region
Regional with syncLocal Redis + sync counts every 1-5 secondsFast and fairly accurateComplex, can have brief overruns during sync delay

Cool features we can add later:

1. Different limits for different endpoints - POST /upload: 10 per minute (expensive operation) - GET /status: 1000 per minute (cheap operation) - Just use different Redis keys for different endpoints

2. Different limits for different users - Free users: 100 per minute - Paid users: 10,000 per minute - Store user tier in Redis or database, look it up when checking

3. Graduated response - First warning: slow down the response (add delay) - Second warning: return 429 error - Repeated abuse: block the user completely

What would you do differently for...

**Billing-critical limits** (like AWS billing quotas): Use stronger consistency - it is worth the extra latency because money is involved. **DDoS protection**: Use local counting at the edge without any sync. Speed matters more than accuracy when you are being attacked. **Per-user monthly quotas**: Store in a database with proper transactions. Checked less often (not every request), so accuracy matters more than speed.

Design Trade-offs

Advantages

  • +Super simple to build
  • +Uses very little memory (just one number per user)
  • +Very fast - just increment a counter

Disadvantages

  • -Can be gamed at window boundaries (200 requests in 2 seconds when limit is 100 per minute)
  • -Resets abruptly which can look weird to users
When to use

Use for simple cases where boundary gaming is not a big deal. Like internal APIs or non-critical limits.