Design Walkthrough
Problem Statement
The Question: Design a rate limiter that can handle 100,000+ requests per second across many servers.
What is a rate limiter? Think of it like a water faucet with a limit. You can only fill one glass per minute. If you try to fill two glasses, the faucet stops you.
Why do we need rate limiting? (most important first):
- 1.Stop abuse and attacks - A bad person might try to crash your website by sending millions of requests. Rate limiting stops them.
- 2.Fair sharing - If one customer uses all the resources, other customers suffer. Rate limiting keeps things fair.
- 3.Control costs - APIs cost money to run. Rate limiting prevents surprise bills from runaway usage.
- 4.Keep the service running - If too many requests come at once, servers can crash. Rate limiting protects them.
What to say first
Before I start designing, let me ask some questions to understand what exactly we need to build. I want to know the scale, how accurate we need to be, and what happens when someone hits the limit.
What the interviewer really wants to see: - Do you understand that this is hard because of multiple servers? - Do you know that being super accurate means being slow, and being fast means being a bit inaccurate? - Can you think about what happens when things break? - Do you know different ways to count requests and when to use each?
Clarifying Questions
Before you start designing, ask questions to understand what you are building. Good questions show the interviewer you think before you code.
Question 1: How big is this?
How many requests per second do we need to handle? How many different users or API keys do we need to track?
Why ask this: If we only have 100 requests per second, one server can handle it easily. If we have 100,000 per second, we need many servers working together.
What interviewers usually say: 100,000 requests per second, 1 million different API keys to track.
How this changes your design: With this scale, we cannot use just one server. We need a distributed system with multiple rate limiters.
Question 2: How accurate does it need to be?
If the limit is 100 requests per minute, is it okay if sometimes we allow 105? Or must it be exactly 100, never more?
Why ask this: Perfect accuracy needs all servers to talk to each other before every request. This is slow. A little bit of error lets us be much faster.
What interviewers usually say: Small overrun is fine. We are not charging per request.
How this changes your design: We can use simpler and faster methods. We do not need complex locks that slow everything down.
Question 3: Fixed or sliding time window?
Should we count requests per clock minute (like 2:00 to 2:01) or any 60-second period (like the last 60 seconds from now)?
Why ask this: Fixed windows are simpler but have a problem - someone can make 100 requests at 2:00:59 and another 100 at 2:01:01. That is 200 in 2 seconds! Sliding windows are more accurate but harder to build.
What interviewers usually say: Sliding window is preferred to prevent this gaming.
How this changes your design: We need to use the sliding window counter method (I will explain later) instead of simple counting.
Question 4: One region or global?
Is the rate limit for one data center only, or should it count requests from all around the world together?
Why ask this: If someone in New York and someone in Tokyo both count toward the same limit, we need to share data across the world. This is much harder.
What interviewers usually say: Start with one region, mention global as a future improvement.
How this changes your design: We can start simpler with everything in one place, then talk about how to make it global later.
Summarize your assumptions
Let me summarize what I will design for: 100,000 requests per second, 1 million API keys, sliding window is preferred, small overrun (like 5%) is acceptable, and we will start with one region. I will mention global rate limiting as a future feature.
The Hard Part
Say this to the interviewer
The hardest part is counting accurately when you have many servers. Imagine you have 10 rate limiter servers. User A sends a request to Server 1. At the exact same moment, User A sends another request to Server 2. Both servers need to know the total count - but they do not know about each other!
Why distributed counting is tricky (explained simply):
- 1.The race condition problem - Two servers read the count at the same time. Both see 99. Both think "okay, limit is 100, I can allow this." Both allow. Now the real count is 101.
- 2.The speed problem - We could make all servers ask a central place before every request. But that central place becomes slow. And talking between servers takes time (network latency).
- 3.The time boundary problem - If the limit is 100 per minute, what happens at minute boundaries? Someone could make 100 requests at 2:00:59 (end of one minute) and 100 more at 2:01:01 (start of next minute). That is 200 requests in just 2 seconds!
- 4.The clock problem - Different servers have slightly different times. Server A thinks it is 2:00:00, Server B thinks it is 2:00:01. Their counts for "this minute" might not match.
Common mistake candidates make
Many people say: just use a counter and increment it! This works for one server, but breaks completely when you have many servers. Always think about the distributed case first - what if there are 10 servers?
The fundamental choice you must make:
Option 1: Be perfectly accurate (slow) - Before every request, all servers must agree on the count - This is called "distributed consensus" and takes 50-100 milliseconds - Your API becomes slow because of rate limiting
Option 2: Be very fast but slightly inaccurate (recommended) - Each server makes its own decision quickly - Sometimes the total count goes a bit over the limit - Your API stays fast
For rate limiting, we almost always pick Option 2. Why? If someone makes 105 requests instead of 100, nothing terrible happens. But if every API call takes 50ms longer, users will leave.
The race condition problem
Scale and Access Patterns
Before designing, let me figure out how big this system needs to be. This helps us choose the right tools.
| What we are measuring | Number | What this means for our design |
|---|---|---|
| Requests per second | 100,000 | One server cannot handle this alone - Redis can do 100K+ operations per second, so it works |
| Unique API keys | 1,000,000 | Need to track 1 million different users - this fits in memory easily |
What to tell the interviewer
At 100K requests per second with 1 million keys, the data easily fits in memory (about 100 MB). Redis can handle this throughput on a single node with room to spare. The challenge is not storage or throughput - it is coordinating counts across multiple rate limiter instances while staying under 5ms latency.
How people use the rate limiter (from most common to least common):
- 1.Check and allow - Most requests (99%) are under the limit and pass through. This must be super fast.
- 2.Check and block - About 1% of requests hit the limit. We return "429 Too Many Requests" error.
- 3.Near the limit - When someone is close to their limit, they might slow down or retry later.
- 4.Way over limit - Bad actors or bugs that send way too many requests. Block quickly.
How much space does one user's rate limit data need?
- API key: about 30 bytes
- Current count: 8 bytesKey insight about hot keys
Some API keys are much busier than others. One big customer might make 10,000 requests per second, while most make 1 per second. We call the busy keys "hot keys" and might need special handling for them (like local caching).
High-Level Architecture
Now let me draw the big picture of how all the pieces fit together. I will keep it simple and explain what each part does.
What to tell the interviewer
I will put the rate limiter in front of our API servers. Every request must pass through the rate limiter first. I will use Redis to store the counts because it is super fast (in-memory) and has atomic operations (no race conditions).
Rate Limiter System - The Big Picture
What each part does and WHY it is there:
| Part | What it does | Why we need it (what to tell interviewer) |
|---|---|---|
| Load Balancer | Spreads incoming requests across multiple rate limiters | One rate limiter cannot handle 100K requests per second. We need several working together. |
| Rate Limiter | Checks if this user has made too many requests. If yes, block. If no, allow. | This is the core of our system. It makes the allow/block decision for every request. |
| Redis | Stores the count of requests for each user. Super fast because it is all in memory. | We need a central place to keep counts. Redis is perfect because: (1) in-memory = fast, (2) atomic operations = no race conditions, (3) TTL = automatic cleanup. |
| API Servers | Do the actual work (whatever the API does). | Only get requests that passed the rate limit check. Protected from overload. |
Common interview question: Why not put rate limiting in the API server itself?
You could! For simple cases, this works fine. We use a separate rate limiter layer when: (1) we have many different API servers and want consistent limits, (2) we want to block bad traffic before it reaches our expensive API servers, (3) we want to change rate limit rules without redeploying all API servers.
Why we picked these tools:
Redis (Recommended for rate limiting) - Why we chose it: Super fast (in-memory), atomic operations (INCR command), automatic expiry (TTL), everyone uses it so lots of help available - Other options we considered: - Memcached: Also fast, but no atomic increment with expiry in one command - In-memory in each rate limiter: Would not share counts across servers - Database (PostgreSQL): Too slow for 100K operations per second
How real companies do it
Stripe uses Redis for rate limiting. GitHub uses a combination of Redis and in-memory caches. Cloudflare does rate limiting at the edge (their servers around the world) with local counting and periodic sync. For most companies, Redis works great.
Data Model and Storage
Now let me show how we store the rate limit data in Redis. The key design is simple but important.
What to tell the interviewer
I will use Redis as the single source of truth for request counts. The key pattern is rate_limit:api_key:window_id. I will use Redis Lua scripts to make check-and-increment atomic (no race conditions).
How we name the keys in Redis:
Think of Redis keys like file names. We need a naming system that is easy to understand and look up.
Key pattern: rate_limit:{api_key}:{time_window}
Examples:Making it atomic (no race conditions):
The problem: If we do "read count, check if under limit, then increment" as separate steps, two requests can sneak through between the steps.
The solution: Use a Lua script that Redis runs as ONE atomic operation. Nobody can interrupt it.
-- This script runs as ONE atomic operation in Redis
-- Nobody can interrupt it, so no race conditions!
Why Lua scripts?
When Redis runs a Lua script, it blocks everything else until the script finishes. This means no other request can sneak in between our read and write. It is like putting a "do not disturb" sign on the door.
For Sliding Window (better accuracy):
The simple version above uses fixed windows (like 2:00 to 2:01). But someone could game it at the boundary. The sliding window counter is smarter:
-- Sliding Window Counter - More accurate than fixed windows
-- Uses a weighted average of current and previous window
Why sliding window is better
Imagine the limit is 100 per minute. With fixed windows, you could make 100 requests at 2:00:59 and 100 more at 2:01:01 (200 in 2 seconds!). With sliding window, we look at the last 60 seconds from RIGHT NOW, so this trick does not work.
Rate Limiting Algorithms Deep Dive
There are several ways to count requests and enforce limits. Let me explain the main ones in simple terms.
| Algorithm | How it works (simple explanation) | Good things | Bad things | When to use |
|---|---|---|---|---|
| Fixed Window | Count requests in each clock minute (2:00-2:01, 2:01-2:02) | Super simple, very fast | Can be gamed at boundaries (200 requests in 2 seconds) | Simple cases where boundary gaming is okay |
| Sliding Window Log | Keep a list of all request timestamps, count how many in last 60 seconds | Most accurate | Uses lots of memory (storing every timestamp) | When accuracy is critical and memory is not a problem |
| Sliding Window Counter | Keep counts for current and previous window, calculate weighted average | Accurate enough, low memory | Slightly approximate | Best choice for most real systems |
| Token Bucket | Imagine a bucket of tokens. Each request takes one. Tokens refill over time. | Allows bursts (use saved tokens) | More complex to implement | When you want to allow occasional bursts |
| Leaky Bucket | Requests go into a bucket that leaks at a steady rate | Very smooth output rate | Does not allow any bursts | When you need strictly steady traffic |
Algorithm 1: Fixed Window Counter (Simplest)
Think of this like a taxi meter that resets at the start of each minute.
Fixed Window - The Boundary Problem
The Problem: 200 requests in 20 seconds while the limit is 100 per minute! This happens because we reset the count at minute boundaries.
Algorithm 2: Sliding Window Counter (Recommended)
Instead of hard resets, we use a weighted average. It is like looking at a 60-second window that slides with time.
def check_rate_limit(user_id, limit=100, window_seconds=60):
"""
Example: It is now 2:00:30 (30 seconds into minute 2:00)Algorithm 3: Token Bucket (When you want to allow bursts)
Imagine a bucket that can hold 100 tokens. Each request needs one token. Tokens refill at a steady rate (like 1 per second).
def token_bucket_check(user_id, bucket_size=100, refill_rate=1.67):
"""
bucket_size: How many tokens the bucket can hold (burst capacity)Which algorithm should you pick?
For most API rate limiting: use Sliding Window Counter. It is simple, uses little memory, and accurate enough. Use Token Bucket only if you specifically want to allow bursts (like letting a user make 10 quick requests, then slow down).
What Can Go Wrong and How We Handle It
Tell the interviewer about failures
Good engineers think about what can break. Let me walk through the things that can go wrong and how we protect against them. Interviewers love when you bring this up without being asked!
Common failures and how we handle them:
| What breaks | What happens to users | How we fix it | Why this works |
|---|---|---|---|
| Redis goes down | Cannot check rate limits | Allow all requests (fail open) + alert the team | Better to let some extra traffic through than block everyone |
| Redis is slow | Every API request becomes slow | Set a 5ms timeout + fail open if timeout | If Redis does not respond in 5ms, just allow the request and move on |
| One rate limiter crashes | Traffic shifts to other rate limiters | Use health checks + auto-restart | Since rate limiters are stateless, any one can handle any request |
| Network split between servers | Different servers see different counts | Accept some inaccuracy during the split | When network heals, counts become accurate again |
| One user sends millions of requests | Their requests flood Redis | Use local cache for hot keys | Cache the count locally for a few milliseconds to reduce Redis load |
The big decision: Fail Open vs Fail Closed
When Redis is down, what should we do?
Fail Open (Allow all requests) - User experience: API keeps working - Risk: Some users might exceed their limits - When to use: Most rate limiting (protecting resources)
Fail Closed (Block all requests) - User experience: API stops working completely - Benefit: Strict limit enforcement - When to use: Only when overrun is truly dangerous (like billing, critical resources)
def check_rate_limit_safely(user_id, limit):
"""
Check rate limit with proper error handling.Circuit Breaker Pattern (Smart failure handling)
If Redis keeps failing, we do not want to keep trying and failing. Instead, we "trip the circuit" and skip Redis entirely for a while.
class RateLimiterWithCircuitBreaker:
def __init__(self):
self.failure_count = 0What is a circuit breaker?
Like an electrical circuit breaker that trips when there is too much current, this pattern trips when there are too many failures. It protects Redis from being hammered when it is already struggling, and gives it time to recover.
Growing the System Over Time
What to tell the interviewer
This design works great for one region with up to 500K requests per second. Let me explain how we would grow it if we need to go bigger or serve users around the world.
How we grow step by step:
Stage 1: Single Redis (up to 100K requests per second) - One Redis server handles everything - Super simple, strongly consistent - Single point of failure (use a replica for backup) - This is enough for most companies!
Stage 2: Redis Cluster (up to 1M requests per second) - Split users across multiple Redis servers - User A goes to Redis 1, User B goes to Redis 2 - Each Redis handles fewer users = more capacity - Still in one data center
Stage 3: Multi-Region (Global users) - Users in US talk to US servers - Users in Europe talk to Europe servers - Either: separate limits per region, OR sync counts across regions
Growing from single Redis to global
Global Rate Limiting (The hard mode)
If a user has a limit of 100 per minute globally, but they are making requests from both US and Europe, how do we count?
Option 1: One central Redis (Simple but slow for distant users) - All rate limit checks go to one Redis in, say, US - Users in Europe have extra latency (crossing the ocean) - Simple and accurate
Option 2: Regional Redis with sync (Complex but fast) - Each region has its own Redis - Counts sync between regions every few seconds - Fast for users, but counts might be slightly off during sync delays
| Approach | How it works | Good things | Bad things |
|---|---|---|---|
| Single global Redis | Everyone talks to one Redis | Simple, always accurate | Slow for distant users (100ms+ latency) |
| Regional with budget | Each region gets a share (US: 50, EU: 50) | Fast, no sync needed | Unfair if user only uses one region |
| Regional with sync | Local Redis + sync counts every 1-5 seconds | Fast and fairly accurate | Complex, can have brief overruns during sync delay |
Cool features we can add later:
1. Different limits for different endpoints - POST /upload: 10 per minute (expensive operation) - GET /status: 1000 per minute (cheap operation) - Just use different Redis keys for different endpoints
2. Different limits for different users - Free users: 100 per minute - Paid users: 10,000 per minute - Store user tier in Redis or database, look it up when checking
3. Graduated response - First warning: slow down the response (add delay) - Second warning: return 429 error - Repeated abuse: block the user completely
What would you do differently for...
**Billing-critical limits** (like AWS billing quotas): Use stronger consistency - it is worth the extra latency because money is involved. **DDoS protection**: Use local counting at the edge without any sync. Speed matters more than accuracy when you are being attacked. **Per-user monthly quotas**: Store in a database with proper transactions. Checked less often (not every request), so accuracy matters more than speed.