Design a Rate Limiter: Complete System Design Guide
How to design a rate limiter for system design interviews. Covers token bucket, sliding window, distributed rate limiting, and edge cases.
Ready to Master System Design Interviews?
Learn from 25+ real interview problems from Netflix, Uber, Google, and Stripe. Created by a senior engineer who's taken 200+ system design interviews at FAANG companies.
Complete Solutions
Architecture diagrams & trade-off analysis
Real Interview Problems
From actual FAANG interviews
7-day money-back guarantee • Lifetime access • New problems added quarterly
"Design a rate limiter."
This question appears in almost every system design interview because rate limiting is everywhere. Every API you've ever used, Stripe, Twitter, AWS, has rate limiting. It's a fundamental building block that protects systems from abuse and ensures fair resource allocation.
What makes this question interesting is the depth. The basic concept is simple, but the implementation reveals your understanding of algorithms, distributed systems, and real-world trade-offs.
Why Rate Limiting Matters
Before diving into design, understand why companies rate limit:
1. Prevent resource exhaustion
Without limits, a single user (or bot) could consume all your server capacity. Rate limiting ensures no single client monopolizes resources.
2. Protect against attacks
DDoS attacks and brute-force attempts rely on high request volumes. Rate limiting is the first line of defense.
3. Control costs
Cloud services charge by usage. Runaway scripts or bugs can generate massive bills. Rate limiting provides a safety net.
4. Ensure fair usage
In multi-tenant systems, rate limiting prevents one tenant from degrading service for others.
5. Business model enforcement
SaaS products often tier features by usage limits. Rate limiting enforces these tiers.
Step 1: Clarify Requirements
Start by asking questions. Rate limiters can vary significantly based on requirements.
Questions to Ask
1. What are we rate limiting?
- Requests per user?
- Requests per IP address?
- Requests per API key?
- Requests per endpoint?
2. What are the limits?
- 100 requests per minute?
- 1000 requests per hour?
- Multiple tiers (free vs. paid)?
3. What happens when limits are exceeded?
- Return 429 (Too Many Requests)?
- Queue and process later?
- Degrade service quality?
4. Is this distributed or single-server?
- Single application server?
- Multiple servers behind a load balancer?
- Multiple data centers?
5. What's the accuracy requirement?
- Exactly 100 requests allowed?
- Approximately 100 (95-105 acceptable)?
Typical Requirements
For this design, assume:
- Rate limit by user ID and API endpoint
- 100 requests per minute per user per endpoint
- Return 429 when exceeded with retry-after header
- Distributed across multiple servers
- Soft limits acceptable (slight over-allowance is OK)
Step 2: High-Level Design
Here's the architecture:
┌────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Client │────▶│ Load Balancer │────▶│ API Server │
└────────────┘ └─────────────────┘ └────────┬────────┘
│
┌────────▼────────┐
│ Rate Limiter │
│ Middleware │
└────────┬────────┘
│
┌────────▼────────┐
│ Redis │
│ (Rate Counters) │
└─────────────────┘
Request Flow
- Client sends request to API
- Request hits rate limiter middleware
- Middleware checks Redis for current count
- If under limit: increment counter, allow request
- If over limit: return 429 with retry-after header
Why Redis?
Rate limiting requires:
- Fast reads/writes , Checking limits shouldn't add latency
- Atomic operations , Multiple servers must coordinate
- TTL support , Counters must expire automatically
Redis provides all three with sub-millisecond latency.
Step 3: Rate Limiting Algorithms
This is where you differentiate yourself. There are four main algorithms, each with trade-offs.
Algorithm 1: Fixed Window Counter
The simplest approach. Divide time into fixed windows (e.g., 1-minute intervals) and count requests in each.
Window: 10:00:00 - 10:00:59 → Count: 45
Window: 10:01:00 - 10:01:59 → Count: 78
Window: 10:02:00 - 10:02:59 → Count: 12
Implementation:
def is_allowed(user_id, limit=100):
# Get current minute as key
current_window = int(time.time() / 60)
key = f"rate:{user_id}:{current_window}"
# Atomic increment
count = redis.incr(key)
# Set expiry on first request
if count == 1:
redis.expire(key, 60)
return count <= limit
Pros:
- Simple to implement
- Memory efficient (one counter per window)
Cons:
- Boundary problem: User can send 100 requests at 10:00:59 and 100 more at 10:01:00 (200 in 2 seconds)
When to use: When simplicity matters more than precision.
Algorithm 2: Sliding Window Log
Track timestamps of all requests. Count requests within the sliding window.
Requests: [10:00:15, 10:00:23, 10:00:45, 10:01:02, 10:01:15]
At 10:01:30, window is 10:00:30 - 10:01:30
Requests in window: [10:00:45, 10:01:02, 10:01:15] = 3 requests
Implementation:
def is_allowed(user_id, limit=100, window_seconds=60):
now = time.time()
key = f"rate:{user_id}"
# Remove old entries
window_start = now - window_seconds
redis.zremrangebyscore(key, 0, window_start)
# Count current requests
count = redis.zcard(key)
if count < limit:
# Add current request
redis.zadd(key, {str(uuid.uuid4()): now})
redis.expire(key, window_seconds)
return True
return False
Pros:
- Precise, no boundary problem
- Smooth rate limiting
Cons:
- Memory intensive (stores every request timestamp)
- More Redis operations per request
When to use: When precision is critical and traffic is moderate.
Algorithm 3: Sliding Window Counter
Hybrid approach. Combines fixed windows with weighted counting.
Previous window (10:00-10:01): 84 requests
Current window (10:01-10:02): 36 requests
At 10:01:15 (25% into current window):
Weighted count = 84 × 0.75 + 36 × 1.0 = 63 + 36 = 99 requests
Implementation:
def is_allowed(user_id, limit=100, window_seconds=60):
now = time.time()
current_window = int(now / window_seconds)
previous_window = current_window - 1
# Position in current window (0.0 to 1.0)
window_position = (now % window_seconds) / window_seconds
# Get counts
current_key = f"rate:{user_id}:{current_window}"
previous_key = f"rate:{user_id}:{previous_window}"
current_count = int(redis.get(current_key) or 0)
previous_count = int(redis.get(previous_key) or 0)
# Weighted count
weighted_count = previous_count * (1 - window_position) + current_count
if weighted_count < limit:
redis.incr(current_key)
redis.expire(current_key, window_seconds * 2)
return True
return False
Pros:
- Memory efficient (two counters)
- Smooths out boundary problem
- Good balance of precision and efficiency
Cons:
- Approximation (not exact)
- Slightly more complex than fixed window
When to use: Best general-purpose choice for most applications.
Algorithm 4: Token Bucket
Tokens are added to a bucket at a fixed rate. Each request consumes a token. If no tokens available, request is rejected.
Bucket capacity: 10 tokens
Refill rate: 1 token per second
Time 0: 10 tokens (full)
Request → 9 tokens
Request → 8 tokens
...
Time 5: 3 tokens + 5 new = 8 tokens
Request → 7 tokens
Implementation:
def is_allowed(user_id, capacity=10, refill_rate=1):
key = f"bucket:{user_id}"
now = time.time()
# Get current state
bucket = redis.hgetall(key)
if not bucket:
# Initialize bucket
tokens = capacity - 1 # Consume one for this request
redis.hset(key, mapping={"tokens": tokens, "last_refill": now})
redis.expire(key, 3600)
return True
# Calculate tokens to add
last_refill = float(bucket["last_refill"])
elapsed = now - last_refill
tokens = float(bucket["tokens"])
# Refill tokens
tokens = min(capacity, tokens + elapsed * refill_rate)
if tokens >= 1:
# Consume token
tokens -= 1
redis.hset(key, mapping={"tokens": tokens, "last_refill": now})
return True
return False
Pros:
- Allows bursts (up to bucket capacity)
- Smooth long-term rate limiting
- Used by many real systems (AWS, Stripe)
Cons:
- More complex state (tokens + timestamp)
- Two parameters to tune (capacity, rate)
When to use: When you want to allow bursts while maintaining average rate.
Algorithm 5: Leaky Bucket
Requests enter a queue (bucket) and are processed at a fixed rate. If queue is full, requests are dropped.
Queue capacity: 10
Processing rate: 1 request/second
Requests enter queue → processed in order → overflow dropped
Pros:
- Smooth output rate (good for downstream systems)
- Simple to understand
Cons:
- Adds latency (requests wait in queue)
- Doesn't handle bursts well
When to use: When you need to smooth traffic to downstream services.
Algorithm Comparison
| Algorithm | Memory | Precision | Burst Handling | Complexity |
|---|---|---|---|---|
| Fixed Window | Low | Low | Poor | Simple |
| Sliding Log | High | High | Good | Medium |
| Sliding Counter | Low | Medium | Good | Medium |
| Token Bucket | Low | High | Excellent | Medium |
| Leaky Bucket | Medium | High | Poor | Medium |
Recommendation: For interviews, explain Token Bucket or Sliding Window Counter. They show sophistication while being practical.
Step 4: Distributed Rate Limiting
The real challenge. With multiple API servers, how do you maintain consistent limits?
The Problem
Server 1: User sent 60 requests (sees 60)
Server 2: User sent 60 requests (sees 60)
Total: 120 requests, but each server thinks user is under 100 limit
Solution: Centralized Counter (Redis)
All servers share the same Redis instance for counters.
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Server 1 │ │ Server 2 │ │ Server 3 │
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │
└────────────────┼────────────────┘
│
┌──────▼──────┐
│ Redis │
│ (Counters) │
└─────────────┘
Redis commands for atomic operations:
# Atomic increment with expiry
pipe = redis.pipeline()
pipe.incr(key)
pipe.expire(key, window_seconds)
count, _ = pipe.execute()
Handling Redis Latency
Adding Redis adds network latency. For high-traffic systems:
Option 1: Local cache + async sync
class RateLimiter:
def __init__(self):
self.local_counts = {} # Local counter
self.sync_interval = 1 # Sync every second
def is_allowed(self, user_id):
# Check local count first
if self.local_counts.get(user_id, 0) >= LOCAL_LIMIT:
return False
# Increment locally
self.local_counts[user_id] = self.local_counts.get(user_id, 0) + 1
return True
def sync_to_redis(self):
# Background job: push local counts to Redis
for user_id, count in self.local_counts.items():
redis.incrby(f"rate:{user_id}", count)
self.local_counts.clear()
Trade-off: Reduces Redis calls but allows slight over-limit.
Option 2: Rate limit at load balancer
Use sticky sessions so all requests from a user hit the same server. Rate limit locally.
Trade-off: Simpler, but less flexible.
Step 5: Handling Edge Cases
Strong candidates proactively address these:
1. What if Redis goes down?
Option A: Fail open (allow all requests)
def is_allowed(user_id):
try:
return check_redis_limit(user_id)
except RedisError:
# Redis down: allow request
logger.warning("Redis unavailable, allowing request")
return True
Option B: Fail closed (reject all requests)
def is_allowed(user_id):
try:
return check_redis_limit(user_id)
except RedisError:
# Redis down: reject request
return False
Option C: Local fallback
def is_allowed(user_id):
try:
return check_redis_limit(user_id)
except RedisError:
# Fall back to local rate limiting
return check_local_limit(user_id)
Best practice: Fail open with monitoring. Temporary over-allowance is usually better than service outage.
2. Race conditions
Two requests arrive simultaneously. Both check count (99), both increment (100, 101).
Solution: Use atomic Redis operations.
-- Lua script for atomic check-and-increment
local count = redis.call('INCR', KEYS[1])
if count == 1 then
redis.call('EXPIRE', KEYS[1], ARGV[1])
end
return count
3. Clock synchronization
Servers have slightly different clocks. Window boundaries differ.
Solution: Use Redis server time, not local time.
def get_current_window():
# Use Redis TIME command for consistency
redis_time = redis.time()
return int(redis_time[0] / 60) # Current minute
4. Thundering herd at window reset
All rate-limited users retry exactly when window resets.
Solution: Add jitter to retry-after headers.
def get_retry_after(window_end):
base_delay = window_end - time.time()
jitter = random.uniform(0, 5) # Random 0-5 seconds
return base_delay + jitter
Step 6: API Design
Response Headers
Return rate limit info in headers:
HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 45
X-RateLimit-Reset: 1704067200
When exceeded:
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1704067200
Retry-After: 30
Client-Friendly Design
Good rate limiters help clients:
- Clear error messages: "Rate limit exceeded. Try again in 30 seconds."
- Retry-After header: Tells clients exactly when to retry
- Remaining count: Lets clients pace themselves
- Consistent across endpoints: Predictable behavior
Step 7: Rate Limiting Strategies
By User
Most common. Each user has their own limit.
key = f"rate:user:{user_id}:{endpoint}"
By IP Address
For unauthenticated endpoints (login, signup).
key = f"rate:ip:{client_ip}:{endpoint}"
Caution: NAT and proxies can make many users share one IP.
By API Key
For B2B APIs. Each customer gets their own limit.
key = f"rate:api_key:{api_key}"
Tiered Limits
Different limits for different user tiers:
def get_limit(user):
if user.plan == "enterprise":
return 10000 # per minute
elif user.plan == "pro":
return 1000
else:
return 100
Per-Endpoint Limits
Different limits for different endpoints:
ENDPOINT_LIMITS = {
"/api/search": 10, # Expensive operation
"/api/users": 100, # Normal operation
"/api/health": 1000, # Cheap operation
}
Step 8: Monitoring and Observability
Metrics to Track
# Prometheus metrics example
rate_limit_hits = Counter(
"rate_limit_hits_total",
"Number of requests that hit rate limits",
["user_tier", "endpoint"]
)
rate_limit_remaining = Gauge(
"rate_limit_remaining",
"Remaining requests in current window",
["user_id"]
)
Alerts
Set up alerts for:
- High rate of 429 responses (might indicate attack or misconfiguration)
- Redis latency spikes
- Users consistently hitting limits (might need limit adjustment)
Dashboard
Visualize:
- 429 response rate over time
- Top rate-limited users
- Limit utilization by tier
- Redis operation latency
Implementation: Complete Token Bucket
Here's a production-ready implementation:
import time
import redis
class TokenBucketRateLimiter:
def __init__(self, redis_client, capacity, refill_rate):
self.redis = redis_client
self.capacity = capacity
self.refill_rate = refill_rate # tokens per second
def is_allowed(self, key):
now = time.time()
# Lua script for atomic operation
lua_script = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1])
local last_refill = tonumber(bucket[2])
if tokens == nil then
-- Initialize bucket
tokens = capacity - 1
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 3600)
return {1, tokens}
end
-- Refill tokens
local elapsed = now - last_refill
tokens = math.min(capacity, tokens + elapsed * refill_rate)
if tokens >= 1 then
tokens = tokens - 1
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
return {1, tokens}
end
return {0, tokens}
"""
result = self.redis.eval(
lua_script,
1,
key,
self.capacity,
self.refill_rate,
now
)
allowed = result[0] == 1
remaining = result[1]
return allowed, remaining
# Usage
limiter = TokenBucketRateLimiter(
redis_client=redis.Redis(),
capacity=100,
refill_rate=100/60 # 100 per minute
)
allowed, remaining = limiter.is_allowed(f"user:{user_id}")
if not allowed:
return Response(status=429, headers={
"X-RateLimit-Remaining": 0,
"Retry-After": 1
})
Common Interview Follow-Ups
"How would you handle 10x traffic?"
- Scale Redis horizontally (Redis Cluster)
- Add local caching layer
- Consider eventual consistency (sync less frequently)
- Rate limit earlier in the stack (CDN/load balancer)
"What if you can't use Redis?"
Alternatives:
- In-memory with gossip protocol: Servers share state periodically
- Database: Slower, but works for low-traffic systems
- Cloud services: AWS API Gateway, Cloudflare rate limiting
"How do you test a rate limiter?"
- Unit tests for algorithm correctness
- Load tests for performance under stress
- Chaos tests for Redis failures
- Integration tests for header correctness
"Token bucket vs sliding window?"
- Token bucket: Better for allowing bursts while maintaining average rate
- Sliding window: Better for strict limits without burst allowance
Summary: The Winning Answer
| Time | What to Cover |
|---|---|
| 0-5 min | Clarify: what to limit, limits, distributed? |
| 5-15 min | Explain 2-3 algorithms, compare trade-offs |
| 15-25 min | High-level design with Redis |
| 25-35 min | Deep dive: distributed challenges, race conditions |
| 35-45 min | Edge cases: Redis failure, monitoring |
Key differentiators:
- Explain multiple algorithms, not just one
- Discuss distributed challenges proactively
- Address failure modes
- Show awareness of production concerns (monitoring, alerting)---
Frequently Asked Questions
Which algorithm should I use in production?
Token bucket is the most common choice. It's used by AWS, Stripe, and most major APIs. It handles bursts gracefully while maintaining average rate limits.
Should rate limiting happen at the API gateway or application?
Both. API gateway provides first line of defense (DDoS protection). Application-level provides granular, user-specific limits.
How do I choose limit values?
Start conservative and adjust based on data. Monitor 95th percentile usage. Set limits slightly above normal usage but below abuse thresholds.
What about rate limiting in microservices?
Each service should have its own rate limiting. Use service mesh (Istio, Envoy) for consistent rate limiting across services.
Ready to Master System Design Interviews?
Learn from 25+ real interview problems from Netflix, Uber, Google, and Stripe. Created by a senior engineer who's taken 200+ system design interviews at FAANG companies.
Complete Solutions
Architecture diagrams & trade-off analysis
Real Interview Problems
From actual FAANG interviews
7-day money-back guarantee • Lifetime access • New problems added quarterly
FREE: System Design Interview Cheat Sheet
Get the 7-page PDF cheat sheet with critical numbers, decision frameworks, and the interview approach used by 10,000+ engineers.
No spam. Unsubscribe anytime.
Related Articles
Why Distributed Systems Fail: 15 Failure Scenarios Every Engineer Must Know
A comprehensive guide to the most common failure modes in distributed systems, from network partitions to split-brain scenarios, with practical fixes for each.
Read moreThe 7 System Design Problems You Must Know Before Your Interview
These 7 system design questions appear in 80% of interviews at Google, Meta, Amazon, and Netflix. Master them, and you can handle any variation.
Read moreAmazon System Design Interview: Leadership Principles Meet Distributed Systems
How Amazon's system design interviews differ from other FAANG companies. Real questions, LP integration, and what bar raisers actually look for.
Read more