Design a Rate Limiter: Complete System Design Guide

"Design a rate limiter."

This question appears in almost every system design interview because rate limiting is everywhere. Every API you've ever used, Stripe, Twitter, AWS, has rate limiting. It's a fundamental building block that protects systems from abuse and ensures fair resource allocation.

What makes this question interesting is the depth. The basic concept is simple, but the implementation reveals your understanding of algorithms, distributed systems, and real-world trade-offs.

Why Rate Limiting Matters

Before diving into design, understand why companies rate limit:

1. Prevent resource exhaustion

Without limits, a single user (or bot) could consume all your server capacity. Rate limiting ensures no single client monopolizes resources.

2. Protect against attacks

DDoS attacks and brute-force attempts rely on high request volumes. Rate limiting is the first line of defense.

3. Control costs

Cloud services charge by usage. Runaway scripts or bugs can generate massive bills. Rate limiting provides a safety net.

4. Ensure fair usage

In multi-tenant systems, rate limiting prevents one tenant from degrading service for others.

5. Business model enforcement

SaaS products often tier features by usage limits. Rate limiting enforces these tiers.

Step 1: Clarify Requirements

Start by asking questions. Rate limiters can vary significantly based on requirements.

Questions to Ask

1. What are we rate limiting?

Requests per user?
Requests per IP address?
Requests per API key?
Requests per endpoint?

2. What are the limits?

100 requests per minute?
1000 requests per hour?
Multiple tiers (free vs. paid)?

3. What happens when limits are exceeded?

Return 429 (Too Many Requests)?
Queue and process later?
Degrade service quality?

4. Is this distributed or single-server?

Single application server?
Multiple servers behind a load balancer?
Multiple data centers?

5. What's the accuracy requirement?

Exactly 100 requests allowed?
Approximately 100 (95-105 acceptable)?

Typical Requirements

For this design, assume:

Rate limit by user ID and API endpoint
100 requests per minute per user per endpoint
Return 429 when exceeded with retry-after header
Distributed across multiple servers
Soft limits acceptable (slight over-allowance is OK)

Step 2: High-Level Design

Here's the architecture:

┌────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Client   │────▶│  Load Balancer  │────▶│   API Server    │
└────────────┘     └─────────────────┘     └────────┬────────┘
                                                    │
                                           ┌────────▼────────┐
                                           │  Rate Limiter   │
                                           │   Middleware    │
                                           └────────┬────────┘
                                                    │
                                           ┌────────▼────────┐
                                           │     Redis       │
                                           │ (Rate Counters) │
                                           └─────────────────┘

Request Flow

Client sends request to API
Request hits rate limiter middleware
Middleware checks Redis for current count
If under limit: increment counter, allow request
If over limit: return 429 with retry-after header

Why Redis?

Rate limiting requires:

Fast reads/writes , Checking limits shouldn't add latency
Atomic operations , Multiple servers must coordinate
TTL support , Counters must expire automatically

Redis provides all three with sub-millisecond latency.

Step 3: Rate Limiting Algorithms

This is where you differentiate yourself. There are four main algorithms, each with trade-offs.

Algorithm 1: Fixed Window Counter

The simplest approach. Divide time into fixed windows (e.g., 1-minute intervals) and count requests in each.

Window: 10:00:00 - 10:00:59 → Count: 45
Window: 10:01:00 - 10:01:59 → Count: 78
Window: 10:02:00 - 10:02:59 → Count: 12

Implementation:

def is_allowed(user_id, limit=100):
    # Get current minute as key
    current_window = int(time.time() / 60)
    key = f"rate:{user_id}:{current_window}"

    # Atomic increment
    count = redis.incr(key)

    # Set expiry on first request
    if count == 1:
        redis.expire(key, 60)

    return count <= limit

Pros:

Simple to implement
Memory efficient (one counter per window)

Cons:

Boundary problem: User can send 100 requests at 10:00:59 and 100 more at 10:01:00 (200 in 2 seconds)

When to use: When simplicity matters more than precision.

Algorithm 2: Sliding Window Log

Track timestamps of all requests. Count requests within the sliding window.

Requests: [10:00:15, 10:00:23, 10:00:45, 10:01:02, 10:01:15]

At 10:01:30, window is 10:00:30 - 10:01:30
Requests in window: [10:00:45, 10:01:02, 10:01:15] = 3 requests

Implementation:

def is_allowed(user_id, limit=100, window_seconds=60):
    now = time.time()
    key = f"rate:{user_id}"

    # Remove old entries
    window_start = now - window_seconds
    redis.zremrangebyscore(key, 0, window_start)

    # Count current requests
    count = redis.zcard(key)

    if count < limit:
        # Add current request
        redis.zadd(key, {str(uuid.uuid4()): now})
        redis.expire(key, window_seconds)
        return True

    return False

Pros:

Precise, no boundary problem
Smooth rate limiting

Cons:

Memory intensive (stores every request timestamp)
More Redis operations per request

When to use: When precision is critical and traffic is moderate.

Algorithm 3: Sliding Window Counter

Hybrid approach. Combines fixed windows with weighted counting.

Previous window (10:00-10:01): 84 requests
Current window (10:01-10:02): 36 requests

At 10:01:15 (25% into current window):
Weighted count = 84 × 0.75 + 36 × 1.0 = 63 + 36 = 99 requests

Implementation:

def is_allowed(user_id, limit=100, window_seconds=60):
    now = time.time()
    current_window = int(now / window_seconds)
    previous_window = current_window - 1

    # Position in current window (0.0 to 1.0)
    window_position = (now % window_seconds) / window_seconds

    # Get counts
    current_key = f"rate:{user_id}:{current_window}"
    previous_key = f"rate:{user_id}:{previous_window}"

    current_count = int(redis.get(current_key) or 0)
    previous_count = int(redis.get(previous_key) or 0)

    # Weighted count
    weighted_count = previous_count * (1 - window_position) + current_count

    if weighted_count < limit:
        redis.incr(current_key)
        redis.expire(current_key, window_seconds * 2)
        return True

    return False

Pros:

Memory efficient (two counters)
Smooths out boundary problem
Good balance of precision and efficiency

Cons:

Approximation (not exact)
Slightly more complex than fixed window

When to use: Best general-purpose choice for most applications.

Algorithm 4: Token Bucket

Tokens are added to a bucket at a fixed rate. Each request consumes a token. If no tokens available, request is rejected.

Bucket capacity: 10 tokens
Refill rate: 1 token per second

Time 0: 10 tokens (full)
Request → 9 tokens
Request → 8 tokens
...
Time 5: 3 tokens + 5 new = 8 tokens
Request → 7 tokens

Implementation:

def is_allowed(user_id, capacity=10, refill_rate=1):
    key = f"bucket:{user_id}"
    now = time.time()

    # Get current state
    bucket = redis.hgetall(key)

    if not bucket:
        # Initialize bucket
        tokens = capacity - 1  # Consume one for this request
        redis.hset(key, mapping={"tokens": tokens, "last_refill": now})
        redis.expire(key, 3600)
        return True

    # Calculate tokens to add
    last_refill = float(bucket["last_refill"])
    elapsed = now - last_refill
    tokens = float(bucket["tokens"])

    # Refill tokens
    tokens = min(capacity, tokens + elapsed * refill_rate)

    if tokens >= 1:
        # Consume token
        tokens -= 1
        redis.hset(key, mapping={"tokens": tokens, "last_refill": now})
        return True

    return False

Pros:

Allows bursts (up to bucket capacity)
Smooth long-term rate limiting
Used by many real systems (AWS, Stripe)

Cons:

More complex state (tokens + timestamp)
Two parameters to tune (capacity, rate)

When to use: When you want to allow bursts while maintaining average rate.

Algorithm 5: Leaky Bucket

Requests enter a queue (bucket) and are processed at a fixed rate. If queue is full, requests are dropped.

Queue capacity: 10
Processing rate: 1 request/second

Requests enter queue → processed in order → overflow dropped

Pros:

Smooth output rate (good for downstream systems)
Simple to understand

Cons:

Adds latency (requests wait in queue)
Doesn't handle bursts well

When to use: When you need to smooth traffic to downstream services.

Algorithm Comparison

Algorithm	Memory	Precision	Burst Handling	Complexity
Fixed Window	Low	Low	Poor	Simple
Sliding Log	High	High	Good	Medium
Sliding Counter	Low	Medium	Good	Medium
Token Bucket	Low	High	Excellent	Medium
Leaky Bucket	Medium	High	Poor	Medium

Recommendation: For interviews, explain Token Bucket or Sliding Window Counter. They show sophistication while being practical.

Step 4: Distributed Rate Limiting

The real challenge. With multiple API servers, how do you maintain consistent limits?

The Problem

Server 1: User sent 60 requests (sees 60)
Server 2: User sent 60 requests (sees 60)
Total: 120 requests, but each server thinks user is under 100 limit

Solution: Centralized Counter (Redis)

All servers share the same Redis instance for counters.

┌──────────┐     ┌──────────┐     ┌──────────┐
│ Server 1 │     │ Server 2 │     │ Server 3 │
└────┬─────┘     └────┬─────┘     └────┬─────┘
     │                │                │
     └────────────────┼────────────────┘
                      │
               ┌──────▼──────┐
               │    Redis    │
               │  (Counters) │
               └─────────────┘

Redis commands for atomic operations:

# Atomic increment with expiry
pipe = redis.pipeline()
pipe.incr(key)
pipe.expire(key, window_seconds)
count, _ = pipe.execute()

Handling Redis Latency

Adding Redis adds network latency. For high-traffic systems:

Option 1: Local cache + async sync

class RateLimiter:
    def __init__(self):
        self.local_counts = {}  # Local counter
        self.sync_interval = 1  # Sync every second

    def is_allowed(self, user_id):
        # Check local count first
        if self.local_counts.get(user_id, 0) >= LOCAL_LIMIT:
            return False

        # Increment locally
        self.local_counts[user_id] = self.local_counts.get(user_id, 0) + 1
        return True

    def sync_to_redis(self):
        # Background job: push local counts to Redis
        for user_id, count in self.local_counts.items():
            redis.incrby(f"rate:{user_id}", count)
        self.local_counts.clear()

Trade-off: Reduces Redis calls but allows slight over-limit.

Option 2: Rate limit at load balancer

Use sticky sessions so all requests from a user hit the same server. Rate limit locally.

Trade-off: Simpler, but less flexible.

Step 5: Handling Edge Cases

Strong candidates proactively address these:

1. What if Redis goes down?

Option A: Fail open (allow all requests)

def is_allowed(user_id):
    try:
        return check_redis_limit(user_id)
    except RedisError:
        # Redis down: allow request
        logger.warning("Redis unavailable, allowing request")
        return True

Option B: Fail closed (reject all requests)

def is_allowed(user_id):
    try:
        return check_redis_limit(user_id)
    except RedisError:
        # Redis down: reject request
        return False

Option C: Local fallback

def is_allowed(user_id):
    try:
        return check_redis_limit(user_id)
    except RedisError:
        # Fall back to local rate limiting
        return check_local_limit(user_id)

Best practice: Fail open with monitoring. Temporary over-allowance is usually better than service outage.

2. Race conditions

Two requests arrive simultaneously. Both check count (99), both increment (100, 101).

Solution: Use atomic Redis operations.

-- Lua script for atomic check-and-increment
local count = redis.call('INCR', KEYS[1])
if count == 1 then
    redis.call('EXPIRE', KEYS[1], ARGV[1])
end
return count

3. Clock synchronization

Servers have slightly different clocks. Window boundaries differ.

Solution: Use Redis server time, not local time.

def get_current_window():
    # Use Redis TIME command for consistency
    redis_time = redis.time()
    return int(redis_time[0] / 60)  # Current minute

4. Thundering herd at window reset

All rate-limited users retry exactly when window resets.

Solution: Add jitter to retry-after headers.

def get_retry_after(window_end):
    base_delay = window_end - time.time()
    jitter = random.uniform(0, 5)  # Random 0-5 seconds
    return base_delay + jitter

Step 6: API Design

Response Headers

Return rate limit info in headers:

HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 45
X-RateLimit-Reset: 1704067200

When exceeded:

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1704067200
Retry-After: 30

Client-Friendly Design

Good rate limiters help clients:

Clear error messages: "Rate limit exceeded. Try again in 30 seconds."
Retry-After header: Tells clients exactly when to retry
Remaining count: Lets clients pace themselves
Consistent across endpoints: Predictable behavior

Step 7: Rate Limiting Strategies

By User

Most common. Each user has their own limit.

key = f"rate:user:{user_id}:{endpoint}"

By IP Address

For unauthenticated endpoints (login, signup).

key = f"rate:ip:{client_ip}:{endpoint}"

Caution: NAT and proxies can make many users share one IP.

By API Key

For B2B APIs. Each customer gets their own limit.

key = f"rate:api_key:{api_key}"

Tiered Limits

Different limits for different user tiers:

def get_limit(user):
    if user.plan == "enterprise":
        return 10000  # per minute
    elif user.plan == "pro":
        return 1000
    else:
        return 100

Per-Endpoint Limits

Different limits for different endpoints:

ENDPOINT_LIMITS = {
    "/api/search": 10,      # Expensive operation
    "/api/users": 100,      # Normal operation
    "/api/health": 1000,    # Cheap operation
}

Step 8: Monitoring and Observability

Metrics to Track

# Prometheus metrics example
rate_limit_hits = Counter(
    "rate_limit_hits_total",
    "Number of requests that hit rate limits",
    ["user_tier", "endpoint"]
)

rate_limit_remaining = Gauge(
    "rate_limit_remaining",
    "Remaining requests in current window",
    ["user_id"]
)

Alerts

Set up alerts for:

High rate of 429 responses (might indicate attack or misconfiguration)
Redis latency spikes
Users consistently hitting limits (might need limit adjustment)

Dashboard

Visualize:

429 response rate over time
Top rate-limited users
Limit utilization by tier
Redis operation latency

Implementation: Complete Token Bucket

Here's a production-ready implementation:

import time
import redis

class TokenBucketRateLimiter:
    def __init__(self, redis_client, capacity, refill_rate):
        self.redis = redis_client
        self.capacity = capacity
        self.refill_rate = refill_rate  # tokens per second

    def is_allowed(self, key):
        now = time.time()

        # Lua script for atomic operation
        lua_script = """
        local key = KEYS[1]
        local capacity = tonumber(ARGV[1])
        local refill_rate = tonumber(ARGV[2])
        local now = tonumber(ARGV[3])

        local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
        local tokens = tonumber(bucket[1])
        local last_refill = tonumber(bucket[2])

        if tokens == nil then
            -- Initialize bucket
            tokens = capacity - 1
            redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
            redis.call('EXPIRE', key, 3600)
            return {1, tokens}
        end

        -- Refill tokens
        local elapsed = now - last_refill
        tokens = math.min(capacity, tokens + elapsed * refill_rate)

        if tokens >= 1 then
            tokens = tokens - 1
            redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
            return {1, tokens}
        end

        return {0, tokens}
        """

        result = self.redis.eval(
            lua_script,
            1,
            key,
            self.capacity,
            self.refill_rate,
            now
        )

        allowed = result[0] == 1
        remaining = result[1]

        return allowed, remaining

# Usage
limiter = TokenBucketRateLimiter(
    redis_client=redis.Redis(),
    capacity=100,
    refill_rate=100/60  # 100 per minute
)

allowed, remaining = limiter.is_allowed(f"user:{user_id}")
if not allowed:
    return Response(status=429, headers={
        "X-RateLimit-Remaining": 0,
        "Retry-After": 1
    })

Common Interview Follow-Ups

"How would you handle 10x traffic?"

Scale Redis horizontally (Redis Cluster)
Add local caching layer
Consider eventual consistency (sync less frequently)
Rate limit earlier in the stack (CDN/load balancer)

"What if you can't use Redis?"

Alternatives:

In-memory with gossip protocol: Servers share state periodically
Database: Slower, but works for low-traffic systems
Cloud services: AWS API Gateway, Cloudflare rate limiting

"How do you test a rate limiter?"

Unit tests for algorithm correctness
Load tests for performance under stress
Chaos tests for Redis failures
Integration tests for header correctness

"Token bucket vs sliding window?"

Token bucket: Better for allowing bursts while maintaining average rate
Sliding window: Better for strict limits without burst allowance

Summary: The Winning Answer

Time	What to Cover
0-5 min	Clarify: what to limit, limits, distributed?
5-15 min	Explain 2-3 algorithms, compare trade-offs
15-25 min	High-level design with Redis
25-35 min	Deep dive: distributed challenges, race conditions
35-45 min	Edge cases: Redis failure, monitoring

Key differentiators:

Explain multiple algorithms, not just one
Discuss distributed challenges proactively
Address failure modes
Show awareness of production concerns (monitoring, alerting)---

Frequently Asked Questions

Which algorithm should I use in production?

Token bucket is the most common choice. It's used by AWS, Stripe, and most major APIs. It handles bursts gracefully while maintaining average rate limits.

Should rate limiting happen at the API gateway or application?

Both. API gateway provides first line of defense (DDoS protection). Application-level provides granular, user-specific limits.

How do I choose limit values?

Start conservative and adjust based on data. Monitor 95th percentile usage. Set limits slightly above normal usage but below abuse thresholds.

What about rate limiting in microservices?

Each service should have its own rate limiting. Use service mesh (Istio, Envoy) for consistent rate limiting across services.

Design a Rate Limiter: Complete System Design Guide

Ready to Master System Design Interviews?

Why Rate Limiting Matters

Step 1: Clarify Requirements

Questions to Ask

Typical Requirements

Step 2: High-Level Design

Request Flow

Why Redis?

Step 3: Rate Limiting Algorithms

Algorithm 1: Fixed Window Counter

Algorithm 2: Sliding Window Log

Algorithm 3: Sliding Window Counter

Algorithm 4: Token Bucket

Algorithm 5: Leaky Bucket

Algorithm Comparison

Step 4: Distributed Rate Limiting

The Problem

Solution: Centralized Counter (Redis)

Handling Redis Latency

Step 5: Handling Edge Cases

1. What if Redis goes down?

2. Race conditions

3. Clock synchronization

4. Thundering herd at window reset

Step 6: API Design

Response Headers

Client-Friendly Design

Step 7: Rate Limiting Strategies

By User

By IP Address

By API Key

Tiered Limits

Per-Endpoint Limits

Step 8: Monitoring and Observability

Metrics to Track

Alerts

Dashboard

Implementation: Complete Token Bucket

Common Interview Follow-Ups

"How would you handle 10x traffic?"

"What if you can't use Redis?"

"How do you test a rate limiter?"

"Token bucket vs sliding window?"

Summary: The Winning Answer

Frequently Asked Questions

Which algorithm should I use in production?

Should rate limiting happen at the API gateway or application?

How do I choose limit values?

What about rate limiting in microservices?

Ready to Master System Design Interviews?

FREE: System Design Interview Cheat Sheet

Related Articles

Why Distributed Systems Fail: 15 Failure Scenarios Every Engineer Must Know

The 7 System Design Problems You Must Know Before Your Interview

Amazon System Design Interview: Leadership Principles Meet Distributed Systems