Back to Blog
system-designrate-limiterinterview

Design a Rate Limiter: Complete System Design Guide

How to design a rate limiter for system design interviews. Covers token bucket, sliding window, distributed rate limiting, and edge cases.

14 min readBy SystemExperts
From the Interviewer's Side

Ready to Master System Design Interviews?

Learn from 25+ real interview problems from Netflix, Uber, Google, and Stripe. Created by a senior engineer who's taken 200+ system design interviews at FAANG companies.

Complete Solutions

Architecture diagrams & trade-off analysis

Real Interview Problems

From actual FAANG interviews

7-day money-back guarantee • Lifetime access • New problems added quarterly

"Design a rate limiter."

This question appears in almost every system design interview because rate limiting is everywhere. Every API you've ever used, Stripe, Twitter, AWS, has rate limiting. It's a fundamental building block that protects systems from abuse and ensures fair resource allocation.

What makes this question interesting is the depth. The basic concept is simple, but the implementation reveals your understanding of algorithms, distributed systems, and real-world trade-offs.


Why Rate Limiting Matters

Before diving into design, understand why companies rate limit:

1. Prevent resource exhaustion

Without limits, a single user (or bot) could consume all your server capacity. Rate limiting ensures no single client monopolizes resources.

2. Protect against attacks

DDoS attacks and brute-force attempts rely on high request volumes. Rate limiting is the first line of defense.

3. Control costs

Cloud services charge by usage. Runaway scripts or bugs can generate massive bills. Rate limiting provides a safety net.

4. Ensure fair usage

In multi-tenant systems, rate limiting prevents one tenant from degrading service for others.

5. Business model enforcement

SaaS products often tier features by usage limits. Rate limiting enforces these tiers.


Step 1: Clarify Requirements

Start by asking questions. Rate limiters can vary significantly based on requirements.

Questions to Ask

1. What are we rate limiting?

  • Requests per user?
  • Requests per IP address?
  • Requests per API key?
  • Requests per endpoint?

2. What are the limits?

  • 100 requests per minute?
  • 1000 requests per hour?
  • Multiple tiers (free vs. paid)?

3. What happens when limits are exceeded?

  • Return 429 (Too Many Requests)?
  • Queue and process later?
  • Degrade service quality?

4. Is this distributed or single-server?

  • Single application server?
  • Multiple servers behind a load balancer?
  • Multiple data centers?

5. What's the accuracy requirement?

  • Exactly 100 requests allowed?
  • Approximately 100 (95-105 acceptable)?

Typical Requirements

For this design, assume:

  • Rate limit by user ID and API endpoint
  • 100 requests per minute per user per endpoint
  • Return 429 when exceeded with retry-after header
  • Distributed across multiple servers
  • Soft limits acceptable (slight over-allowance is OK)

Step 2: High-Level Design

Here's the architecture:

┌────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Client   │────▶│  Load Balancer  │────▶│   API Server    │
└────────────┘     └─────────────────┘     └────────┬────────┘
                                                    │
                                           ┌────────▼────────┐
                                           │  Rate Limiter   │
                                           │   Middleware    │
                                           └────────┬────────┘
                                                    │
                                           ┌────────▼────────┐
                                           │     Redis       │
                                           │ (Rate Counters) │
                                           └─────────────────┘

Request Flow

  1. Client sends request to API
  2. Request hits rate limiter middleware
  3. Middleware checks Redis for current count
  4. If under limit: increment counter, allow request
  5. If over limit: return 429 with retry-after header

Why Redis?

Rate limiting requires:

  • Fast reads/writes , Checking limits shouldn't add latency
  • Atomic operations , Multiple servers must coordinate
  • TTL support , Counters must expire automatically

Redis provides all three with sub-millisecond latency.


Step 3: Rate Limiting Algorithms

This is where you differentiate yourself. There are four main algorithms, each with trade-offs.

Algorithm 1: Fixed Window Counter

The simplest approach. Divide time into fixed windows (e.g., 1-minute intervals) and count requests in each.

Window: 10:00:00 - 10:00:59 → Count: 45
Window: 10:01:00 - 10:01:59 → Count: 78
Window: 10:02:00 - 10:02:59 → Count: 12

Implementation:

def is_allowed(user_id, limit=100):
    # Get current minute as key
    current_window = int(time.time() / 60)
    key = f"rate:{user_id}:{current_window}"

    # Atomic increment
    count = redis.incr(key)

    # Set expiry on first request
    if count == 1:
        redis.expire(key, 60)

    return count <= limit

Pros:

  • Simple to implement
  • Memory efficient (one counter per window)

Cons:

  • Boundary problem: User can send 100 requests at 10:00:59 and 100 more at 10:01:00 (200 in 2 seconds)

When to use: When simplicity matters more than precision.


Algorithm 2: Sliding Window Log

Track timestamps of all requests. Count requests within the sliding window.

Requests: [10:00:15, 10:00:23, 10:00:45, 10:01:02, 10:01:15]

At 10:01:30, window is 10:00:30 - 10:01:30
Requests in window: [10:00:45, 10:01:02, 10:01:15] = 3 requests

Implementation:

def is_allowed(user_id, limit=100, window_seconds=60):
    now = time.time()
    key = f"rate:{user_id}"

    # Remove old entries
    window_start = now - window_seconds
    redis.zremrangebyscore(key, 0, window_start)

    # Count current requests
    count = redis.zcard(key)

    if count < limit:
        # Add current request
        redis.zadd(key, {str(uuid.uuid4()): now})
        redis.expire(key, window_seconds)
        return True

    return False

Pros:

  • Precise, no boundary problem
  • Smooth rate limiting

Cons:

  • Memory intensive (stores every request timestamp)
  • More Redis operations per request

When to use: When precision is critical and traffic is moderate.


Algorithm 3: Sliding Window Counter

Hybrid approach. Combines fixed windows with weighted counting.

Previous window (10:00-10:01): 84 requests
Current window (10:01-10:02): 36 requests

At 10:01:15 (25% into current window):
Weighted count = 84 × 0.75 + 36 × 1.0 = 63 + 36 = 99 requests

Implementation:

def is_allowed(user_id, limit=100, window_seconds=60):
    now = time.time()
    current_window = int(now / window_seconds)
    previous_window = current_window - 1

    # Position in current window (0.0 to 1.0)
    window_position = (now % window_seconds) / window_seconds

    # Get counts
    current_key = f"rate:{user_id}:{current_window}"
    previous_key = f"rate:{user_id}:{previous_window}"

    current_count = int(redis.get(current_key) or 0)
    previous_count = int(redis.get(previous_key) or 0)

    # Weighted count
    weighted_count = previous_count * (1 - window_position) + current_count

    if weighted_count < limit:
        redis.incr(current_key)
        redis.expire(current_key, window_seconds * 2)
        return True

    return False

Pros:

  • Memory efficient (two counters)
  • Smooths out boundary problem
  • Good balance of precision and efficiency

Cons:

  • Approximation (not exact)
  • Slightly more complex than fixed window

When to use: Best general-purpose choice for most applications.


Algorithm 4: Token Bucket

Tokens are added to a bucket at a fixed rate. Each request consumes a token. If no tokens available, request is rejected.

Bucket capacity: 10 tokens
Refill rate: 1 token per second

Time 0: 10 tokens (full)
Request → 9 tokens
Request → 8 tokens
...
Time 5: 3 tokens + 5 new = 8 tokens
Request → 7 tokens

Implementation:

def is_allowed(user_id, capacity=10, refill_rate=1):
    key = f"bucket:{user_id}"
    now = time.time()

    # Get current state
    bucket = redis.hgetall(key)

    if not bucket:
        # Initialize bucket
        tokens = capacity - 1  # Consume one for this request
        redis.hset(key, mapping={"tokens": tokens, "last_refill": now})
        redis.expire(key, 3600)
        return True

    # Calculate tokens to add
    last_refill = float(bucket["last_refill"])
    elapsed = now - last_refill
    tokens = float(bucket["tokens"])

    # Refill tokens
    tokens = min(capacity, tokens + elapsed * refill_rate)

    if tokens >= 1:
        # Consume token
        tokens -= 1
        redis.hset(key, mapping={"tokens": tokens, "last_refill": now})
        return True

    return False

Pros:

  • Allows bursts (up to bucket capacity)
  • Smooth long-term rate limiting
  • Used by many real systems (AWS, Stripe)

Cons:

  • More complex state (tokens + timestamp)
  • Two parameters to tune (capacity, rate)

When to use: When you want to allow bursts while maintaining average rate.


Algorithm 5: Leaky Bucket

Requests enter a queue (bucket) and are processed at a fixed rate. If queue is full, requests are dropped.

Queue capacity: 10
Processing rate: 1 request/second

Requests enter queue → processed in order → overflow dropped

Pros:

  • Smooth output rate (good for downstream systems)
  • Simple to understand

Cons:

  • Adds latency (requests wait in queue)
  • Doesn't handle bursts well

When to use: When you need to smooth traffic to downstream services.


Algorithm Comparison

AlgorithmMemoryPrecisionBurst HandlingComplexity
Fixed WindowLowLowPoorSimple
Sliding LogHighHighGoodMedium
Sliding CounterLowMediumGoodMedium
Token BucketLowHighExcellentMedium
Leaky BucketMediumHighPoorMedium

Recommendation: For interviews, explain Token Bucket or Sliding Window Counter. They show sophistication while being practical.


Step 4: Distributed Rate Limiting

The real challenge. With multiple API servers, how do you maintain consistent limits?

The Problem

Server 1: User sent 60 requests (sees 60)
Server 2: User sent 60 requests (sees 60)
Total: 120 requests, but each server thinks user is under 100 limit

Solution: Centralized Counter (Redis)

All servers share the same Redis instance for counters.

┌──────────┐     ┌──────────┐     ┌──────────┐
│ Server 1 │     │ Server 2 │     │ Server 3 │
└────┬─────┘     └────┬─────┘     └────┬─────┘
     │                │                │
     └────────────────┼────────────────┘
                      │
               ┌──────▼──────┐
               │    Redis    │
               │  (Counters) │
               └─────────────┘

Redis commands for atomic operations:

# Atomic increment with expiry
pipe = redis.pipeline()
pipe.incr(key)
pipe.expire(key, window_seconds)
count, _ = pipe.execute()

Handling Redis Latency

Adding Redis adds network latency. For high-traffic systems:

Option 1: Local cache + async sync

class RateLimiter:
    def __init__(self):
        self.local_counts = {}  # Local counter
        self.sync_interval = 1  # Sync every second

    def is_allowed(self, user_id):
        # Check local count first
        if self.local_counts.get(user_id, 0) >= LOCAL_LIMIT:
            return False

        # Increment locally
        self.local_counts[user_id] = self.local_counts.get(user_id, 0) + 1
        return True

    def sync_to_redis(self):
        # Background job: push local counts to Redis
        for user_id, count in self.local_counts.items():
            redis.incrby(f"rate:{user_id}", count)
        self.local_counts.clear()

Trade-off: Reduces Redis calls but allows slight over-limit.

Option 2: Rate limit at load balancer

Use sticky sessions so all requests from a user hit the same server. Rate limit locally.

Trade-off: Simpler, but less flexible.


Step 5: Handling Edge Cases

Strong candidates proactively address these:

1. What if Redis goes down?

Option A: Fail open (allow all requests)

def is_allowed(user_id):
    try:
        return check_redis_limit(user_id)
    except RedisError:
        # Redis down: allow request
        logger.warning("Redis unavailable, allowing request")
        return True

Option B: Fail closed (reject all requests)

def is_allowed(user_id):
    try:
        return check_redis_limit(user_id)
    except RedisError:
        # Redis down: reject request
        return False

Option C: Local fallback

def is_allowed(user_id):
    try:
        return check_redis_limit(user_id)
    except RedisError:
        # Fall back to local rate limiting
        return check_local_limit(user_id)

Best practice: Fail open with monitoring. Temporary over-allowance is usually better than service outage.

2. Race conditions

Two requests arrive simultaneously. Both check count (99), both increment (100, 101).

Solution: Use atomic Redis operations.

-- Lua script for atomic check-and-increment
local count = redis.call('INCR', KEYS[1])
if count == 1 then
    redis.call('EXPIRE', KEYS[1], ARGV[1])
end
return count

3. Clock synchronization

Servers have slightly different clocks. Window boundaries differ.

Solution: Use Redis server time, not local time.

def get_current_window():
    # Use Redis TIME command for consistency
    redis_time = redis.time()
    return int(redis_time[0] / 60)  # Current minute

4. Thundering herd at window reset

All rate-limited users retry exactly when window resets.

Solution: Add jitter to retry-after headers.

def get_retry_after(window_end):
    base_delay = window_end - time.time()
    jitter = random.uniform(0, 5)  # Random 0-5 seconds
    return base_delay + jitter

Step 6: API Design

Response Headers

Return rate limit info in headers:

HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 45
X-RateLimit-Reset: 1704067200

When exceeded:

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1704067200
Retry-After: 30

Client-Friendly Design

Good rate limiters help clients:

  1. Clear error messages: "Rate limit exceeded. Try again in 30 seconds."
  2. Retry-After header: Tells clients exactly when to retry
  3. Remaining count: Lets clients pace themselves
  4. Consistent across endpoints: Predictable behavior

Step 7: Rate Limiting Strategies

By User

Most common. Each user has their own limit.

key = f"rate:user:{user_id}:{endpoint}"

By IP Address

For unauthenticated endpoints (login, signup).

key = f"rate:ip:{client_ip}:{endpoint}"

Caution: NAT and proxies can make many users share one IP.

By API Key

For B2B APIs. Each customer gets their own limit.

key = f"rate:api_key:{api_key}"

Tiered Limits

Different limits for different user tiers:

def get_limit(user):
    if user.plan == "enterprise":
        return 10000  # per minute
    elif user.plan == "pro":
        return 1000
    else:
        return 100

Per-Endpoint Limits

Different limits for different endpoints:

ENDPOINT_LIMITS = {
    "/api/search": 10,      # Expensive operation
    "/api/users": 100,      # Normal operation
    "/api/health": 1000,    # Cheap operation
}

Step 8: Monitoring and Observability

Metrics to Track

# Prometheus metrics example
rate_limit_hits = Counter(
    "rate_limit_hits_total",
    "Number of requests that hit rate limits",
    ["user_tier", "endpoint"]
)

rate_limit_remaining = Gauge(
    "rate_limit_remaining",
    "Remaining requests in current window",
    ["user_id"]
)

Alerts

Set up alerts for:

  • High rate of 429 responses (might indicate attack or misconfiguration)
  • Redis latency spikes
  • Users consistently hitting limits (might need limit adjustment)

Dashboard

Visualize:

  • 429 response rate over time
  • Top rate-limited users
  • Limit utilization by tier
  • Redis operation latency

Implementation: Complete Token Bucket

Here's a production-ready implementation:

import time
import redis

class TokenBucketRateLimiter:
    def __init__(self, redis_client, capacity, refill_rate):
        self.redis = redis_client
        self.capacity = capacity
        self.refill_rate = refill_rate  # tokens per second

    def is_allowed(self, key):
        now = time.time()

        # Lua script for atomic operation
        lua_script = """
        local key = KEYS[1]
        local capacity = tonumber(ARGV[1])
        local refill_rate = tonumber(ARGV[2])
        local now = tonumber(ARGV[3])

        local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
        local tokens = tonumber(bucket[1])
        local last_refill = tonumber(bucket[2])

        if tokens == nil then
            -- Initialize bucket
            tokens = capacity - 1
            redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
            redis.call('EXPIRE', key, 3600)
            return {1, tokens}
        end

        -- Refill tokens
        local elapsed = now - last_refill
        tokens = math.min(capacity, tokens + elapsed * refill_rate)

        if tokens >= 1 then
            tokens = tokens - 1
            redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
            return {1, tokens}
        end

        return {0, tokens}
        """

        result = self.redis.eval(
            lua_script,
            1,
            key,
            self.capacity,
            self.refill_rate,
            now
        )

        allowed = result[0] == 1
        remaining = result[1]

        return allowed, remaining

# Usage
limiter = TokenBucketRateLimiter(
    redis_client=redis.Redis(),
    capacity=100,
    refill_rate=100/60  # 100 per minute
)

allowed, remaining = limiter.is_allowed(f"user:{user_id}")
if not allowed:
    return Response(status=429, headers={
        "X-RateLimit-Remaining": 0,
        "Retry-After": 1
    })

Common Interview Follow-Ups

"How would you handle 10x traffic?"

  1. Scale Redis horizontally (Redis Cluster)
  2. Add local caching layer
  3. Consider eventual consistency (sync less frequently)
  4. Rate limit earlier in the stack (CDN/load balancer)

"What if you can't use Redis?"

Alternatives:

  • In-memory with gossip protocol: Servers share state periodically
  • Database: Slower, but works for low-traffic systems
  • Cloud services: AWS API Gateway, Cloudflare rate limiting

"How do you test a rate limiter?"

  1. Unit tests for algorithm correctness
  2. Load tests for performance under stress
  3. Chaos tests for Redis failures
  4. Integration tests for header correctness

"Token bucket vs sliding window?"

  • Token bucket: Better for allowing bursts while maintaining average rate
  • Sliding window: Better for strict limits without burst allowance

Summary: The Winning Answer

TimeWhat to Cover
0-5 minClarify: what to limit, limits, distributed?
5-15 minExplain 2-3 algorithms, compare trade-offs
15-25 minHigh-level design with Redis
25-35 minDeep dive: distributed challenges, race conditions
35-45 minEdge cases: Redis failure, monitoring

Key differentiators:

  • Explain multiple algorithms, not just one
  • Discuss distributed challenges proactively
  • Address failure modes
  • Show awareness of production concerns (monitoring, alerting)---

Frequently Asked Questions

Which algorithm should I use in production?

Token bucket is the most common choice. It's used by AWS, Stripe, and most major APIs. It handles bursts gracefully while maintaining average rate limits.

Should rate limiting happen at the API gateway or application?

Both. API gateway provides first line of defense (DDoS protection). Application-level provides granular, user-specific limits.

How do I choose limit values?

Start conservative and adjust based on data. Monitor 95th percentile usage. Set limits slightly above normal usage but below abuse thresholds.

What about rate limiting in microservices?

Each service should have its own rate limiting. Use service mesh (Istio, Envoy) for consistent rate limiting across services.

From the Interviewer's Side

Ready to Master System Design Interviews?

Learn from 25+ real interview problems from Netflix, Uber, Google, and Stripe. Created by a senior engineer who's taken 200+ system design interviews at FAANG companies.

Complete Solutions

Architecture diagrams & trade-off analysis

Real Interview Problems

From actual FAANG interviews

7-day money-back guarantee • Lifetime access • New problems added quarterly

FREE DOWNLOAD • 7-PAGE PDF

FREE: System Design Interview Cheat Sheet

Get the 7-page PDF cheat sheet with critical numbers, decision frameworks, and the interview approach used by 10,000+ engineers.

Includes:Critical NumbersDecision Frameworks35 Patterns5-Step Method

No spam. Unsubscribe anytime.