System Design Masterclass

58 items

System Design Masterclass

Paymentsauthenticationoauthjwtsessionssecurityintermediate

Design Authentication and User Login

Design a secure authentication system handling billions of logins per day like Google or Facebook

Billions of logins/day, sub-100ms latency, 99.99% availability|Similar to Google, Facebook, Auth0, Okta, AWS Cognito|45 min read

Summary

Design an authentication system that handles billions of logins per day with high security and low latency. The core challenges are secure credential storage, session management at scale, preventing credential stuffing attacks, and supporting multiple authentication methods (password, OAuth, MFA). We solve this with distributed session stores, JWT tokens for stateless auth, rate limiting, and defense in depth.

Key Takeaways

Core Problem

This is fundamentally a trust establishment problem - securely verifying that a user is who they claim to be, then maintaining that trust through a session, all while preventing abuse at scale.

The Hard Part

Balancing security with usability at scale. Strong security (long passwords, frequent MFA) hurts UX. Weak security leads to account takeovers. Finding the right balance while handling billions of requests is the challenge.

Scaling Axis

Scale by partitioning sessions by user ID across a distributed cache (Redis Cluster). Stateless JWT tokens can eliminate session lookups entirely for read-heavy workloads.

The Question: Design an authentication system for a large-scale web application that handles billions of logins per day.

Authentication is the gateway to every user interaction: - Identity verification: Proving users are who they claim to be - Session management: Maintaining authenticated state across requests - Security: Preventing unauthorized access and account takeovers - Compliance: Meeting regulatory requirements (GDPR, SOC2)

What to say first

Before I design, let me clarify the authentication methods we need to support, the scale requirements, and the security constraints. Authentication has many variations and the design depends heavily on these factors.

Hidden requirements interviewers are testing: - Do you understand password hashing (bcrypt, Argon2) and why it matters? - Can you design for both security AND scale (these often conflict)? - Do you know the difference between authentication and authorization? - Can you handle session management in a distributed system? - Do you understand OAuth flows and when to use them?

Summary

Key Takeaways

Core Problem

This is fundamentally a trust establishment problem - securely verifying that a user is who they claim to be, then maintaining that trust through a session, all while preventing abuse at scale.

The Hard Part

Scaling Axis

Scale by partitioning sessions by user ID across a distributed cache (Redis Cluster). Stateless JWT tokens can eliminate session lookups entirely for read-heavy workloads.

Critical Invariant

Never store passwords in plaintext. Never leak timing information during authentication. Never allow session fixation or hijacking.

Performance Requirement

Login must complete in under 500ms p99. Session validation must be under 10ms. These requirements shape whether we use stateful sessions or stateless tokens.

Key Tradeoff

Stateful sessions (Redis) give instant revocation but require distributed state. Stateless tokens (JWT) scale infinitely but cannot be revoked until expiry.

Design Walkthrough

Problem Statement

The Question: Design an authentication system for a large-scale web application that handles billions of logins per day.

What to say first

Clarifying Questions

Ask these questions to demonstrate security awareness and systems thinking.

Question 1: Authentication Methods

What authentication methods do we need to support? Username/password only, or also OAuth (Google, Facebook), SSO, passwordless, MFA?

Why this matters: Each method has different security properties and implementation complexity. Typical answer: Password + OAuth + optional MFA for sensitive actions Architecture impact: Need to support multiple identity providers, credential storage varies by method

Question 2: Scale and Geography

What is the login volume? Are users global? Do we need to handle login from multiple devices simultaneously?

Why this matters: Determines session storage strategy and consistency requirements. Typical answer: Billions of logins/day, global users, multiple device support Architecture impact: Need distributed session store, consider regional deployments

Question 3: Security Requirements

What is the sensitivity of the data? Do we need to detect suspicious logins? Is there regulatory compliance (SOC2, GDPR)?

Why this matters: High-security applications need additional layers (MFA, anomaly detection). Typical answer: Moderate sensitivity, need to detect credential stuffing, GDPR compliant Architecture impact: Need rate limiting, anomaly detection, audit logging

Question 4: Session Requirements

How long should sessions last? Do we need instant session revocation (e.g., for logout from all devices)?

Why this matters: Instant revocation requires stateful sessions. Long sessions need refresh mechanisms. Typical answer: Sessions last 30 days with activity, need logout-all-devices feature Architecture impact: Stateful sessions required for instant revocation, or short-lived tokens with refresh

Stating assumptions

Based on this, I will assume: Password + OAuth + optional MFA, billions of logins/day globally, moderate security with credential stuffing protection, 30-day sessions with instant revocation capability.

The Hard Part

Say this out loud

The hard part here is balancing security with performance at scale. Secure password hashing is intentionally slow (to prevent brute force), but we need sub-second login times for billions of users.

Why this is genuinely hard:

1.Security vs Performance: Password hashing (bcrypt) is designed to be slow (100-500ms). At billions of logins/day, this is a massive compute cost.
2.Distributed Sessions: If a user logs in on Server A and their next request hits Server B, how does B know they are authenticated?
3.Credential Stuffing: Attackers use leaked password databases to try username/password combinations at scale. How do you detect and block this without affecting legitimate users?
4.Session Security: Sessions can be hijacked (stolen cookies), fixated (attacker sets session ID), or replayed. Each attack vector needs defense.

Common mistakes

1) Storing passwords with fast hashes (MD5, SHA256) - these are NOT secure for passwords. 2) Using sequential session IDs - these are guessable. 3) Not implementing rate limiting - allows brute force attacks.

The fundamental tensions:

Security vs UX: More security (longer passwords, frequent MFA) = worse UX - Stateful vs Stateless: Stateful sessions = instant revocation but harder to scale. Stateless (JWT) = infinite scale but delayed revocation - Consistency vs Availability: Strong session consistency can cause login failures during partitions

Scale and Access Patterns

Let me estimate the scale and understand access patterns.

Dimension	Value	Impact
Daily Logins	1 billion	~12K logins/second average, 50K+ peak
Active Sessions	500 million	Distributed session storage required

+ 4 more rows...

What to say

The critical insight is that login (write-heavy, slow due to hashing) and session validation (read-heavy, must be fast) have completely different characteristics. We should optimize them separately.

Access Pattern Analysis:

Login: Write-heavy, computationally expensive, relatively rare per user - Session Validation: Read-heavy (100:1 ratio to login), must be sub-10ms - Password Changes: Very rare, can be slow - Session Revocation: Rare but must be fast when it happens - Geographic Distribution: Users login from specific regions, sessions accessed globally

Password hashing compute:
- 500M password logins/day
- 200ms average hash time

+ 14 more lines...

High-Level Architecture

Let me design the system with separate paths for authentication and session validation.

What to say

I will separate the authentication flow (login) from session validation. Login is write-heavy and slow; session validation is read-heavy and must be fast. Different optimization strategies for each.

Authentication System Architecture

Component Responsibilities:

API Gateway: Routes requests, extracts session tokens, rate limiting

Authentication Service: Handles login flows - Password verification (bcrypt/Argon2) - OAuth token exchange - MFA verification - Creates sessions on success

Session Service: Lightweight session validation - Validates session tokens - Returns user context - Handles session refresh

User Database: Stores credentials - Sharded by user_id - Password hashes, OAuth links, MFA secrets

Session Store (Redis Cluster): Active sessions - Sharded by session_id - Fast reads for validation - TTL for automatic expiry

Real-world reference

Google uses a similar architecture with separate auth and session services. Auth0 separates identity verification from session management. Facebook uses a hybrid approach with short-lived access tokens and longer refresh tokens.

Data Model and Storage

Let me define the data models for users, credentials, and sessions.

What to say

The user table stores identity, the credentials table stores authentication methods (separated for security), and sessions are in Redis for fast access.

-- Users table (sharded by user_id)
CREATE TABLE users (
    user_id         UUID PRIMARY KEY,

+ 40 more lines...

Key: session:{session_id}
Value: JSON object
TTL: 30 days (refreshed on activity)

+ 19 more lines...

Why separate credentials table?

1.Security: Credentials can have different access controls than user data 2. Flexibility: Easy to add new auth methods (passkeys, WebAuthn) 3. Audit: Separate audit trail for credential changes 4. Multi-method: User can have password + OAuth + passkey simultaneously

Critical security detail

Password hashes must use bcrypt (cost factor 12+) or Argon2id. Never use MD5, SHA1, or even SHA256 for passwords. These are too fast and allow brute force attacks.

Authentication Flow Deep Dive

Let me walk through the login flow in detail.

Password Login Flow

async def login(email: str, password: str, request: Request) -> LoginResponse:
    # Step 1: Rate limiting
    if not await rate_limiter.check(f"login:{email}", limit=5, window=300):

+ 40 more lines...

Critical Security Details:

1.Timing Attack Prevention: Always perform password hash even for non-existent users. Otherwise, attackers can enumerate valid emails by measuring response time.
2.Rate Limiting: Per-email AND per-IP. Prevents brute force and credential stuffing.
3.Fraud Detection: Analyze login context (IP, device, time, location) to detect anomalies.
4.Audit Logging: Log all authentication events for security review and compliance.

OAuth Flow

For OAuth (Google, Facebook), we redirect to the provider, receive an authorization code, exchange it for tokens, then fetch user info. The flow is stateless - we use a PKCE code verifier stored in a short-lived cookie.

Session Management

What to say

I will use a hybrid approach: short-lived access tokens (15 min) for API calls, and longer refresh tokens (30 days) stored server-side. This gives us fast validation AND instant revocation.

Session Token Types:

1.Access Token: Short-lived (15 min), can be JWT for stateless validation 2. Refresh Token: Long-lived (30 days), stored in Redis, used to get new access tokens 3. Session ID: Ties access and refresh tokens together, used for logout-all

async def validate_session(access_token: str) -> User:
    # Option 1: Stateless JWT validation (fastest, no Redis call)
    if is_jwt(access_token):

+ 20 more lines...

Token Refresh Flow:

async def refresh_tokens(refresh_token: str) -> TokenPair:
    # Validate refresh token in Redis
    session = await redis.get(f"refresh:{refresh_token}")

+ 25 more lines...

Refresh token rotation

Always rotate refresh tokens on use. If an attacker steals a refresh token and uses it, the legitimate user next refresh will fail (token already rotated), alerting them to the compromise.

Security Defenses

System Invariants

1) Never store passwords in plaintext or reversible encryption. 2) Never leak whether an email exists via timing or error messages. 3) Never allow session fixation (attacker setting session ID). 4) Always use HTTPS and secure cookie flags.

Defense in Depth:

Attack	Defense	Implementation
Brute Force	Rate limiting + account lockout	5 attempts per 5 min, exponential backoff
Credential Stuffing	Rate limit + device fingerprint + CAPTCHA	Detect high-velocity attempts from same IP range

+ 4 more rows...

async def detect_credential_stuffing(request: Request) -> bool:
    ip = request.client_ip
    ip_range = get_ip_range(ip)  # /24 subnet

+ 22 more lines...

Password Security:

import argon2
from argon2 import PasswordHasher

+ 34 more lines...

Failure Modes and Resilience

Proactively discuss failures

Let me walk through failure scenarios. Authentication is critical - we need graceful degradation, not complete outage.

Failure	Impact	Mitigation	Why It Works
Redis down	Cannot validate sessions	JWT fallback + local cache	Stateless tokens work without Redis
User DB down	Cannot login	Cached credentials for repeat users	Allow recent users to re-authenticate

+ 4 more rows...

async def validate_session_resilient(token: str) -> User:
    # Try Redis first
    try:

+ 21 more lines...

Graceful degradation principle

For authentication, availability often beats perfect security. A user locked out of their account is worse than a slight security risk during an outage. Log degraded operations for post-incident review.

Evolution and Scaling

What to say

This design handles billions of logins. Let me discuss how it evolves for additional requirements: passwordless, passkeys, and zero-trust architecture.

Evolution Path:

Stage 1: Basic Auth (current design) - Password + OAuth - Redis sessions - MFA optional

Stage 2: Passwordless - Magic links via email - WebAuthn/Passkeys - Biometric authentication

Stage 3: Zero Trust - Continuous authentication - Risk-based access - Device trust scoring

Scaling Considerations:

Scale Challenge	Solution	Tradeoff
Password hashing CPU	Dedicated auth servers + async queuing	Added latency for login
Session storage	Redis Cluster + regional sharding	Complexity, eventual consistency
Global users	Regional auth services + session replication	Cost, consistency delay
Peak load (Super Bowl)	Pre-warm sessions + queue overflow	Degraded UX during peak

Passkeys (WebAuthn) eliminate passwords entirely:

1. Registration:

+ 15 more lines...

Alternative approach

If we needed even lower latency, we could use edge authentication - validate JWTs at CDN edge nodes using shared public keys. This gives sub-millisecond validation globally, but requires careful key rotation.

What I would do differently for...

Banking app: Stronger MFA requirements, session timeout after 5 minutes of inactivity, transaction signing for sensitive operations.

Social network: Longer sessions, risk-based authentication, focus on account recovery flows.

Enterprise SSO: SAML/OIDC federation, just-in-time provisioning, group-based access control.

Design Trade-offs

Advantages

+Instant revocation
+Server controls session
+Can store arbitrary data

Disadvantages

-Requires distributed cache
-Network call per validation
-Single point of failure

When to use

When instant logout and session control are required

System Design Masterclass

Weather Application with Forecasting

URL Shortener

Hotel Booking and Reservation System

Top K Most Shared Articles

API Rate Limiter

Real-time Stock Price Viewer

Parts Compatibility Validation

Live Comments Feature

Authentication and User Login

On-Call Escalation System

File Download and Sync Library

Marketplace Features

Price Alert System

High-Profile Likes Counter

Live Reactions System

Netflix Screen Concurrency Limits

Real-time Active Viewers

Top-K Rankings System

Google Calendar

Web Crawler

News Feed

Video Streaming Platform

Distributed Control Infrastructure

Cluster Health Monitoring System

User Analytics Dashboard & Event Pipeline

Surge Pricing System

Flash Sale Backend

Distributed Metrics Logging and Aggregation

IoC / Dependency Injection Framework

Wire Transfer API

Notification Service

ETA and Live Location Sharing

P2P File Transfer System

Photo Sharing Platform

Ads Management & Delivery System

Rider Matching System

Collaborative Editing System

Server Metrics Collection System

Distributed Key-Value Store

Payment Processing System

Distributed Stream Processing System

Distributed Job Scheduler

Distributed Message Queue

Dropbox / Google Drive

Distributed Tracing System

Large Data Sorting and Processing

Database Control Plane

Large Data Migration to Cloud

WhatsApp / Messenger

Payment Wallet at Global Scale

Uber / Ride Sharing

Web Search Engine

Globally Distributed SQL Database

Real-Time Analytics System

Recommendation System (Netflix)

Multi-Region Disaster Recovery System

Time-Series Database

Fraud Detection System

Design Authentication and User Login

Summary

Key Takeaways

Core Problem

The Hard Part

Scaling Axis

Problem Statement

What to say first

Premium Content