Design Walkthrough
Problem Statement
The Question: Design an authentication system for a large-scale web application that handles billions of logins per day.
Authentication is the gateway to every user interaction: - Identity verification: Proving users are who they claim to be - Session management: Maintaining authenticated state across requests - Security: Preventing unauthorized access and account takeovers - Compliance: Meeting regulatory requirements (GDPR, SOC2)
What to say first
Before I design, let me clarify the authentication methods we need to support, the scale requirements, and the security constraints. Authentication has many variations and the design depends heavily on these factors.
Hidden requirements interviewers are testing: - Do you understand password hashing (bcrypt, Argon2) and why it matters? - Can you design for both security AND scale (these often conflict)? - Do you know the difference between authentication and authorization? - Can you handle session management in a distributed system? - Do you understand OAuth flows and when to use them?
Clarifying Questions
Ask these questions to demonstrate security awareness and systems thinking.
Question 1: Authentication Methods
What authentication methods do we need to support? Username/password only, or also OAuth (Google, Facebook), SSO, passwordless, MFA?
Why this matters: Each method has different security properties and implementation complexity. Typical answer: Password + OAuth + optional MFA for sensitive actions Architecture impact: Need to support multiple identity providers, credential storage varies by method
Question 2: Scale and Geography
What is the login volume? Are users global? Do we need to handle login from multiple devices simultaneously?
Why this matters: Determines session storage strategy and consistency requirements. Typical answer: Billions of logins/day, global users, multiple device support Architecture impact: Need distributed session store, consider regional deployments
Question 3: Security Requirements
What is the sensitivity of the data? Do we need to detect suspicious logins? Is there regulatory compliance (SOC2, GDPR)?
Why this matters: High-security applications need additional layers (MFA, anomaly detection). Typical answer: Moderate sensitivity, need to detect credential stuffing, GDPR compliant Architecture impact: Need rate limiting, anomaly detection, audit logging
Question 4: Session Requirements
How long should sessions last? Do we need instant session revocation (e.g., for logout from all devices)?
Why this matters: Instant revocation requires stateful sessions. Long sessions need refresh mechanisms. Typical answer: Sessions last 30 days with activity, need logout-all-devices feature Architecture impact: Stateful sessions required for instant revocation, or short-lived tokens with refresh
Stating assumptions
Based on this, I will assume: Password + OAuth + optional MFA, billions of logins/day globally, moderate security with credential stuffing protection, 30-day sessions with instant revocation capability.
The Hard Part
Say this out loud
The hard part here is balancing security with performance at scale. Secure password hashing is intentionally slow (to prevent brute force), but we need sub-second login times for billions of users.
Why this is genuinely hard:
- 1.Security vs Performance: Password hashing (bcrypt) is designed to be slow (100-500ms). At billions of logins/day, this is a massive compute cost.
- 2.Distributed Sessions: If a user logs in on Server A and their next request hits Server B, how does B know they are authenticated?
- 3.Credential Stuffing: Attackers use leaked password databases to try username/password combinations at scale. How do you detect and block this without affecting legitimate users?
- 4.Session Security: Sessions can be hijacked (stolen cookies), fixated (attacker sets session ID), or replayed. Each attack vector needs defense.
Common mistakes
1) Storing passwords with fast hashes (MD5, SHA256) - these are NOT secure for passwords. 2) Using sequential session IDs - these are guessable. 3) Not implementing rate limiting - allows brute force attacks.
The fundamental tensions:
- Security vs UX: More security (longer passwords, frequent MFA) = worse UX - Stateful vs Stateless: Stateful sessions = instant revocation but harder to scale. Stateless (JWT) = infinite scale but delayed revocation - Consistency vs Availability: Strong session consistency can cause login failures during partitions
Scale and Access Patterns
Let me estimate the scale and understand access patterns.
| Dimension | Value | Impact |
|---|---|---|
| Daily Logins | 1 billion | ~12K logins/second average, 50K+ peak |
| Active Sessions | 500 million | Distributed session storage required |
What to say
The critical insight is that login (write-heavy, slow due to hashing) and session validation (read-heavy, must be fast) have completely different characteristics. We should optimize them separately.
Access Pattern Analysis:
- Login: Write-heavy, computationally expensive, relatively rare per user - Session Validation: Read-heavy (100:1 ratio to login), must be sub-10ms - Password Changes: Very rare, can be slow - Session Revocation: Rare but must be fast when it happens - Geographic Distribution: Users login from specific regions, sessions accessed globally
Password hashing compute:
- 500M password logins/day
- 200ms average hash timeHigh-Level Architecture
Let me design the system with separate paths for authentication and session validation.
What to say
I will separate the authentication flow (login) from session validation. Login is write-heavy and slow; session validation is read-heavy and must be fast. Different optimization strategies for each.
Authentication System Architecture
Component Responsibilities:
API Gateway: Routes requests, extracts session tokens, rate limiting
Authentication Service: Handles login flows - Password verification (bcrypt/Argon2) - OAuth token exchange - MFA verification - Creates sessions on success
Session Service: Lightweight session validation - Validates session tokens - Returns user context - Handles session refresh
User Database: Stores credentials - Sharded by user_id - Password hashes, OAuth links, MFA secrets
Session Store (Redis Cluster): Active sessions - Sharded by session_id - Fast reads for validation - TTL for automatic expiry
Real-world reference
Google uses a similar architecture with separate auth and session services. Auth0 separates identity verification from session management. Facebook uses a hybrid approach with short-lived access tokens and longer refresh tokens.
Data Model and Storage
Let me define the data models for users, credentials, and sessions.
What to say
The user table stores identity, the credentials table stores authentication methods (separated for security), and sessions are in Redis for fast access.
-- Users table (sharded by user_id)
CREATE TABLE users (
user_id UUID PRIMARY KEY,Key: session:{session_id}
Value: JSON object
TTL: 30 days (refreshed on activity)Why separate credentials table?
- 1.Security: Credentials can have different access controls than user data 2. Flexibility: Easy to add new auth methods (passkeys, WebAuthn) 3. Audit: Separate audit trail for credential changes 4. Multi-method: User can have password + OAuth + passkey simultaneously
Critical security detail
Password hashes must use bcrypt (cost factor 12+) or Argon2id. Never use MD5, SHA1, or even SHA256 for passwords. These are too fast and allow brute force attacks.
Authentication Flow Deep Dive
Let me walk through the login flow in detail.
Password Login Flow
async def login(email: str, password: str, request: Request) -> LoginResponse:
# Step 1: Rate limiting
if not await rate_limiter.check(f"login:{email}", limit=5, window=300):Critical Security Details:
- 1.Timing Attack Prevention: Always perform password hash even for non-existent users. Otherwise, attackers can enumerate valid emails by measuring response time.
- 2.Rate Limiting: Per-email AND per-IP. Prevents brute force and credential stuffing.
- 3.Fraud Detection: Analyze login context (IP, device, time, location) to detect anomalies.
- 4.Audit Logging: Log all authentication events for security review and compliance.
OAuth Flow
For OAuth (Google, Facebook), we redirect to the provider, receive an authorization code, exchange it for tokens, then fetch user info. The flow is stateless - we use a PKCE code verifier stored in a short-lived cookie.
Session Management
What to say
I will use a hybrid approach: short-lived access tokens (15 min) for API calls, and longer refresh tokens (30 days) stored server-side. This gives us fast validation AND instant revocation.
Session Token Types:
- 1.Access Token: Short-lived (15 min), can be JWT for stateless validation 2. Refresh Token: Long-lived (30 days), stored in Redis, used to get new access tokens 3. Session ID: Ties access and refresh tokens together, used for logout-all
async def validate_session(access_token: str) -> User:
# Option 1: Stateless JWT validation (fastest, no Redis call)
if is_jwt(access_token):Token Refresh Flow:
async def refresh_tokens(refresh_token: str) -> TokenPair:
# Validate refresh token in Redis
session = await redis.get(f"refresh:{refresh_token}")Refresh token rotation
Always rotate refresh tokens on use. If an attacker steals a refresh token and uses it, the legitimate user next refresh will fail (token already rotated), alerting them to the compromise.
Security Defenses
System Invariants
1) Never store passwords in plaintext or reversible encryption. 2) Never leak whether an email exists via timing or error messages. 3) Never allow session fixation (attacker setting session ID). 4) Always use HTTPS and secure cookie flags.
Defense in Depth:
| Attack | Defense | Implementation |
|---|---|---|
| Brute Force | Rate limiting + account lockout | 5 attempts per 5 min, exponential backoff |
| Credential Stuffing | Rate limit + device fingerprint + CAPTCHA | Detect high-velocity attempts from same IP range |
async def detect_credential_stuffing(request: Request) -> bool:
ip = request.client_ip
ip_range = get_ip_range(ip) # /24 subnetPassword Security:
import argon2
from argon2 import PasswordHasher
Failure Modes and Resilience
Proactively discuss failures
Let me walk through failure scenarios. Authentication is critical - we need graceful degradation, not complete outage.
| Failure | Impact | Mitigation | Why It Works |
|---|---|---|---|
| Redis down | Cannot validate sessions | JWT fallback + local cache | Stateless tokens work without Redis |
| User DB down | Cannot login | Cached credentials for repeat users | Allow recent users to re-authenticate |
async def validate_session_resilient(token: str) -> User:
# Try Redis first
try:Graceful degradation principle
For authentication, availability often beats perfect security. A user locked out of their account is worse than a slight security risk during an outage. Log degraded operations for post-incident review.
Evolution and Scaling
What to say
This design handles billions of logins. Let me discuss how it evolves for additional requirements: passwordless, passkeys, and zero-trust architecture.
Evolution Path:
Stage 1: Basic Auth (current design) - Password + OAuth - Redis sessions - MFA optional
Stage 2: Passwordless - Magic links via email - WebAuthn/Passkeys - Biometric authentication
Stage 3: Zero Trust - Continuous authentication - Risk-based access - Device trust scoring
Scaling Considerations:
| Scale Challenge | Solution | Tradeoff |
|---|---|---|
| Password hashing CPU | Dedicated auth servers + async queuing | Added latency for login |
| Session storage | Redis Cluster + regional sharding | Complexity, eventual consistency |
| Global users | Regional auth services + session replication | Cost, consistency delay |
| Peak load (Super Bowl) | Pre-warm sessions + queue overflow | Degraded UX during peak |
Passkeys (WebAuthn) eliminate passwords entirely:
1. Registration:Alternative approach
If we needed even lower latency, we could use edge authentication - validate JWTs at CDN edge nodes using shared public keys. This gives sub-millisecond validation globally, but requires careful key rotation.
What I would do differently for...
Banking app: Stronger MFA requirements, session timeout after 5 minutes of inactivity, transaction signing for sensitive operations.
Social network: Longer sessions, risk-based authentication, focus on account recovery flows.
Enterprise SSO: SAML/OIDC federation, just-in-time provisioning, group-based access control.