System Design Masterclass

58 items

System Design Masterclass

Infrastructureurl-shorteninghashingbase62-encodingcachinganalyticsbeginner

Design URL Shortener

Design a link shortening service like bit.ly or TinyURL

Billions of URLs, millions of redirects per second|Similar to Google, Meta, Twitter, Uber, LinkedIn, Amazon|45 min read

Summary

A URL shortener takes a long web link and makes it short (like bit.ly/abc123). When someone clicks the short link, they go to the original long link. The tricky parts are: making sure each short link is unique, handling millions of clicks per second, tracking how many people clicked each link, and making short links that are easy to type. Companies like Google, Meta, Twitter, and Uber ask this question in interviews because it seems simple but has many hidden challenges.

Key Takeaways

Core Problem

The main job is to create a short, unique code for each long URL and redirect users to the original link as fast as possible (under 100 milliseconds).

The Hard Part

Making sure no two URLs get the same short code. If we create 1 billion short links, each one must be different. We also need to do this fast without a slow database check every time.

Scaling Axis

Reads (redirects) happen 100 times more than writes (creating links). A popular link might get clicked millions of times. We need heavy caching to handle this.

Critical Invariant

A short code must ALWAYS go to the same long URL. If bit.ly/abc123 goes to google.com today, it must go there forever. Changing this would break links people have shared.

Performance Requirement

Redirects must happen in under 100 milliseconds. Users hate waiting. This means most lookups must come from cache, not the database.

Key Tradeoff

Shorter codes are easier to share but we run out of them faster. A 6-character code gives us 56 billion combinations. A 7-character code gives us 3.5 trillion.

Design Walkthrough

Problem Statement

The Question: Design a URL shortening service like bit.ly where users can paste a long link and get a short one back. When someone visits the short link, they get sent to the original long link.

What the service needs to do (most important first):

1.Shorten a URL - User gives us a long link like https://amazon.com/very/long/product/page/12345. We give back something short like bit.ly/x7Kp2m.
2.Redirect quickly - When someone clicks bit.ly/x7Kp2m, we send them to the original Amazon link in under 100 milliseconds.
3.Handle lots of traffic - Popular links might get clicked millions of times. The service must not slow down.
4.Track clicks - Count how many people clicked each link, when they clicked, and where they clicked from.
5.Custom short links - Let users pick their own short code like bit.ly/my-sale instead of a random code.
6.Link expiration - Some links should stop working after a set time (like a 24-hour sale link).

What to say first

Let me understand what we are building. Do we need custom short codes or just random ones? Do links expire or last forever? Do we need click tracking? Once I know the features, I will ask about scale - how many URLs and how many clicks per day.

What the interviewer really wants to see: - Can you generate unique short codes without duplicates, even with multiple servers? - Do you understand that reads (redirects) happen way more than writes (creating links)? - Can you design a system that responds in under 100 milliseconds? - How do you handle a viral link that suddenly gets millions of clicks?

Clarifying Questions

Before you start designing, ask questions to understand what you are building. Good questions show the interviewer you think before you code.

Question 1: How big is this?

How many new short URLs do we create per day? How many redirects happen per day? This tells me if we need one server or thousands.

Why ask this: The design for 1,000 URLs per day is very different from 100 million per day.

What interviewers usually say: 100 million new URLs per day, 10 billion redirects per day. Redirects happen 100 times more than URL creation.

How this changes your design: Since redirects happen 100x more, we must make redirects super fast using caching. Creating URLs can be a bit slower.

Question 2: How short should the URL be?

Should the short code be 6 characters, 7 characters, or longer? Shorter is easier to type but we run out of combinations faster.

Why ask this: A 6-character code using letters and numbers (a-z, A-Z, 0-9) gives us 62^6 = 56 billion combinations. A 7-character code gives us 62^7 = 3.5 trillion combinations.

What interviewers usually say: Start with 7 characters. This gives us enough combinations for many years.

How this changes your design: With 100 million new URLs per day, 7 characters will last us about 95 years. We are safe.

Question 3: Do links expire?

Should short links work forever, or should they expire after some time? Can users set their own expiration?

Why ask this: If links never expire, our database grows forever. If they expire, we can reuse old short codes and delete old data.

What interviewers usually say: Links last forever by default, but users can set an expiration if they want.

How this changes your design: We need to store a created_at and expires_at time for each link. A background job can clean up expired links.

Question 4: Do we need analytics?

Should we track how many times each link was clicked? Do we need to know when and where people clicked?

Why ask this: Analytics adds complexity. We need to store every single click, which is way more data than just the URLs.

What interviewers usually say: Yes, track total clicks. Nice to have: clicks by day, by country, by device.

How this changes your design: We cannot update the database on every click (too slow). We need to batch the updates or use a separate analytics system.

Summarize your assumptions

Let me summarize: 100 million new URLs per day, 10 billion redirects per day, 7-character codes, links last forever by default, and we need basic click tracking. Redirects must be under 100 milliseconds.

The Hard Part

Say this to the interviewer

The hardest part of a URL shortener is generating unique short codes. If we have 10 servers all creating URLs at the same time, how do we make sure they never create the same short code? Even one duplicate would break the system.

Why unique IDs are tricky (explained simply):

1.Many servers at once - If 10 servers are creating URLs at the same time, they might accidentally pick the same short code.
2.Must be fast - We cannot check the database every time to see if a code is taken. That would be too slow.
3.Must never repeat - If bit.ly/abc123 goes to Site A, it can never go to Site B. Ever. People have shared that link.
4.Codes should look random - We do not want bit.ly/1, bit.ly/2, bit.ly/3. People could guess URLs and find private links.
5.Need billions of them - At 100 million URLs per day, we need 36 billion codes per year. We cannot run out.

Common mistake candidates make

Many people say: just use a random string and check if it exists in the database. This is wrong because: (1) checking the database every time is slow, (2) random collisions become more likely as the database fills up, (3) with multiple servers, two might check at the same time and both think the code is free.

Three ways to generate unique short codes:

Option 1: Counter with Base62 Encoding (Recommended) - Keep a counter that goes up: 1, 2, 3, 4... - Convert each number to letters and numbers (Base62) - Number 1 becomes "1", number 62 becomes "10", number 1000000 becomes "4c92" - Why this is good: Guaranteed unique, very fast, no database check needed - How to share the counter: Give each server a range (Server 1 gets 1-1000000, Server 2 gets 1000001-2000000)

Option 2: Hash the Long URL - Use a hash function (like MD5) on the long URL - Take the first 7 characters of the hash - Why this is tricky: Different URLs might have the same first 7 hash characters (collision). Need to handle this.

Option 3: Pre-generate Codes - Create millions of random codes ahead of time and store them in a table - When someone needs a short URL, grab one from the table and mark it as used - Why this works: No collision possible because each code is used only once

How we generate unique codes

What is Base62?
- We use 62 characters: a-z (26) + A-Z (26) + 0-9 (10) = 62
- Just like Base10 uses 0-9, Base62 uses more characters

+ 16 more lines...

Scale and Access Patterns

Before designing, let me figure out how big this system needs to be. This helps us choose the right tools.

What we are measuring	Number	What this means for our design
New URLs per day	100 million	About 1,160 writes per second - one database handles this easily
Redirects per day	10 billion	About 115,000 reads per second - need heavy caching

+ 4 more rows...

What to tell the interviewer

This is a read-heavy system with 100:1 read to write ratio. Our main focus should be making redirects super fast using caching. At 115,000 redirects per second, we need Redis cache in front of our database. A cache hit should happen 99% of the time.

Common interview mistake: Ignoring the read-heavy pattern

Many candidates focus on making URL creation fast. But redirects happen 100x more! A slow redirect (even 500ms) would make users angry. A slow URL creation (even 2 seconds) is fine - users only do it once.

How people use the service (from most common to least common):

1.Click a short link (redirect) - This is 99% of all traffic. Someone clicks bit.ly/abc123 and goes to the original site. Must be super fast.
2.Create a short link - User pastes a long URL and gets a short one back. Happens 100x less than redirects.
3.View analytics - User checks how many clicks their link got. Happens rarely.
4.Delete a link - User removes a link they created. Very rare.

How much space does one short URL need?
- Short code: 7 bytes
- Long URL: 200 bytes average (some are longer)

+ 14 more lines...

High-Level Architecture

Now let me draw the big picture of how all the pieces fit together. I will keep it simple and explain what each part does.

What to tell the interviewer

I will separate the system into two main paths: creating URLs and redirecting. Redirects need to be super fast, so they go through a cache first. URL creation can be a bit slower since it happens 100x less often.

URL Shortener - The Big Picture

What each part does and WHY it is separate:

Part	What it does	Why it is separate (what to tell interviewer)
Redirect Service	Looks up short code, sends user to long URL	This is the hot path - 99% of traffic. Must be super fast. Talks to cache first, database only if cache misses.
URL Service	Creates new short URLs, handles custom codes	Separate from redirects because creating URLs is slower and less frequent. We do not want slow URL creation to affect fast redirects.
Analytics Service	Counts clicks, stores time and location data	Why separate? We cannot update the database on every click - too slow. Analytics collects clicks in batches and writes them later.
Redis Cache	Stores recently used short code to long URL mappings	Why needed? Database lookups take 5-10 milliseconds. Cache lookups take 0.5 milliseconds. At 115K redirects per second, we need 99% to hit cache.
Zookeeper	Gives each server a range of IDs to use	Why needed? If Server 1 and Server 2 both try to use ID 1000, we get a duplicate. Zookeeper gives Server 1 IDs 1-1M and Server 2 IDs 1M-2M.

Common interview question: Why not hash the URL?

Interviewers often ask: Why not just hash the long URL to get the short code? Your answer: Hashing works but has problems: (1) Two different URLs might have the same hash prefix (collision), (2) We need extra logic to handle collisions, (3) Same URL from different users would get the same short code - what if one user wants to delete it? Counter-based approach avoids all these problems.

Technology Choices - Why we picked these tools:

Database: PostgreSQL (Recommended) - Why we chose it: Great at storing key-value data (short code to long URL), handles our scale easily, supports good indexes - Other options we considered: - MySQL: Also works great - pick what your team knows - DynamoDB: Good for key-value lookups, but harder to do analytics queries - Cassandra: Good if we need to scale beyond 100TB, but adds complexity

Cache: Redis (Recommended) - Why we chose it: Super fast (0.5ms lookups), handles 100K+ operations per second, perfect for our read-heavy workload - Other options we considered: - Memcached: Also works, but Redis has more features - Local in-memory cache: Only works for small systems

ID Generation: Zookeeper or Database Sequence - Why we need it: Multiple servers need unique IDs without talking to each other - Zookeeper: Gives each server a range of IDs to use - Database Sequence: PostgreSQL can auto-increment IDs (simpler but slower)

Important interview tip

Pick technologies YOU know! If you have used MySQL at your job, use MySQL. If you know MongoDB, explain how it would work here. Interviewers care more about your reasoning than the specific tool.

Data Model and Storage

Now let me show how we organize the data in the database. We need two main tables: one for URLs and one for click tracking.

What to tell the interviewer

I will use a SQL database with two main tables: urls (stores the short code and long URL) and clicks (stores analytics data). The short code is the primary key for fast lookups.

Table 1: URLs - Stores the mapping from short code to long URL

This is the main table. When someone clicks a short link, we look up the short code here and find the long URL.

Column	What it stores	Example
short_code	The 7-character code (PRIMARY KEY)	x7Kp2mQ
long_url	The original long URL	https://amazon.com/very/long/path

+ 6 more rows...

Database Index

The short_code is the PRIMARY KEY, so lookups are super fast. We also add an INDEX on user_id so users can see all their links quickly.

Table 2: Clicks - Stores every click for analytics

Every time someone clicks a short link, we record it here. This table gets huge (billions of rows), so we store it separately.

Column	What it stores	Example
id	Unique click ID	click_789
short_code	Which link was clicked	x7Kp2mQ

+ 6 more rows...

Important: Do NOT write clicks directly

We get 115,000 clicks per second. Writing each click to the database one by one would kill the database. Instead, we batch them: collect 1000 clicks in memory, then write them all at once. Or use a time-series database like TimescaleDB that is built for this.

How we handle the click count:

The urls table has a click_count column. We do NOT update it on every click (too slow). Instead:

1.Every click goes to a counter in Redis (super fast) 2. Every 5 minutes, a background job reads Redis counters 3. The job updates the click_count in PostgreSQL in batches 4. This way, the count might be 5 minutes behind, but that is okay for analytics

CREATE TABLE urls (
    short_code VARCHAR(10) PRIMARY KEY,
    long_url TEXT NOT NULL,

+ 14 more lines...

The Redirect Flow

This is the most important part of the system. When someone clicks bit.ly/x7Kp2mQ, we need to send them to the right place in under 100 milliseconds.

What to tell the interviewer

The redirect flow is our hot path - 99% of all traffic. It must be blazing fast. We check the cache first (0.5ms), only go to the database if cache misses (5-10ms). We also record the click for analytics but do not wait for it.

What happens when someone clicks a short link

FUNCTION handle_redirect(short_code):
    
    STEP 1: Check the cache first (super fast - 0.5ms)

+ 22 more lines...

HTTP 301 vs HTTP 302 - Which redirect to use?

HTTP 301 (Moved Permanently) - Tells the browser: "Remember this! Next time, go directly to the long URL." - Good for: SEO (search engines pass link value to the original site) - Bad for: Analytics (browser might skip us next time)

HTTP 302 (Temporary Redirect) - Tells the browser: "Go there this time, but ask me again next time." - Good for: Analytics (we see every click) - Bad for: SEO (search engines do not pass full link value)

Our choice: Use 301 by default (better for users), but let users pick 302 if they need accurate click counts.

Why we do NOT wait for analytics

In Step 4, we send click data to a queue and do NOT wait. Why? Writing to the database takes time. If we waited, redirects would be slower. Instead, we put the click data in a fast queue (like Kafka) and let the analytics service process it later. Users do not notice a few seconds delay in analytics.

Making it even faster with caching strategy:

1.Cache everything that gets clicked - After a database lookup, always save to Redis
2.Pre-warm popular links - When a link goes viral, make sure all Redis servers have it
3.Use local cache too - Each server can keep the top 10,000 links in memory (even faster than Redis)
4.Set smart expiration - Popular links stay in cache longer (24 hours), rarely clicked links expire faster (1 hour)

Creating Short URLs

When a user gives us a long URL, we need to create a unique short code and save the mapping. This happens 100x less than redirects, so it can be a bit slower.

What to tell the interviewer

For creating URLs, the key challenge is generating unique short codes across multiple servers. I will use a counter-based approach where each server gets a range of IDs from Zookeeper. This guarantees uniqueness without checking the database.

How we create a new short URL

FUNCTION create_short_url(long_url, user_id, custom_code = null):
    
    STEP 1: Validate the long URL

+ 27 more lines...

How the counter works across multiple servers:

Problem: We have 10 servers creating URLs. How do we make sure they never create the same short code?

Solution: Give each server its own range of numbers.

1.Server 1 starts up and asks Zookeeper: "Give me some IDs" 2. Zookeeper says: "You get IDs 1 to 1,000,000" 3. Server 1 uses ID 1, then 2, then 3... up to 1,000,000 4. Server 2 asks Zookeeper and gets: "You get IDs 1,000,001 to 2,000,000" 5. When Server 1 runs out, it asks for another range

This way, no two servers ever use the same ID.

When Server starts up:
    range_start, range_end = ZOOKEEPER.get_id_range(size = 1000000)
    current_id = range_start

+ 18 more lines...

Why not use database auto-increment?

PostgreSQL can auto-increment IDs, but it becomes a bottleneck. Every URL creation would need to talk to the database to get the next ID. With Zookeeper ranges, each server can create 1 million URLs without talking to anyone. Much faster.

Analytics and Click Tracking

Users want to know: How many people clicked my link? When did they click? Where are they from? But with 115,000 clicks per second, we cannot write each click to the database immediately.

The problem with real-time analytics

If we tried to INSERT INTO clicks for every click, the database would die. 115,000 inserts per second is way too many. Instead, we collect clicks in batches and write them together.

How we track clicks without killing the database

Two-part analytics system:

Part 1: Fast Counter (for total clicks) - Every click increments a Redis counter - Redis handles 100,000+ increments per second easily - Every 5 minutes, we copy Redis counts to PostgreSQL - Users see total clicks that are at most 5 minutes old

Part 2: Detailed Logging (for who, when, where) - Every click goes to a Kafka queue - Analytics workers pull from Kafka in batches of 1000 - Workers write batches to TimescaleDB (time-series database) - Users can see detailed analytics a few minutes later

FUNCTION record_click(short_code, request_info):
    // This runs ASYNC - we do not wait for it

+ 17 more lines...

FUNCTION analytics_worker():
    // This runs continuously in the background

+ 18 more lines...

What is TimescaleDB?

TimescaleDB is PostgreSQL with special features for time-series data (data that has timestamps). It automatically splits data by time (last hour, last day, last week) so queries like show me clicks for the last 7 days are super fast. Perfect for click analytics.

What analytics we show to users:

1.Total clicks - From the click_count in the urls table (5 min delay max) 2. Clicks over time - Chart showing clicks per hour/day from TimescaleDB 3. Top countries - Group clicks by country 4. Top referrers - Where did clicks come from (Twitter, Facebook, etc.) 5. Device breakdown - Mobile vs Desktop vs Tablet

What Can Go Wrong and How We Handle It

Tell the interviewer about failures

Good engineers think about what can break. Let me walk through the things that can go wrong and how we protect against them.

What breaks	What happens to users	How we fix it	Why this works
Redis cache goes down	Redirects become slow (hit database)	Keep database read replicas + auto-restart Redis	Slow is better than broken. Database can handle some load.
Database goes down	Cannot create new URLs, redirects fail on cache miss	Use database replicas + failover	Read replica becomes primary. Recent URLs are in cache.

+ 4 more rows...

Handling a viral link:

Imagine a celebrity tweets a short link and suddenly 10 million people click it in 1 minute. How do we survive?

1.Cache is king - The link is already in Redis cache. All 10 million requests hit cache. 2. No database pressure - Database only saw 1 request (when we first loaded the link to cache) 3. Analytics handles it - Clicks go to Kafka queue. We process them as fast as we can. If we fall behind, that is okay - Kafka stores them. 4. Rate limit if needed - If cache cannot handle it, we can return cached result from CDN edge servers

FUNCTION create_url_with_rate_limit(long_url, user_id, ip_address):
    
    STEP 1: Check rate limit for this IP

+ 21 more lines...

Checking for malware and spam

Before shortening any URL, we check it against Google Safe Browsing API. This tells us if the URL is known malware, phishing, or spam. If it is bad, we refuse to shorten it. We also block certain domains entirely (like known spam sites).

Growing the System Over Time

What to tell the interviewer

This design works great for up to 100 million URLs per day. Let me explain how we would grow it if we need to support even more traffic or users around the world.

How we grow step by step:

Stage 1: Starting out (up to 10 million URLs per day) - One PostgreSQL database - One Redis cluster - A few application servers - This handles most companies needs

Stage 2: Growing fast (10-100 million URLs per day) - Add PostgreSQL read replicas - Add more Redis nodes (cluster mode) - Add more application servers behind load balancer - Add CDN for static content

Stage 3: Global scale (100 million+ URLs per day) - Multiple data centers (US, Europe, Asia) - Database replication across regions - Route users to nearest data center - This is what bit.ly and TinyURL do

Multi-region setup for global users

Cool features we can add later:

1. Link previews - When someone shares a short link on Twitter or Slack, show a preview of where it goes - Fetch the title and image from the original page - Store this metadata with the URL

2. QR codes - Generate a QR code for each short link - People can scan it with their phone instead of typing - Good for printed materials like posters and business cards

3. Password-protected links - Let users set a password on a link - Visitors must enter the password to see the destination - Good for private content

4. A/B testing - One short link can go to different pages for different users - 50% go to Page A, 50% go to Page B - Useful for marketing tests

5. Link editing - Let users change where a short link goes - Useful if the original page moves - But be careful - this could be abused

Security consideration for link editing

If users can change where a link goes, bad actors could: (1) Share a safe link, (2) Wait for people to trust it, (3) Change it to a malware site. Solution: Only allow editing within first 24 hours, or require re-verification when changing to a different domain.

Different types of URL shorteners need different focus:

Public shortener (like bit.ly): Focus on speed, analytics, and preventing abuse. Anyone can create links.

Enterprise shortener (internal company links): Focus on access control, integration with company systems, and audit logs.

Marketing shortener (like Rebrandly): Focus on custom domains, detailed analytics, and campaign tracking.

Social media shortener (like Twitter t.co): Focus on safety scanning, preview generation, and extremely high traffic.

Design Trade-offs

Advantages

+Guaranteed unique - no collisions ever
+Super fast - no database check needed
+Simple to understand and implement
+Sequential IDs mean good database performance

Disadvantages

-Need coordination service (Zookeeper) for multiple servers
-IDs are somewhat predictable (can guess nearby codes)
-If counter state is lost, need to scan database to find last ID

When to use

This is the standard approach. Use this for most URL shorteners. The predictability issue rarely matters in practice.

System Design Masterclass

Weather Application with Forecasting

URL Shortener

Hotel Booking and Reservation System

Top K Most Shared Articles

API Rate Limiter

Real-time Stock Price Viewer

Parts Compatibility Validation

Live Comments Feature

Authentication and User Login

On-Call Escalation System

File Download and Sync Library

Marketplace Features

Price Alert System

High-Profile Likes Counter

Live Reactions System

Netflix Screen Concurrency Limits

Real-time Active Viewers

Top-K Rankings System

Google Calendar

Web Crawler

News Feed

Video Streaming Platform

Distributed Control Infrastructure

Cluster Health Monitoring System

User Analytics Dashboard & Event Pipeline

Surge Pricing System

Flash Sale Backend

Distributed Metrics Logging and Aggregation

IoC / Dependency Injection Framework

Wire Transfer API

Notification Service

ETA and Live Location Sharing

P2P File Transfer System

Photo Sharing Platform

Ads Management & Delivery System

Rider Matching System

Collaborative Editing System

Server Metrics Collection System

Distributed Key-Value Store

Payment Processing System

Distributed Stream Processing System

Distributed Job Scheduler

Distributed Message Queue

Dropbox / Google Drive

Distributed Tracing System

Large Data Sorting and Processing

Database Control Plane

Large Data Migration to Cloud

WhatsApp / Messenger

Payment Wallet at Global Scale

Uber / Ride Sharing

Web Search Engine

Globally Distributed SQL Database

Real-Time Analytics System

Recommendation System (Netflix)

Multi-Region Disaster Recovery System

Time-Series Database

Fraud Detection System

Design URL Shortener

Summary

Key Takeaways

Core Problem

The Hard Part

Scaling Axis

Critical Invariant

Performance Requirement

Key Tradeoff

Design Walkthrough

Problem Statement

What to say first

Clarifying Questions

Question 1: How big is this?

Question 2: How short should the URL be?

Question 3: Do links expire?

Question 4: Do we need analytics?

Summarize your assumptions

The Hard Part

Say this to the interviewer

Common mistake candidates make