System Design Masterclass
Infrastructureweb-crawlerdistributed-systemsurl-frontierrobots-txtpolitenessintermediate

Design Web Crawler

Design a system that can crawl and index billions of web pages

Billions of pages|Similar to Google, Microsoft, Amazon, Apple, Bing, DuckDuckGo, Common Crawl|45 min read

Summary

A web crawler is like a robot that visits websites and saves what it finds. Google uses crawlers to find and save billions of web pages so you can search them later. The tricky parts are: (1) not visiting the same page twice (there are billions of URLs!), (2) being polite - not hitting the same website too fast or it will block you, (3) following the rules in robots.txt that tell crawlers what they can and cannot visit, and (4) doing this fast enough to keep up with the internet which adds millions of new pages every day. Companies like Google, Bing, Amazon, and any search startup ask this question in interviews.

Key Takeaways

Core Problem

The main job is to visit web pages, save their content, and find new links to visit. Sounds simple, but the internet has billions of pages and you need to avoid visiting the same page twice while being nice to websites.

The Hard Part

How do you know if you already visited a URL? Saving all 50 billion URLs in a list would use too much memory. We use a clever trick called a Bloom filter - it might sometimes say yes when the answer is no, but it never says no when the answer is yes.

Scaling Axis

We can easily add more crawler machines - each one handles different websites. The tricky part is the shared state: which URLs have we seen? Which websites are we currently crawling? This shared knowledge is the bottleneck.

Critical Invariant

Never hit a website faster than its robots.txt allows. If robots.txt says wait 10 seconds between requests, you MUST wait. Breaking this rule gets your crawler blocked and gives your company a bad name.

Performance Requirement

To crawl 1 billion pages per day, we need to fetch about 11,500 pages per second. Since each page takes about 500ms to download, we need thousands of crawler machines working at the same time.

Key Tradeoff

Freshness vs Coverage: Do we re-visit popular pages often (to keep them fresh) or visit new pages (to increase coverage)? Google re-crawls CNN.com every few minutes but might visit a small blog only once a month.

Design Walkthrough

Problem Statement

The Question: Design a web crawler that can find, download, and store billions of web pages while following the rules and not overloading any website.

What the crawler needs to do (most important first):

  1. 1.Find new pages - Start with some seed URLs (like cnn.com, wikipedia.org) and discover new pages by following links.
  2. 2.Download pages - Fetch the HTML content of each page and save it for later processing.
  3. 3.Follow robots.txt rules - Every website has a robots.txt file that says what crawlers can and cannot visit. We must obey it.
  4. 4.Be polite - Do not hit the same website too fast. Wait between requests to the same domain.
  5. 5.Avoid duplicates - The internet has many links pointing to the same page. Visit each page only once.
  6. 6.Handle traps - Some websites have infinite pages (like calendars that go forever). Do not get stuck.
  7. 7.Stay fresh - Re-visit important pages to catch updates. News sites change every minute; personal blogs change rarely.

What to say first

Let me first understand the scope. Are we building a general web crawler like Google, or a focused crawler for specific sites? How many pages do we need to crawl per day? Do we need to store the full page content or just extract specific data? Once I know this, I can design the right system.

What the interviewer really wants to see: - Can you handle the scale? (billions of URLs without running out of memory) - Do you know about politeness? (robots.txt, crawl delays) - Can you avoid getting stuck? (spider traps, infinite loops) - Do you understand the tradeoffs? (freshness vs coverage, breadth vs depth)

Clarifying Questions

Before you start designing, ask questions to understand what you are building. Good questions show the interviewer you think before you code.

Question 1: How big is this?

How many pages do we need to crawl per day? Are we crawling the whole internet or just specific websites?

Why ask this: A crawler for 1 million pages is very different from one for 1 billion pages.

What interviewers usually say: Assume we want to crawl 1 billion pages per day, covering the whole publicly accessible web.

How this changes your design: At 1 billion pages/day, we need about 11,500 pages per second. This needs thousands of machines working together.

Question 2: What do we do with the pages?

Do we store the full HTML content? Do we extract text for search indexing? Do we need to render JavaScript?

Why ask this: JavaScript rendering is 10-100x slower and more expensive than simple HTML fetching.

What interviewers usually say: Store full HTML for now. JavaScript rendering can be a follow-up discussion.

How this changes your design: Without JavaScript, we can use simple HTTP fetchers. With JavaScript, we need headless browsers (Chrome without a screen), which are much slower.

Question 3: How fresh should pages be?

How often should we re-crawl pages? Should CNN.com be crawled every minute while personal blogs are crawled monthly?

Why ask this: Re-crawling affects how many total fetches we need to do.

What interviewers usually say: Yes, prioritize freshness for important/changing pages. Assign crawl priorities based on page importance and change frequency.

How this changes your design: We need a priority system and tracking of when each page was last crawled and how often it changes.

Question 4: What about private or illegal content?

Should we crawl pages behind logins? What about content that might be illegal in some countries?

Why ask this: This is a real concern for search engines.

What interviewers usually say: Only crawl public pages (no login required). Assume we have a content moderation system as a separate service.

How this changes your design: We skip any URL that requires authentication. Content filtering happens after crawling.

Summarize your assumptions

Let me summarize: We are building a large-scale web crawler targeting 1 billion pages per day. We store full HTML content, follow robots.txt rules, and prioritize re-crawling based on page importance. We do not render JavaScript in the basic design but can discuss it as an extension.

The Hard Part

Say this to the interviewer

The hardest part of web crawling is remembering what we have already seen. There are 50+ billion web pages on the internet. If we try to store every URL in a simple list, we run out of memory. But we MUST avoid visiting the same page twice, or we waste time and annoy website owners.

Why URL deduplication is tricky (explained simply):

  1. 1.Too many URLs - 50 billion URLs, each about 100 bytes = 5 TB just for the URLs. That does not fit in memory!
  2. 2.Same page, different URL - These all point to the same page: - http://example.com/page - https://example.com/page - https://www.example.com/page - https://example.com/page?utm_source=twitter - https://example.com/page#section1
  3. 3.Need to check fast - We find millions of new URLs every hour. Checking each one against 50 billion stored URLs must be FAST.
  4. 4.Distributed problem - Our crawlers run on thousands of machines. How do they all know what has been seen?

Common mistake candidates make

Many people say: just use a database to store all URLs. This is slow! Checking if a URL exists in a 50 billion row database takes too long. Even with good indexes, this becomes a bottleneck when you are finding 10,000 new URLs per second.

The clever solution: Bloom Filters

A Bloom filter is a space-efficient data structure that can tell you if you have PROBABLY seen something before.

  • How it works: We hash each URL into several positions in a big bit array. To check if we have seen a URL, we check if all those positions are set to 1.
  • The tradeoff: It might say "yes, seen before" when we actually have not (false positive). But it NEVER says "no, not seen" when we actually have (no false negatives).
  • Why this is okay: If we skip a URL we have not seen (false positive), we miss that page. At web scale, missing a few pages is fine. But if we visit the same page twice (false negative), we waste resources. Bloom filters prevent false negatives.
  • Space savings: A Bloom filter for 10 billion URLs with 1% false positive rate needs only about 12 GB of memory. Storing actual URLs would need 1 TB+!

How a Bloom Filter works

The second hard part: Being Polite

Websites do not want you hitting them 1000 times per second. This can crash their servers. Every website has a robots.txt file that tells crawlers:

  1. 1.What pages they CAN visit 2. What pages they should NOT visit 3. How long to wait between requests (crawl-delay)

Example robots.txt for a website:

User-agent: *
Disallow: /admin/
Disallow: /private/
Crawl-delay: 10

User-agent: Googlebot
Allow: /
Crawl-delay: 1

This says: Most crawlers should wait 10 seconds between requests and cannot visit /admin/ or /private/. But Google can crawl everything with only 1 second delay (because Google brings traffic!).

Scale and Access Patterns

Before designing, let me figure out how big this system needs to be. This helps us choose the right tools.

What we are measuringNumberWhat this means for our design
Pages to crawl per day1 billionAbout 11,500 pages per second
Average page size500 KB500 TB of new data per day
+ 6 more rows...

What to tell the interviewer

At 1 billion pages per day, we need about 11,500 pages per second. Since each download takes ~500ms, we need thousands of parallel connections. The URL frontier (list of URLs to visit) will have billions of entries. We need distributed storage and clever memory tricks like Bloom filters.

How the crawler works (the main loop):

  1. 1.Pick a URL to visit - Get the next URL from the frontier (the to-do list) 2. Check robots.txt - Is this URL allowed? Load from cache or fetch fresh. 3. Check politeness - Did we recently visit this domain? If yes, wait. 4. Download the page - Fetch the HTML content. 5. Save the content - Store the HTML in our storage system. 6. Extract new links - Find all links on the page. 7. Add new links to frontier - After deduplication, add to the to-do list. 8. Repeat - Go back to step 1.
How many crawlers do we need?
- Target: 11,500 pages per second
- Average download time: 500ms (0.5 seconds)
+ 17 more lines...

High-Level Architecture

Now let me draw the big picture of how all the pieces fit together. I will keep it simple and explain what each part does.

What to tell the interviewer

I will break this into separate parts - a URL frontier that decides what to crawl next, crawler workers that download pages, a content store that saves the HTML, and a link extractor that finds new URLs. The key insight is that we partition by domain - all URLs from cnn.com go to the same queue, so we can control the crawl rate per site.

Web Crawler - The Big Picture

What each part does and WHY it is separate:

PartWhat it doesWhy it is separate (what to tell interviewer)
URL NormalizerCleans up URLs - removes tracking parameters, fixes http/https, handles www vs non-wwwWhy separate? Many URLs look different but point to the same page. Normalizing first reduces duplicates by 20-30%.
Bloom FilterQuickly checks if we have seen a URL before using very little memoryWhy separate? Checking 50 billion URLs must be FAST. Bloom filter uses 62GB memory and answers in microseconds. A database would be too slow.
+ 5 more rows...

Common interview question: How do you avoid hitting one site too fast?

Each domain has its own queue in the frontier. When we pick URLs to crawl, we round-robin across domains. If cnn.com says wait 1 second between requests, we track when we last hit cnn.com and wait. Meanwhile, we crawl other domains. One slow domain does not block the whole crawler.

Technology Choices - Why we picked these tools:

URL Frontier: Custom or Kafka - Why: Need ordered queues per domain with persistence - Kafka works well because it handles billions of messages and partitions by key (domain)

Bloom Filter: In-memory on dedicated servers - Why: Must be fast (microseconds) and fit in memory - Redis with RedisBloom module, or custom service on machines with lots of RAM

Content Store: S3 or HDFS - Why: Storing 100TB+ per day needs distributed storage - S3 is easier to manage; HDFS is faster for processing

Metadata: PostgreSQL or Cassandra - PostgreSQL for smaller scale (millions of pages) - Cassandra for larger scale (billions of pages) - better write throughput

Message Queue: Kafka - Why: Connects all the pieces reliably - Handles 10,000+ messages per second easily

Important interview tip

Pick technologies YOU know! The interviewer cares more about your reasoning than the specific tool. Say: I would use Kafka because I have used it before, but RabbitMQ or AWS SQS would also work.

URL Frontier Deep Dive

The URL Frontier is the most important part of a web crawler. It is like a smart to-do list that decides what to crawl next.

What to tell the interviewer

The frontier has two jobs: (1) Prioritization - which pages are more important to crawl first? and (2) Politeness - make sure we do not hit any website too fast. I will use a two-level queue design with front queues for priority and back queues for politeness.

The two-level queue design:

Level 1: Front Queues (Priority) - Multiple queues, each for a different priority level - High priority: Important sites (cnn.com), pages that change often, pages we have not crawled in a long time - Low priority: Less important pages, newly discovered URLs from unknown sites

Level 2: Back Queues (Politeness) - One queue per domain (all cnn.com URLs in one queue) - We only take from a domain queue when enough time has passed since last request - This ensures we never hit a domain too fast

URL Frontier - Two Level Queues

How we decide priority (what makes a page important):

  1. 1.PageRank or similar - Pages linked by many other pages are more important 2. How often it changes - News sites change every minute; personal blogs change monthly 3. Time since last crawl - Pages we have not visited in a long time get higher priority 4. Depth from seed - Pages closer to seed URLs (like homepage) are usually more important
FUNCTION get_next_url_to_crawl():
    // This function picks the next URL to crawl.
    // It must respect politeness (crawl-delay) per domain.
+ 46 more lines...

Handling 200 million domains

We cannot have 200 million queues in memory! Solution: Use a two-tier system. Keep active domains (recently seen) in memory, and store inactive domains on disk. When a URL arrives for an inactive domain, load its queue into memory and evict a less-active one.

Politeness and robots.txt

The golden rule of web crawling

Always follow robots.txt rules. Always respect crawl delays. A crawler that breaks these rules will get blocked by websites, get your IP blacklisted, and can even get your company sued. Being a good citizen of the web is not optional!

What is robots.txt?

Every website can have a robots.txt file at the root (like example.com/robots.txt). It tells crawlers:

  1. 1.What you can visit - Allow or Disallow certain paths 2. How fast you can crawl - Crawl-delay in seconds 3. Where the sitemap is - A list of all pages on the site
# Example robots.txt for example.com

# Rules for all crawlers
+ 18 more lines...

How we handle robots.txt:

  1. 1.Fetch once, cache for hours - Before crawling any page from a domain, fetch its robots.txt. Cache it for 24 hours (or until it changes).
  2. 2.Parse the rules - Convert robots.txt into a data structure we can query quickly.
  3. 3.Check before every fetch - Before downloading any URL, check if robots.txt allows it.
  4. 4.Respect crawl-delay - Track when we last hit each domain. Wait the required time.
FUNCTION can_crawl(url):
    domain = extract_domain(url)
    path = extract_path(url)
+ 34 more lines...

What if robots.txt fetch fails?

If we cannot fetch robots.txt (timeout, 500 error), we should be conservative and assume we cannot crawl. Try again later. If robots.txt returns 404 (does not exist), it means there are no restrictions - we can crawl everything with a reasonable default delay.

Common politeness rules:

  1. 1.Default delay: If no crawl-delay is specified, use 1 second between requests to the same domain.
  2. 2.Respect 429 responses: If a website returns HTTP 429 (Too Many Requests), back off! Double the delay and try again later.
  3. 3.Track errors per domain: If a domain keeps returning errors, reduce crawl rate or skip it temporarily.
  4. 4.Time of day: Some crawlers slow down during business hours when websites are busy.
  5. 5.Identify yourself: Set a proper User-Agent header so website owners know who is crawling and can contact you.

Avoiding Spider Traps

Say this to the interviewer

Some websites have pages that go on forever. A calendar might have next month links that never end. A search page might have infinite parameter combinations. These are called spider traps and they can make our crawler run forever on one site, wasting resources.

Types of spider traps:

  1. 1.Infinite calendars - A calendar with "next month" links that goes forever into the future (or past)
  2. 2.Session IDs in URLs - Each visit creates a new URL like /page?session=abc123, /page?session=xyz789
  3. 3.Parameter combinations - /search?color=red&size=small has millions of combinations
  4. 4.Soft 404s - Pages that return HTTP 200 but say "Page not found" - links to non-existent pages generate more non-existent pages
  5. 5.Dynamic content generators - Sites that generate random content on every visit

Example: Calendar Spider Trap

How we detect and avoid traps:

  1. 1.URL length limit - If a URL is longer than 2000 characters, skip it. Long URLs often mean too many parameters.
  2. 2.Depth limit - Do not follow links more than 15-20 levels deep from the homepage.
  3. 3.Max pages per domain - Limit how many pages we crawl from one domain (maybe 1 million max).
  4. 4.Duplicate content detection - If many URLs from the same domain have identical content, stop crawling that pattern.
  5. 5.URL pattern detection - If we see /calendar/2024/01, /calendar/2024/02, /calendar/2024/03... recognize the pattern and limit it.
FUNCTION should_crawl(url, metadata):
    // Check various trap indicators before crawling
    
+ 48 more lines...

Real-world example

Early Google crawlers got stuck on calendar websites that had infinite dates. They added rules to detect date patterns in URLs and limit how many calendar pages to crawl per site. Modern crawlers are much smarter about detecting these patterns.

Data Model and Storage

Now let me show how we organize the data. We have two main types of storage: metadata (information about URLs) and content (the actual HTML pages).

What to tell the interviewer

I will separate metadata storage from content storage. Metadata (URL, last crawl time, priority) goes in a database for fast queries. Content (actual HTML) goes in distributed file storage like S3 because it is huge. This separation lets us optimize each storage for its use case.

Table 1: URLs - Everything we know about each URL

This stores both URLs we have crawled and URLs we plan to crawl.

ColumnWhat it storesExample
url_hashUnique ID (hash of normalized URL)abc123def456
urlThe actual URLhttps://cnn.com/news/story1
+ 11 more rows...

Why use url_hash as the key?

URLs can be very long (2000+ characters). Using a hash (like SHA-256 truncated to 16 bytes) as the primary key makes lookups faster and uses less storage. We index by both url_hash and domain for different query patterns.

Table 2: Domains - Information about each website

ColumnWhat it storesExample
domainThe domain namecnn.com
robots_txtCached robots.txt contentUser-agent: * ...
+ 7 more rows...

Content Storage: Not in the database!

The actual HTML content is stored separately in distributed file storage:

  • S3 or similar - Cheap, scalable, durable - Organized by hash - s3://bucket/{first-2-chars}/{url-hash}.html.gz - Compressed - 5x space savings - Immutable - We do not update files, we write new versions
Content storage path format:
s3://crawler-content/{date}/{domain-hash}/{url-hash}.html.gz
+ 14 more lines...

Storage costs add up fast!

At 100 TB per day (compressed), monthly storage is 3 PB. At $0.02/GB/month, that is $60,000/month just for storage! Most crawlers delete old versions after 30-90 days and only keep the latest version of each page.

Distributed Crawling

One machine cannot crawl the whole internet. We need thousands of machines working together. Let me explain how they coordinate.

What to tell the interviewer

I will partition the work by domain. All URLs from cnn.com go to the same crawler group. This makes politeness easy - one group tracks timing for cnn.com. The tricky part is the shared state: Bloom filter and URL frontier must be accessible by all crawlers.

Distributed Crawler Architecture

How we partition the work:

  1. 1.By domain hash - hash(domain) % num_partitions tells us which crawler group handles this domain
  2. 2.Within each group - Multiple crawlers share the work, pulling from the same queue
  3. 3.Politeness is per-partition - Each partition tracks timing only for its domains

The shared Bloom filter challenge:

All crawlers need to check the same Bloom filter. Options:

  1. 1.Centralized service - One service with the Bloom filter in memory. Fast but single point of failure.
  2. 2.Replicated Bloom filter - Each partition has a copy. Updates broadcast to all. Uses more memory but no single point of failure.
  3. 3.Partitioned Bloom filter - hash(url) % num_partitions tells us which Bloom filter to check. Best of both worlds.
// We have N Bloom filter servers, each holding part of the URL space
NUM_BLOOM_SERVERS = 10
+ 23 more lines...

Handling crawler failures:

What happens if a crawler machine dies?

  1. 1.Heartbeat monitoring - Crawlers send heartbeats every few seconds. If no heartbeat, assume dead.
  2. 2.Reassign work - URLs assigned to dead crawler go back to the queue.
  3. 3.Checkpoint progress - Periodically save what URLs are being processed so we can resume.
  4. 4.Idempotent operations - Crawling the same URL twice is wasteful but not wrong. Design for this.

Real numbers: Google's crawler

Google's crawler (Googlebot) reportedly uses tens of thousands of machines and crawls billions of pages per day. They have been optimizing this for 25+ years. For most companies, a few hundred crawlers handling millions of pages per day is enough.

What Can Go Wrong and How We Handle It

Tell the interviewer about failures

Good engineers think about what can break. Let me walk through the things that can go wrong and how we protect against them.

What breaksWhat happensHow we fix itWhy this works
Website is slow or downCrawler waits forever for responseSet timeouts (30 seconds max). Mark domain as slow. Reduce crawl rate.We do not waste time on broken sites. We retry later.
Bloom filter server diesCannot check if URL was seen. Might crawl duplicates.Run 3 replicas. If one dies, others serve. Rebuild from URL database.Redundancy means no single point of failure.
+ 5 more rows...

Handling the most common errors:

HTTP errors and what to do: - 200 OK - Success! Save the content. - 301/302 Redirect - Follow the redirect (but limit to 5 redirects max). - 403 Forbidden - We are not allowed. Skip this URL. - 404 Not Found - Page does not exist. Remove from queue. - 429 Too Many Requests - Slow down! Double the delay for this domain. - 500 Server Error - Website problem. Retry in an hour. - Timeout - Too slow. Mark domain as slow, try again later.

FUNCTION handle_fetch_result(url, response):
    domain = extract_domain(url)
    
+ 43 more lines...

Monitoring is critical

Set up dashboards to watch: pages crawled per second, error rates by type, domains being rate-limited, storage growth, and queue sizes. If any metric looks wrong, you want to know immediately!

Growing the System Over Time

What to tell the interviewer

This design handles billions of pages. Let me explain how we would start small and grow, and what advanced features we could add later.

How we grow step by step:

Stage 1: Starting out (millions of pages) - Single URL frontier service - 10-50 crawler machines - Single Bloom filter server with replica - PostgreSQL for metadata - S3 for content storage - This handles 10-50 million pages per day

Stage 2: Medium scale (hundreds of millions of pages) - Partitioned URL frontier (by domain hash) - 100-500 crawler machines - Partitioned Bloom filter (10 servers) - Cassandra for metadata (better write throughput) - Same S3 for content (it scales automatically)

Stage 3: Google scale (billions of pages) - Multiple data centers for geographic distribution - Thousands of crawlers - Distributed Bloom filter with hundreds of partitions - Custom storage systems - Machine learning for priority and trap detection

Scaling stages

Advanced features we can add:

1. JavaScript rendering

Many modern websites need JavaScript to show content. To crawl these: - Use headless browsers (Chrome without a screen) - Much slower: 5-10 seconds per page vs 0.5 seconds - More expensive: needs more CPU and memory - Solution: Only render pages that need it (detect JavaScript-heavy sites)

FUNCTION should_render_javascript(url, initial_html):
    // Check if this page needs JavaScript rendering
    
+ 19 more lines...

2. Smart prioritization with machine learning

Instead of simple rules, use ML to predict: - How important is this page? (based on links, domain authority, content type) - How often does it change? (learn from past crawls) - Is this a spider trap? (detect patterns from training data)

3. Real-time crawling for news

For news sites, we want updates within minutes: - Subscribe to RSS feeds and sitemaps - Use websockets or webhooks if sites offer them - Monitor social media for trending URLs - Prioritize news domains in the frontier

4. Distributed crawling across regions

Crawl from multiple geographic locations: - Faster downloads (closer to the website) - Avoid IP blocks (requests come from different IPs) - See geo-specific content (some sites show different content by country)

Fun fact: Common Crawl

Common Crawl is a non-profit that crawls the web and makes the data available for free. Their archive is over 250 TB and contains billions of web pages. Many AI companies use Common Crawl data to train their models. You can build a search engine using their data without crawling yourself!

Design Trade-offs

Advantages

  • +Simple to build and debug
  • +No coordination needed
  • +Easy to restart and maintain
  • +Good for learning or small projects

Disadvantages

  • -Cannot scale beyond one machine
  • -Single point of failure
  • -Limited to maybe 1 million pages per day
When to use

Use for crawling specific sites, personal projects, or prototypes. Many companies never need more than this.