Design a Web Crawler — Cracking the System Design Interview

Web crawlers are the backbone of search engines. Google’s crawler discovers and indexes billions of web pages, Bing crawls the web for its search results, and countless other crawlers power everything from price comparison sites to academic research tools. In a system design interview, this problem tests your ability to handle massive scale, distributed coordination, and graceful failure handling.

Let’s build one from scratch.

Understanding the Problem

Before writing any architecture, we need to understand what a web crawler actually does and what constraints it operates under.

Functional Requirements

Crawl billions of web pages across the internet, starting from a set of seed URLs
Discover new pages by extracting links from crawled content
Recrawl pages for freshness so our index stays up to date
Extract content from different formats: HTML, PDF, images, and other media
Respect robots.txt rules that websites publish to control crawler behavior
Store crawled content for downstream processing (indexing, analysis)

Non-Functional Requirements

Scalability: Handle billions of URLs. The web has over 5 billion indexed pages, and the total is much larger.
Politeness: Don’t overwhelm any single website. A crawler that sends 100 requests/second to one server is effectively a DDoS attack.
Freshness: Important pages (news sites, stock tickers) should be recrawled more frequently than static pages (documentation, archived content).
Robustness: Handle malformed HTML, server timeouts, infinite redirect loops, and spider traps gracefully.
Efficiency: Avoid re-downloading pages we’ve already crawled that haven’t changed.

Scale Estimates

Let’s do some quick math to ground our design:

Target: 1 billion pages per month
Pages per second: 1B / (30 days * 86400 sec) = ~385 pages/sec
Average page size: 500 KB (including images, scripts)
Storage per month: 1B * 500 KB = 500 TB
Metadata per URL: ~500 bytes (URL, timestamps, hash, priority)
URL storage: 10B known URLs * 500 bytes = 5 TB

This is a genuinely large-scale system. Let’s design it.

Core Entities and APIs

Entities

Entity	Description
URL	A normalized web address with metadata (priority, last crawl time, domain)
CrawlTask	A unit of work: fetch URL X at time T with priority P
Page	Downloaded content with metadata (HTTP status, headers, body, content hash)
URLFrontier	The scheduling system that decides which URLs to crawl next
DomainQueue	Per-domain queue that enforces politeness (crawl rate limits)

APIs

# Submit a URL for crawling
def submit_url(url: str, priority: int) -> CrawlTaskId

# Check the status of a crawl task
def get_crawl_status(task_id: CrawlTaskId) -> CrawlStatus

# Retrieve crawled content
def get_page_content(url: str) -> Page

These APIs are internal. The crawler is a batch processing system, not a user-facing service.

High-Level Design

The web crawler is fundamentally a pipeline. Each stage has a specific responsibility, and data flows through the system in a loop.

Web crawler architecture showing the full pipeline from seed URLs through URL Frontier, DNS Resolver, Fetcher Workers, Parser, and back through deduplication

The Crawl Loop

Seed URLs bootstrap the process with a curated list of high-quality starting pages (Wikipedia, major news sites, government portals).
URL Frontier is the brain of the crawler. It decides which URL to crawl next based on priority and when to crawl it based on politeness rules.
DNS Resolver Cache translates domain names to IP addresses. Without caching, DNS lookups become a major bottleneck since each lookup takes 10-200ms.
Fetcher Workers download pages over HTTP in parallel. They respect robots.txt rules and handle timeouts, redirects, and errors.
Parser extracts structured content (text, metadata) and discovers new links from the downloaded HTML.
Content Store persists the crawled content (S3, HDFS, or similar blob storage).
URL Extractor pulls all hyperlinks from the parsed page and normalizes them.
Dedup Filter checks each discovered URL against a Bloom filter (have we seen this URL before?) and checks fetched content against a fingerprint store (is this content a near-duplicate of something we already have?).
New, unique URLs are fed back into the URL Frontier, and the loop continues.

Deep Dives

URL Frontier: The Heart of the Crawler

The URL Frontier is the most interesting component. It has two conflicting goals:

Crawl high-priority pages first (priority)
Don’t overwhelm any single website (politeness)

The solution is a two-tier queue architecture:

URL Frontier internals showing front queues organized by priority feeding into a selector, which routes to back queues organized by domain for politeness enforcement

Front Queues (Priority)

URLs enter the frontier and are sorted into priority queues:

P0 (Critical): Breaking news, trending pages, high-PageRank sites
P1 (High): Popular sites, frequently updated content
P2 (Normal): Newly discovered links from crawled pages
P3 (Low): Deep links, rarely changing pages

The prioritizer assigns scores based on:

def compute_priority(url: str, metadata: URLMetadata) -> int:
    score = 0
    score += pagerank_score(url)          # Historical importance
    score += freshness_score(metadata)     # How stale is our copy?
    score += change_frequency(metadata)    # How often does this page change?
    score += domain_authority(url)         # Is this a trusted domain?
    return priority_bucket(score)          # Map to P0-P3

Selector (Weighted Round-Robin)

The selector picks URLs from front queues using weighted round-robin. P0 gets 4x the throughput of P3. This ensures important pages are crawled first while still making progress on lower-priority URLs.

Back Queues (Politeness)

After selection, URLs are routed to per-domain back queues. Each domain has exactly one queue, and the system enforces:

At most one active request per domain at any time
Minimum delay between requests to the same domain (typically 1-10 seconds)
Respect robots.txt Crawl-delay directive

Min-Heap Scheduler

A min-heap tracks the earliest allowed fetch time for each domain. When a fetcher worker is ready, it pops the domain with the earliest timestamp. If that timestamp is in the future, the worker waits.

class PolitenessScheduler:
    def __init__(self):
        self.heap = []  # (next_fetch_time, domain)
        self.domain_queues = {}  # domain -> deque of URLs

    def get_next_url(self) -> Tuple[str, str]:
        while True:
            next_time, domain = heapq.heappop(self.heap)
            if time.now() < next_time:
                time.sleep(next_time - time.now())

            url = self.domain_queues[domain].popleft()

            # Schedule next fetch for this domain
            delay = self.get_crawl_delay(domain)
            heapq.heappush(self.heap, (time.now() + delay, domain))

            return url, domain

    def get_crawl_delay(self, domain: str) -> float:
        robots_delay = self.robots_cache.get_crawl_delay(domain)
        return max(robots_delay, DEFAULT_DELAY)  # At least 1 second

Content Deduplication

The web is full of duplicate content. Mirror sites, syndicated articles, and pages with the same content but different URLs (query parameters, session IDs) are everywhere. Downloading and storing all of them wastes bandwidth and storage.

We use a two-layer deduplication strategy:

Deduplication flow showing URL normalization, Bloom filter check, content fetching, SimHash fingerprinting, and content store comparison

Layer 1: URL Deduplication (Bloom Filter)

Before fetching a URL, we check a Bloom filter. A Bloom filter is a space-efficient probabilistic data structure that tells us whether we’ve probably seen a URL before.

class URLDeduplicator:
    def __init__(self, expected_items=10_000_000_000, fp_rate=0.01):
        # ~1.2 GB for 10 billion URLs at 1% false positive rate
        self.bloom = BloomFilter(expected_items, fp_rate)

    def is_new(self, url: str) -> bool:
        normalized = self.normalize(url)
        if self.bloom.contains(normalized):
            return False  # Probably seen before
        self.bloom.add(normalized)
        return True

    def normalize(self, url: str) -> str:
        # Lowercase scheme and host
        # Remove default ports (80, 443)
        # Remove fragment (#section)
        # Sort query parameters
        # Remove tracking parameters (utm_source, etc.)
        parsed = urlparse(url.lower())
        # ... normalization logic
        return normalized_url

Why a Bloom filter? With 10 billion URLs, a hash set would require ~200 GB of memory. A Bloom filter achieves the same lookup in ~1.2 GB with a 1% false positive rate. The false positives mean we occasionally skip a URL we haven’t seen, but that’s acceptable since we can’t crawl everything anyway.

Layer 2: Content Deduplication (SimHash)

Even after URL dedup, two different URLs can serve identical content. We compute a SimHash fingerprint of the page content and compare it against our fingerprint database.

SimHash is a locality-sensitive hash. Unlike cryptographic hashes (MD5, SHA-256) where a single character change flips ~50% of bits, SimHash produces similar hashes for similar content. Two pages with 95% identical content (same article, different sidebar ads) will have fingerprints that differ by only a few bits.

def compute_simhash(content: str, hash_bits: int = 64) -> int:
    tokens = tokenize(content)  # Split into words/shingles
    v = [0] * hash_bits

    for token in tokens:
        token_hash = hash(token)
        for i in range(hash_bits):
            if token_hash & (1 << i):
                v[i] += 1
            else:
                v[i] -= 1

    fingerprint = 0
    for i in range(hash_bits):
        if v[i] > 0:
            fingerprint |= (1 << i)
    return fingerprint

def is_near_duplicate(fp1: int, fp2: int, threshold: int = 3) -> bool:
    # Hamming distance: number of differing bits
    xor = fp1 ^ fp2
    return bin(xor).count('1') <= threshold

If the Hamming distance between two fingerprints is less than 3 bits, we consider them near-duplicates and skip storing the new page.

Handling Spider Traps

Spider traps are URLs that generate infinite content. A calendar application with URLs like /calendar/2026/04/09, /calendar/2026/04/10, /calendar/2026/04/11… could keep a crawler busy forever. Similarly, session-based URLs that create new URLs for each visit.

We defend against traps with multiple strategies:

class SpiderTrapDetector:
    MAX_URL_LENGTH = 2048       # Reject absurdly long URLs
    MAX_PATH_DEPTH = 16         # /a/b/c/d/.../16 levels max
    MAX_PAGES_PER_DOMAIN = 500_000  # Hard cap per domain per crawl cycle

    def is_trap(self, url: str, domain_stats: DomainStats) -> bool:
        parsed = urlparse(url)

        # Check URL length
        if len(url) > self.MAX_URL_LENGTH:
            return True

        # Check path depth
        depth = parsed.path.count('/')
        if depth > self.MAX_PATH_DEPTH:
            return True

        # Check domain page count
        if domain_stats.pages_crawled > self.MAX_PAGES_PER_DOMAIN:
            return True

        # Detect repeating patterns: /a/b/a/b/a/b
        if self.has_repeating_pattern(parsed.path):
            return True

        return False

Crawl Scheduling and Freshness

Not all pages need to be recrawled at the same frequency. A news homepage changes every few minutes; a company’s “About Us” page changes once a year.

We use an adaptive recrawl strategy:

def compute_recrawl_interval(url: str, history: CrawlHistory) -> timedelta:
    # If the page changed on last N crawls, crawl more frequently
    change_rate = history.changes / history.total_crawls

    if change_rate > 0.8:
        return timedelta(hours=1)     # Very dynamic: hourly
    elif change_rate > 0.5:
        return timedelta(hours=12)    # Moderately dynamic: twice daily
    elif change_rate > 0.2:
        return timedelta(days=3)      # Somewhat dynamic: every few days
    else:
        return timedelta(days=30)     # Mostly static: monthly

    # Also factor in page importance
    # A stale copy of nytimes.com is worse than a stale copy of random-blog.xyz

We also use HTTP conditional requests to avoid re-downloading unchanged content:

If-Modified-Since: Send the last crawl timestamp. Server returns 304 Not Modified if nothing changed.
ETag / If-None-Match: Send the page’s ETag. Server returns 304 if the ETag still matches.

This saves enormous bandwidth. Studies show that 40-60% of pages don’t change between crawl cycles.

Distributed Crawling

A single machine can’t crawl billions of pages. We need to distribute the work.

Partitioning Strategy

We partition by domain using consistent hashing. Each crawler node is responsible for a set of domains:

hash("google.com")    → Node A
hash("wikipedia.org") → Node B
hash("github.com")    → Node A
hash("reddit.com")    → Node C

This has a critical benefit: politeness is naturally enforced because all URLs for a given domain are handled by the same node. No cross-node coordination is needed for rate limiting.

Coordination

A central coordinator service handles:

Node assignment: Which crawler handles which domain ranges
Health monitoring: Detect and redistribute work from failed nodes
Global dedup: The Bloom filter is partitioned across nodes using the same consistent hash
Seed URL distribution: New seed URLs are routed to the correct node

class CrawlerCoordinator:
    def __init__(self, num_nodes: int):
        self.ring = ConsistentHashRing(num_nodes)
        self.node_health = {}

    def assign_url(self, url: str) -> NodeId:
        domain = extract_domain(url)
        return self.ring.get_node(domain)

    def handle_node_failure(self, failed_node: NodeId):
        # Reassign domains from failed node to healthy nodes
        domains = self.ring.get_domains(failed_node)
        self.ring.remove_node(failed_node)
        # Domains automatically redistribute via consistent hashing
        # Only ~1/N domains need to move

Robots.txt Handling

Each crawler node maintains a cache of robots.txt files for its assigned domains. The cache has a TTL of 24 hours, and we re-fetch robots.txt before it expires.

class RobotsCache:
    def __init__(self):
        self.cache = {}  # domain -> (rules, expiry_time)

    def is_allowed(self, url: str, user_agent: str = "GyanByteBot") -> bool:
        domain = extract_domain(url)
        rules = self.get_rules(domain)
        return rules.is_allowed(user_agent, url)

    def get_crawl_delay(self, domain: str) -> float:
        rules = self.get_rules(domain)
        delay = rules.get_crawl_delay("GyanByteBot")
        return delay if delay else 1.0  # Default 1 second

Putting It All Together

Here’s how a single URL flows through the entire system:

A URL is discovered in a crawled page and submitted to the URL Frontier
The Dedup Filter checks the Bloom filter. If the URL is already known, it’s discarded
The Prioritizer assigns a priority based on domain authority, freshness, and change history
The URL enters the appropriate Front Queue (P0-P3)
The Selector picks it based on weighted round-robin
It’s placed in the Back Queue for its domain
The Min-Heap Scheduler waits until the domain’s politeness delay has elapsed
A Fetcher Worker checks robots.txt, resolves DNS (from cache), and downloads the page
The Parser extracts content and links
SimHash checks for content duplicates
Unique content is stored; new URLs loop back to step 1

This loop runs continuously across hundreds of crawler nodes, processing thousands of pages per second, gradually building a comprehensive index of the web.

Key Trade-offs

Decision	Trade-off
Bloom filter for URL dedup	Space efficient (1.2 GB for 10B URLs) but has false positives (~1%)
SimHash for content dedup	Catches near-duplicates but misses completely rewritten duplicates
Consistent hashing for distribution	Natural politeness enforcement but hot domains can overload a node
Priority queues	Important pages are fresh, but low-priority pages may go stale
Adaptive recrawl intervals	Efficient bandwidth use but requires crawl history storage

The web crawler is a beautiful example of a system where every component has clear responsibilities, the data flow forms a natural loop, and the most interesting engineering is in the scheduling and deduplication layers. When presenting this in an interview, spend most of your time on the URL Frontier design and deduplication strategy, as these are where the real depth lies.