Web crawlers are the backbone of search engines. Google’s crawler discovers and indexes billions of web pages, Bing crawls the web for its search results, and countless other crawlers power everything from price comparison sites to academic research tools. In a system design interview, this problem tests your ability to handle massive scale, distributed coordination, and graceful failure handling.
Let’s build one from scratch.
Understanding the Problem
Before writing any architecture, we need to understand what a web crawler actually does and what constraints it operates under.
Functional Requirements
- Crawl billions of web pages across the internet, starting from a set of seed URLs
- Discover new pages by extracting links from crawled content
- Recrawl pages for freshness so our index stays up to date
- Extract content from different formats: HTML, PDF, images, and other media
- Respect robots.txt rules that websites publish to control crawler behavior
- Store crawled content for downstream processing (indexing, analysis)
Non-Functional Requirements
- Scalability: Handle billions of URLs. The web has over 5 billion indexed pages, and the total is much larger.
- Politeness: Don’t overwhelm any single website. A crawler that sends 100 requests/second to one server is effectively a DDoS attack.
- Freshness: Important pages (news sites, stock tickers) should be recrawled more frequently than static pages (documentation, archived content).
- Robustness: Handle malformed HTML, server timeouts, infinite redirect loops, and spider traps gracefully.
- Efficiency: Avoid re-downloading pages we’ve already crawled that haven’t changed.
Scale Estimates
Let’s do some quick math to ground our design:
- Target: 1 billion pages per month
- Pages per second: 1B / (30 days * 86400 sec) = ~385 pages/sec
- Average page size: 500 KB (including images, scripts)
- Storage per month: 1B * 500 KB = 500 TB
- Metadata per URL: ~500 bytes (URL, timestamps, hash, priority)
- URL storage: 10B known URLs * 500 bytes = 5 TB
This is a genuinely large-scale system. Let’s design it.
Core Entities and APIs
Entities
| Entity | Description |
|---|---|
| URL | A normalized web address with metadata (priority, last crawl time, domain) |
| CrawlTask | A unit of work: fetch URL X at time T with priority P |
| Page | Downloaded content with metadata (HTTP status, headers, body, content hash) |
| URLFrontier | The scheduling system that decides which URLs to crawl next |
| DomainQueue | Per-domain queue that enforces politeness (crawl rate limits) |
APIs
# Submit a URL for crawling
def submit_url(url: str, priority: int) -> CrawlTaskId
# Check the status of a crawl task
def get_crawl_status(task_id: CrawlTaskId) -> CrawlStatus
# Retrieve crawled content
def get_page_content(url: str) -> PageThese APIs are internal. The crawler is a batch processing system, not a user-facing service.
High-Level Design
The web crawler is fundamentally a pipeline. Each stage has a specific responsibility, and data flows through the system in a loop.
The Crawl Loop
-
Seed URLs bootstrap the process with a curated list of high-quality starting pages (Wikipedia, major news sites, government portals).
-
URL Frontier is the brain of the crawler. It decides which URL to crawl next based on priority and when to crawl it based on politeness rules.
-
DNS Resolver Cache translates domain names to IP addresses. Without caching, DNS lookups become a major bottleneck since each lookup takes 10-200ms.
-
Fetcher Workers download pages over HTTP in parallel. They respect
robots.txtrules and handle timeouts, redirects, and errors. -
Parser extracts structured content (text, metadata) and discovers new links from the downloaded HTML.
-
Content Store persists the crawled content (S3, HDFS, or similar blob storage).
-
URL Extractor pulls all hyperlinks from the parsed page and normalizes them.
-
Dedup Filter checks each discovered URL against a Bloom filter (have we seen this URL before?) and checks fetched content against a fingerprint store (is this content a near-duplicate of something we already have?).
-
New, unique URLs are fed back into the URL Frontier, and the loop continues.
Deep Dives
URL Frontier: The Heart of the Crawler
The URL Frontier is the most interesting component. It has two conflicting goals:
- Crawl high-priority pages first (priority)
- Don’t overwhelm any single website (politeness)
The solution is a two-tier queue architecture:
Front Queues (Priority)
URLs enter the frontier and are sorted into priority queues:
- P0 (Critical): Breaking news, trending pages, high-PageRank sites
- P1 (High): Popular sites, frequently updated content
- P2 (Normal): Newly discovered links from crawled pages
- P3 (Low): Deep links, rarely changing pages
The prioritizer assigns scores based on:
def compute_priority(url: str, metadata: URLMetadata) -> int:
score = 0
score += pagerank_score(url) # Historical importance
score += freshness_score(metadata) # How stale is our copy?
score += change_frequency(metadata) # How often does this page change?
score += domain_authority(url) # Is this a trusted domain?
return priority_bucket(score) # Map to P0-P3Selector (Weighted Round-Robin)
The selector picks URLs from front queues using weighted round-robin. P0 gets 4x the throughput of P3. This ensures important pages are crawled first while still making progress on lower-priority URLs.
Back Queues (Politeness)
After selection, URLs are routed to per-domain back queues. Each domain has exactly one queue, and the system enforces:
- At most one active request per domain at any time
- Minimum delay between requests to the same domain (typically 1-10 seconds)
- Respect robots.txt Crawl-delay directive
Min-Heap Scheduler
A min-heap tracks the earliest allowed fetch time for each domain. When a fetcher worker is ready, it pops the domain with the earliest timestamp. If that timestamp is in the future, the worker waits.
class PolitenessScheduler:
def __init__(self):
self.heap = [] # (next_fetch_time, domain)
self.domain_queues = {} # domain -> deque of URLs
def get_next_url(self) -> Tuple[str, str]:
while True:
next_time, domain = heapq.heappop(self.heap)
if time.now() < next_time:
time.sleep(next_time - time.now())
url = self.domain_queues[domain].popleft()
# Schedule next fetch for this domain
delay = self.get_crawl_delay(domain)
heapq.heappush(self.heap, (time.now() + delay, domain))
return url, domain
def get_crawl_delay(self, domain: str) -> float:
robots_delay = self.robots_cache.get_crawl_delay(domain)
return max(robots_delay, DEFAULT_DELAY) # At least 1 secondContent Deduplication
The web is full of duplicate content. Mirror sites, syndicated articles, and pages with the same content but different URLs (query parameters, session IDs) are everywhere. Downloading and storing all of them wastes bandwidth and storage.
We use a two-layer deduplication strategy:
Layer 1: URL Deduplication (Bloom Filter)
Before fetching a URL, we check a Bloom filter. A Bloom filter is a space-efficient probabilistic data structure that tells us whether we’ve probably seen a URL before.
class URLDeduplicator:
def __init__(self, expected_items=10_000_000_000, fp_rate=0.01):
# ~1.2 GB for 10 billion URLs at 1% false positive rate
self.bloom = BloomFilter(expected_items, fp_rate)
def is_new(self, url: str) -> bool:
normalized = self.normalize(url)
if self.bloom.contains(normalized):
return False # Probably seen before
self.bloom.add(normalized)
return True
def normalize(self, url: str) -> str:
# Lowercase scheme and host
# Remove default ports (80, 443)
# Remove fragment (#section)
# Sort query parameters
# Remove tracking parameters (utm_source, etc.)
parsed = urlparse(url.lower())
# ... normalization logic
return normalized_urlWhy a Bloom filter? With 10 billion URLs, a hash set would require ~200 GB of memory. A Bloom filter achieves the same lookup in ~1.2 GB with a 1% false positive rate. The false positives mean we occasionally skip a URL we haven’t seen, but that’s acceptable since we can’t crawl everything anyway.
Layer 2: Content Deduplication (SimHash)
Even after URL dedup, two different URLs can serve identical content. We compute a SimHash fingerprint of the page content and compare it against our fingerprint database.
SimHash is a locality-sensitive hash. Unlike cryptographic hashes (MD5, SHA-256) where a single character change flips ~50% of bits, SimHash produces similar hashes for similar content. Two pages with 95% identical content (same article, different sidebar ads) will have fingerprints that differ by only a few bits.
def compute_simhash(content: str, hash_bits: int = 64) -> int:
tokens = tokenize(content) # Split into words/shingles
v = [0] * hash_bits
for token in tokens:
token_hash = hash(token)
for i in range(hash_bits):
if token_hash & (1 << i):
v[i] += 1
else:
v[i] -= 1
fingerprint = 0
for i in range(hash_bits):
if v[i] > 0:
fingerprint |= (1 << i)
return fingerprint
def is_near_duplicate(fp1: int, fp2: int, threshold: int = 3) -> bool:
# Hamming distance: number of differing bits
xor = fp1 ^ fp2
return bin(xor).count('1') <= thresholdIf the Hamming distance between two fingerprints is less than 3 bits, we consider them near-duplicates and skip storing the new page.
Handling Spider Traps
Spider traps are URLs that generate infinite content. A calendar application with URLs like /calendar/2026/04/09, /calendar/2026/04/10, /calendar/2026/04/11… could keep a crawler busy forever. Similarly, session-based URLs that create new URLs for each visit.
We defend against traps with multiple strategies:
class SpiderTrapDetector:
MAX_URL_LENGTH = 2048 # Reject absurdly long URLs
MAX_PATH_DEPTH = 16 # /a/b/c/d/.../16 levels max
MAX_PAGES_PER_DOMAIN = 500_000 # Hard cap per domain per crawl cycle
def is_trap(self, url: str, domain_stats: DomainStats) -> bool:
parsed = urlparse(url)
# Check URL length
if len(url) > self.MAX_URL_LENGTH:
return True
# Check path depth
depth = parsed.path.count('/')
if depth > self.MAX_PATH_DEPTH:
return True
# Check domain page count
if domain_stats.pages_crawled > self.MAX_PAGES_PER_DOMAIN:
return True
# Detect repeating patterns: /a/b/a/b/a/b
if self.has_repeating_pattern(parsed.path):
return True
return FalseCrawl Scheduling and Freshness
Not all pages need to be recrawled at the same frequency. A news homepage changes every few minutes; a company’s “About Us” page changes once a year.
We use an adaptive recrawl strategy:
def compute_recrawl_interval(url: str, history: CrawlHistory) -> timedelta:
# If the page changed on last N crawls, crawl more frequently
change_rate = history.changes / history.total_crawls
if change_rate > 0.8:
return timedelta(hours=1) # Very dynamic: hourly
elif change_rate > 0.5:
return timedelta(hours=12) # Moderately dynamic: twice daily
elif change_rate > 0.2:
return timedelta(days=3) # Somewhat dynamic: every few days
else:
return timedelta(days=30) # Mostly static: monthly
# Also factor in page importance
# A stale copy of nytimes.com is worse than a stale copy of random-blog.xyzWe also use HTTP conditional requests to avoid re-downloading unchanged content:
- If-Modified-Since: Send the last crawl timestamp. Server returns 304 Not Modified if nothing changed.
- ETag / If-None-Match: Send the page’s ETag. Server returns 304 if the ETag still matches.
This saves enormous bandwidth. Studies show that 40-60% of pages don’t change between crawl cycles.
Distributed Crawling
A single machine can’t crawl billions of pages. We need to distribute the work.
Partitioning Strategy
We partition by domain using consistent hashing. Each crawler node is responsible for a set of domains:
hash("google.com") → Node A
hash("wikipedia.org") → Node B
hash("github.com") → Node A
hash("reddit.com") → Node CThis has a critical benefit: politeness is naturally enforced because all URLs for a given domain are handled by the same node. No cross-node coordination is needed for rate limiting.
Coordination
A central coordinator service handles:
- Node assignment: Which crawler handles which domain ranges
- Health monitoring: Detect and redistribute work from failed nodes
- Global dedup: The Bloom filter is partitioned across nodes using the same consistent hash
- Seed URL distribution: New seed URLs are routed to the correct node
class CrawlerCoordinator:
def __init__(self, num_nodes: int):
self.ring = ConsistentHashRing(num_nodes)
self.node_health = {}
def assign_url(self, url: str) -> NodeId:
domain = extract_domain(url)
return self.ring.get_node(domain)
def handle_node_failure(self, failed_node: NodeId):
# Reassign domains from failed node to healthy nodes
domains = self.ring.get_domains(failed_node)
self.ring.remove_node(failed_node)
# Domains automatically redistribute via consistent hashing
# Only ~1/N domains need to moveRobots.txt Handling
Each crawler node maintains a cache of robots.txt files for its assigned domains. The cache has a TTL of 24 hours, and we re-fetch robots.txt before it expires.
class RobotsCache:
def __init__(self):
self.cache = {} # domain -> (rules, expiry_time)
def is_allowed(self, url: str, user_agent: str = "GyanByteBot") -> bool:
domain = extract_domain(url)
rules = self.get_rules(domain)
return rules.is_allowed(user_agent, url)
def get_crawl_delay(self, domain: str) -> float:
rules = self.get_rules(domain)
delay = rules.get_crawl_delay("GyanByteBot")
return delay if delay else 1.0 # Default 1 secondPutting It All Together
Here’s how a single URL flows through the entire system:
- A URL is discovered in a crawled page and submitted to the URL Frontier
- The Dedup Filter checks the Bloom filter. If the URL is already known, it’s discarded
- The Prioritizer assigns a priority based on domain authority, freshness, and change history
- The URL enters the appropriate Front Queue (P0-P3)
- The Selector picks it based on weighted round-robin
- It’s placed in the Back Queue for its domain
- The Min-Heap Scheduler waits until the domain’s politeness delay has elapsed
- A Fetcher Worker checks
robots.txt, resolves DNS (from cache), and downloads the page - The Parser extracts content and links
- SimHash checks for content duplicates
- Unique content is stored; new URLs loop back to step 1
This loop runs continuously across hundreds of crawler nodes, processing thousands of pages per second, gradually building a comprehensive index of the web.
Key Trade-offs
| Decision | Trade-off |
|---|---|
| Bloom filter for URL dedup | Space efficient (1.2 GB for 10B URLs) but has false positives (~1%) |
| SimHash for content dedup | Catches near-duplicates but misses completely rewritten duplicates |
| Consistent hashing for distribution | Natural politeness enforcement but hot domains can overload a node |
| Priority queues | Important pages are fresh, but low-priority pages may go stale |
| Adaptive recrawl intervals | Efficient bandwidth use but requires crawl history storage |
The web crawler is a beautiful example of a system where every component has clear responsibilities, the data flow forms a natural loop, and the most interesting engineering is in the scheduling and deduplication layers. When presenting this in an interview, spend most of your time on the URL Frontier design and deduplication strategy, as these are where the real depth lies.
