A search engine is one of the most complex distributed systems ever built. Google processes over 8.5 billion searches per day, indexing hundreds of billions of web pages across a petabyte-scale inverted index. In an interview, you won’t design the full Google — but you need to demonstrate mastery of the core components: crawling, indexing, query processing, and ranking.
Understanding the Problem
Functional Requirements
- Web search by keywords — User types a query, gets ranked results
- Autocomplete suggestions — As-you-type query suggestions
- Paginated results — 10 results per page, with next/previous
- Spell correction — “Did you mean: running shoes”
- Snippet generation — Show relevant text excerpt from each result
- Freshness — New content indexed within hours, not days
Non-Functional Requirements
| Requirement | Target |
|---|---|
| Query latency (p99) | < 200ms |
| Index freshness | New pages indexed within 2-4 hours |
| Index size | Billions of documents |
| Query throughput | 100K+ QPS |
| Availability | 99.99% |
| Relevance | First-page results satisfy 90%+ of queries |
Scale Estimation
- 100 billion indexed pages, average 50KB each = 5 PB raw content
- Inverted index is ~10% of raw content = 500 TB index data
- 100K QPS at peak, each touching ~10 index shards = 1M internal RPCs/sec
- Crawl rate: ~1 billion pages/day to maintain freshness
Core Entities and APIs
Data Model
-- Document metadata (stored separately from index)
CREATE TABLE documents (
doc_id BIGINT PRIMARY KEY,
url TEXT UNIQUE,
title VARCHAR(500),
content_hash CHAR(64), -- SHA-256 for dedup
page_rank FLOAT DEFAULT 0.0,
crawled_at TIMESTAMP,
indexed_at TIMESTAMP
);
-- The inverted index is NOT in SQL — it's a custom
-- data structure optimized for term lookups.
-- Conceptually:
-- term → [(doc_id, term_frequency, [positions]), ...]API Design
# Search
GET /api/v1/search?q=distributed+systems&page=1&size=10
response:
total_results: 142000000
results:
- title: "Designing Data-Intensive Applications"
url: "https://example.com/ddia"
snippet: "A guide to the big ideas behind reliable, scalable..."
score: 0.94
- ...
suggestions:
spell_correction: null
related_searches: ["distributed systems book", "system design"]
# Autocomplete
GET /api/v1/suggest?prefix=distrib
response:
suggestions:
- "distributed systems"
- "distributed computing"
- "distributed database"
- "distributed lock"
# Index (internal API, not public)
POST /internal/v1/index
body: { url, title, content, crawled_at }High-Level Design
A search engine has two distinct pipelines that share the inverted index.
Indexing Pipeline (Offline/Batch)
- Web Crawler — Fetches pages from the internet using BFS. A URL frontier manages priority (important sites first) and politeness (don’t DDoS anyone).
- Document Processor — Parses HTML, extracts text, detects language, deduplicates via SimHash/MinHash.
- Indexer — Tokenizes text, applies stemming, builds inverted index entries, computes term frequencies.
- Index Storage — Writes sharded inverted index to disk. Computes PageRank from the link graph.
Query Pipeline (Online/Real-Time)
- Query Parser — Tokenizes the query, applies spell check, removes stop words, stems terms.
- Index Lookup — Scatters the query to relevant index shards, gathers posting lists.
- Scorer/Ranker — Computes relevance score per document using BM25 + PageRank + freshness + ML signals.
- Result Builder — Fetches document metadata (title, URL), generates snippets, formats the response.
Deep Dive: Inverted Index
The inverted index is the heart of any search engine. It maps every term to the list of documents containing that term.
Structure
Term Dictionary:
"distributed" → PostingList
"system" → PostingList
"design" → PostingList
...
PostingList for "distributed":
[
{ doc_id: 1042, tf: 5, positions: [3, 15, 42, 89, 123] },
{ doc_id: 2891, tf: 2, positions: [7, 56] },
{ doc_id: 5420, tf: 8, positions: [1, 12, 28, 33, 67, 91, 104, 118] },
...
]Each posting list entry contains:
- doc_id — Which document contains this term
- tf (term frequency) — How many times the term appears
- positions — Where in the document the term appears (enables phrase queries like “distributed systems”)
Building the Index
def build_inverted_index(documents):
"""
Build an in-memory inverted index from a batch of documents.
In production, this writes to a sorted on-disk format (SSTable-like).
"""
index = defaultdict(list) # term → list of PostingEntry
for doc in documents:
tokens = tokenize(doc.content) # split on whitespace/punctuation
tokens = [t.lower() for t in tokens] # lowercase
tokens = remove_stop_words(tokens) # remove "the", "is", "a"
tokens = [stem(t) for t in tokens] # "running" → "run"
# Count term frequencies and record positions
term_positions = defaultdict(list)
for pos, token in enumerate(tokens):
term_positions[token].append(pos)
for term, positions in term_positions.items():
index[term].append(PostingEntry(
doc_id=doc.id,
tf=len(positions),
positions=positions
))
# Sort posting lists by doc_id for efficient intersection
for term in index:
index[term].sort(key=lambda e: e.doc_id)
return indexSharding the Index
There are two fundamental strategies for distributing the index across machines:
Document-Partitioned (Uber’s Elasticsearch uses this):
- Each shard holds a complete index for a subset of documents
- Every query is broadcast to all shards (scatter-gather)
- Each shard returns its local top-K, coordinator merges
- Simpler to rebalance, easier to add capacity
Term-Partitioned:
- Each shard owns a subset of terms (e.g., A-M, N-Z)
- A single-term query hits exactly one shard
- Multi-term queries hit multiple shards, then intersect
- Better for single-term lookups, harder to manage
For most interviews, go with document-partitioned sharding. It’s what Elasticsearch, Solr, and most modern search systems use. The scatter-gather overhead is acceptable because each shard responds in parallel.
# Scatter-gather pseudocode
async def search(query, num_shards=1000, top_k=10):
# Parse and prepare
terms = parse_query(query)
# Scatter: send to all shards in parallel
futures = []
for shard_id in range(num_shards):
futures.append(
shard_clients[shard_id].search(terms, top_k=top_k)
)
# Gather: collect local top-K from each shard
shard_results = await asyncio.gather(*futures)
# Merge: global top-K from all shard results
heap = []
for results in shard_results:
for doc_id, score in results:
heapq.heappush(heap, (score, doc_id))
if len(heap) > top_k:
heapq.heappop(heap)
return sorted(heap, reverse=True)Deep Dive: Query Processing Pipeline
A raw user query goes through several transformations before it hits the index.
Step 1: Tokenization
Split the raw query into individual terms, normalize case, handle special characters.
def tokenize(query):
# "New York restaurants -fast-food" →
# ["new", "york", "restaurants", "-fast-food"]
tokens = re.split(r'\s+', query.lower().strip())
return tokensStep 2: Spell Correction
Check each term against a dictionary. For unknown terms, find the closest known term using edit distance or a pre-built n-gram index.
def spell_check(term, dictionary, max_edit_distance=2):
if term in dictionary:
return term
# Generate candidates within edit distance
candidates = []
for dict_term in dictionary.get_candidates(term[:3]): # prefix filter
dist = levenshtein_distance(term, dict_term)
if dist <= max_edit_distance:
candidates.append((dict_term, dist, dictionary.frequency(dict_term)))
if not candidates:
return term # Unknown term, keep as-is
# Rank by (edit_distance ASC, frequency DESC)
candidates.sort(key=lambda c: (c[1], -c[2]))
return candidates[0][0]
# "runing" → edit distance 1 from "running" → correctStep 3: Stemming
Reduce words to their root form so that “running”, “runs”, “ran” all match documents containing any of these forms.
# Porter Stemmer examples:
# "running" → "run"
# "computers" → "comput"
# "arguing" → "argu"
# "agreed" → "agre"Step 4: Query Expansion
Add synonyms and related terms to improve recall:
def expand_query(terms):
expanded = list(terms)
for term in terms:
synonyms = synonym_dict.get(term, [])
# Add top 2 synonyms with reduced weight
for syn in synonyms[:2]:
expanded.append((syn, boost=0.5))
return expanded
# "shoes" → ["shoes", ("footwear", 0.5), ("sneakers", 0.5)]Step 5: Index Lookup (Scatter-Gather)
The parsed query is broadcast to all index shards. Each shard performs a local search:
- Look up each query term in the shard’s inverted index
- Intersect posting lists (for AND queries) or merge them (for OR queries)
- Score each matching document locally
- Return the top-K results to the coordinator
Step 6: Scoring with BM25
BM25 (Best Matching 25) is the industry-standard relevance scoring function. It’s an improvement over TF-IDF that handles document length normalization:
import math
def bm25_score(query_terms, doc, index, corpus_stats):
"""
BM25 scoring for a single document against query terms.
k1 = 1.2 (term frequency saturation)
b = 0.75 (document length normalization)
"""
k1 = 1.2
b = 0.75
score = 0.0
N = corpus_stats.total_docs
avgdl = corpus_stats.avg_doc_length
dl = doc.length # number of terms in this document
for term in query_terms:
# df = number of documents containing this term
df = index.doc_frequency(term)
# IDF component: rare terms are more important
idf = math.log((N - df + 0.5) / (df + 0.5) + 1)
# TF component with saturation and length normalization
tf = doc.term_frequency(term)
tf_norm = (tf * (k1 + 1)) / (tf + k1 * (1 - b + b * dl / avgdl))
score += idf * tf_norm
return scoreBM25 captures two crucial intuitions:
- IDF: A term appearing in 90% of documents (“the”) is nearly worthless. A term appearing in 0.01% of documents (“kubernetes”) is highly discriminative.
- Saturating TF: Mentioning “database” 50 times in a document doesn’t make it 50x more relevant than mentioning it once. The score saturates around tf=5-10.
Step 7: Combining Signals
BM25 alone isn’t sufficient. Modern search engines combine multiple signals:
def final_score(doc, query, bm25):
return (
0.40 * bm25 + # Text relevance
0.25 * doc.page_rank + # Link authority
0.15 * freshness_score(doc.crawled_at) + # Newer is better
0.10 * click_through_rate(doc, query) + # User engagement
0.10 * domain_authority(doc.url) # Trust signal
)PageRank measures a page’s authority based on the link graph. A page linked to by many important pages gets a high PageRank. It’s computed offline during the indexing pipeline.
Deep Dive: Autocomplete
Autocomplete must return suggestions within 50-100ms of each keystroke. The data structure of choice is a Trie (prefix tree) combined with pre-computed top-K results per prefix.
class TrieNode:
def __init__(self):
self.children = {}
self.top_suggestions = [] # Pre-computed top 10 for this prefix
self.is_end = False
class AutocompleteTrie:
def __init__(self):
self.root = TrieNode()
def insert(self, phrase, score):
node = self.root
for char in phrase:
if char not in node.children:
node.children[char] = TrieNode()
node = node.children[char]
# Update top-K suggestions at each prefix node
self._update_top_k(node, phrase, score)
node.is_end = True
def suggest(self, prefix):
node = self.root
for char in prefix:
if char not in node.children:
return []
node = node.children[char]
return node.top_suggestions # O(1) lookup!
def _update_top_k(self, node, phrase, score, k=10):
node.top_suggestions.append((phrase, score))
node.top_suggestions.sort(key=lambda x: -x[1])
node.top_suggestions = node.top_suggestions[:k]The key optimization: pre-compute the top-K suggestions at every trie node. When a user types “dist”, we traverse to the “d”→“i”→“s”→“t” node and return its pre-computed list immediately — no DFS traversal needed at query time.
The trie is rebuilt periodically (e.g., hourly) from query logs, weighted by query frequency.
Deep Dive: Freshness vs. Completeness
A tension exists between two goals:
- Freshness: Index breaking news within minutes
- Completeness: Have a comprehensive index of the entire web
The solution is tiered indexing:
Tier 1 (Hot):
- Top 10M most important URLs
- Re-crawled every 1-4 hours
- Served from fastest SSD-backed shards
- Breaking news, popular sites
Tier 2 (Warm):
- Next 1B URLs
- Re-crawled every 1-7 days
- Served from standard storage
Tier 3 (Cold):
- Long-tail: 100B+ URLs
- Re-crawled every 30-90 days
- Served from high-density storage
- Only queried if tiers 1+2 don't satisfyFor breaking news, a separate real-time indexing pipeline processes new content from RSS feeds, social media, and news APIs. It pushes updates to Tier 1 within minutes.
Deep Dive: Index Serving at Scale
With 500 TB of index data and 100K QPS, you need careful capacity planning.
Index Size: 500 TB
Shard Size: ~50 GB per shard (fits in RAM on 64GB machines)
Num Shards: 10,000
Replicas: 3 per shard (availability)
Total Machines: 30,000 index servers
Query Fan-out: Each query hits all 10,000 shards
But shards are grouped into ~100 "rows"
Each row holds a full copy of the index
Query hits 1 shard per row = 100 RPCs per query
At 100K QPS: 100K * 100 = 10M internal RPCs/sec
Per shard: ~1K queries/sec (manageable with SSD + RAM cache)Bloom filters optimize this further: each shard maintains a Bloom filter of its terms. The query coordinator checks the Bloom filter first and skips shards guaranteed not to have any of the query terms.
Key Takeaways
| Component | Choice | Why |
|---|---|---|
| Index structure | Inverted index with posting lists | Standard for full-text search, supports AND/OR/phrase queries |
| Sharding | Document-partitioned | Simpler rebalancing, each shard is self-contained |
| Query pattern | Scatter-gather | Broadcast to all shards, merge top-K results |
| Scoring | BM25 + PageRank + freshness | BM25 for text relevance, PageRank for authority |
| Autocomplete | Trie with pre-computed top-K | O(1) suggestion lookup per prefix |
| Freshness | Tiered indexing (hot/warm/cold) | Important pages re-crawled hourly, long-tail monthly |
| Spell check | Edit distance + n-gram index | Fast candidate generation with frequency-weighted ranking |
| Crawling | BFS with priority + politeness | Crawl important pages first, respect robots.txt |
The search engine design tests your understanding of data structures at scale (inverted index, tries), distributed query processing (scatter-gather), and relevance engineering (BM25, PageRank). In the interview, start with the two-pipeline architecture, then let the interviewer guide which deep dive they want.
