Design a Search Engine — Cracking the System Design Interview

A search engine is one of the most complex distributed systems ever built. Google processes over 8.5 billion searches per day, indexing hundreds of billions of web pages across a petabyte-scale inverted index. In an interview, you won’t design the full Google — but you need to demonstrate mastery of the core components: crawling, indexing, query processing, and ranking.

Understanding the Problem

Functional Requirements

Web search by keywords — User types a query, gets ranked results
Autocomplete suggestions — As-you-type query suggestions
Paginated results — 10 results per page, with next/previous
Spell correction — “Did you mean: running shoes”
Snippet generation — Show relevant text excerpt from each result
Freshness — New content indexed within hours, not days

Non-Functional Requirements

Requirement	Target
Query latency (p99)	< 200ms
Index freshness	New pages indexed within 2-4 hours
Index size	Billions of documents
Query throughput	100K+ QPS
Availability	99.99%
Relevance	First-page results satisfy 90%+ of queries

Scale Estimation

100 billion indexed pages, average 50KB each = 5 PB raw content
Inverted index is ~10% of raw content = 500 TB index data
100K QPS at peak, each touching ~10 index shards = 1M internal RPCs/sec
Crawl rate: ~1 billion pages/day to maintain freshness

Core Entities and APIs

Data Model

-- Document metadata (stored separately from index)
CREATE TABLE documents (
    doc_id      BIGINT PRIMARY KEY,
    url         TEXT UNIQUE,
    title       VARCHAR(500),
    content_hash CHAR(64),     -- SHA-256 for dedup
    page_rank   FLOAT DEFAULT 0.0,
    crawled_at  TIMESTAMP,
    indexed_at  TIMESTAMP
);

-- The inverted index is NOT in SQL — it's a custom
-- data structure optimized for term lookups.
-- Conceptually:
--   term → [(doc_id, term_frequency, [positions]), ...]

API Design

# Search
GET /api/v1/search?q=distributed+systems&page=1&size=10
response:
  total_results: 142000000
  results:
    - title: "Designing Data-Intensive Applications"
      url: "https://example.com/ddia"
      snippet: "A guide to the big ideas behind reliable, scalable..."
      score: 0.94
    - ...
  suggestions:
    spell_correction: null
    related_searches: ["distributed systems book", "system design"]

# Autocomplete
GET /api/v1/suggest?prefix=distrib
response:
  suggestions:
    - "distributed systems"
    - "distributed computing"
    - "distributed database"
    - "distributed lock"

# Index (internal API, not public)
POST /internal/v1/index
  body: { url, title, content, crawled_at }

High-Level Design

A search engine has two distinct pipelines that share the inverted index.

Indexing Pipeline (Offline/Batch)

Web Crawler — Fetches pages from the internet using BFS. A URL frontier manages priority (important sites first) and politeness (don’t DDoS anyone).
Document Processor — Parses HTML, extracts text, detects language, deduplicates via SimHash/MinHash.
Indexer — Tokenizes text, applies stemming, builds inverted index entries, computes term frequencies.
Index Storage — Writes sharded inverted index to disk. Computes PageRank from the link graph.

Query Pipeline (Online/Real-Time)

Query Parser — Tokenizes the query, applies spell check, removes stop words, stems terms.
Index Lookup — Scatters the query to relevant index shards, gathers posting lists.
Scorer/Ranker — Computes relevance score per document using BM25 + PageRank + freshness + ML signals.
Result Builder — Fetches document metadata (title, URL), generates snippets, formats the response.

Deep Dive: Inverted Index

The inverted index is the heart of any search engine. It maps every term to the list of documents containing that term.

Structure

Term Dictionary:
  "distributed" → PostingList
  "system"      → PostingList
  "design"      → PostingList
  ...

PostingList for "distributed":
  [
    { doc_id: 1042, tf: 5, positions: [3, 15, 42, 89, 123] },
    { doc_id: 2891, tf: 2, positions: [7, 56] },
    { doc_id: 5420, tf: 8, positions: [1, 12, 28, 33, 67, 91, 104, 118] },
    ...
  ]

Each posting list entry contains:

doc_id — Which document contains this term
tf (term frequency) — How many times the term appears
positions — Where in the document the term appears (enables phrase queries like “distributed systems”)

Building the Index

def build_inverted_index(documents):
    """
    Build an in-memory inverted index from a batch of documents.
    In production, this writes to a sorted on-disk format (SSTable-like).
    """
    index = defaultdict(list)  # term → list of PostingEntry
    
    for doc in documents:
        tokens = tokenize(doc.content)       # split on whitespace/punctuation
        tokens = [t.lower() for t in tokens] # lowercase
        tokens = remove_stop_words(tokens)   # remove "the", "is", "a"
        tokens = [stem(t) for t in tokens]   # "running" → "run"
        
        # Count term frequencies and record positions
        term_positions = defaultdict(list)
        for pos, token in enumerate(tokens):
            term_positions[token].append(pos)
        
        for term, positions in term_positions.items():
            index[term].append(PostingEntry(
                doc_id=doc.id,
                tf=len(positions),
                positions=positions
            ))
    
    # Sort posting lists by doc_id for efficient intersection
    for term in index:
        index[term].sort(key=lambda e: e.doc_id)
    
    return index

Sharding the Index

There are two fundamental strategies for distributing the index across machines:

Document-Partitioned (Uber’s Elasticsearch uses this):

Each shard holds a complete index for a subset of documents
Every query is broadcast to all shards (scatter-gather)
Each shard returns its local top-K, coordinator merges
Simpler to rebalance, easier to add capacity

Term-Partitioned:

Each shard owns a subset of terms (e.g., A-M, N-Z)
A single-term query hits exactly one shard
Multi-term queries hit multiple shards, then intersect
Better for single-term lookups, harder to manage

For most interviews, go with document-partitioned sharding. It’s what Elasticsearch, Solr, and most modern search systems use. The scatter-gather overhead is acceptable because each shard responds in parallel.

# Scatter-gather pseudocode
async def search(query, num_shards=1000, top_k=10):
    # Parse and prepare
    terms = parse_query(query)
    
    # Scatter: send to all shards in parallel
    futures = []
    for shard_id in range(num_shards):
        futures.append(
            shard_clients[shard_id].search(terms, top_k=top_k)
        )
    
    # Gather: collect local top-K from each shard
    shard_results = await asyncio.gather(*futures)
    
    # Merge: global top-K from all shard results
    heap = []
    for results in shard_results:
        for doc_id, score in results:
            heapq.heappush(heap, (score, doc_id))
            if len(heap) > top_k:
                heapq.heappop(heap)
    
    return sorted(heap, reverse=True)

Deep Dive: Query Processing Pipeline

A raw user query goes through several transformations before it hits the index.

Step 1: Tokenization

Split the raw query into individual terms, normalize case, handle special characters.

def tokenize(query):
    # "New York restaurants -fast-food" →
    # ["new", "york", "restaurants", "-fast-food"]
    tokens = re.split(r'\s+', query.lower().strip())
    return tokens

Step 2: Spell Correction

Check each term against a dictionary. For unknown terms, find the closest known term using edit distance or a pre-built n-gram index.

def spell_check(term, dictionary, max_edit_distance=2):
    if term in dictionary:
        return term
    
    # Generate candidates within edit distance
    candidates = []
    for dict_term in dictionary.get_candidates(term[:3]):  # prefix filter
        dist = levenshtein_distance(term, dict_term)
        if dist <= max_edit_distance:
            candidates.append((dict_term, dist, dictionary.frequency(dict_term)))
    
    if not candidates:
        return term  # Unknown term, keep as-is
    
    # Rank by (edit_distance ASC, frequency DESC)
    candidates.sort(key=lambda c: (c[1], -c[2]))
    return candidates[0][0]

# "runing" → edit distance 1 from "running" → correct

Step 3: Stemming

Reduce words to their root form so that “running”, “runs”, “ran” all match documents containing any of these forms.

# Porter Stemmer examples:
# "running"   → "run"
# "computers" → "comput"
# "arguing"   → "argu"
# "agreed"    → "agre"

Step 4: Query Expansion

Add synonyms and related terms to improve recall:

def expand_query(terms):
    expanded = list(terms)
    for term in terms:
        synonyms = synonym_dict.get(term, [])
        # Add top 2 synonyms with reduced weight
        for syn in synonyms[:2]:
            expanded.append((syn, boost=0.5))
    return expanded

# "shoes" → ["shoes", ("footwear", 0.5), ("sneakers", 0.5)]

Step 5: Index Lookup (Scatter-Gather)

The parsed query is broadcast to all index shards. Each shard performs a local search:

Look up each query term in the shard’s inverted index
Intersect posting lists (for AND queries) or merge them (for OR queries)
Score each matching document locally
Return the top-K results to the coordinator

Step 6: Scoring with BM25

BM25 (Best Matching 25) is the industry-standard relevance scoring function. It’s an improvement over TF-IDF that handles document length normalization:

import math

def bm25_score(query_terms, doc, index, corpus_stats):
    """
    BM25 scoring for a single document against query terms.
    
    k1 = 1.2  (term frequency saturation)
    b  = 0.75 (document length normalization)
    """
    k1 = 1.2
    b = 0.75
    score = 0.0
    
    N = corpus_stats.total_docs
    avgdl = corpus_stats.avg_doc_length
    dl = doc.length  # number of terms in this document
    
    for term in query_terms:
        # df = number of documents containing this term
        df = index.doc_frequency(term)
        
        # IDF component: rare terms are more important
        idf = math.log((N - df + 0.5) / (df + 0.5) + 1)
        
        # TF component with saturation and length normalization
        tf = doc.term_frequency(term)
        tf_norm = (tf * (k1 + 1)) / (tf + k1 * (1 - b + b * dl / avgdl))
        
        score += idf * tf_norm
    
    return score

BM25 captures two crucial intuitions:

IDF: A term appearing in 90% of documents (“the”) is nearly worthless. A term appearing in 0.01% of documents (“kubernetes”) is highly discriminative.
Saturating TF: Mentioning “database” 50 times in a document doesn’t make it 50x more relevant than mentioning it once. The score saturates around tf=5-10.

Step 7: Combining Signals

BM25 alone isn’t sufficient. Modern search engines combine multiple signals:

def final_score(doc, query, bm25):
    return (
        0.40 * bm25                           +  # Text relevance
        0.25 * doc.page_rank                   +  # Link authority
        0.15 * freshness_score(doc.crawled_at) +  # Newer is better
        0.10 * click_through_rate(doc, query)  +  # User engagement
        0.10 * domain_authority(doc.url)           # Trust signal
    )

PageRank measures a page’s authority based on the link graph. A page linked to by many important pages gets a high PageRank. It’s computed offline during the indexing pipeline.

Deep Dive: Autocomplete

Autocomplete must return suggestions within 50-100ms of each keystroke. The data structure of choice is a Trie (prefix tree) combined with pre-computed top-K results per prefix.

class TrieNode:
    def __init__(self):
        self.children = {}
        self.top_suggestions = []  # Pre-computed top 10 for this prefix
        self.is_end = False

class AutocompleteTrie:
    def __init__(self):
        self.root = TrieNode()
    
    def insert(self, phrase, score):
        node = self.root
        for char in phrase:
            if char not in node.children:
                node.children[char] = TrieNode()
            node = node.children[char]
            # Update top-K suggestions at each prefix node
            self._update_top_k(node, phrase, score)
        node.is_end = True
    
    def suggest(self, prefix):
        node = self.root
        for char in prefix:
            if char not in node.children:
                return []
            node = node.children[char]
        return node.top_suggestions  # O(1) lookup!
    
    def _update_top_k(self, node, phrase, score, k=10):
        node.top_suggestions.append((phrase, score))
        node.top_suggestions.sort(key=lambda x: -x[1])
        node.top_suggestions = node.top_suggestions[:k]

The key optimization: pre-compute the top-K suggestions at every trie node. When a user types “dist”, we traverse to the “d”→“i”→“s”→“t” node and return its pre-computed list immediately — no DFS traversal needed at query time.

The trie is rebuilt periodically (e.g., hourly) from query logs, weighted by query frequency.

Deep Dive: Freshness vs. Completeness

A tension exists between two goals:

Freshness: Index breaking news within minutes
Completeness: Have a comprehensive index of the entire web

The solution is tiered indexing:

Tier 1 (Hot): 
  - Top 10M most important URLs
  - Re-crawled every 1-4 hours
  - Served from fastest SSD-backed shards
  - Breaking news, popular sites

Tier 2 (Warm):
  - Next 1B URLs
  - Re-crawled every 1-7 days
  - Served from standard storage

Tier 3 (Cold):
  - Long-tail: 100B+ URLs
  - Re-crawled every 30-90 days
  - Served from high-density storage
  - Only queried if tiers 1+2 don't satisfy

For breaking news, a separate real-time indexing pipeline processes new content from RSS feeds, social media, and news APIs. It pushes updates to Tier 1 within minutes.

Deep Dive: Index Serving at Scale

With 500 TB of index data and 100K QPS, you need careful capacity planning.

Index Size:     500 TB
Shard Size:     ~50 GB per shard (fits in RAM on 64GB machines)
Num Shards:     10,000
Replicas:       3 per shard (availability)
Total Machines: 30,000 index servers

Query Fan-out:  Each query hits all 10,000 shards
                But shards are grouped into ~100 "rows"
                Each row holds a full copy of the index
                Query hits 1 shard per row = 100 RPCs per query

At 100K QPS:    100K * 100 = 10M internal RPCs/sec
Per shard:      ~1K queries/sec (manageable with SSD + RAM cache)

Bloom filters optimize this further: each shard maintains a Bloom filter of its terms. The query coordinator checks the Bloom filter first and skips shards guaranteed not to have any of the query terms.

Key Takeaways

Component	Choice	Why
Index structure	Inverted index with posting lists	Standard for full-text search, supports AND/OR/phrase queries
Sharding	Document-partitioned	Simpler rebalancing, each shard is self-contained
Query pattern	Scatter-gather	Broadcast to all shards, merge top-K results
Scoring	BM25 + PageRank + freshness	BM25 for text relevance, PageRank for authority
Autocomplete	Trie with pre-computed top-K	O(1) suggestion lookup per prefix
Freshness	Tiered indexing (hot/warm/cold)	Important pages re-crawled hourly, long-tail monthly
Spell check	Edit distance + n-gram index	Fast candidate generation with frequency-weighted ranking
Crawling	BFS with priority + politeness	Crawl important pages first, respect robots.txt

The search engine design tests your understanding of data structures at scale (inverted index, tries), distributed query processing (scatter-gather), and relevance engineering (BM25, PageRank). In the interview, start with the two-pipeline architecture, then let the interviewer guide which deep dive they want.