Design a Distributed Cache — Cracking the System Design Interview

Every high-traffic system hits the same wall: the database cannot handle the read load. A distributed cache sits between your application and the database, serving frequently accessed data from memory at sub-millisecond latency. Redis and Memcached are the de facto standards, but understanding how to design one from scratch reveals the trade-offs that matter in a system design interview.

In this lesson, we will design a distributed cache that handles 100K+ operations per second per node, distributes keys across a cluster using consistent hashing, and handles the real-world problems — cache stampedes, hot keys, and node failures.

Understanding the Problem

Functional Requirements

GET / SET / DELETE: Core operations with key-value semantics
TTL support: Keys expire automatically after a configurable time-to-live
Key expiration: Both lazy expiration (on access) and active expiration (background sweep)
Data types: Support strings, hashes, lists, sets (like Redis)
Pub/Sub: Publish messages to channels, subscribers receive in real time

Non-Functional Requirements

Sub-millisecond latency: p99 < 1ms for GET/SET on a single node
High throughput: 100K+ operations per second per node
Horizontal scalability: Add nodes to increase capacity linearly
Fault tolerance: Node failure does not lose all cached data
Memory efficiency: Minimize overhead per key-value pair

Core Entities and APIs

Entities

interface CacheEntry {
  key: string;
  value: Buffer;          // Serialized value (string, hash, list, etc.)
  ttl: number | null;     // Seconds until expiry, null = no expiry
  createdAt: number;      // Unix timestamp
  accessedAt: number;     // For LRU tracking
  size: number;           // Bytes, for memory accounting
}

interface CacheNode {
  id: string;
  host: string;
  port: number;
  maxMemory: number;      // Bytes
  usedMemory: number;
  status: 'ACTIVE' | 'DRAINING' | 'DOWN';
}

interface HashRing {
  nodes: CacheNode[];
  virtualNodesPerNode: number;  // Typically 100-200
  ring: SortedMap<number, CacheNode>;  // Hash position -> node
}

APIs

GET    key                     — Retrieve value, return null if missing/expired
SET    key value [EX seconds]  — Store value with optional TTL
DEL    key                     — Remove key
EXPIRE key seconds             — Set TTL on existing key
KEYS   pattern                 — Find keys matching pattern (use sparingly)
TTL    key                     — Get remaining TTL

High-Level Design

Distributed cache architecture with client library, hash ring, and cache nodes

The architecture has three layers:

Client Library — Runs inside each application server. Implements consistent hashing to route requests to the correct cache node. Manages connection pools and serialization.
Cache Cluster — Multiple cache nodes, each responsible for a portion of the key space. Nodes are arranged on a hash ring.
Cluster Coordinator — Manages cluster membership. Detects node failures, triggers rebalancing, and maintains the authoritative ring configuration.

The client library is the critical piece. It hashes each key to determine which node owns it, then sends the request directly to that node. There is no central router — this is what makes the system scale.

Deep Dive: Consistent Hashing

The naive approach to distributing keys is modulo hashing: node = hash(key) % num_nodes. This works until you add or remove a node — suddenly almost every key maps to a different node, causing a cache stampede as every request hits the database.

Consistent hashing ring showing virtual nodes and key remapping when a node is added

How Consistent Hashing Works

Instead of modulo, we place nodes on a ring (a circular hash space from 0 to 2^32). To find which node owns a key, hash the key and walk clockwise around the ring until you hit a node.

import hashlib
from sortedcontainers import SortedDict

class ConsistentHashRing:
    def __init__(self, virtual_nodes=150):
        self.virtual_nodes = virtual_nodes
        self.ring = SortedDict()       # hash_position -> node_id
        self.nodes = {}                # node_id -> node_info
    
    def _hash(self, key: str) -> int:
        """MD5 hash to 32-bit integer for ring position."""
        digest = hashlib.md5(key.encode()).hexdigest()
        return int(digest[:8], 16)
    
    def add_node(self, node_id: str, node_info: dict):
        """Add a node with virtual nodes for balanced distribution."""
        self.nodes[node_id] = node_info
        for i in range(self.virtual_nodes):
            virtual_key = f"{node_id}:vnode:{i}"
            position = self._hash(virtual_key)
            self.ring[position] = node_id
    
    def remove_node(self, node_id: str):
        """Remove a node — only keys on this node are remapped."""
        positions_to_remove = [
            pos for pos, nid in self.ring.items() if nid == node_id
        ]
        for pos in positions_to_remove:
            del self.ring[pos]
        del self.nodes[node_id]
    
    def get_node(self, key: str) -> str:
        """Find the node responsible for this key."""
        if not self.ring:
            raise Exception("No nodes in ring")
        
        position = self._hash(key)
        
        # Find the first node position >= key position (clockwise)
        idx = self.ring.bisect_left(position)
        if idx == len(self.ring):
            idx = 0  # Wrap around to first node
        
        return self.ring.values()[idx]

Why Virtual Nodes?

Without virtual nodes, the key distribution is wildly uneven. If you have 3 nodes, one might end up responsible for 60% of the keys. Virtual nodes solve this by placing each physical node at 100-200 positions on the ring, creating a statistically even distribution.

The trade-off: more virtual nodes means more even distribution but more memory for the ring and slightly slower lookups. In practice, 150 virtual nodes per physical node gives a standard deviation of less than 5% in key distribution.

What Happens When a Node Fails?

When Node B is removed from the ring, only the keys that were assigned to Node B need to be remapped. They move to the next clockwise node (Node C). All other keys stay where they are. This means only K/N keys are remapped (where K is total keys and N is the number of nodes), compared to modulo hashing where nearly all keys are remapped.

Deep Dive: Eviction Policies

When memory is full and a new key needs to be stored, the cache must evict an existing key. The eviction policy determines which key to remove.

LRU (Least Recently Used)

The most common policy. Evict the key that has not been accessed for the longest time. The insight: if you have not needed it recently, you probably will not need it soon.

LRU cache internals showing doubly-linked list and hashmap for O(1) operations

The implementation uses two data structures working together:

HashMap: O(1) lookup by key, stores pointers to linked list nodes
Doubly-Linked List: Maintains access order. Most recent at the head, least recent at the tail.

class LRUNode:
    def __init__(self, key, value):
        self.key = key
        self.value = value
        self.prev = None
        self.next = None

class LRUCache:
    def __init__(self, capacity: int):
        self.capacity = capacity
        self.map = {}                    # key -> LRUNode
        self.head = LRUNode('', '')      # Dummy head
        self.tail = LRUNode('', '')      # Dummy tail
        self.head.next = self.tail
        self.tail.prev = self.head
    
    def _remove(self, node: LRUNode):
        """Remove node from its current position."""
        node.prev.next = node.next
        node.next.prev = node.prev
    
    def _add_to_head(self, node: LRUNode):
        """Add node right after the dummy head (most recent)."""
        node.next = self.head.next
        node.prev = self.head
        self.head.next.prev = node
        self.head.next = node
    
    def get(self, key: str):
        """O(1) — lookup in hashmap, promote to head."""
        if key not in self.map:
            return None
        node = self.map[key]
        self._remove(node)           # Remove from current position
        self._add_to_head(node)      # Move to head (most recent)
        return node.value
    
    def set(self, key: str, value):
        """O(1) — insert at head, evict from tail if full."""
        if key in self.map:
            self._remove(self.map[key])
        
        node = LRUNode(key, value)
        self._add_to_head(node)
        self.map[key] = node
        
        if len(self.map) > self.capacity:
            # Evict least recently used (tail)
            evicted = self.tail.prev
            self._remove(evicted)
            del self.map[evicted.key]
            return evicted.key        # Return evicted key for logging
        return None

Every operation — GET, SET, DELETE — is O(1). This is critical for a cache that needs to handle 100K+ operations per second.

Other Eviction Policies

Policy	How It Works	Best For
LRU	Evict least recently accessed	General purpose, most common
LFU	Evict least frequently accessed	Workloads with stable hot set
Random	Evict a random key	Surprisingly effective, zero overhead
TTL	Evict the key closest to expiry	Time-sensitive data

Redis uses an approximated LRU by default. Instead of tracking a global LRU list (expensive in memory), it samples 5 random keys and evicts the one with the oldest access time. This uses no additional memory and produces near-optimal results in practice.

Deep Dive: Cache Stampede Prevention

A cache stampede (also called “thundering herd”) occurs when a popular key expires and hundreds of concurrent requests all miss the cache simultaneously, overwhelming the database.

The Problem

t=0: key "top_products" expires
t=0: 500 requests arrive for "top_products"
t=0: All 500 miss cache → all 500 query the database
t=0: Database overloaded, latency spikes to 5 seconds
t=5: All 500 responses arrive, 499 of them were redundant

Solution 1: Distributed Locking

Only one request computes the value; others wait for it.

def get_with_lock(key: str, compute_fn, ttl: int = 300):
    """Prevent stampede with distributed lock."""
    value = cache.get(key)
    if value is not None:
        return value
    
    # Try to acquire lock for this key
    lock_key = f"lock:{key}"
    acquired = redis.set(lock_key, "1", nx=True, ex=10)
    
    if acquired:
        try:
            # Winner: compute the value
            value = compute_fn()
            cache.set(key, value, ex=ttl)
            return value
        finally:
            redis.delete(lock_key)
    else:
        # Loser: wait for the winner to populate the cache
        for _ in range(50):  # Max 5 seconds
            time.sleep(0.1)
            value = cache.get(key)
            if value is not None:
                return value
        # Fallback: compute it ourselves
        return compute_fn()

Solution 2: Probabilistic Early Recomputation

The more elegant solution. Instead of waiting for the key to expire, recompute it slightly before expiry with increasing probability as the TTL gets closer to zero.

import math
import random

def get_with_early_recompute(key: str, compute_fn, ttl: int = 300):
    """XFetch algorithm — recompute before expiry to prevent stampede."""
    value, remaining_ttl = cache.get_with_ttl(key)
    
    if value is not None:
        # Probabilistic early recomputation
        # As TTL approaches 0, probability of recomputation increases
        beta = 1.0  # Tuning parameter
        delta = time.time()  # Time to recompute (estimated)
        
        expiry_gap = remaining_ttl - delta * beta * math.log(random.random())
        
        if expiry_gap > 0:
            return value  # Not time to recompute yet
        # Fall through to recompute
    
    # Compute and cache the value
    value = compute_fn()
    cache.set(key, value, ex=ttl)
    return value

This approach is statistically elegant: when 500 servers all check the same key with 10 seconds of TTL remaining, only 1-2 of them will trigger recomputation. No locking, no coordination, no waiting.

Deep Dive: Hot Key Problem

A “hot key” is a single key that receives a disproportionate amount of traffic. Think of a viral tweet, a flash sale product, or a breaking news article. Even a single node handling 100K ops/sec will buckle under 500K ops/sec for one key.

Solution: Local Caching + Key Splitting

class HotKeyHandler:
    def __init__(self, cache_cluster, local_cache_ttl=1):
        self.cluster = cache_cluster
        self.local_cache = {}           # In-process cache
        self.local_ttl = local_cache_ttl
        self.hot_keys = set()           # Tracked by monitoring
    
    def get(self, key: str):
        # Check local cache first for hot keys
        if key in self.hot_keys:
            local = self.local_cache.get(key)
            if local and time.time() - local['ts'] < self.local_ttl:
                return local['value']
        
        # Fall through to distributed cache
        value = self.cluster.get(key)
        
        if key in self.hot_keys:
            self.local_cache[key] = {
                'value': value, 
                'ts': time.time()
            }
        
        return value
    
    def set_hot(self, key: str, value, replicas: int = 3):
        """Split a hot key across multiple nodes."""
        for i in range(replicas):
            replica_key = f"{key}:replica:{i}"
            self.cluster.set(replica_key, value)
    
    def get_hot(self, key: str, replicas: int = 3):
        """Read from a random replica to distribute load."""
        replica_idx = random.randint(0, replicas - 1)
        return self.cluster.get(f"{key}:replica:{replica_idx}")

Two complementary strategies:

Local (in-process) cache with a very short TTL (1-2 seconds). Serves most reads from application memory without touching the cache cluster. Acceptable staleness for hot data.
Key splitting: Replicate the hot key as key:replica:0, key:replica:1, etc. Each replica lives on a different node, spreading the load. Reads pick a random replica.

Deep Dive: Data Replication

For fault tolerance, each key should exist on multiple nodes. There are two approaches.

Async Replication (Redis Default)

Writes go to the primary and are asynchronously replicated to replicas. This is fast but risks data loss if the primary crashes before replication completes.

Client → Primary Node (write ACK) → Async → Replica 1
                                   → Async → Replica 2

Latency: Low (write returns immediately)
Durability: Weak (up to ~1 second of data loss on primary failure)
Use case: Caching (data can be recomputed from the database)

Sync Replication

Writes are acknowledged only after at least one replica confirms. Higher latency but stronger durability.

Client → Primary Node → Sync → Replica 1 (ACK)
                       → Sync → Replica 2
       ← write ACK ←

Latency: Higher (~2x write latency)
Durability: Strong (data survives single node failure)
Use case: Session storage, feature flags (data that cannot be cheaply recomputed)

For a caching system, async replication is usually the right choice. The cache is a performance optimization — if a node fails and we lose some data, the worst case is a cache miss that gets refilled from the database.

Deep Dive: Memory Management

Efficient memory management is critical when you are storing millions of keys in RAM.

Slab Allocation (Memcached Approach)

Instead of using the system allocator (malloc/free) which causes fragmentation, Memcached pre-allocates memory in fixed-size “slabs”:

Slab Class 1: 96-byte chunks   (for keys < 96 bytes)
Slab Class 2: 120-byte chunks  (for keys 97-120 bytes)
Slab Class 3: 152-byte chunks  (for keys 121-152 bytes)
...
Slab Class 42: 1MB chunks      (for keys up to 1MB)

Each value is stored in the smallest slab class that fits. This eliminates fragmentation at the cost of some internal waste (a 97-byte value uses a 120-byte chunk, wasting 23 bytes).

Redis Memory Optimization

Redis takes a different approach, using specialized data structures based on the data’s size:

Small hash (< 128 fields):  ziplist (compact, linear scan)
Large hash (>= 128 fields): hashtable (traditional, O(1) lookup)

Small list (< 128 elements): ziplist → quicklist
Large list: quicklist (linked list of ziplists)

Small set (< 64 int members): intset (sorted array of integers)
Large set: hashtable

This adaptive approach saves significant memory for the common case of small objects while still providing efficient operations for large ones.

Key Takeaways

Consistent hashing is essential for distributed caches. It minimizes key remapping when nodes are added or removed — only K/N keys move, compared to nearly all keys with modulo hashing.
LRU with a doubly-linked list + hashmap gives O(1) everything. GET, SET, DELETE, and eviction are all constant time. Approximated LRU (sampling) saves memory with minimal accuracy loss.
Cache stampedes are a real threat. Probabilistic early recomputation is the most elegant solution — no locks, no coordination, statistically correct.
Hot keys need special treatment. Local in-process caches with short TTLs and key splitting across replicas handle extreme read-heavy keys.
Virtual nodes are not optional. Without them, consistent hashing produces wildly uneven key distribution. 150 virtual nodes per physical node is a good starting point.
Choose replication based on your data’s recoverability. Async replication for cacheable data (fast writes, acceptable loss). Sync replication for session data (slower writes, no data loss).

A well-designed distributed cache is the single biggest performance lever in most architectures. Understanding these internals — from the hash ring to the LRU list to the eviction policies — is what separates a surface-level answer from a strong system design interview response.