Every high-traffic system hits the same wall: the database cannot handle the read load. A distributed cache sits between your application and the database, serving frequently accessed data from memory at sub-millisecond latency. Redis and Memcached are the de facto standards, but understanding how to design one from scratch reveals the trade-offs that matter in a system design interview.
In this lesson, we will design a distributed cache that handles 100K+ operations per second per node, distributes keys across a cluster using consistent hashing, and handles the real-world problems — cache stampedes, hot keys, and node failures.
Understanding the Problem
Functional Requirements
- GET / SET / DELETE: Core operations with key-value semantics
- TTL support: Keys expire automatically after a configurable time-to-live
- Key expiration: Both lazy expiration (on access) and active expiration (background sweep)
- Data types: Support strings, hashes, lists, sets (like Redis)
- Pub/Sub: Publish messages to channels, subscribers receive in real time
Non-Functional Requirements
- Sub-millisecond latency: p99 < 1ms for GET/SET on a single node
- High throughput: 100K+ operations per second per node
- Horizontal scalability: Add nodes to increase capacity linearly
- Fault tolerance: Node failure does not lose all cached data
- Memory efficiency: Minimize overhead per key-value pair
Core Entities and APIs
Entities
interface CacheEntry {
key: string;
value: Buffer; // Serialized value (string, hash, list, etc.)
ttl: number | null; // Seconds until expiry, null = no expiry
createdAt: number; // Unix timestamp
accessedAt: number; // For LRU tracking
size: number; // Bytes, for memory accounting
}
interface CacheNode {
id: string;
host: string;
port: number;
maxMemory: number; // Bytes
usedMemory: number;
status: 'ACTIVE' | 'DRAINING' | 'DOWN';
}
interface HashRing {
nodes: CacheNode[];
virtualNodesPerNode: number; // Typically 100-200
ring: SortedMap<number, CacheNode>; // Hash position -> node
}APIs
GET key — Retrieve value, return null if missing/expired
SET key value [EX seconds] — Store value with optional TTL
DEL key — Remove key
EXPIRE key seconds — Set TTL on existing key
KEYS pattern — Find keys matching pattern (use sparingly)
TTL key — Get remaining TTLHigh-Level Design
The architecture has three layers:
-
Client Library — Runs inside each application server. Implements consistent hashing to route requests to the correct cache node. Manages connection pools and serialization.
-
Cache Cluster — Multiple cache nodes, each responsible for a portion of the key space. Nodes are arranged on a hash ring.
-
Cluster Coordinator — Manages cluster membership. Detects node failures, triggers rebalancing, and maintains the authoritative ring configuration.
The client library is the critical piece. It hashes each key to determine which node owns it, then sends the request directly to that node. There is no central router — this is what makes the system scale.
Deep Dive: Consistent Hashing
The naive approach to distributing keys is modulo hashing: node = hash(key) % num_nodes. This works until you add or remove a node — suddenly almost every key maps to a different node, causing a cache stampede as every request hits the database.
How Consistent Hashing Works
Instead of modulo, we place nodes on a ring (a circular hash space from 0 to 2^32). To find which node owns a key, hash the key and walk clockwise around the ring until you hit a node.
import hashlib
from sortedcontainers import SortedDict
class ConsistentHashRing:
def __init__(self, virtual_nodes=150):
self.virtual_nodes = virtual_nodes
self.ring = SortedDict() # hash_position -> node_id
self.nodes = {} # node_id -> node_info
def _hash(self, key: str) -> int:
"""MD5 hash to 32-bit integer for ring position."""
digest = hashlib.md5(key.encode()).hexdigest()
return int(digest[:8], 16)
def add_node(self, node_id: str, node_info: dict):
"""Add a node with virtual nodes for balanced distribution."""
self.nodes[node_id] = node_info
for i in range(self.virtual_nodes):
virtual_key = f"{node_id}:vnode:{i}"
position = self._hash(virtual_key)
self.ring[position] = node_id
def remove_node(self, node_id: str):
"""Remove a node — only keys on this node are remapped."""
positions_to_remove = [
pos for pos, nid in self.ring.items() if nid == node_id
]
for pos in positions_to_remove:
del self.ring[pos]
del self.nodes[node_id]
def get_node(self, key: str) -> str:
"""Find the node responsible for this key."""
if not self.ring:
raise Exception("No nodes in ring")
position = self._hash(key)
# Find the first node position >= key position (clockwise)
idx = self.ring.bisect_left(position)
if idx == len(self.ring):
idx = 0 # Wrap around to first node
return self.ring.values()[idx]Why Virtual Nodes?
Without virtual nodes, the key distribution is wildly uneven. If you have 3 nodes, one might end up responsible for 60% of the keys. Virtual nodes solve this by placing each physical node at 100-200 positions on the ring, creating a statistically even distribution.
The trade-off: more virtual nodes means more even distribution but more memory for the ring and slightly slower lookups. In practice, 150 virtual nodes per physical node gives a standard deviation of less than 5% in key distribution.
What Happens When a Node Fails?
When Node B is removed from the ring, only the keys that were assigned to Node B need to be remapped. They move to the next clockwise node (Node C). All other keys stay where they are. This means only K/N keys are remapped (where K is total keys and N is the number of nodes), compared to modulo hashing where nearly all keys are remapped.
Deep Dive: Eviction Policies
When memory is full and a new key needs to be stored, the cache must evict an existing key. The eviction policy determines which key to remove.
LRU (Least Recently Used)
The most common policy. Evict the key that has not been accessed for the longest time. The insight: if you have not needed it recently, you probably will not need it soon.
The implementation uses two data structures working together:
- HashMap: O(1) lookup by key, stores pointers to linked list nodes
- Doubly-Linked List: Maintains access order. Most recent at the head, least recent at the tail.
class LRUNode:
def __init__(self, key, value):
self.key = key
self.value = value
self.prev = None
self.next = None
class LRUCache:
def __init__(self, capacity: int):
self.capacity = capacity
self.map = {} # key -> LRUNode
self.head = LRUNode('', '') # Dummy head
self.tail = LRUNode('', '') # Dummy tail
self.head.next = self.tail
self.tail.prev = self.head
def _remove(self, node: LRUNode):
"""Remove node from its current position."""
node.prev.next = node.next
node.next.prev = node.prev
def _add_to_head(self, node: LRUNode):
"""Add node right after the dummy head (most recent)."""
node.next = self.head.next
node.prev = self.head
self.head.next.prev = node
self.head.next = node
def get(self, key: str):
"""O(1) — lookup in hashmap, promote to head."""
if key not in self.map:
return None
node = self.map[key]
self._remove(node) # Remove from current position
self._add_to_head(node) # Move to head (most recent)
return node.value
def set(self, key: str, value):
"""O(1) — insert at head, evict from tail if full."""
if key in self.map:
self._remove(self.map[key])
node = LRUNode(key, value)
self._add_to_head(node)
self.map[key] = node
if len(self.map) > self.capacity:
# Evict least recently used (tail)
evicted = self.tail.prev
self._remove(evicted)
del self.map[evicted.key]
return evicted.key # Return evicted key for logging
return NoneEvery operation — GET, SET, DELETE — is O(1). This is critical for a cache that needs to handle 100K+ operations per second.
Other Eviction Policies
| Policy | How It Works | Best For |
|---|---|---|
| LRU | Evict least recently accessed | General purpose, most common |
| LFU | Evict least frequently accessed | Workloads with stable hot set |
| Random | Evict a random key | Surprisingly effective, zero overhead |
| TTL | Evict the key closest to expiry | Time-sensitive data |
Redis uses an approximated LRU by default. Instead of tracking a global LRU list (expensive in memory), it samples 5 random keys and evicts the one with the oldest access time. This uses no additional memory and produces near-optimal results in practice.
Deep Dive: Cache Stampede Prevention
A cache stampede (also called “thundering herd”) occurs when a popular key expires and hundreds of concurrent requests all miss the cache simultaneously, overwhelming the database.
The Problem
t=0: key "top_products" expires
t=0: 500 requests arrive for "top_products"
t=0: All 500 miss cache → all 500 query the database
t=0: Database overloaded, latency spikes to 5 seconds
t=5: All 500 responses arrive, 499 of them were redundantSolution 1: Distributed Locking
Only one request computes the value; others wait for it.
def get_with_lock(key: str, compute_fn, ttl: int = 300):
"""Prevent stampede with distributed lock."""
value = cache.get(key)
if value is not None:
return value
# Try to acquire lock for this key
lock_key = f"lock:{key}"
acquired = redis.set(lock_key, "1", nx=True, ex=10)
if acquired:
try:
# Winner: compute the value
value = compute_fn()
cache.set(key, value, ex=ttl)
return value
finally:
redis.delete(lock_key)
else:
# Loser: wait for the winner to populate the cache
for _ in range(50): # Max 5 seconds
time.sleep(0.1)
value = cache.get(key)
if value is not None:
return value
# Fallback: compute it ourselves
return compute_fn()Solution 2: Probabilistic Early Recomputation
The more elegant solution. Instead of waiting for the key to expire, recompute it slightly before expiry with increasing probability as the TTL gets closer to zero.
import math
import random
def get_with_early_recompute(key: str, compute_fn, ttl: int = 300):
"""XFetch algorithm — recompute before expiry to prevent stampede."""
value, remaining_ttl = cache.get_with_ttl(key)
if value is not None:
# Probabilistic early recomputation
# As TTL approaches 0, probability of recomputation increases
beta = 1.0 # Tuning parameter
delta = time.time() # Time to recompute (estimated)
expiry_gap = remaining_ttl - delta * beta * math.log(random.random())
if expiry_gap > 0:
return value # Not time to recompute yet
# Fall through to recompute
# Compute and cache the value
value = compute_fn()
cache.set(key, value, ex=ttl)
return valueThis approach is statistically elegant: when 500 servers all check the same key with 10 seconds of TTL remaining, only 1-2 of them will trigger recomputation. No locking, no coordination, no waiting.
Deep Dive: Hot Key Problem
A “hot key” is a single key that receives a disproportionate amount of traffic. Think of a viral tweet, a flash sale product, or a breaking news article. Even a single node handling 100K ops/sec will buckle under 500K ops/sec for one key.
Solution: Local Caching + Key Splitting
class HotKeyHandler:
def __init__(self, cache_cluster, local_cache_ttl=1):
self.cluster = cache_cluster
self.local_cache = {} # In-process cache
self.local_ttl = local_cache_ttl
self.hot_keys = set() # Tracked by monitoring
def get(self, key: str):
# Check local cache first for hot keys
if key in self.hot_keys:
local = self.local_cache.get(key)
if local and time.time() - local['ts'] < self.local_ttl:
return local['value']
# Fall through to distributed cache
value = self.cluster.get(key)
if key in self.hot_keys:
self.local_cache[key] = {
'value': value,
'ts': time.time()
}
return value
def set_hot(self, key: str, value, replicas: int = 3):
"""Split a hot key across multiple nodes."""
for i in range(replicas):
replica_key = f"{key}:replica:{i}"
self.cluster.set(replica_key, value)
def get_hot(self, key: str, replicas: int = 3):
"""Read from a random replica to distribute load."""
replica_idx = random.randint(0, replicas - 1)
return self.cluster.get(f"{key}:replica:{replica_idx}")Two complementary strategies:
-
Local (in-process) cache with a very short TTL (1-2 seconds). Serves most reads from application memory without touching the cache cluster. Acceptable staleness for hot data.
-
Key splitting: Replicate the hot key as
key:replica:0,key:replica:1, etc. Each replica lives on a different node, spreading the load. Reads pick a random replica.
Deep Dive: Data Replication
For fault tolerance, each key should exist on multiple nodes. There are two approaches.
Async Replication (Redis Default)
Writes go to the primary and are asynchronously replicated to replicas. This is fast but risks data loss if the primary crashes before replication completes.
Client → Primary Node (write ACK) → Async → Replica 1
→ Async → Replica 2- Latency: Low (write returns immediately)
- Durability: Weak (up to ~1 second of data loss on primary failure)
- Use case: Caching (data can be recomputed from the database)
Sync Replication
Writes are acknowledged only after at least one replica confirms. Higher latency but stronger durability.
Client → Primary Node → Sync → Replica 1 (ACK)
→ Sync → Replica 2
← write ACK ←- Latency: Higher (~2x write latency)
- Durability: Strong (data survives single node failure)
- Use case: Session storage, feature flags (data that cannot be cheaply recomputed)
For a caching system, async replication is usually the right choice. The cache is a performance optimization — if a node fails and we lose some data, the worst case is a cache miss that gets refilled from the database.
Deep Dive: Memory Management
Efficient memory management is critical when you are storing millions of keys in RAM.
Slab Allocation (Memcached Approach)
Instead of using the system allocator (malloc/free) which causes fragmentation, Memcached pre-allocates memory in fixed-size “slabs”:
Slab Class 1: 96-byte chunks (for keys < 96 bytes)
Slab Class 2: 120-byte chunks (for keys 97-120 bytes)
Slab Class 3: 152-byte chunks (for keys 121-152 bytes)
...
Slab Class 42: 1MB chunks (for keys up to 1MB)Each value is stored in the smallest slab class that fits. This eliminates fragmentation at the cost of some internal waste (a 97-byte value uses a 120-byte chunk, wasting 23 bytes).
Redis Memory Optimization
Redis takes a different approach, using specialized data structures based on the data’s size:
Small hash (< 128 fields): ziplist (compact, linear scan)
Large hash (>= 128 fields): hashtable (traditional, O(1) lookup)
Small list (< 128 elements): ziplist → quicklist
Large list: quicklist (linked list of ziplists)
Small set (< 64 int members): intset (sorted array of integers)
Large set: hashtableThis adaptive approach saves significant memory for the common case of small objects while still providing efficient operations for large ones.
Key Takeaways
-
Consistent hashing is essential for distributed caches. It minimizes key remapping when nodes are added or removed — only
K/Nkeys move, compared to nearly all keys with modulo hashing. -
LRU with a doubly-linked list + hashmap gives O(1) everything. GET, SET, DELETE, and eviction are all constant time. Approximated LRU (sampling) saves memory with minimal accuracy loss.
-
Cache stampedes are a real threat. Probabilistic early recomputation is the most elegant solution — no locks, no coordination, statistically correct.
-
Hot keys need special treatment. Local in-process caches with short TTLs and key splitting across replicas handle extreme read-heavy keys.
-
Virtual nodes are not optional. Without them, consistent hashing produces wildly uneven key distribution. 150 virtual nodes per physical node is a good starting point.
-
Choose replication based on your data’s recoverability. Async replication for cacheable data (fast writes, acceptable loss). Sync replication for session data (slower writes, no data loss).
A well-designed distributed cache is the single biggest performance lever in most architectures. Understanding these internals — from the hash ring to the LRU list to the eviction policies — is what separates a surface-level answer from a strong system design interview response.
