E-commerce platforms handle some of the most demanding distributed systems challenges: inventory that must never oversell, flash sales with 100x traffic spikes, distributed transactions spanning multiple services, and search over millions of products with faceted filtering. This lesson covers how to design a platform that handles all of this reliably.
Understanding the Problem
Functional Requirements
- Product catalog — Browse products by category, view details, images, reviews
- Product search — Full-text search with filters (price range, brand, rating, category)
- Shopping cart — Add/remove items, persist across sessions
- Checkout and ordering — Convert cart to order, apply coupons, select shipping
- Payment processing — Charge credit card, handle refunds
- Order tracking — Real-time status from placed through delivered
- User reviews — Star ratings and text reviews on products
- Seller dashboard — Sellers manage their product listings and view analytics
Non-Functional Requirements
| Requirement | Target |
|---|---|
| Availability | 99.99% (downtime = lost revenue) |
| Search latency (p99) | < 100ms |
| Checkout consistency | Zero overselling — never sell what you don’t have |
| Flash sale handling | Sustain 10-100x normal traffic for 30-60 minutes |
| Order processing | Handle 50K orders/minute at peak |
| Data durability | Zero order/payment data loss |
Scale Estimation
- 100 million products in the catalog
- 500 million registered users, 50 million daily active
- 5 million orders/day normally, 50 million during peak events
- Product search: 200K QPS at peak
- Cart operations: 500K QPS at peak (lots of browsing, adding, removing)
Core Entities and APIs
Data Model
-- Products
CREATE TABLE products (
id UUID PRIMARY KEY,
seller_id UUID REFERENCES sellers(id),
title VARCHAR(500),
description TEXT,
category_id UUID REFERENCES categories(id),
price DECIMAL(10,2),
image_urls TEXT[],
rating_avg DECIMAL(2,1),
rating_count INT DEFAULT 0,
status VARCHAR(20) DEFAULT 'active',
created_at TIMESTAMP DEFAULT NOW()
);
-- Inventory (separate from product — different update patterns)
CREATE TABLE inventory (
product_id UUID PRIMARY KEY REFERENCES products(id),
stock INT NOT NULL CHECK (stock >= 0),
reserved INT NOT NULL DEFAULT 0, -- held by in-progress orders
version INT NOT NULL DEFAULT 1, -- for optimistic locking
warehouse_id UUID,
updated_at TIMESTAMP DEFAULT NOW()
);
-- Orders
CREATE TABLE orders (
id UUID PRIMARY KEY,
user_id UUID REFERENCES users(id),
status VARCHAR(20) NOT NULL,
-- PENDING → RESERVED → PAID → CONFIRMED → SHIPPED → DELIVERED → RETURNED
-- PENDING → CANCELLED (at any pre-shipped stage)
total_amount DECIMAL(10,2),
shipping_addr JSONB,
coupon_code VARCHAR(50),
discount_amount DECIMAL(10,2) DEFAULT 0,
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE order_items (
id UUID PRIMARY KEY,
order_id UUID REFERENCES orders(id),
product_id UUID REFERENCES products(id),
quantity INT NOT NULL,
unit_price DECIMAL(10,2) NOT NULL,
subtotal DECIMAL(10,2) NOT NULL
);
-- Payments
CREATE TABLE payments (
id UUID PRIMARY KEY,
order_id UUID REFERENCES orders(id),
amount DECIMAL(10,2),
status VARCHAR(20), -- PENDING, CHARGED, REFUNDED, FAILED
provider VARCHAR(20), -- stripe, paypal
provider_ref VARCHAR(255), -- external transaction ID
created_at TIMESTAMP DEFAULT NOW()
);
-- Reviews
CREATE TABLE reviews (
id UUID PRIMARY KEY,
product_id UUID REFERENCES products(id),
user_id UUID REFERENCES users(id),
rating SMALLINT CHECK (rating BETWEEN 1 AND 5),
title VARCHAR(200),
body TEXT,
verified_purchase BOOLEAN DEFAULT false,
created_at TIMESTAMP DEFAULT NOW(),
UNIQUE(product_id, user_id)
);API Design
# Product APIs
GET /api/v1/products?category=electronics&sort=rating&page=1
GET /api/v1/products/{id}
GET /api/v1/products/search?q=wireless+headphones&min_price=50&max_price=200&brand=Sony
# Cart APIs
GET /api/v1/cart
POST /api/v1/cart/items
body: { product_id, quantity }
PUT /api/v1/cart/items/{product_id}
body: { quantity }
DELETE /api/v1/cart/items/{product_id}
# Order APIs
POST /api/v1/orders/checkout
body: { shipping_address_id, payment_method_id, coupon_code? }
response: { order_id, status: "PENDING", estimated_delivery }
GET /api/v1/orders/{id}
GET /api/v1/orders/{id}/track
# Payment (called by order service internally)
POST /internal/v1/payments/charge
body: { order_id, amount, payment_method_id }
# Reviews
POST /api/v1/products/{id}/reviews
body: { rating: 5, title: "Great product", body: "..." }
GET /api/v1/products/{id}/reviews?sort=helpful&page=1High-Level Design
The platform follows a microservices architecture where each service owns its data and communicates via APIs and events.
Product Service — CRUD for the catalog. Writes to PostgreSQL, syncs to Elasticsearch for search. Reads are cached heavily (products change infrequently).
Search Service — Wraps Elasticsearch. Handles full-text search, faceted filtering (brand, price range, category, rating), and sorting. Product data is denormalized into the search index for fast queries.
Cart Service — Uses Redis for active session carts (fast reads/writes, TTL for abandoned carts) backed by PostgreSQL for persistence (so carts survive if Redis flushes).
Order Service — Orchestrates the order lifecycle. Acts as the Saga orchestrator, coordinating across inventory, payment, and notification services.
Inventory Service — Source of truth for stock levels. Uses optimistic locking to prevent overselling. Completely separate from the product service because inventory has very different write patterns (high contention during sales).
Payment Service — Integrates with Stripe/PayPal. Handles charges, refunds, and payment status webhooks. All payment operations are idempotent (retry-safe).
Notification Service — Sends order confirmations, shipping updates, and delivery notifications via email, SMS, and push.
Why Separate Inventory from Product?
Product data is read-heavy (millions of page views, rare updates). Inventory data is write-heavy during checkout (many concurrent decrements). Separating them means:
- Product reads are served from cache/read-replicas without contention
- Inventory writes use optimistic locking on a focused table with minimal columns
- Each service scales independently
Deep Dive: Inventory Management (Preventing Overselling)
The single most critical correctness requirement: never sell more items than you have in stock. This is surprisingly hard in a concurrent distributed system.
The Problem
Two users simultaneously try to buy the last unit of a product:
- User A reads
stock = 1 - User B reads
stock = 1 - User A writes
stock = 0— success - User B writes
stock = 0— success (BUG: item oversold!)
The Solution: Optimistic Locking (CAS)
Add a version column. Every update is a Compare-And-Swap (CAS) operation: “update only if the version is what I last read.”
async def reserve_inventory(product_id: str, quantity: int, max_retries: int = 3):
"""
Attempt to reserve inventory using optimistic locking.
Returns True if reservation succeeded, False if out of stock.
"""
for attempt in range(max_retries):
# Step 1: Read current stock and version
row = await db.fetchone("""
SELECT stock, reserved, version
FROM inventory
WHERE product_id = $1
""", product_id)
available = row.stock - row.reserved
if available < quantity:
return False # Not enough stock
# Step 2: CAS update — only succeeds if version hasn't changed
result = await db.execute("""
UPDATE inventory
SET reserved = reserved + $1,
version = version + 1
WHERE product_id = $2
AND version = $3
AND stock - reserved >= $1
""", quantity, product_id, row.version)
if result.rowcount == 1:
return True # Reserved successfully
# Version conflict — another transaction modified this row
# Retry with exponential backoff
await asyncio.sleep(0.01 * (2 ** attempt))
return False # Exhausted retriesThe key line is the WHERE version = $3 clause. If another request incremented the version between our read and our write, the UPDATE matches zero rows and we know to retry.
Why Not Pessimistic Locking?
-- Pessimistic approach: SELECT FOR UPDATE
BEGIN;
SELECT stock FROM inventory WHERE product_id = $1 FOR UPDATE;
-- This locks the row until the transaction commits
UPDATE inventory SET stock = stock - 1 WHERE product_id = $1;
COMMIT;Pessimistic locking works but has a major downside: every concurrent request for the same product blocks, waiting for the lock. During a flash sale with 10K users trying to buy the same product, this creates massive contention and timeouts.
Optimistic locking has no blocking. Most requests succeed on the first try (low contention). Only when there’s a genuine conflict does a retry happen. Under normal conditions, it’s dramatically faster.
For Flash Sales: Redis + Queue
When a single product gets 100K simultaneous purchase attempts, even optimistic locking struggles (too many retries). The solution is to put a queue in front:
async def flash_sale_purchase(product_id: str, user_id: str):
"""
Queue-based flow for flash sale items.
"""
# Step 1: Atomic decrement in Redis
remaining = await redis.decr(f"flash:{product_id}:stock")
if remaining < 0:
# Sold out — fast rejection without hitting the DB
await redis.incr(f"flash:{product_id}:stock") # undo
return {"status": "sold_out"}
# Step 2: User "won" a slot — enqueue for actual order processing
await kafka.produce("flash-orders", {
"product_id": product_id,
"user_id": user_id,
"timestamp": time.time()
})
return {"status": "queued", "message": "Order is being processed"}Redis DECR is atomic and single-threaded — it naturally serializes access. It can handle 500K+ operations/second on a single key. The actual database writes happen asynchronously via the queue, at a controlled rate.
Deep Dive: Order Processing with the Saga Pattern
Placing an order involves multiple services: create the order, reserve inventory, charge payment, confirm. In a monolith, this would be a single database transaction. In microservices, each service has its own database — you can’t use a distributed transaction (2PC is too slow and fragile).
The Saga pattern solves this: a sequence of local transactions, each publishing an event that triggers the next step. If any step fails, compensating transactions undo the previous steps.
Orchestrator-Based Saga
The Order Service acts as the saga orchestrator, explicitly managing the sequence:
class OrderSaga:
"""
Orchestrates the checkout flow across services.
Each step has a corresponding compensation action.
"""
async def execute(self, cart, user, payment_method):
order = None
inventory_reserved = False
payment_charged = False
try:
# Step 1: Create order (PENDING)
order = await order_service.create_order(
user_id=user.id,
items=cart.items,
total=cart.total
)
# Step 2: Reserve inventory for each item
for item in cart.items:
success = await inventory_service.reserve(
product_id=item.product_id,
quantity=item.quantity,
order_id=order.id
)
if not success:
raise SagaException(f"Out of stock: {item.product_id}")
inventory_reserved = True
await order_service.update_status(order.id, "RESERVED")
# Step 3: Charge payment
payment = await payment_service.charge(
order_id=order.id,
amount=cart.total,
payment_method=payment_method,
idempotency_key=f"order-{order.id}"
)
if payment.status != "CHARGED":
raise SagaException(f"Payment failed: {payment.decline_reason}")
payment_charged = True
await order_service.update_status(order.id, "PAID")
# Step 4: Confirm order
await order_service.update_status(order.id, "CONFIRMED")
await notification_service.send_order_confirmation(order)
await cart_service.clear(user.id)
return order
except SagaException as e:
# Compensating transactions — undo in reverse order
await self._compensate(order, inventory_reserved, payment_charged, str(e))
raise
async def _compensate(self, order, inventory_reserved, payment_charged, reason):
"""Undo completed steps in reverse order."""
if payment_charged:
await payment_service.refund(order.id)
if inventory_reserved:
for item in order.items:
await inventory_service.release(
product_id=item.product_id,
quantity=item.quantity,
order_id=order.id
)
if order:
await order_service.update_status(
order.id, "CANCELLED",
reason=reason
)Why Orchestration Over Choreography?
There are two flavors of Saga:
Orchestration (what we use): A central coordinator calls each service in sequence. Easier to understand, debug, and add new steps.
Choreography: Each service publishes events, and the next service reacts. More decoupled, but the flow is hard to trace and debug.
For checkout — which is a critical, well-defined sequence — orchestration is the clear winner. The Order Service acts as the single place to understand “what happened to this order.”
Idempotency is Critical
Every service call in the saga must be idempotent. If the orchestrator crashes after step 2 and restarts, it may re-execute step 2. The inventory service must handle duplicate reserve requests:
async def reserve_inventory(product_id, quantity, order_id):
# Check if already reserved for this order
existing = await db.fetchone("""
SELECT id FROM reservations
WHERE order_id = $1 AND product_id = $2
""", order_id, product_id)
if existing:
return True # Already reserved — idempotent
# Proceed with CAS reservation
return await _do_reserve(product_id, quantity, order_id)Deep Dive: Product Search with Faceted Filtering
Users expect to search “wireless headphones” and filter by price ($50-$200), brand (Sony, Bose), rating (4+ stars), and see results in under 100ms.
Elasticsearch Index Design
{
"mappings": {
"properties": {
"title": { "type": "text", "analyzer": "standard" },
"description": { "type": "text", "analyzer": "standard" },
"category": { "type": "keyword" },
"brand": { "type": "keyword" },
"price": { "type": "float" },
"rating_avg": { "type": "float" },
"rating_count": { "type": "integer" },
"tags": { "type": "keyword" },
"seller_id": { "type": "keyword" },
"in_stock": { "type": "boolean" },
"created_at": { "type": "date" }
}
}
}Faceted Search Query
{
"query": {
"bool": {
"must": [
{ "multi_match": {
"query": "wireless headphones",
"fields": ["title^3", "description", "tags^2"]
}},
{ "term": { "in_stock": true } }
],
"filter": [
{ "range": { "price": { "gte": 50, "lte": 200 } } },
{ "terms": { "brand": ["sony", "bose"] } },
{ "range": { "rating_avg": { "gte": 4.0 } } }
]
}
},
"aggs": {
"brands": {
"terms": { "field": "brand", "size": 20 }
},
"price_ranges": {
"range": {
"field": "price",
"ranges": [
{ "to": 50 },
{ "from": 50, "to": 100 },
{ "from": 100, "to": 200 },
{ "from": 200 }
]
}
},
"avg_rating": {
"range": {
"field": "rating_avg",
"ranges": [
{ "from": 4.0, "key": "4_and_up" },
{ "from": 3.0, "key": "3_and_up" }
]
}
}
},
"sort": [
{ "_score": "desc" },
{ "rating_avg": "desc" }
],
"size": 20
}Faceted aggregations are computed alongside the search results. The aggs section returns counts per brand, price range, and rating bracket — these populate the filter sidebar on the UI. Elasticsearch computes these efficiently because keyword and numeric fields use doc-values (columnar storage).
Keeping Search in Sync
When a product is updated in PostgreSQL, we need to update Elasticsearch. This is done asynchronously via events:
Product Service → Kafka (product-updates) → Search Indexer → ElasticsearchThe indexer consumes product change events and upserts the Elasticsearch document. There’s a brief inconsistency window (seconds), but search results don’t need to be perfectly real-time.
Deep Dive: Shopping Cart Design
The cart seems simple, but the design choices matter for performance:
class CartService:
"""
Hybrid cart: Redis for speed, PostgreSQL for durability.
"""
async def get_cart(self, user_id: str):
# Try Redis first (fast path)
cart_data = await redis.hgetall(f"cart:{user_id}")
if cart_data:
return self._deserialize(cart_data)
# Fall back to PostgreSQL (user returning after days)
cart = await db.fetchall("""
SELECT product_id, quantity FROM cart_items
WHERE user_id = $1
""", user_id)
# Warm the Redis cache
if cart:
await self._cache_cart(user_id, cart)
return cart
async def add_item(self, user_id: str, product_id: str, quantity: int):
# Write-through: update both Redis and DB
await redis.hset(f"cart:{user_id}", product_id, quantity)
await redis.expire(f"cart:{user_id}", 86400 * 30) # 30 day TTL
await db.execute("""
INSERT INTO cart_items (user_id, product_id, quantity)
VALUES ($1, $2, $3)
ON CONFLICT (user_id, product_id)
DO UPDATE SET quantity = $3, updated_at = NOW()
""", user_id, product_id, quantity)Why Redis + DB?
- Redis handles the high QPS of cart operations (500K/s at peak)
- PostgreSQL ensures carts persist if Redis loses data
- The write-through pattern keeps both in sync
- Redis TTL handles abandoned cart cleanup
Deep Dive: Flash Sale Handling
A flash sale (e.g., Amazon Prime Day, Black Friday lightning deals) can spike traffic 100x in seconds. The system must not crash, and must not oversell.
Architecture for Spikes
┌─── Rate Limiter ─── Queue ─── Order Processor
User → CDN → API Gateway → LB ─────┤
└─── Static cache (product pages, images)Layer 1: CDN + Static Cache Pre-warm product pages in the CDN. Most flash sale traffic is people refreshing the product page — serve that from cache.
Layer 2: Rate Limiting At the API Gateway, limit checkout requests per user to 1 per 5 seconds. This prevents a single user from monopolizing stock.
Layer 3: Redis Pre-Check Before hitting the database, check Redis:
async def can_purchase(product_id, user_id):
# Has user already purchased this flash sale item?
already_bought = await redis.sismember(f"flash:{product_id}:buyers", user_id)
if already_bought:
return False, "Already purchased"
# Is there stock remaining?
stock = await redis.get(f"flash:{product_id}:stock")
if int(stock) <= 0:
return False, "Sold out"
return True, "Proceed"Layer 4: Queue-Based Processing Users who pass the pre-check are enqueued. A pool of workers processes orders at a controlled rate, preventing database overload:
# Worker pool: 100 workers, each processing 1 order at a time
# Throughput: ~100 orders/second sustained (plenty for most flash sales)
async def flash_order_worker():
while True:
event = await kafka.consume("flash-orders")
try:
# This uses the normal Saga flow with optimistic locking
await order_saga.execute(
cart=event.cart,
user=event.user,
payment_method=event.payment
)
except SagaException:
await notification_service.send(
event.user.id,
"Sorry, this item sold out while processing your order."
)Pre-Warming
Before the flash sale starts:
- Load product data into CDN and Redis caches
- Set
flash:{product_id}:stockin Redis to the available quantity - Scale up API Gateway, Order Service, and Inventory Service instances
- Alert the on-call team
Order State Machine
Orders follow a strict state machine. Only valid transitions are allowed:
ORDER_TRANSITIONS = {
"PENDING": ["RESERVED", "CANCELLED"],
"RESERVED": ["PAID", "CANCELLED"],
"PAID": ["CONFIRMED", "CANCELLED"], # rare: cancel after payment = refund
"CONFIRMED": ["SHIPPED", "CANCELLED"], # cancel before ship = refund
"SHIPPED": ["DELIVERED"], # no cancel after shipping
"DELIVERED": ["RETURNED"], # customer initiates return
"CANCELLED": [], # terminal
"RETURNED": [], # terminal
}
async def update_order_status(order_id, new_status, reason=None):
order = await db.fetchone("SELECT status FROM orders WHERE id = $1", order_id)
if new_status not in ORDER_TRANSITIONS[order.status]:
raise InvalidTransition(
f"Cannot transition from {order.status} to {new_status}"
)
await db.execute("""
UPDATE orders
SET status = $1, updated_at = NOW(), status_reason = $2
WHERE id = $3
""", new_status, reason, order_id)
# Emit event for downstream services
await kafka.produce("order-status-changes", {
"order_id": order_id,
"old_status": order.status,
"new_status": new_status,
"timestamp": time.time()
})Key Takeaways
| Decision | Choice | Why |
|---|---|---|
| Inventory consistency | Optimistic locking (CAS) | High throughput, no blocking, retry on conflict |
| Flash sale throttling | Redis atomic decrement + Kafka queue | Fast rejection, controlled processing rate |
| Distributed transactions | Saga pattern (orchestrated) | No 2PC, clear flow, compensating actions on failure |
| Product search | Elasticsearch | Full-text search, faceted filtering, sub-100ms queries |
| Cart storage | Redis (hot) + PostgreSQL (durable) | Speed for active sessions, persistence for returning users |
| Order processing | State machine with Kafka events | Audit trail, decouple downstream services |
| Payment idempotency | Idempotency keys per order | Safe to retry on network failures |
| Traffic spikes | CDN + rate limit + queue + auto-scale | Each layer absorbs a portion of the spike |
The e-commerce design is a masterclass in consistency under concurrency (inventory), distributed transactions (Saga), and handling traffic spikes (flash sales). Interviewers love to probe the inventory overselling problem and the Saga failure/compensation flow — have your CAS SQL query and compensation logic ready to whiteboard.
