Design an E-Commerce Platform — Cracking the System Design Interview

E-commerce platforms handle some of the most demanding distributed systems challenges: inventory that must never oversell, flash sales with 100x traffic spikes, distributed transactions spanning multiple services, and search over millions of products with faceted filtering. This lesson covers how to design a platform that handles all of this reliably.

Understanding the Problem

Functional Requirements

Product catalog — Browse products by category, view details, images, reviews
Product search — Full-text search with filters (price range, brand, rating, category)
Shopping cart — Add/remove items, persist across sessions
Checkout and ordering — Convert cart to order, apply coupons, select shipping
Payment processing — Charge credit card, handle refunds
Order tracking — Real-time status from placed through delivered
User reviews — Star ratings and text reviews on products
Seller dashboard — Sellers manage their product listings and view analytics

Non-Functional Requirements

Requirement	Target
Availability	99.99% (downtime = lost revenue)
Search latency (p99)	< 100ms
Checkout consistency	Zero overselling — never sell what you don’t have
Flash sale handling	Sustain 10-100x normal traffic for 30-60 minutes
Order processing	Handle 50K orders/minute at peak
Data durability	Zero order/payment data loss

Scale Estimation

100 million products in the catalog
500 million registered users, 50 million daily active
5 million orders/day normally, 50 million during peak events
Product search: 200K QPS at peak
Cart operations: 500K QPS at peak (lots of browsing, adding, removing)

Core Entities and APIs

Data Model

-- Products
CREATE TABLE products (
    id              UUID PRIMARY KEY,
    seller_id       UUID REFERENCES sellers(id),
    title           VARCHAR(500),
    description     TEXT,
    category_id     UUID REFERENCES categories(id),
    price           DECIMAL(10,2),
    image_urls      TEXT[],
    rating_avg      DECIMAL(2,1),
    rating_count    INT DEFAULT 0,
    status          VARCHAR(20) DEFAULT 'active',
    created_at      TIMESTAMP DEFAULT NOW()
);

-- Inventory (separate from product — different update patterns)
CREATE TABLE inventory (
    product_id      UUID PRIMARY KEY REFERENCES products(id),
    stock           INT NOT NULL CHECK (stock >= 0),
    reserved        INT NOT NULL DEFAULT 0,  -- held by in-progress orders
    version         INT NOT NULL DEFAULT 1,  -- for optimistic locking
    warehouse_id    UUID,
    updated_at      TIMESTAMP DEFAULT NOW()
);

-- Orders
CREATE TABLE orders (
    id              UUID PRIMARY KEY,
    user_id         UUID REFERENCES users(id),
    status          VARCHAR(20) NOT NULL,
    -- PENDING → RESERVED → PAID → CONFIRMED → SHIPPED → DELIVERED → RETURNED
    -- PENDING → CANCELLED (at any pre-shipped stage)
    total_amount    DECIMAL(10,2),
    shipping_addr   JSONB,
    coupon_code     VARCHAR(50),
    discount_amount DECIMAL(10,2) DEFAULT 0,
    created_at      TIMESTAMP DEFAULT NOW(),
    updated_at      TIMESTAMP DEFAULT NOW()
);

CREATE TABLE order_items (
    id              UUID PRIMARY KEY,
    order_id        UUID REFERENCES orders(id),
    product_id      UUID REFERENCES products(id),
    quantity        INT NOT NULL,
    unit_price      DECIMAL(10,2) NOT NULL,
    subtotal        DECIMAL(10,2) NOT NULL
);

-- Payments
CREATE TABLE payments (
    id              UUID PRIMARY KEY,
    order_id        UUID REFERENCES orders(id),
    amount          DECIMAL(10,2),
    status          VARCHAR(20),  -- PENDING, CHARGED, REFUNDED, FAILED
    provider        VARCHAR(20),  -- stripe, paypal
    provider_ref    VARCHAR(255), -- external transaction ID
    created_at      TIMESTAMP DEFAULT NOW()
);

-- Reviews
CREATE TABLE reviews (
    id              UUID PRIMARY KEY,
    product_id      UUID REFERENCES products(id),
    user_id         UUID REFERENCES users(id),
    rating          SMALLINT CHECK (rating BETWEEN 1 AND 5),
    title           VARCHAR(200),
    body            TEXT,
    verified_purchase BOOLEAN DEFAULT false,
    created_at      TIMESTAMP DEFAULT NOW(),
    UNIQUE(product_id, user_id)
);

API Design

# Product APIs
GET    /api/v1/products?category=electronics&sort=rating&page=1
GET    /api/v1/products/{id}
GET    /api/v1/products/search?q=wireless+headphones&min_price=50&max_price=200&brand=Sony

# Cart APIs
GET    /api/v1/cart
POST   /api/v1/cart/items
  body: { product_id, quantity }
PUT    /api/v1/cart/items/{product_id}
  body: { quantity }
DELETE /api/v1/cart/items/{product_id}

# Order APIs
POST   /api/v1/orders/checkout
  body: { shipping_address_id, payment_method_id, coupon_code? }
  response: { order_id, status: "PENDING", estimated_delivery }

GET    /api/v1/orders/{id}
GET    /api/v1/orders/{id}/track

# Payment (called by order service internally)
POST   /internal/v1/payments/charge
  body: { order_id, amount, payment_method_id }

# Reviews
POST   /api/v1/products/{id}/reviews
  body: { rating: 5, title: "Great product", body: "..." }
GET    /api/v1/products/{id}/reviews?sort=helpful&page=1

High-Level Design

The platform follows a microservices architecture where each service owns its data and communicates via APIs and events.

Product Service — CRUD for the catalog. Writes to PostgreSQL, syncs to Elasticsearch for search. Reads are cached heavily (products change infrequently).

Search Service — Wraps Elasticsearch. Handles full-text search, faceted filtering (brand, price range, category, rating), and sorting. Product data is denormalized into the search index for fast queries.

Cart Service — Uses Redis for active session carts (fast reads/writes, TTL for abandoned carts) backed by PostgreSQL for persistence (so carts survive if Redis flushes).

Order Service — Orchestrates the order lifecycle. Acts as the Saga orchestrator, coordinating across inventory, payment, and notification services.

Inventory Service — Source of truth for stock levels. Uses optimistic locking to prevent overselling. Completely separate from the product service because inventory has very different write patterns (high contention during sales).

Payment Service — Integrates with Stripe/PayPal. Handles charges, refunds, and payment status webhooks. All payment operations are idempotent (retry-safe).

Notification Service — Sends order confirmations, shipping updates, and delivery notifications via email, SMS, and push.

Why Separate Inventory from Product?

Product data is read-heavy (millions of page views, rare updates). Inventory data is write-heavy during checkout (many concurrent decrements). Separating them means:

Product reads are served from cache/read-replicas without contention
Inventory writes use optimistic locking on a focused table with minimal columns
Each service scales independently

Deep Dive: Inventory Management (Preventing Overselling)

The single most critical correctness requirement: never sell more items than you have in stock. This is surprisingly hard in a concurrent distributed system.

Inventory management using optimistic locking with compare-and-swap showing how two concurrent requests race, and how CAS prevents overselling by failing the second request when the version has changed

The Problem

Two users simultaneously try to buy the last unit of a product:

User A reads stock = 1
User B reads stock = 1
User A writes stock = 0 — success
User B writes stock = 0 — success (BUG: item oversold!)

The Solution: Optimistic Locking (CAS)

Add a version column. Every update is a Compare-And-Swap (CAS) operation: “update only if the version is what I last read.”

async def reserve_inventory(product_id: str, quantity: int, max_retries: int = 3):
    """
    Attempt to reserve inventory using optimistic locking.
    Returns True if reservation succeeded, False if out of stock.
    """
    for attempt in range(max_retries):
        # Step 1: Read current stock and version
        row = await db.fetchone("""
            SELECT stock, reserved, version
            FROM inventory
            WHERE product_id = $1
        """, product_id)
        
        available = row.stock - row.reserved
        if available < quantity:
            return False  # Not enough stock
        
        # Step 2: CAS update — only succeeds if version hasn't changed
        result = await db.execute("""
            UPDATE inventory
            SET reserved = reserved + $1,
                version = version + 1
            WHERE product_id = $2
              AND version = $3
              AND stock - reserved >= $1
        """, quantity, product_id, row.version)
        
        if result.rowcount == 1:
            return True  # Reserved successfully
        
        # Version conflict — another transaction modified this row
        # Retry with exponential backoff
        await asyncio.sleep(0.01 * (2 ** attempt))
    
    return False  # Exhausted retries

The key line is the WHERE version = $3 clause. If another request incremented the version between our read and our write, the UPDATE matches zero rows and we know to retry.

Why Not Pessimistic Locking?

-- Pessimistic approach: SELECT FOR UPDATE
BEGIN;
SELECT stock FROM inventory WHERE product_id = $1 FOR UPDATE;
-- This locks the row until the transaction commits
UPDATE inventory SET stock = stock - 1 WHERE product_id = $1;
COMMIT;

Pessimistic locking works but has a major downside: every concurrent request for the same product blocks, waiting for the lock. During a flash sale with 10K users trying to buy the same product, this creates massive contention and timeouts.

Optimistic locking has no blocking. Most requests succeed on the first try (low contention). Only when there’s a genuine conflict does a retry happen. Under normal conditions, it’s dramatically faster.

For Flash Sales: Redis + Queue

When a single product gets 100K simultaneous purchase attempts, even optimistic locking struggles (too many retries). The solution is to put a queue in front:

async def flash_sale_purchase(product_id: str, user_id: str):
    """
    Queue-based flow for flash sale items.
    """
    # Step 1: Atomic decrement in Redis
    remaining = await redis.decr(f"flash:{product_id}:stock")
    
    if remaining < 0:
        # Sold out — fast rejection without hitting the DB
        await redis.incr(f"flash:{product_id}:stock")  # undo
        return {"status": "sold_out"}
    
    # Step 2: User "won" a slot — enqueue for actual order processing
    await kafka.produce("flash-orders", {
        "product_id": product_id,
        "user_id": user_id,
        "timestamp": time.time()
    })
    
    return {"status": "queued", "message": "Order is being processed"}

Redis DECR is atomic and single-threaded — it naturally serializes access. It can handle 500K+ operations/second on a single key. The actual database writes happen asynchronously via the queue, at a controlled rate.

Deep Dive: Order Processing with the Saga Pattern

Placing an order involves multiple services: create the order, reserve inventory, charge payment, confirm. In a monolith, this would be a single database transaction. In microservices, each service has its own database — you can’t use a distributed transaction (2PC is too slow and fragile).

The Saga pattern solves this: a sequence of local transactions, each publishing an event that triggers the next step. If any step fails, compensating transactions undo the previous steps.

Saga pattern for order processing showing the happy path of create order, reserve stock, charge payment, confirm order, and the failure path with compensating transactions that release stock and cancel the order when payment fails

Orchestrator-Based Saga

The Order Service acts as the saga orchestrator, explicitly managing the sequence:

class OrderSaga:
    """
    Orchestrates the checkout flow across services.
    Each step has a corresponding compensation action.
    """
    
    async def execute(self, cart, user, payment_method):
        order = None
        inventory_reserved = False
        payment_charged = False
        
        try:
            # Step 1: Create order (PENDING)
            order = await order_service.create_order(
                user_id=user.id,
                items=cart.items,
                total=cart.total
            )
            
            # Step 2: Reserve inventory for each item
            for item in cart.items:
                success = await inventory_service.reserve(
                    product_id=item.product_id,
                    quantity=item.quantity,
                    order_id=order.id
                )
                if not success:
                    raise SagaException(f"Out of stock: {item.product_id}")
            inventory_reserved = True
            
            await order_service.update_status(order.id, "RESERVED")
            
            # Step 3: Charge payment
            payment = await payment_service.charge(
                order_id=order.id,
                amount=cart.total,
                payment_method=payment_method,
                idempotency_key=f"order-{order.id}"
            )
            if payment.status != "CHARGED":
                raise SagaException(f"Payment failed: {payment.decline_reason}")
            payment_charged = True
            
            await order_service.update_status(order.id, "PAID")
            
            # Step 4: Confirm order
            await order_service.update_status(order.id, "CONFIRMED")
            await notification_service.send_order_confirmation(order)
            await cart_service.clear(user.id)
            
            return order
            
        except SagaException as e:
            # Compensating transactions — undo in reverse order
            await self._compensate(order, inventory_reserved, payment_charged, str(e))
            raise
    
    async def _compensate(self, order, inventory_reserved, payment_charged, reason):
        """Undo completed steps in reverse order."""
        
        if payment_charged:
            await payment_service.refund(order.id)
        
        if inventory_reserved:
            for item in order.items:
                await inventory_service.release(
                    product_id=item.product_id,
                    quantity=item.quantity,
                    order_id=order.id
                )
        
        if order:
            await order_service.update_status(
                order.id, "CANCELLED",
                reason=reason
            )

Why Orchestration Over Choreography?

There are two flavors of Saga:

Orchestration (what we use): A central coordinator calls each service in sequence. Easier to understand, debug, and add new steps.

Choreography: Each service publishes events, and the next service reacts. More decoupled, but the flow is hard to trace and debug.

For checkout — which is a critical, well-defined sequence — orchestration is the clear winner. The Order Service acts as the single place to understand “what happened to this order.”

Idempotency is Critical

Every service call in the saga must be idempotent. If the orchestrator crashes after step 2 and restarts, it may re-execute step 2. The inventory service must handle duplicate reserve requests:

async def reserve_inventory(product_id, quantity, order_id):
    # Check if already reserved for this order
    existing = await db.fetchone("""
        SELECT id FROM reservations
        WHERE order_id = $1 AND product_id = $2
    """, order_id, product_id)
    
    if existing:
        return True  # Already reserved — idempotent
    
    # Proceed with CAS reservation
    return await _do_reserve(product_id, quantity, order_id)

Deep Dive: Product Search with Faceted Filtering

Users expect to search “wireless headphones” and filter by price ($50-$200), brand (Sony, Bose), rating (4+ stars), and see results in under 100ms.

Elasticsearch Index Design

{
  "mappings": {
    "properties": {
      "title":       { "type": "text", "analyzer": "standard" },
      "description": { "type": "text", "analyzer": "standard" },
      "category":    { "type": "keyword" },
      "brand":       { "type": "keyword" },
      "price":       { "type": "float" },
      "rating_avg":  { "type": "float" },
      "rating_count": { "type": "integer" },
      "tags":        { "type": "keyword" },
      "seller_id":   { "type": "keyword" },
      "in_stock":    { "type": "boolean" },
      "created_at":  { "type": "date" }
    }
  }
}

Faceted Search Query

{
  "query": {
    "bool": {
      "must": [
        { "multi_match": {
            "query": "wireless headphones",
            "fields": ["title^3", "description", "tags^2"]
        }},
        { "term": { "in_stock": true } }
      ],
      "filter": [
        { "range": { "price": { "gte": 50, "lte": 200 } } },
        { "terms": { "brand": ["sony", "bose"] } },
        { "range": { "rating_avg": { "gte": 4.0 } } }
      ]
    }
  },
  "aggs": {
    "brands": {
      "terms": { "field": "brand", "size": 20 }
    },
    "price_ranges": {
      "range": {
        "field": "price",
        "ranges": [
          { "to": 50 },
          { "from": 50, "to": 100 },
          { "from": 100, "to": 200 },
          { "from": 200 }
        ]
      }
    },
    "avg_rating": {
      "range": {
        "field": "rating_avg",
        "ranges": [
          { "from": 4.0, "key": "4_and_up" },
          { "from": 3.0, "key": "3_and_up" }
        ]
      }
    }
  },
  "sort": [
    { "_score": "desc" },
    { "rating_avg": "desc" }
  ],
  "size": 20
}

Faceted aggregations are computed alongside the search results. The aggs section returns counts per brand, price range, and rating bracket — these populate the filter sidebar on the UI. Elasticsearch computes these efficiently because keyword and numeric fields use doc-values (columnar storage).

Keeping Search in Sync

When a product is updated in PostgreSQL, we need to update Elasticsearch. This is done asynchronously via events:

Product Service → Kafka (product-updates) → Search Indexer → Elasticsearch

The indexer consumes product change events and upserts the Elasticsearch document. There’s a brief inconsistency window (seconds), but search results don’t need to be perfectly real-time.

Deep Dive: Shopping Cart Design

The cart seems simple, but the design choices matter for performance:

class CartService:
    """
    Hybrid cart: Redis for speed, PostgreSQL for durability.
    """
    
    async def get_cart(self, user_id: str):
        # Try Redis first (fast path)
        cart_data = await redis.hgetall(f"cart:{user_id}")
        if cart_data:
            return self._deserialize(cart_data)
        
        # Fall back to PostgreSQL (user returning after days)
        cart = await db.fetchall("""
            SELECT product_id, quantity FROM cart_items
            WHERE user_id = $1
        """, user_id)
        
        # Warm the Redis cache
        if cart:
            await self._cache_cart(user_id, cart)
        
        return cart
    
    async def add_item(self, user_id: str, product_id: str, quantity: int):
        # Write-through: update both Redis and DB
        await redis.hset(f"cart:{user_id}", product_id, quantity)
        await redis.expire(f"cart:{user_id}", 86400 * 30)  # 30 day TTL
        
        await db.execute("""
            INSERT INTO cart_items (user_id, product_id, quantity)
            VALUES ($1, $2, $3)
            ON CONFLICT (user_id, product_id)
            DO UPDATE SET quantity = $3, updated_at = NOW()
        """, user_id, product_id, quantity)

Why Redis + DB?

Redis handles the high QPS of cart operations (500K/s at peak)
PostgreSQL ensures carts persist if Redis loses data
The write-through pattern keeps both in sync
Redis TTL handles abandoned cart cleanup

Deep Dive: Flash Sale Handling

A flash sale (e.g., Amazon Prime Day, Black Friday lightning deals) can spike traffic 100x in seconds. The system must not crash, and must not oversell.

Architecture for Spikes

                                    ┌─── Rate Limiter ─── Queue ─── Order Processor
User → CDN → API Gateway → LB ─────┤
                                    └─── Static cache (product pages, images)

Layer 1: CDN + Static Cache Pre-warm product pages in the CDN. Most flash sale traffic is people refreshing the product page — serve that from cache.

Layer 2: Rate Limiting At the API Gateway, limit checkout requests per user to 1 per 5 seconds. This prevents a single user from monopolizing stock.

Layer 3: Redis Pre-Check Before hitting the database, check Redis:

async def can_purchase(product_id, user_id):
    # Has user already purchased this flash sale item?
    already_bought = await redis.sismember(f"flash:{product_id}:buyers", user_id)
    if already_bought:
        return False, "Already purchased"
    
    # Is there stock remaining?
    stock = await redis.get(f"flash:{product_id}:stock")
    if int(stock) <= 0:
        return False, "Sold out"
    
    return True, "Proceed"

Layer 4: Queue-Based Processing Users who pass the pre-check are enqueued. A pool of workers processes orders at a controlled rate, preventing database overload:

# Worker pool: 100 workers, each processing 1 order at a time
# Throughput: ~100 orders/second sustained (plenty for most flash sales)

async def flash_order_worker():
    while True:
        event = await kafka.consume("flash-orders")
        
        try:
            # This uses the normal Saga flow with optimistic locking
            await order_saga.execute(
                cart=event.cart,
                user=event.user,
                payment_method=event.payment
            )
        except SagaException:
            await notification_service.send(
                event.user.id,
                "Sorry, this item sold out while processing your order."
            )

Pre-Warming

Before the flash sale starts:

Load product data into CDN and Redis caches
Set flash:{product_id}:stock in Redis to the available quantity
Scale up API Gateway, Order Service, and Inventory Service instances
Alert the on-call team

Order State Machine

Orders follow a strict state machine. Only valid transitions are allowed:

ORDER_TRANSITIONS = {
    "PENDING":   ["RESERVED", "CANCELLED"],
    "RESERVED":  ["PAID", "CANCELLED"],
    "PAID":      ["CONFIRMED", "CANCELLED"],   # rare: cancel after payment = refund
    "CONFIRMED": ["SHIPPED", "CANCELLED"],      # cancel before ship = refund
    "SHIPPED":   ["DELIVERED"],                  # no cancel after shipping
    "DELIVERED": ["RETURNED"],                   # customer initiates return
    "CANCELLED": [],                             # terminal
    "RETURNED":  [],                             # terminal
}

async def update_order_status(order_id, new_status, reason=None):
    order = await db.fetchone("SELECT status FROM orders WHERE id = $1", order_id)
    
    if new_status not in ORDER_TRANSITIONS[order.status]:
        raise InvalidTransition(
            f"Cannot transition from {order.status} to {new_status}"
        )
    
    await db.execute("""
        UPDATE orders
        SET status = $1, updated_at = NOW(), status_reason = $2
        WHERE id = $3
    """, new_status, reason, order_id)
    
    # Emit event for downstream services
    await kafka.produce("order-status-changes", {
        "order_id": order_id,
        "old_status": order.status,
        "new_status": new_status,
        "timestamp": time.time()
    })

Key Takeaways

Decision	Choice	Why
Inventory consistency	Optimistic locking (CAS)	High throughput, no blocking, retry on conflict
Flash sale throttling	Redis atomic decrement + Kafka queue	Fast rejection, controlled processing rate
Distributed transactions	Saga pattern (orchestrated)	No 2PC, clear flow, compensating actions on failure
Product search	Elasticsearch	Full-text search, faceted filtering, sub-100ms queries
Cart storage	Redis (hot) + PostgreSQL (durable)	Speed for active sessions, persistence for returning users
Order processing	State machine with Kafka events	Audit trail, decouple downstream services
Payment idempotency	Idempotency keys per order	Safe to retry on network failures
Traffic spikes	CDN + rate limit + queue + auto-scale	Each layer absorbs a portion of the spike

The e-commerce design is a masterclass in consistency under concurrency (inventory), distributed transactions (Saga), and handling traffic spikes (flash sales). Interviewers love to probe the inventory overselling problem and the Saga failure/compensation flow — have your CAS SQL query and compensation logic ready to whiteboard.