Design a Payment System — Cracking the System Design Interview

Payment systems are the backbone of e-commerce. A bug here does not just cause a bad user experience — it loses real money. When Stripe processes a $100 payment, there is zero tolerance for charging the customer twice or losing the merchant’s settlement. This makes payment system design one of the most demanding system design problems, where correctness trumps performance.

In this lesson, we will design a payment system that handles credit card processing, refunds, merchant payouts, and multi-currency support — all while maintaining an accurate financial audit trail.

Understanding the Problem

Functional Requirements

Process payments: Support credit card, bank transfer, and wallet payments
Refunds: Full and partial refunds with automatic ledger reconciliation
Transaction history: Queryable history per user and per merchant
Multi-currency: Accept payments in one currency, settle in another
Merchant payouts: Batch settlements on configurable schedules (T+1, T+2)
Webhook notifications: Notify merchants of payment status changes in real time

Non-Functional Requirements

Exactly-once processing: No double charges — ever
Strong consistency: Account balances must never be stale
PCI DSS compliance: Never store raw card numbers
99.999% availability: Five-nines uptime for the payment path
Audit trail: Every money movement must be traceable

Core Entities and APIs

Entities

interface Payment {
  id: string;                    // UUID
  idempotencyKey: string;        // Client-provided, prevents duplicates
  amount: number;                // In smallest currency unit (cents)
  currency: string;              // ISO 4217 (USD, EUR, GBP)
  status: PaymentStatus;         // CREATED | PROCESSING | AUTHORIZED | CAPTURED | SETTLED | FAILED | REFUNDED
  paymentMethodId: string;       // Reference to stored payment method
  merchantId: string;
  customerId: string;
  metadata: Record<string, string>;
  createdAt: Date;
  updatedAt: Date;
}

enum PaymentStatus {
  CREATED = 'CREATED',
  PROCESSING = 'PROCESSING',
  AUTHORIZED = 'AUTHORIZED',
  CAPTURED = 'CAPTURED',
  SETTLED = 'SETTLED',
  FAILED = 'FAILED',
  REFUNDED = 'REFUNDED'
}

interface LedgerEntry {
  id: string;
  transactionId: string;        // Groups related debit/credit pairs
  accountId: string;
  type: 'DEBIT' | 'CREDIT';
  amount: number;
  currency: string;
  description: string;
  createdAt: Date;
}

APIs

POST   /v1/payments              — Create a payment intent
POST   /v1/payments/:id/capture  — Capture an authorized payment
POST   /v1/payments/:id/refund   — Refund a captured payment
GET    /v1/payments/:id          — Get payment details
GET    /v1/transactions          — List transaction history (paginated)
POST   /v1/payouts               — Trigger merchant payout

Every write endpoint requires an Idempotency-Key header. The same key with the same parameters always returns the same result, even if the client retries.

High-Level Design

Payment system architecture showing services, databases, and event bus

The architecture separates concerns into focused services, connected through Kafka for asynchronous event processing:

Payment Service — The gateway for all payment operations. Validates requests, enforces idempotency, and coordinates the payment flow.
Payment Processor Integration — Communicates with external processors (Stripe, Adyen) for card authorization and capture.
Ledger Service — Maintains the double-entry bookkeeping system. Every money movement creates exactly two entries.
Wallet Service — Manages internal account balances for customers and merchants.
Fraud Detection Service — Runs ML-based risk scoring before authorizing payments.
Reconciliation Service — Daily batch job that matches internal records against bank statements.

Deep Dive: Idempotency

The single most important property of a payment system is exactly-once processing. Network failures, client retries, and load balancer re-routes can all cause duplicate requests. Idempotency keys solve this.

How Idempotency Keys Work

def process_payment(request):
    idempotency_key = request.headers['Idempotency-Key']
    
    # Step 1: Check if we've seen this key before
    existing = redis.get(f"idempotency:{idempotency_key}")
    if existing:
        return json.loads(existing)  # Return cached response
    
    # Step 2: Acquire a distributed lock
    lock = redis.set(
        f"lock:{idempotency_key}", 
        "1", 
        nx=True,   # Only set if not exists
        ex=30       # 30 second expiry
    )
    if not lock:
        return Response(status=409, body="Request in progress")
    
    try:
        # Step 3: Process the payment
        result = execute_payment(request)
        
        # Step 4: Cache the result (TTL = 24 hours)
        redis.setex(
            f"idempotency:{idempotency_key}",
            86400,
            json.dumps(result)
        )
        return result
    finally:
        redis.delete(f"lock:{idempotency_key}")

The key insight: idempotency is not just “deduplication.” The response to a retried request must be identical to the original response. We store the full response, not just a flag.

Payment State Machine

Payment flow state machine showing transitions from created to settled

Each payment moves through a strict state machine. Transitions are validated — you cannot go from CREATED directly to SETTLED. This prevents invalid states even under race conditions.

VALID_TRANSITIONS = {
    'CREATED':     ['PROCESSING', 'FAILED'],
    'PROCESSING':  ['AUTHORIZED', 'FAILED'],
    'AUTHORIZED':  ['CAPTURED', 'FAILED'],
    'CAPTURED':    ['SETTLED', 'REFUNDED'],
    'SETTLED':     ['REFUNDED'],
    'FAILED':      ['CREATED'],  # Retry path
}

def transition_payment(payment_id, new_status):
    with db.transaction():
        payment = db.query(
            "SELECT * FROM payments WHERE id = %s FOR UPDATE",
            payment_id
        )
        if new_status not in VALID_TRANSITIONS[payment.status]:
            raise InvalidTransitionError(
                f"Cannot go from {payment.status} to {new_status}"
            )
        db.execute(
            "UPDATE payments SET status = %s, updated_at = NOW() WHERE id = %s",
            new_status, payment_id
        )
        # Emit event for downstream consumers
        kafka.produce('payment.status_changed', {
            'payment_id': payment_id,
            'old_status': payment.status,
            'new_status': new_status
        })

The FOR UPDATE lock prevents concurrent transitions on the same payment. Combined with the state machine validation, this guarantees correctness even under high concurrency.

Deep Dive: Double-Entry Ledger

Every financial system needs a double-entry ledger. The fundamental rule: every transaction creates at least two entries — a debit and a credit — that sum to zero. If the books do not balance, something is wrong.

Double-entry ledger showing debit and credit entries for a payment

Why Double-Entry?

Single-entry bookkeeping (just recording ”+$100 to merchant”) seems simpler, but it breaks down fast:

No error detection: If a record is lost, you cannot tell
No audit trail: Where did the $100 come from?
No reconciliation: You cannot verify balances against external systems

With double-entry, every dollar is accounted for. If total debits do not equal total credits, you know there is an error — and you can find it.

Schema Design

CREATE TABLE ledger_entries (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    transaction_id  UUID NOT NULL,       -- Groups related entries
    account_id      VARCHAR(64) NOT NULL,
    entry_type      VARCHAR(6) NOT NULL CHECK (entry_type IN ('DEBIT', 'CREDIT')),
    amount          BIGINT NOT NULL,     -- In cents, always positive
    currency        VARCHAR(3) NOT NULL,
    description     TEXT,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    
    -- Ensure entries are immutable
    CONSTRAINT no_updates CHECK (created_at = created_at)
);

-- Index for balance calculations
CREATE INDEX idx_ledger_account ON ledger_entries(account_id, created_at);

-- Verify balance invariant
CREATE OR REPLACE FUNCTION check_transaction_balance()
RETURNS TRIGGER AS $$
BEGIN
    IF (
        SELECT SUM(CASE WHEN entry_type = 'DEBIT' THEN amount ELSE -amount END)
        FROM ledger_entries
        WHERE transaction_id = NEW.transaction_id
    ) != 0 THEN
        RAISE EXCEPTION 'Transaction % is not balanced', NEW.transaction_id;
    END IF;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

Ledger entries are immutable. You never update or delete a ledger entry. To correct an error, you create a new reversing entry. This guarantees a complete audit trail.

Deep Dive: Handling Processor Failures

External payment processors (Stripe, Adyen) are third-party services. They will fail — network timeouts, rate limits, downtime. Your system must handle this gracefully without double-charging.

Retry with Exponential Backoff

import time
import random

def call_processor_with_retry(request, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = payment_processor.authorize(request)
            if response.status == 'success':
                return response
            if response.status == 'declined':
                return response  # Don't retry declines
            # Processor returned error — retry
        except TimeoutError:
            pass  # Retry on timeout
        except ConnectionError:
            pass  # Retry on connection issues
        
        # Exponential backoff with jitter
        delay = min(2 ** attempt + random.uniform(0, 1), 30)
        time.sleep(delay)
    
    # All retries exhausted
    raise ProcessorUnavailableError("Payment processor unreachable")

Critical rule: never retry on a decline. A decline means the bank rejected the transaction (insufficient funds, stolen card). Retrying will not change the outcome and may trigger fraud alerts.

Circuit Breaker Pattern

When a processor is consistently failing, a circuit breaker prevents cascading failures:

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=60):
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.state = 'CLOSED'       # CLOSED = normal, OPEN = blocking
        self.last_failure_time = 0
    
    def call(self, func, *args):
        if self.state == 'OPEN':
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = 'HALF_OPEN'  # Allow one test request
            else:
                raise CircuitOpenError("Processor circuit is open")
        
        try:
            result = func(*args)
            if self.state == 'HALF_OPEN':
                self.state = 'CLOSED'    # Recovery successful
                self.failure_count = 0
            return result
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = 'OPEN'
            raise

When the circuit opens, the system can fail fast with a clear error instead of hanging on timeouts. This protects both the user experience and your system’s resources.

Deep Dive: Reconciliation

Reconciliation is the process of matching your internal records against external bank statements. This catches discrepancies like: a payment you thought succeeded was actually reversed by the bank, or a settlement amount does not match what you expected.

def daily_reconciliation():
    """Run daily to match internal ledger with bank statements."""
    
    # Fetch today's bank statement
    bank_records = bank_api.get_settlements(date=today())
    
    # Fetch our internal records for the same period
    internal_records = db.query("""
        SELECT payment_id, amount, currency, status
        FROM payments
        WHERE settled_at::date = %s AND status = 'SETTLED'
    """, today())
    
    # Build lookup maps
    bank_map = {r.reference_id: r for r in bank_records}
    internal_map = {r.payment_id: r for r in internal_records}
    
    discrepancies = []
    
    # Check each internal record has a matching bank record
    for payment_id, internal in internal_map.items():
        bank = bank_map.get(payment_id)
        if not bank:
            discrepancies.append({
                'type': 'MISSING_IN_BANK',
                'payment_id': payment_id,
                'amount': internal.amount
            })
        elif bank.amount != internal.amount:
            discrepancies.append({
                'type': 'AMOUNT_MISMATCH',
                'payment_id': payment_id,
                'internal_amount': internal.amount,
                'bank_amount': bank.amount
            })
    
    # Check for bank records we don't know about
    for ref_id, bank in bank_map.items():
        if ref_id not in internal_map:
            discrepancies.append({
                'type': 'MISSING_INTERNALLY',
                'reference_id': ref_id,
                'amount': bank.amount
            })
    
    if discrepancies:
        alert_finance_team(discrepancies)
    
    return len(discrepancies) == 0

In production, reconciliation runs as a scheduled batch job. Any discrepancy triggers an alert for manual review. The goal is to catch issues within 24 hours, not to fix them automatically — automated fixes in financial systems are risky.

Deep Dive: PCI DSS Compliance

PCI DSS (Payment Card Industry Data Security Standard) governs how you handle cardholder data. The most practical approach is tokenization: never store or transmit raw card numbers.

Tokenization Architecture

Customer enters card → Frontend sends directly to processor (Stripe.js)
                     → Processor returns a token (tok_xxx)
                     → Your backend only sees the token, never the card number

This architecture moves you to SAQ-A (the simplest PCI compliance level) because raw card data never touches your servers. Your backend stores only the token, which is useless without the processor’s decryption keys.

# Your backend never sees card numbers
def create_payment_method(customer_id, processor_token):
    """Store a tokenized payment method."""
    return db.insert('payment_methods', {
        'customer_id': customer_id,
        'processor_token': processor_token,  # "tok_xxx" not "4242..."
        'last_four': '4242',                 # Display only
        'brand': 'visa',
        'exp_month': 12,
        'exp_year': 2028
    })

Encryption at Rest

Even tokens and metadata must be encrypted at rest. Use AES-256 encryption with keys managed through a hardware security module (HSM) or a cloud KMS:

from cryptography.fernet import Fernet

# Key stored in AWS KMS / HashiCorp Vault, NOT in code
cipher = Fernet(kms.get_key('payment-encryption-key'))

def encrypt_sensitive_field(value: str) -> str:
    return cipher.encrypt(value.encode()).decode()

def decrypt_sensitive_field(encrypted: str) -> str:
    return cipher.decrypt(encrypted.encode()).decode()

Key Takeaways

Idempotency is non-negotiable. Every write endpoint needs an idempotency key. Store the full response, not just a dedup flag.
Double-entry ledger ensures financial accuracy. Every money movement creates two entries that sum to zero. Entries are immutable — corrections create new reversing entries.
Payment state machines prevent invalid transitions. Use database-level locks (FOR UPDATE) to prevent concurrent state corruption.
Never retry declines. Only retry on network errors and timeouts. Use circuit breakers to fail fast when a processor is down.
Reconciliation catches what automation misses. Daily batch comparison between your records and bank statements catches discrepancies before they compound.
Tokenization simplifies PCI compliance. Never let raw card numbers touch your servers. Use processor-side tokenization (Stripe.js, Adyen Web Components) to minimize your compliance scope.

Payment systems require an unusual level of rigor. The cost of a bug is not a broken page — it is lost money and lost trust. Design defensively, test exhaustively, and always assume that anything that can fail will fail at the worst possible moment.