Payment systems are the backbone of e-commerce. A bug here does not just cause a bad user experience — it loses real money. When Stripe processes a $100 payment, there is zero tolerance for charging the customer twice or losing the merchant’s settlement. This makes payment system design one of the most demanding system design problems, where correctness trumps performance.
In this lesson, we will design a payment system that handles credit card processing, refunds, merchant payouts, and multi-currency support — all while maintaining an accurate financial audit trail.
Understanding the Problem
Functional Requirements
- Process payments: Support credit card, bank transfer, and wallet payments
- Refunds: Full and partial refunds with automatic ledger reconciliation
- Transaction history: Queryable history per user and per merchant
- Multi-currency: Accept payments in one currency, settle in another
- Merchant payouts: Batch settlements on configurable schedules (T+1, T+2)
- Webhook notifications: Notify merchants of payment status changes in real time
Non-Functional Requirements
- Exactly-once processing: No double charges — ever
- Strong consistency: Account balances must never be stale
- PCI DSS compliance: Never store raw card numbers
- 99.999% availability: Five-nines uptime for the payment path
- Audit trail: Every money movement must be traceable
Core Entities and APIs
Entities
interface Payment {
id: string; // UUID
idempotencyKey: string; // Client-provided, prevents duplicates
amount: number; // In smallest currency unit (cents)
currency: string; // ISO 4217 (USD, EUR, GBP)
status: PaymentStatus; // CREATED | PROCESSING | AUTHORIZED | CAPTURED | SETTLED | FAILED | REFUNDED
paymentMethodId: string; // Reference to stored payment method
merchantId: string;
customerId: string;
metadata: Record<string, string>;
createdAt: Date;
updatedAt: Date;
}
enum PaymentStatus {
CREATED = 'CREATED',
PROCESSING = 'PROCESSING',
AUTHORIZED = 'AUTHORIZED',
CAPTURED = 'CAPTURED',
SETTLED = 'SETTLED',
FAILED = 'FAILED',
REFUNDED = 'REFUNDED'
}
interface LedgerEntry {
id: string;
transactionId: string; // Groups related debit/credit pairs
accountId: string;
type: 'DEBIT' | 'CREDIT';
amount: number;
currency: string;
description: string;
createdAt: Date;
}APIs
POST /v1/payments — Create a payment intent
POST /v1/payments/:id/capture — Capture an authorized payment
POST /v1/payments/:id/refund — Refund a captured payment
GET /v1/payments/:id — Get payment details
GET /v1/transactions — List transaction history (paginated)
POST /v1/payouts — Trigger merchant payoutEvery write endpoint requires an Idempotency-Key header. The same key with the same parameters always returns the same result, even if the client retries.
High-Level Design
The architecture separates concerns into focused services, connected through Kafka for asynchronous event processing:
- Payment Service — The gateway for all payment operations. Validates requests, enforces idempotency, and coordinates the payment flow.
- Payment Processor Integration — Communicates with external processors (Stripe, Adyen) for card authorization and capture.
- Ledger Service — Maintains the double-entry bookkeeping system. Every money movement creates exactly two entries.
- Wallet Service — Manages internal account balances for customers and merchants.
- Fraud Detection Service — Runs ML-based risk scoring before authorizing payments.
- Reconciliation Service — Daily batch job that matches internal records against bank statements.
Deep Dive: Idempotency
The single most important property of a payment system is exactly-once processing. Network failures, client retries, and load balancer re-routes can all cause duplicate requests. Idempotency keys solve this.
How Idempotency Keys Work
def process_payment(request):
idempotency_key = request.headers['Idempotency-Key']
# Step 1: Check if we've seen this key before
existing = redis.get(f"idempotency:{idempotency_key}")
if existing:
return json.loads(existing) # Return cached response
# Step 2: Acquire a distributed lock
lock = redis.set(
f"lock:{idempotency_key}",
"1",
nx=True, # Only set if not exists
ex=30 # 30 second expiry
)
if not lock:
return Response(status=409, body="Request in progress")
try:
# Step 3: Process the payment
result = execute_payment(request)
# Step 4: Cache the result (TTL = 24 hours)
redis.setex(
f"idempotency:{idempotency_key}",
86400,
json.dumps(result)
)
return result
finally:
redis.delete(f"lock:{idempotency_key}")The key insight: idempotency is not just “deduplication.” The response to a retried request must be identical to the original response. We store the full response, not just a flag.
Payment State Machine
Each payment moves through a strict state machine. Transitions are validated — you cannot go from CREATED directly to SETTLED. This prevents invalid states even under race conditions.
VALID_TRANSITIONS = {
'CREATED': ['PROCESSING', 'FAILED'],
'PROCESSING': ['AUTHORIZED', 'FAILED'],
'AUTHORIZED': ['CAPTURED', 'FAILED'],
'CAPTURED': ['SETTLED', 'REFUNDED'],
'SETTLED': ['REFUNDED'],
'FAILED': ['CREATED'], # Retry path
}
def transition_payment(payment_id, new_status):
with db.transaction():
payment = db.query(
"SELECT * FROM payments WHERE id = %s FOR UPDATE",
payment_id
)
if new_status not in VALID_TRANSITIONS[payment.status]:
raise InvalidTransitionError(
f"Cannot go from {payment.status} to {new_status}"
)
db.execute(
"UPDATE payments SET status = %s, updated_at = NOW() WHERE id = %s",
new_status, payment_id
)
# Emit event for downstream consumers
kafka.produce('payment.status_changed', {
'payment_id': payment_id,
'old_status': payment.status,
'new_status': new_status
})The FOR UPDATE lock prevents concurrent transitions on the same payment. Combined with the state machine validation, this guarantees correctness even under high concurrency.
Deep Dive: Double-Entry Ledger
Every financial system needs a double-entry ledger. The fundamental rule: every transaction creates at least two entries — a debit and a credit — that sum to zero. If the books do not balance, something is wrong.
Why Double-Entry?
Single-entry bookkeeping (just recording ”+$100 to merchant”) seems simpler, but it breaks down fast:
- No error detection: If a record is lost, you cannot tell
- No audit trail: Where did the $100 come from?
- No reconciliation: You cannot verify balances against external systems
With double-entry, every dollar is accounted for. If total debits do not equal total credits, you know there is an error — and you can find it.
Schema Design
CREATE TABLE ledger_entries (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
transaction_id UUID NOT NULL, -- Groups related entries
account_id VARCHAR(64) NOT NULL,
entry_type VARCHAR(6) NOT NULL CHECK (entry_type IN ('DEBIT', 'CREDIT')),
amount BIGINT NOT NULL, -- In cents, always positive
currency VARCHAR(3) NOT NULL,
description TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
-- Ensure entries are immutable
CONSTRAINT no_updates CHECK (created_at = created_at)
);
-- Index for balance calculations
CREATE INDEX idx_ledger_account ON ledger_entries(account_id, created_at);
-- Verify balance invariant
CREATE OR REPLACE FUNCTION check_transaction_balance()
RETURNS TRIGGER AS $$
BEGIN
IF (
SELECT SUM(CASE WHEN entry_type = 'DEBIT' THEN amount ELSE -amount END)
FROM ledger_entries
WHERE transaction_id = NEW.transaction_id
) != 0 THEN
RAISE EXCEPTION 'Transaction % is not balanced', NEW.transaction_id;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;Ledger entries are immutable. You never update or delete a ledger entry. To correct an error, you create a new reversing entry. This guarantees a complete audit trail.
Deep Dive: Handling Processor Failures
External payment processors (Stripe, Adyen) are third-party services. They will fail — network timeouts, rate limits, downtime. Your system must handle this gracefully without double-charging.
Retry with Exponential Backoff
import time
import random
def call_processor_with_retry(request, max_retries=3):
for attempt in range(max_retries):
try:
response = payment_processor.authorize(request)
if response.status == 'success':
return response
if response.status == 'declined':
return response # Don't retry declines
# Processor returned error — retry
except TimeoutError:
pass # Retry on timeout
except ConnectionError:
pass # Retry on connection issues
# Exponential backoff with jitter
delay = min(2 ** attempt + random.uniform(0, 1), 30)
time.sleep(delay)
# All retries exhausted
raise ProcessorUnavailableError("Payment processor unreachable")Critical rule: never retry on a decline. A decline means the bank rejected the transaction (insufficient funds, stolen card). Retrying will not change the outcome and may trigger fraud alerts.
Circuit Breaker Pattern
When a processor is consistently failing, a circuit breaker prevents cascading failures:
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=60):
self.failure_count = 0
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.state = 'CLOSED' # CLOSED = normal, OPEN = blocking
self.last_failure_time = 0
def call(self, func, *args):
if self.state == 'OPEN':
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = 'HALF_OPEN' # Allow one test request
else:
raise CircuitOpenError("Processor circuit is open")
try:
result = func(*args)
if self.state == 'HALF_OPEN':
self.state = 'CLOSED' # Recovery successful
self.failure_count = 0
return result
except Exception as e:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = 'OPEN'
raiseWhen the circuit opens, the system can fail fast with a clear error instead of hanging on timeouts. This protects both the user experience and your system’s resources.
Deep Dive: Reconciliation
Reconciliation is the process of matching your internal records against external bank statements. This catches discrepancies like: a payment you thought succeeded was actually reversed by the bank, or a settlement amount does not match what you expected.
def daily_reconciliation():
"""Run daily to match internal ledger with bank statements."""
# Fetch today's bank statement
bank_records = bank_api.get_settlements(date=today())
# Fetch our internal records for the same period
internal_records = db.query("""
SELECT payment_id, amount, currency, status
FROM payments
WHERE settled_at::date = %s AND status = 'SETTLED'
""", today())
# Build lookup maps
bank_map = {r.reference_id: r for r in bank_records}
internal_map = {r.payment_id: r for r in internal_records}
discrepancies = []
# Check each internal record has a matching bank record
for payment_id, internal in internal_map.items():
bank = bank_map.get(payment_id)
if not bank:
discrepancies.append({
'type': 'MISSING_IN_BANK',
'payment_id': payment_id,
'amount': internal.amount
})
elif bank.amount != internal.amount:
discrepancies.append({
'type': 'AMOUNT_MISMATCH',
'payment_id': payment_id,
'internal_amount': internal.amount,
'bank_amount': bank.amount
})
# Check for bank records we don't know about
for ref_id, bank in bank_map.items():
if ref_id not in internal_map:
discrepancies.append({
'type': 'MISSING_INTERNALLY',
'reference_id': ref_id,
'amount': bank.amount
})
if discrepancies:
alert_finance_team(discrepancies)
return len(discrepancies) == 0In production, reconciliation runs as a scheduled batch job. Any discrepancy triggers an alert for manual review. The goal is to catch issues within 24 hours, not to fix them automatically — automated fixes in financial systems are risky.
Deep Dive: PCI DSS Compliance
PCI DSS (Payment Card Industry Data Security Standard) governs how you handle cardholder data. The most practical approach is tokenization: never store or transmit raw card numbers.
Tokenization Architecture
Customer enters card → Frontend sends directly to processor (Stripe.js)
→ Processor returns a token (tok_xxx)
→ Your backend only sees the token, never the card numberThis architecture moves you to SAQ-A (the simplest PCI compliance level) because raw card data never touches your servers. Your backend stores only the token, which is useless without the processor’s decryption keys.
# Your backend never sees card numbers
def create_payment_method(customer_id, processor_token):
"""Store a tokenized payment method."""
return db.insert('payment_methods', {
'customer_id': customer_id,
'processor_token': processor_token, # "tok_xxx" not "4242..."
'last_four': '4242', # Display only
'brand': 'visa',
'exp_month': 12,
'exp_year': 2028
})Encryption at Rest
Even tokens and metadata must be encrypted at rest. Use AES-256 encryption with keys managed through a hardware security module (HSM) or a cloud KMS:
from cryptography.fernet import Fernet
# Key stored in AWS KMS / HashiCorp Vault, NOT in code
cipher = Fernet(kms.get_key('payment-encryption-key'))
def encrypt_sensitive_field(value: str) -> str:
return cipher.encrypt(value.encode()).decode()
def decrypt_sensitive_field(encrypted: str) -> str:
return cipher.decrypt(encrypted.encode()).decode()Key Takeaways
-
Idempotency is non-negotiable. Every write endpoint needs an idempotency key. Store the full response, not just a dedup flag.
-
Double-entry ledger ensures financial accuracy. Every money movement creates two entries that sum to zero. Entries are immutable — corrections create new reversing entries.
-
Payment state machines prevent invalid transitions. Use database-level locks (
FOR UPDATE) to prevent concurrent state corruption. -
Never retry declines. Only retry on network errors and timeouts. Use circuit breakers to fail fast when a processor is down.
-
Reconciliation catches what automation misses. Daily batch comparison between your records and bank statements catches discrepancies before they compound.
-
Tokenization simplifies PCI compliance. Never let raw card numbers touch your servers. Use processor-side tokenization (Stripe.js, Adyen Web Components) to minimize your compliance scope.
Payment systems require an unusual level of rigor. The cost of a bug is not a broken page — it is lost money and lost trust. Design defensively, test exhaustively, and always assume that anything that can fail will fail at the worst possible moment.
