System Design Masterclass

58 items

System Design Masterclass

Paymentspaymentsbankingdistributed-transactionssaga-patternidempotencyadvanced

Design Wire Transfer API

Design a wire transfer API for banking with strong consistency guarantees

Millions of transfers per day, trillions in volume|Similar to Wise, Stripe, Square, Western Union, JPMorgan, Goldman Sachs|45 min read

Summary

A wire transfer API moves money between bank accounts with absolute correctness guarantees - money cannot be created, destroyed, or stuck in limbo. The core challenges are ensuring exactly-once semantics in a distributed system, handling partial failures gracefully, maintaining complete audit trails for compliance, and integrating with legacy banking systems. This is asked at Stripe, Wise, Square, traditional banks, and any fintech company.

Key Takeaways

Core Problem

This is fundamentally a distributed transaction problem where the invariant (money conservation) must NEVER be violated, even during failures.

The Hard Part

Ensuring exactly-once semantics when network failures, timeouts, and retries are inevitable. Money cannot disappear or be duplicated.

Scaling Axis

Scale by partitioning accounts across database shards. Cross-shard transfers require distributed coordination.

The Question: Design an API for wire transfers between bank accounts that handles millions of transfers daily with absolute correctness guarantees.

Wire transfers are critical infrastructure for: - Moving money between accounts (internal and external) - Business payments like payroll, vendor payments, invoices - International remittances across currencies and jurisdictions - Settlement systems between financial institutions

What to say first

Before I design this, I need to understand the consistency requirements. In payments, we typically need strong consistency - eventual consistency is not acceptable when money is involved.

Hidden requirements interviewers are testing: - Do you understand why payments need stronger guarantees than typical systems? - Can you design for exactly-once semantics? - Do you know the Saga pattern for distributed transactions? - Can you handle compliance and audit requirements? - Do you understand idempotency in payment systems?

Summary

Key Takeaways

Core Problem

This is fundamentally a distributed transaction problem where the invariant (money conservation) must NEVER be violated, even during failures.

The Hard Part

Ensuring exactly-once semantics when network failures, timeouts, and retries are inevitable. Money cannot disappear or be duplicated.

Scaling Axis

Scale by partitioning accounts across database shards. Cross-shard transfers require distributed coordination.

Critical Invariant

Sum of all account balances must remain constant. Debit from source must equal credit to destination. No exceptions.

Compliance Requirement

Every transfer must have complete audit trail. Regulators can request transaction history going back years.

Key Tradeoff

We choose strong consistency over availability. A failed transfer is acceptable; an incorrect transfer is catastrophic.

Design Walkthrough

Problem Statement

The Question: Design an API for wire transfers between bank accounts that handles millions of transfers daily with absolute correctness guarantees.

What to say first

Before I design this, I need to understand the consistency requirements. In payments, we typically need strong consistency - eventual consistency is not acceptable when money is involved.

Clarifying Questions

Ask these questions to demonstrate you understand payment systems are different from typical CRUD applications.

Question 1: Scope

Are we designing internal transfers (same bank) or also external transfers (different banks via SWIFT/ACH)?

Why this matters: External transfers involve third-party systems we do not control. Typical answer: Start with internal, discuss external as extension Architecture impact: External transfers need async processing, reconciliation

Question 2: Consistency

What happens if a transfer partially fails? Can we ever show inconsistent balances to users?

Why this matters: Determines transaction model. Typical answer: Never show inconsistent state. Failed transfer should fully rollback. Architecture impact: Need distributed transactions or Saga pattern

Question 3: Latency vs Consistency

Is it acceptable for a transfer to take a few seconds if it guarantees correctness?

Why this matters: Real-time vs batch processing. Typical answer: Yes, correctness over speed. Users expect transfers to take seconds, not milliseconds. Architecture impact: Can use synchronous coordination, no need for eventual consistency

Question 4: Compliance

What audit and compliance requirements exist? AML screening? Transaction limits?

Why this matters: Compliance is not optional in banking. Typical answer: Full audit trail, AML screening for large transfers, regulatory reporting Architecture impact: Immutable event log, compliance service integration

Stating assumptions

I will assume: internal transfers initially, strong consistency required, correctness over latency, full audit trail needed, transfers under 1M dollars (no special AML holds).

The Hard Part

Say this out loud

The hard part here is ensuring exactly-once semantics for money movement in a distributed system where failures are inevitable.

Why this is genuinely hard:

1.The Double-Spend Problem: If a request times out, did it succeed or fail? Retrying might transfer twice. Not retrying might lose the transfer.
2.Distributed State: Source and destination accounts might be on different database shards or even different banks. How do you ensure atomic update across both?
3.Partial Failures: What if we debit source but crash before crediting destination? Money is stuck in limbo.
4.Network Unreliability: Between API call and response, anything can fail - client, server, database, network. Each failure mode needs handling.

The fundamental invariant

Sum of all balances must be constant. If account A loses $100, account B must gain exactly $100. This invariant can NEVER be violated, even temporarily.

The money conservation equation:

Before transfer: Balance(A) + Balance(B) = X After transfer: Balance(A) + Balance(B) = X

At no point, even during the transfer, should this equation be false.

Common mistake

Candidates often propose simple debit-then-credit without considering what happens if the process crashes between the two operations. Always ask: what if we fail here?

Scale and Access Patterns

Let me estimate the scale and understand access patterns for a major bank or fintech.

Dimension	Value	Impact
Daily transfers	10 million	Need distributed processing
Peak TPS	500-1000	Much lower than social media - can afford strong consistency

+ 4 more rows...

What to say

The scale is high but TPS is manageable - 1000 TPS is not Google scale. This means we CAN afford strong consistency mechanisms that would be too slow for social media apps.

Access Pattern Analysis:

Write-heavy for transfers: Each transfer is a write to two accounts - Read-heavy for balances: Users check balance much more than they transfer - Hot accounts exist: Business accounts have much higher transaction volume - Temporal patterns: Month-end, payroll days have 10x normal volume - Audit reads: Rare but must return complete history

Daily transfers: 10M
Peak multiplier: 3x average
Peak TPS: 10M / 86400 * 3 = ~350 TPS

+ 9 more lines...

High-Level Architecture

Let me design a transfer system using the Saga pattern for distributed transactions.

What to say

I will use a Saga pattern with an orchestrator. This gives us distributed transactions without two-phase commit, which does not scale well. We scale by partitioning accounts across shards.

Wire Transfer Architecture

Component Responsibilities:

1.API Gateway: Rate limiting, authentication, request validation
2.Transfer Service (Orchestrator): Coordinates the saga, maintains transfer state machine, handles retries and compensation
3.Account Services: Own account balances, execute debits and credits, enforce business rules (minimum balance, daily limits)
4.Transfer DB: Stores transfer state, idempotency keys, saga state
5.Account DBs: Sharded by account ID, stores balances and transaction history
6.Audit Log: Immutable append-only log of all operations for compliance
7.AML/Fraud Services: Screen transfers before execution

Real-world reference

Stripe uses a similar saga-based architecture. Their blog post on idempotency describes how they handle exactly-once semantics for payments.

Data Model and Storage

The data model must support strong consistency, audit trails, and idempotency.

What to say

I will use PostgreSQL for strong ACID guarantees. The key tables are transfers (saga state), accounts (balances), and ledger entries (immutable audit trail).

-- Transfer saga state
CREATE TABLE transfers (
    id UUID PRIMARY KEY,

+ 54 more lines...

Key Design Decisions:

1.Idempotency Key: Client provides unique key. Same key returns same result without re-executing.
2.Version Column: Optimistic locking prevents concurrent updates to same account.
3.Ledger Entries: Immutable audit trail. Every balance change has corresponding entry.
4.Double-Entry: For every debit there is a credit. Sum of all entries = 0.

Important detail

The ledger_entries table is append-only. We NEVER update or delete from it. This is critical for audit compliance - regulators need to see complete history.

Transfer Flow Deep Dive

Let me walk through the complete transfer flow using the Saga pattern.

Transfer Saga State Machine

class TransferOrchestrator:
    def execute_transfer(self, request: TransferRequest) -> TransferResult:
        # Step 1: Idempotency check

+ 41 more lines...

The Debit Operation (with optimistic locking):

-- This runs in a transaction
BEGIN;

+ 31 more lines...

Why optimistic locking?

Optimistic locking with version numbers avoids long-held locks while ensuring no concurrent modification. If version changed, we retry. This is better than pessimistic locking for high-volume accounts.

Idempotency Deep Dive

Critical requirement

Idempotency is not optional for payments. If a client retries a timed-out request, we MUST return the same result, not execute the transfer twice.

How idempotency works:

1.Client generates unique idempotency key (UUID or hash of transfer details) 2. Server checks if key exists before processing 3. If exists, return stored response 4. If not, process and store response with key 5. Keys expire after reasonable time (24-48 hours)

class IdempotencyHandler:
    def handle_request(self, idempotency_key: str, 
                       process_fn: Callable) -> Response:

+ 26 more lines...

Scenario	Behavior	Result
First request	Process transfer, store response	New transfer executed
Retry with same key (success)	Return stored response	Same successful response
Retry with same key (failed)	Return stored failure	Same failure response
Concurrent duplicate	Second waits or fails	Only one executes
Different key, same details	Process as new transfer	Duplicate transfer (client error)

Client responsibility

The client must generate idempotency keys deterministically. Common approach: hash of (user_id, destination, amount, timestamp rounded to minute). This prevents duplicate payments even if client has a bug.

Consistency and Invariants

System Invariants

1. Sum of all account balances is constant (money conservation) 2. No account balance goes negative 3. Every balance change has a ledger entry 4. Completed transfers are irreversible without explicit reversal

Why strong consistency is required:

Unlike social media (where eventual consistency is fine), payments have real-world consequences:

Violation	Business Impact	Legal Impact
Double credit	Bank loses money	Fraud investigation
Double debit	Customer loses money	Lawsuit, regulatory fine
Stuck in limbo	Money inaccessible	Customer complaints, chargeback
Missing audit entry	Cannot prove transaction	Regulatory violation

What to say

In payments, we choose consistency over availability. If we cannot guarantee correctness, we fail the request. Users would rather see an error than lose money.

Ensuring consistency across shards:

When source and destination are on different shards, we cannot use a single database transaction. Options:

Approach	Pros	Cons
Two-Phase Commit	Strong consistency	Slow, coordinator is SPOF
Saga with compensation	Better availability	Temporary inconsistency window
Event sourcing	Complete audit trail	Complex, eventual consistency
Hybrid: same-shard atomic	Best of both	Requires careful shard assignment

Recommended approach: Saga with immediate compensation

The inconsistency window is minimal (milliseconds to seconds) and we can always recover to consistent state via compensation.

Failure Modes and Resilience

Proactively discuss failures

Let me walk through failure scenarios. In payments, every failure mode must have a defined recovery path.

Failure	Impact	Recovery	Prevention
Crash after debit before credit	Money in limbo	Saga recovery worker completes or reverses	Persistent saga state
Database unavailable	Transfers fail	Fail fast, client retries with same idempotency key	Multi-AZ database
Network timeout	Unknown state	Client retries, idempotency handles dedup	Idempotency keys
Destination account closed	Credit fails	Reverse debit automatically	Pre-validation
Insufficient funds race	Debit fails	Transaction isolation handles	FOR UPDATE locking

Saga Recovery Worker:

A background process that scans for stuck transfers and completes or reverses them:

class SagaRecoveryWorker:
    def run(self):
        while True:

+ 31 more lines...

Never fail silently

Unlike other systems where we might drop failed operations, payment failures must be tracked, alerted, and resolved. Every transfer must reach a terminal state.

Evolution and Scaling

What to say

This design handles millions of daily transfers. Let me discuss how it evolves for external transfers, global deployment, and 10x scale.

Evolution Path:

Stage 1: Internal Transfers (Current Design) - Same-bank transfers only - Single region - PostgreSQL for strong consistency

Stage 2: External Transfers - Integration with SWIFT, ACH, FedWire - Async processing (external systems are slow) - Reconciliation system for settlement

Stage 3: Multi-Currency - FX rate service integration - Currency conversion ledger entries - Settlement in multiple currencies

Stage 4: Global Deployment - Regional databases for latency - Cross-region transfers via message queue - Global reconciliation

External Transfer Integration

Scaling considerations:

1.Database Sharding: Shard by account_id. Cross-shard transfers use the saga pattern.
2.Read Replicas: Balance checks can hit replicas (with careful handling of replication lag).
3.Hot Accounts: High-volume merchant accounts need special handling - dedicated shards or pre-computed running balances.
4.Event Sourcing: At very high scale, consider event sourcing where ledger is the source of truth and balances are derived.

Alternative approach

If I needed even stronger consistency guarantees (like for inter-bank settlement), I would consider using a database with built-in distributed transactions like CockroachDB or Spanner, accepting the latency tradeoff.

Design Trade-offs

Advantages

+Strong consistency
+Atomic across shards
+Simple mental model

Disadvantages

-Coordinator is SPOF
-Blocking protocol
-High latency

When to use

Small scale, cannot tolerate any inconsistency window

System Design Masterclass

Weather Application with Forecasting

URL Shortener

Live Comments Feature

API Rate Limiter

On-Call Escalation System

Hotel Booking and Reservation System

Parts Compatibility Validation

Real-time Stock Price Viewer

Top-K Rankings System

File Download and Sync Library

Real-time Active Viewers

Marketplace Features

Price Alert System

Netflix Screen Concurrency Limits

Live Reactions System

Top K Most Shared Articles

High-Profile Likes Counter

Authentication and User Login

Google Calendar

Web Crawler

News Feed

Video Streaming Platform

IoC / Dependency Injection Framework

Distributed Control Infrastructure

Notification Service

Distributed Tracing System

P2P File Transfer System

Large Data Migration to Cloud

Wire Transfer API

Large Data Sorting and Processing

Database Control Plane

Distributed Metrics Logging and Aggregation

Ads Management & Delivery System

Flash Sale Backend

Photo Sharing Platform

Cluster Health Monitoring System

Rider Matching System

Surge Pricing System

Collaborative Editing System

Server Metrics Collection System

User Analytics Dashboard & Event Pipeline

Dropbox / Google Drive

Distributed Message Queue

ETA and Live Location Sharing

Distributed Key-Value Store

Distributed Stream Processing System

Payment Processing System

Distributed Job Scheduler

WhatsApp / Messenger

Payment Wallet at Global Scale

Uber / Ride Sharing

Web Search Engine

Globally Distributed SQL Database

Real-Time Analytics System

Recommendation System (Netflix)

Multi-Region Disaster Recovery System

Time-Series Database

Fraud Detection System

Design Wire Transfer API

Summary

Key Takeaways

Core Problem

The Hard Part

Scaling Axis

Problem Statement

What to say first

Premium Content