Design Walkthrough
Problem Statement
The Question: Design an API for wire transfers between bank accounts that handles millions of transfers daily with absolute correctness guarantees.
Wire transfers are critical infrastructure for: - Moving money between accounts (internal and external) - Business payments like payroll, vendor payments, invoices - International remittances across currencies and jurisdictions - Settlement systems between financial institutions
What to say first
Before I design this, I need to understand the consistency requirements. In payments, we typically need strong consistency - eventual consistency is not acceptable when money is involved.
Hidden requirements interviewers are testing: - Do you understand why payments need stronger guarantees than typical systems? - Can you design for exactly-once semantics? - Do you know the Saga pattern for distributed transactions? - Can you handle compliance and audit requirements? - Do you understand idempotency in payment systems?
Clarifying Questions
Ask these questions to demonstrate you understand payment systems are different from typical CRUD applications.
Question 1: Scope
Are we designing internal transfers (same bank) or also external transfers (different banks via SWIFT/ACH)?
Why this matters: External transfers involve third-party systems we do not control. Typical answer: Start with internal, discuss external as extension Architecture impact: External transfers need async processing, reconciliation
Question 2: Consistency
What happens if a transfer partially fails? Can we ever show inconsistent balances to users?
Why this matters: Determines transaction model. Typical answer: Never show inconsistent state. Failed transfer should fully rollback. Architecture impact: Need distributed transactions or Saga pattern
Question 3: Latency vs Consistency
Is it acceptable for a transfer to take a few seconds if it guarantees correctness?
Why this matters: Real-time vs batch processing. Typical answer: Yes, correctness over speed. Users expect transfers to take seconds, not milliseconds. Architecture impact: Can use synchronous coordination, no need for eventual consistency
Question 4: Compliance
What audit and compliance requirements exist? AML screening? Transaction limits?
Why this matters: Compliance is not optional in banking. Typical answer: Full audit trail, AML screening for large transfers, regulatory reporting Architecture impact: Immutable event log, compliance service integration
Stating assumptions
I will assume: internal transfers initially, strong consistency required, correctness over latency, full audit trail needed, transfers under 1M dollars (no special AML holds).
The Hard Part
Say this out loud
The hard part here is ensuring exactly-once semantics for money movement in a distributed system where failures are inevitable.
Why this is genuinely hard:
- 1.The Double-Spend Problem: If a request times out, did it succeed or fail? Retrying might transfer twice. Not retrying might lose the transfer.
- 2.Distributed State: Source and destination accounts might be on different database shards or even different banks. How do you ensure atomic update across both?
- 3.Partial Failures: What if we debit source but crash before crediting destination? Money is stuck in limbo.
- 4.Network Unreliability: Between API call and response, anything can fail - client, server, database, network. Each failure mode needs handling.
The fundamental invariant
Sum of all balances must be constant. If account A loses $100, account B must gain exactly $100. This invariant can NEVER be violated, even temporarily.
The money conservation equation:
Before transfer: Balance(A) + Balance(B) = X After transfer: Balance(A) + Balance(B) = X
At no point, even during the transfer, should this equation be false.
Common mistake
Candidates often propose simple debit-then-credit without considering what happens if the process crashes between the two operations. Always ask: what if we fail here?
Scale and Access Patterns
Let me estimate the scale and understand access patterns for a major bank or fintech.
| Dimension | Value | Impact |
|---|---|---|
| Daily transfers | 10 million | Need distributed processing |
| Peak TPS | 500-1000 | Much lower than social media - can afford strong consistency |
What to say
The scale is high but TPS is manageable - 1000 TPS is not Google scale. This means we CAN afford strong consistency mechanisms that would be too slow for social media apps.
Access Pattern Analysis:
- Write-heavy for transfers: Each transfer is a write to two accounts - Read-heavy for balances: Users check balance much more than they transfer - Hot accounts exist: Business accounts have much higher transaction volume - Temporal patterns: Month-end, payroll days have 10x normal volume - Audit reads: Rare but must return complete history
Daily transfers: 10M
Peak multiplier: 3x average
Peak TPS: 10M / 86400 * 3 = ~350 TPSHigh-Level Architecture
Let me design a transfer system using the Saga pattern for distributed transactions.
What to say
I will use a Saga pattern with an orchestrator. This gives us distributed transactions without two-phase commit, which does not scale well. We scale by partitioning accounts across shards.
Wire Transfer Architecture
Component Responsibilities:
- 1.API Gateway: Rate limiting, authentication, request validation
- 2.Transfer Service (Orchestrator): Coordinates the saga, maintains transfer state machine, handles retries and compensation
- 3.Account Services: Own account balances, execute debits and credits, enforce business rules (minimum balance, daily limits)
- 4.Transfer DB: Stores transfer state, idempotency keys, saga state
- 5.Account DBs: Sharded by account ID, stores balances and transaction history
- 6.Audit Log: Immutable append-only log of all operations for compliance
- 7.AML/Fraud Services: Screen transfers before execution
Real-world reference
Stripe uses a similar saga-based architecture. Their blog post on idempotency describes how they handle exactly-once semantics for payments.
Data Model and Storage
The data model must support strong consistency, audit trails, and idempotency.
What to say
I will use PostgreSQL for strong ACID guarantees. The key tables are transfers (saga state), accounts (balances), and ledger entries (immutable audit trail).
-- Transfer saga state
CREATE TABLE transfers (
id UUID PRIMARY KEY,Key Design Decisions:
- 1.Idempotency Key: Client provides unique key. Same key returns same result without re-executing.
- 2.Version Column: Optimistic locking prevents concurrent updates to same account.
- 3.Ledger Entries: Immutable audit trail. Every balance change has corresponding entry.
- 4.Double-Entry: For every debit there is a credit. Sum of all entries = 0.
Important detail
The ledger_entries table is append-only. We NEVER update or delete from it. This is critical for audit compliance - regulators need to see complete history.
Transfer Flow Deep Dive
Let me walk through the complete transfer flow using the Saga pattern.
Transfer Saga State Machine
class TransferOrchestrator:
def execute_transfer(self, request: TransferRequest) -> TransferResult:
# Step 1: Idempotency checkThe Debit Operation (with optimistic locking):
-- This runs in a transaction
BEGIN;
Why optimistic locking?
Optimistic locking with version numbers avoids long-held locks while ensuring no concurrent modification. If version changed, we retry. This is better than pessimistic locking for high-volume accounts.
Idempotency Deep Dive
Critical requirement
Idempotency is not optional for payments. If a client retries a timed-out request, we MUST return the same result, not execute the transfer twice.
How idempotency works:
- 1.Client generates unique idempotency key (UUID or hash of transfer details) 2. Server checks if key exists before processing 3. If exists, return stored response 4. If not, process and store response with key 5. Keys expire after reasonable time (24-48 hours)
class IdempotencyHandler:
def handle_request(self, idempotency_key: str,
process_fn: Callable) -> Response:| Scenario | Behavior | Result |
|---|---|---|
| First request | Process transfer, store response | New transfer executed |
| Retry with same key (success) | Return stored response | Same successful response |
| Retry with same key (failed) | Return stored failure | Same failure response |
| Concurrent duplicate | Second waits or fails | Only one executes |
| Different key, same details | Process as new transfer | Duplicate transfer (client error) |
Client responsibility
The client must generate idempotency keys deterministically. Common approach: hash of (user_id, destination, amount, timestamp rounded to minute). This prevents duplicate payments even if client has a bug.
Consistency and Invariants
System Invariants
1. Sum of all account balances is constant (money conservation) 2. No account balance goes negative 3. Every balance change has a ledger entry 4. Completed transfers are irreversible without explicit reversal
Why strong consistency is required:
Unlike social media (where eventual consistency is fine), payments have real-world consequences:
| Violation | Business Impact | Legal Impact |
|---|---|---|
| Double credit | Bank loses money | Fraud investigation |
| Double debit | Customer loses money | Lawsuit, regulatory fine |
| Stuck in limbo | Money inaccessible | Customer complaints, chargeback |
| Missing audit entry | Cannot prove transaction | Regulatory violation |
What to say
In payments, we choose consistency over availability. If we cannot guarantee correctness, we fail the request. Users would rather see an error than lose money.
Ensuring consistency across shards:
When source and destination are on different shards, we cannot use a single database transaction. Options:
| Approach | Pros | Cons |
|---|---|---|
| Two-Phase Commit | Strong consistency | Slow, coordinator is SPOF |
| Saga with compensation | Better availability | Temporary inconsistency window |
| Event sourcing | Complete audit trail | Complex, eventual consistency |
| Hybrid: same-shard atomic | Best of both | Requires careful shard assignment |
Recommended approach: Saga with immediate compensation
The inconsistency window is minimal (milliseconds to seconds) and we can always recover to consistent state via compensation.
Failure Modes and Resilience
Proactively discuss failures
Let me walk through failure scenarios. In payments, every failure mode must have a defined recovery path.
| Failure | Impact | Recovery | Prevention |
|---|---|---|---|
| Crash after debit before credit | Money in limbo | Saga recovery worker completes or reverses | Persistent saga state |
| Database unavailable | Transfers fail | Fail fast, client retries with same idempotency key | Multi-AZ database |
| Network timeout | Unknown state | Client retries, idempotency handles dedup | Idempotency keys |
| Destination account closed | Credit fails | Reverse debit automatically | Pre-validation |
| Insufficient funds race | Debit fails | Transaction isolation handles | FOR UPDATE locking |
Saga Recovery Worker:
A background process that scans for stuck transfers and completes or reverses them:
class SagaRecoveryWorker:
def run(self):
while True:Never fail silently
Unlike other systems where we might drop failed operations, payment failures must be tracked, alerted, and resolved. Every transfer must reach a terminal state.
Evolution and Scaling
What to say
This design handles millions of daily transfers. Let me discuss how it evolves for external transfers, global deployment, and 10x scale.
Evolution Path:
Stage 1: Internal Transfers (Current Design) - Same-bank transfers only - Single region - PostgreSQL for strong consistency
Stage 2: External Transfers - Integration with SWIFT, ACH, FedWire - Async processing (external systems are slow) - Reconciliation system for settlement
Stage 3: Multi-Currency - FX rate service integration - Currency conversion ledger entries - Settlement in multiple currencies
Stage 4: Global Deployment - Regional databases for latency - Cross-region transfers via message queue - Global reconciliation
External Transfer Integration
Scaling considerations:
- 1.Database Sharding: Shard by account_id. Cross-shard transfers use the saga pattern.
- 2.Read Replicas: Balance checks can hit replicas (with careful handling of replication lag).
- 3.Hot Accounts: High-volume merchant accounts need special handling - dedicated shards or pre-computed running balances.
- 4.Event Sourcing: At very high scale, consider event sourcing where ledger is the source of truth and balances are derived.
Alternative approach
If I needed even stronger consistency guarantees (like for inter-bank settlement), I would consider using a database with built-in distributed transactions like CockroachDB or Spanner, accepting the latency tradeoff.