System Design Masterclass
Paymentspaymentsdistributed-transactionspci-dssidempotencyexactly-onceadvanced

Design Payment Processing System

Design a payment system like Stripe or PayPal

10M+ transactions/day, 99.99% availability, sub-second authorization|Similar to Stripe, PayPal, Square, Adyen, Braintree|45 min read

Summary

A payment system lets online stores charge your credit card when you buy something. The tricky parts are: making sure you never get charged twice for the same thing (even if something crashes), keeping your card number super safe, and making sure the store gets their money. When you click "Pay Now", your payment goes through 4 different companies (the store, a payment company like Stripe, the card network like Visa, and your bank) - and any of them could fail. Companies like Stripe, PayPal, Square, and Adyen ask this question in interviews.

Key Takeaways

Core Problem

When you pay online, your money goes through 4 different companies. Any of them could crash or lose connection. We need to make sure money never gets lost or charged twice.

The Hard Part

What if we charge your card, but then our computer crashes before saving that we charged you? When our computer restarts, it does not remember charging you. If it tries again, you get charged twice!

Scaling Axis

Each payment is separate - paying for pizza does not affect someone else paying for shoes. This means we can easily add more computers to handle more payments.

Critical Invariant

NEVER charge a customer twice for the same thing. NEVER lose a payment that worked. Money going in must equal money going out (every penny must be accounted for).

Performance Requirement

The payment must finish in less than 2 seconds. Nobody likes waiting at checkout! The system must work 99.99% of the time - when payments break, stores lose money.

Key Tradeoff

We choose safety over speed. It is better to say payment failed, try again than to accidentally charge someone twice. We can always retry a failed payment, but fixing a double-charge is a big mess.

Design Walkthrough

Problem Statement

The Question: Design a payment system like Stripe that can handle millions of credit card payments every day.

What the system needs to do (most important first):

  1. 1.Accept payments - When someone clicks Pay Now, charge their credit card and give money to the store.
  2. 2.Keep card numbers safe - Credit card numbers are secret. If hackers steal them, people lose money. We need super strong security.
  3. 3.Never charge twice - If something crashes, make sure we do not accidentally charge the same card twice for the same purchase.
  4. 4.Handle refunds - Sometimes people want their money back. We need to return the money to their card.
  5. 5.Catch fraud - Stop bad guys from using stolen credit cards.
  6. 6.Work with different cards - Support Visa, Mastercard, American Express, and others.

What to say first

Let me first understand what we are building. Should I focus on the whole payment flow (charging, refunds, sending money to stores) or just the part where we charge the card? Once I know the features, I will ask about how many payments we need to handle.

What the interviewer really wants to see: - Do you know the payment journey? (Your card gets checked by 4 different companies before saying approved) - Can you make sure the same payment is never processed twice? - Do you understand why card numbers need special protection? (There are laws about this called PCI-DSS) - What happens when something crashes in the middle of a payment?

How a payment travels (4 companies involved)

Clarifying Questions

Before you start designing, ask questions to understand what you are building. Good questions show the interviewer you think before you code.

Question 1: How big is this?

How many payments per day do we need to handle? What about during big sales like Black Friday when everyone shops at once?

Why ask this: Handling 100 payments per second needs a different design than handling 10,000 payments per second.

What interviewers usually say: 10 million payments per day. Normal times: 500 payments per second. Black Friday: 5,000 payments per second.

How this changes your design: We need many computers working together. But each payment is independent, so this is actually easier than it sounds.

Question 2: How fast should it be?

How quickly should the payment finish? Is this for online shopping or in-person at a store?

Why ask this: When you tap your card at a coffee shop, you expect it to beep in 1 second. Online, people will wait 2-3 seconds.

What interviewers usually say: Less than 2 seconds. Visa and Mastercard have a rule that payments must finish in 2 seconds.

How this changes your design: The main payment must be fast. We can do slower things (like updating reports) later in the background.

Question 3: What kinds of payments?

Just one-time payments? Or do we also need subscriptions (like Netflix charging you every month automatically)?

Why ask this: Subscriptions are more complicated. We need to remember cards and charge them again later.

What interviewers usually say: Start with one-time payments. Mention how subscriptions would work, but do not spend too much time on it.

How this changes your design: For subscriptions, we need to save card information (safely!) and have a system that wakes up and charges people on schedule.

Question 4: Which countries and currencies?

Are we handling just US dollars? Or do we need to support euros, yen, and other currencies?

Why ask this: Different countries have different rules and different payment methods.

What interviewers usually say: Start with US dollars and the main cards (Visa, Mastercard). Mention that we could add other currencies later.

How this changes your design: For multiple currencies, we need to convert money and work with banks in different countries.

Summarize your assumptions

Let me summarize what I will design for: 10 million payments per day, payments must finish in 2 seconds, US dollars only, Visa and Mastercard, full payment flow (charge, capture, refund), and we must follow the card security laws (PCI-DSS).

The Hard Part

Say this to the interviewer

The hardest part of payments is making sure we never charge someone twice, even when computers crash. Think about it: what if our system crashes AFTER charging the card but BEFORE saving that we charged it? When it restarts, it does not remember. If it tries again, the person pays twice!

Why this is really tricky (explained simply):

  1. 1.Four companies are involved - Your payment goes through the store, our system, Visa/Mastercard, and your bank. Any of them can crash or lose internet.
  2. 2.The double-charge nightmare - We send the charge to Visa, our computer crashes, Visa approves it, but we never saved that. When we restart and try again... double charge!
  3. 3.The lost payment nightmare - Visa approved the payment, but we crashed before telling the store. The store thinks it failed, the customer is charged.
  4. 4.Half-finished payments - What if the charge worked but the refund failed? Or we sent the money but did not record it? These mismatches are hard to fix.
Time    Our System              Visa                 Result
----    ----------              ----                 ------
1       Send charge request --> Received             
+ 15 more lines...

Common mistake candidates make

Many people only design the happy path (when everything works). Interviewers will ask: What if the database fails after charging the card? You MUST have an answer. The answer is: use idempotency keys (unique IDs) and a state machine.

How we solve these problems:

Solution 1: Idempotency Keys (unique IDs) - Every payment request has a unique ID - If we see the same ID twice, we return the saved result instead of charging again - Example: Store sends ID pay_abc123 twice → we only charge once and return the same result both times

Solution 2: State Machine (tracking where we are) - Payments move through steps: CREATED → PROCESSING → APPROVED → DONE - Each step change is saved to the database - If we crash, we can see exactly where we stopped and continue from there

Solution 3: Write Before You Act - Before calling Visa, write down: "I am about to call Visa for $50" - If we crash after Visa says yes, we know to check Visa when we restart - We never lose track of what we were doing

Solution 4: Daily Check (Reconciliation) - Every day, compare our records with the bank records - If anything is different, investigate and fix it - This catches any mistakes that slipped through

How idempotency keys prevent double-charges

Scale and Access Patterns

Before designing, let me figure out how big this system needs to be. This helps us choose the right tools.

What we are measuringNumberWhat this means for our design
Daily payments10 millionAbout 115 payments per second on average - very manageable
Peak times (Black Friday)5,000 per secondNeed extra computers ready for busy times
+ 5 more rows...

What to tell the interviewer

Each payment is independent - one person paying for shoes does not affect another person paying for pizza. This means we can easily add more computers to handle more payments. The challenge is not handling many payments, it is making sure each payment is correct.

How people use the payment system (from most common to least common):

  1. 1.Process a payment - Someone clicks Pay Now. This is the #1 thing we do, and it must be fast.
  2. 2.Check payment status - Did my payment go through? Stores and customers ask this a lot.
  3. 3.Process a refund - Give money back. Less common but important.
  4. 4.Look up old payments - Show payment history. Can be a bit slower since it is not urgent.
How much space does one payment need?
- Payment info (amount, time, status): about 500 bytes
- Card token (secret reference to saved card): about 50 bytes
+ 13 more lines...

High-Level Architecture

Now let me draw the big picture of how all the pieces fit together. I will keep it simple and explain what each part does.

What to tell the interviewer

I will split the system into three security zones: (1) the public zone that talks to stores, (2) the main payment zone that processes payments, and (3) a super-secure zone just for card numbers. Keeping card numbers separate makes security easier.

Payment System - The Big Picture

What each part does and WHY it is separate:

PartWhat it doesWhy it is separate (what to tell interviewer)
API ServerReceives payment requests from stores. Checks if the store is real. Checks if the request makes sense.This is the front door. It handles security and validation so the payment service can focus on processing.
Payment ServiceThe brain. Processes payments, manages the state machine, talks to banks.This is where the magic happens. Keeping it focused on payments makes it easier to get right.
+ 5 more rows...

Common interview question: Why not just one big service?

Interviewers often ask: Can not you just put everything in one service? Your answer: Yes, for a small system that works fine. We split because: (1) Card numbers need extra security - separating them limits what hackers can steal. (2) Different parts need different speeds - fraud checking is slow, payment checking must be fast. (3) Different parts fail differently - if fraud checking breaks, we can still process payments.

Technology Choices - Why we picked these tools:

Database: PostgreSQL (Recommended) - Why: Great at keeping data safe and correct. Supports transactions (all-or-nothing operations). Most engineers know SQL. - Other options: - MySQL: Also good - pick what your team knows - MongoDB: Not ideal because payment data has lots of relationships

Fast Cache: Redis (Recommended) - Why: Super fast, can store idempotency keys, used by everyone so lots of help available - We use it for: Remembering which payment IDs we have seen (to prevent double-charges)

Message Queue: Kafka or RabbitMQ - Why: Some tasks can wait (like sending receipts). We put them in a queue and do them later. - Kafka: Better for big volume - RabbitMQ: Easier to set up

Encryption: HSM (Hardware Security Module) - Why: Special hardware for encryption. Card numbers are encrypted by a physical device that is impossible to hack remotely. - Required by law (PCI-DSS) for storing card numbers

Important interview tip

Pick technologies YOU know! If you have used MySQL at your job, use MySQL in your design. Interviewers care more about your reasoning than the specific tool. Say something like: I will use PostgreSQL because I have experience with it, but MySQL would work just as well.

Data Model and Storage

Now let me show how we organize the data in the database. Think of tables like spreadsheets - each one stores a different type of information.

What to tell the interviewer

I will use PostgreSQL as the main database because payments need strong guarantees - if we say the payment worked, it really worked. I will use Redis for the idempotency cache because we need to check every single payment very fast.

Table 1: Payments - The main table that stores every payment

When someone pays, we create a row here. We track where the payment is in the process (created, processing, approved, etc.).

ColumnWhat it storesExample
idUnique ID for this paymentpay_abc123
idempotency_keyThe ID the store sent (to prevent double-charges)order_456_charge
+ 11 more rows...

Database Index

We add an INDEX on (merchant_id, created_at). This makes finding a store's recent payments FAST - like a book index that helps you find pages quickly.

Table 2: Payment Events - History of everything that happened to a payment

Every time a payment changes status, we save a record. This helps us understand what happened if something goes wrong.

ColumnWhat it storesExample
idUnique ID for this eventevt_111
payment_idWhich payment this belongs topay_abc123
+ 6 more rows...

Table 3: Card Tokens - Saved cards (stored in the super secure zone)

When customers save their card for later, we store it here. The actual card number is encrypted and only the Card Safe can read it.

ColumnWhat it storesExample
idUnique ID for this saved cardcard_tok_789
merchant_idWhich store this card is saved forstore_xyz
+ 9 more rows...

Important: Card numbers are encrypted

The encrypted_pan column contains the card number, but it is scrambled using a special key stored in the HSM (encryption hardware). Even if someone steals our database, they cannot read the card numbers without the HSM - and the HSM is physically secured and cannot be hacked remotely.

How we use the Idempotency Cache (Redis)

Every payment request must have a unique idempotency key. Before processing any payment, we check if we have seen this key before.

FUNCTION check_idempotency_key(key, merchant_id):
    
    // Make a unique lookup key
+ 46 more lines...

Payment State Machine

A payment goes through several steps. We track these steps carefully so we always know where a payment is, even if our system crashes.

What to tell the interviewer

I use a state machine to track payments. A state machine is like a flowchart - the payment can only move along certain paths. This prevents impossible situations like a payment being refunded before it was charged.

Payment States - Where can a payment be?

What each status means (in simple terms):

  • Created: Payment request received, but we have not talked to the bank yet - Processing: We asked the bank to approve the payment, waiting for answer - Approved: Bank said the card is good and reserved the money. But we have not taken it yet! - Captured: We actually took the money from the customer card - Cancelled: Store decided not to take the money (before capturing) - Failed: Something went wrong - card declined, fraud detected, etc. - Refunded: We gave the money back to the customer - Settled: Money was sent to the store bank account. All done!

Why separate Approved and Captured?

Imagine you book a hotel. They check if your card is good (Approved) but do not charge until you check out (Captured). This is called "authorize and capture". If you cancel, they just release the approval - no money ever moved!

VALID MOVES:
    Created     --> Processing    (we started working on it)
    Processing  --> Approved      (bank said yes)
+ 43 more lines...

Critical: Why we use FOR UPDATE

The FOR UPDATE in the query is super important. Without it, two requests could both read status = Approved at the same time, and both try to capture the payment. With FOR UPDATE, the database makes them wait in line - only one can change the payment at a time.

Processing a Payment Step by Step

Let me walk through exactly what happens when someone clicks Pay Now, including what we do when things go wrong.

Payment Flow (everything that happens)

FUNCTION process_payment(request):
    // request contains: idempotency_key, merchant_id, amount, card_token
    
+ 101 more lines...

Critical: Timeout Handling

When the bank does not respond in time, we MUST NOT assume the payment failed. The charge might have gone through! If we assume failure and let the store retry, we could double-charge. Instead, we mark the payment for manual review and tell the store to try again. The idempotency key makes retries safe.

What Can Go Wrong and How We Handle It

Tell the interviewer about failures

Good engineers think about what can break. Let me walk through the things that can go wrong and how we protect against them.

Common failures and how we handle them:

What breaksWhat happens to usersHow we fix itWhy this works
Database goes downCannot process paymentsKeep backup databases ready + automatic switchBackup takes over in less than 30 seconds
Redis (idempotency cache) goes downCannot check for double-chargesFail closed - reject all payments until fixedBetter to reject than risk double-charging
+ 6 more rows...

Important Decision: Fail Open vs Fail Closed

When something breaks, we have two choices: - Fail Open: Let the payment through anyway and hope for the best - Fail Closed: Reject the payment and tell them to try again

For payments, we almost always Fail Closed because: - A rejected payment can be tried again - A double-charge requires refund, apology, and damages customer trust - Stores prefer "declined" over "fraudulent charge"

// We connect to multiple bank networks
// If one is down, we try the next
+ 44 more lines...

What is a Circuit Breaker?

A circuit breaker is like the one in your house. If too much electricity flows, it flips off to prevent a fire. Our circuit breaker does the same for failing services - if a bank fails too many times, we stop trying for a while. This prevents one broken bank from slowing down our whole system.

Making Sure Our Records Match the Bank

System Rules That Must Never Break

1. NEVER charge a customer twice for the same thing\n2. NEVER lose a payment that went through\n3. Every penny must be accounted for (money in = money out)\n4. Card numbers must always be encrypted

Even with all our safety measures, mistakes can happen. What if a cosmic ray flips a bit in our database? What if a bank sends us wrong information? We need a safety net that catches ANY mistake.

The safety net is called Reconciliation - we compare our records with the bank records every day.

// This runs every day at 6 AM, after banks send us their daily report

FUNCTION daily_reconciliation():
+ 73 more lines...

Why reconciliation matters

Reconciliation is our last line of defense. Even if our system has bugs, even if the bank makes mistakes, reconciliation catches them. Companies that handle money MUST do daily reconciliation - it is required by law for financial audits.

RuleHow we enforce itHow we verify it
Never charge twiceIdempotency keys with 24-hour memoryCheck every request against cache before processing
Never lose paymentsWrite-ahead logging + state machineDaily reconciliation catches any missing records
Money must balanceDouble-entry accounting (credits = debits)End-of-day totals must sum to zero
Cards always encryptedHSM hardware + separate Card Safe zonePCI-DSS audit + security testing

Growing the System Over Time

What to tell the interviewer

This design handles 10 million payments per day easily. Let me explain how it grows if we need to handle 100 times more, or add features like subscriptions and international payments.

How we grow step by step:

Stage 1: Starting out (up to 10 million payments per day) - One PostgreSQL database with backup copies - One Redis cluster for idempotency cache - Primary + backup bank connections - All servers in one location (like US-East) - This handles most businesses easily

Stage 2: Medium scale (up to 100 million payments per day) - Add read-only database copies for reports and dashboards - Split payment processing across multiple server groups - Each group handles certain merchants - Still one main database for writes (strong consistency)

Stage 3: Big scale (up to 1 billion payments per day) - Multiple data centers in different countries - Each region handles local payments - Cross-region payments get routed to the right place - This is how Stripe and PayPal work

Multi-Region Architecture (Stage 3)

Cool features we can add later:

FeatureWhat it isHow hard to add
SubscriptionsCharge customers automatically every month (like Netflix)Medium - need scheduler and saved cards
Multiple currenciesAccept euros, yen, pounds, not just dollarsMedium - need currency conversion rates
+ 4 more rows...

What if we needed super fast payments?

For things like stock trading where milliseconds matter, I would use in-memory processing instead of a database. The payment would be recorded in memory first (super fast), then written to disk later (safer but slower). This trades some safety for speed - good for trading, not good for regular purchases.

Key Takeaways for the Interview:

  1. 1.Safety over speed: In payments, being correct is more important than being fast. Design to never charge twice.
  2. 2.Idempotency keys are essential: Every payment must have a unique ID. If we see the same ID twice, return the saved result.
  3. 3.State machines prevent bugs: Payments move through defined steps (created → processing → approved → captured). This prevents impossible states.
  4. 4.Reconciliation is mandatory: Compare your records with the bank every day. This catches any mistakes that slipped through.
  5. 5.Security laws shape the design: PCI-DSS requires separating card numbers into a secure zone. This is not optional - it is the law.

Design Trade-offs

Advantages

  • +Simple to understand
  • +Know right away if it worked
  • +Easier to debug

Disadvantages

  • -Slow if bank is slow
  • -Harder to handle many payments
  • -If one part breaks, everything waits
When to use

Use for the main payment - customer needs to know right away if their card worked.