Design Walkthrough
Problem Statement
The Question: Design a system that checks every payment transaction in real-time and decides if it is fraud or not.
What the system needs to do (most important first):
- 1.Score every transaction - When a payment comes in, give it a risk score from 0 (safe) to 100 (definitely fraud) in under 100 milliseconds.
- 2.Block or allow automatically - Based on the score, either approve the payment, block it, or send it for human review.
- 3.Learn from history - Use patterns from past transactions to spot suspicious behavior. If a card normally buys coffee in New York but suddenly buys electronics in another country, that is suspicious.
- 4.Update in real-time - Track things like how many purchases a card made in the last hour so we can spot unusual spikes.
- 5.Handle rules - Let fraud analysts add rules like block all transactions over $5000 from new accounts or flag purchases from certain countries.
- 6.Minimize false positives - Do not block good customers. Every blocked legitimate purchase loses money and annoys customers.
- 7.Adapt to new fraud - Criminals keep inventing new tricks. The system must learn and adapt without engineers rewriting everything.
What to say first
Let me understand the requirements first. What types of transactions are we checking - credit cards, bank transfers, or both? What is our acceptable false positive rate? Do we need to support human review for borderline cases? And what is our latency budget - how fast must we decide?
What the interviewer really wants to see: - Can you process millions of transactions per second in real-time? - Do you understand the tradeoff between catching fraud and blocking good customers? - Can you compute features (signals) like transaction velocity in real-time? - How do you combine rules (written by humans) with machine learning (learned from data)?
Clarifying Questions
Before you start designing, ask questions to understand what you are building. Good questions show the interviewer you think before you code.
Question 1: How big is this?
How many transactions per second do we need to handle? What is the peak load during sales events like Black Friday?
Why ask this: A system handling 100 transactions per second is very different from one handling 100,000.
What interviewers usually say: 10,000 transactions per second normally, up to 100,000 during peak times like Black Friday or big sales.
How this changes your design: At this scale, we need distributed processing, efficient caching, and pre-computed features. A single server cannot handle this load.
Question 2: How fast must we decide?
What is the maximum time we have to score a transaction? Is 100ms okay, or do we need it faster?
Why ask this: If we have more time, we can do more checks. If we need to be super fast, we must pre-compute everything.
What interviewers usually say: P99 latency under 100 milliseconds. This means 99% of requests must finish in 0.1 seconds.
How this changes your design: We cannot do complex database queries for every transaction. We need to pre-compute and cache everything we might need.
Question 3: What false positive rate is acceptable?
How many good transactions can we accidentally block? Is 1% okay, or do we need to be more accurate?
Why ask this: Every blocked good transaction loses money and annoys customers. But being too loose lets fraud through.
What interviewers usually say: False positive rate should be under 1-2%. We would rather let some small fraud through than block many good customers.
How this changes your design: We need a tiered system - auto-approve most transactions, auto-block obvious fraud, and send borderline cases for human review.
Question 4: Do we need human review?
Can humans review suspicious transactions, or must everything be fully automated?
Why ask this: Human review is expensive but catches edge cases that machines miss.
What interviewers usually say: Yes, we have a team of fraud analysts who review borderline cases. About 2-5% of transactions go to human review.
How this changes your design: We need a queue system for human review, tools for analysts, and a way to feed their decisions back into the machine learning model.
Summarize your assumptions
Let me summarize: 10,000 transactions per second normally, 100,000 at peak. P99 latency under 100ms. False positive rate under 2%. We have human reviewers for borderline cases. The system should learn from new fraud patterns automatically.
The Hard Part
Say this to the interviewer
The hardest part of fraud detection is computing features in real-time. A feature is a signal like how many times has this card been used in the last hour? We need dozens of these signals for every transaction, and we need them in milliseconds.
Why real-time features are tricky (explained simply):
- 1.Volume is huge - With 10,000 transactions per second, we cannot query the database for each one. That would be 10,000 database queries per second just for one feature.
- 2.Features need history - To know how many purchases a card made today, we need to remember all of today purchases for every card. That is a lot of data to keep in memory.
- 3.Time windows overlap - We might need last 1 hour, last 24 hours, and last 7 days all at once. Each window needs separate tracking.
- 4.Must be consistent - If a fraudster makes 10 purchases in 1 second, each one must see the correct count. We cannot have race conditions.
- 5.Features expire - Data from 8 days ago is not needed for a 7-day window. We must clean up old data to save memory.
Common mistake candidates make
Many people say: just query the database to count transactions. This is wrong because: (1) querying millions of rows for each transaction is too slow, (2) the database would crash under load, (3) it does not give real-time counts - there is always a delay.
Two ways to compute real-time features:
Option 1: Pre-aggregated counters (Recommended) - Keep running counters in Redis: card_123_purchases_1h = 5 - When a new purchase comes in, increment the counter - Reading the feature is instant - just read from Redis - Use TTL (time to live) to auto-expire old data
Option 2: Stream processing with Kafka/Flink - Every transaction goes to a Kafka stream - Flink computes rolling aggregates (sum, count, average) in real-time - Results are stored in a fast database for lookup - Better for complex features like standard deviation of purchase amounts
How we compute features in real-time
Scale and Access Patterns
Before designing, let me figure out how big this system needs to be. This helps us choose the right tools.
| What we are measuring | Number | What this means for our design |
|---|---|---|
| Normal transactions per second | 10,000 | Need distributed processing - one server cannot handle this |
| Peak transactions per second | 100,000 | System must auto-scale 10x during Black Friday |
What to tell the interviewer
At 10,000 TPS normally and 100,000 at peak, we need a distributed system. The key insight is that latency requirement of 100ms means we CANNOT do database queries for features during scoring. Everything must be pre-computed and cached. I will use Redis for real-time features and a streaming system like Kafka for computing aggregates.
Understanding the class imbalance problem
Only 0.1% of transactions are fraud. If our model just said everything is safe it would be 99.9% accurate! But it would catch zero fraud. This is why we use special metrics like precision (of the ones we flagged, how many were actually fraud?) and recall (of all the fraud, how many did we catch?).
How the system is used (from most common to least common):
- 1.Score a transaction - A payment comes in, we score it in real-time. This happens 10,000+ times per second. Must be super fast.
- 2.Update features - After each transaction, update the running counters (purchases today, total spent this week, etc). Happens for every transaction.
- 3.Human review - Analysts look at flagged transactions and decide fraud or not fraud. Maybe 200-500 reviews per hour.
- 4.Add new rules - Fraud analysts add rules like block transactions over $10,000 from accounts less than 7 days old. Happens a few times per day.
- 5.Retrain ML model - Feed new fraud examples to improve the model. Happens daily or weekly.
How much data flows through?
- 10,000 transactions/second x 86,400 seconds/day = 864 million transactions/day
- Each transaction: about 1 KB of dataHigh-Level Architecture
Now let me draw the big picture of how all the pieces fit together. I will keep it simple and explain what each part does.
What to tell the interviewer
I will break this into three paths: the real-time scoring path (must be fast), the feature update path (can be slightly delayed), and the learning path (runs in background). The scoring path touches only fast systems like Redis. Slow things like database writes happen asynchronously.
Fraud Detection System - The Big Picture
What each service does and WHY it is separate:
| Service | What it does | Why it is separate (what to tell interviewer) |
|---|---|---|
| Feature Service | Looks up pre-computed features for a card/user from Redis. Returns things like: 5 purchases today, $500 spent this week, normally buys in US. | Why separate? This is the most called service. We want it to do ONE thing fast - read from Redis. No complex logic here. |
| Rules Engine | Checks hard-coded rules like: block if amount > $10,000 AND account age < 7 days. Rules are written by fraud analysts. | Why separate? Rules change often (daily). We do not want to redeploy the whole system for a new rule. Rules can be updated without touching ML models. |
| ML Scoring Service | Runs the machine learning model on the features to get a fraud probability (like 0.73 means 73% chance of fraud). | Why separate? ML models need special hardware (GPUs sometimes). We want to scale ML servers independently. We can A/B test different models easily. |
| Decision Service | Takes the rule results + ML score and makes final decision: approve, block, or send to human review. | Why separate? This is where business logic lives. We can tune thresholds (block if score > 0.8) without changing ML model. |
| Stream Processing (Flink) | Updates the running counters in Redis as new transactions flow in. Also writes to database for storage. | Why separate? Updates can be slightly delayed (1-2 seconds is fine). We do not want slow writes to affect fast scoring path. |
Common interview question: Why not one service?
Interviewers often ask: Why so many services? Your answer: We separate by speed requirement. The scoring path must be under 100ms - it only touches fast things (Redis, cached models). The update path can take 1-2 seconds - it writes to database. The learning path takes hours - it trains models. Mixing fast and slow in one service means the slow parts would block the fast parts.
Technology Choices - Why we picked these tools:
Feature Store: Redis Cluster (Recommended) - Why: Sub-millisecond reads, can handle millions of operations per second - Alternative: Apache Cassandra - better for larger datasets but slightly slower - Alternative: DynamoDB - managed service, good if you are on AWS
Stream Processing: Apache Flink (Recommended) - Why: Built for exactly this use case - real-time aggregations over time windows - Alternative: Kafka Streams - simpler, good if features are not too complex - Alternative: Spark Streaming - good if you already use Spark
ML Serving: TensorFlow Serving or TorchServe - Why: Optimized for low-latency ML inference - Alternative: Custom Python service with model in memory - simpler but less optimized - Alternative: AWS SageMaker - managed service, good for smaller teams
Message Queue: Apache Kafka (Recommended) - Why: High throughput, can replay events, industry standard - Alternative: AWS Kinesis - managed Kafka alternative - Alternative: Pulsar - newer, some nice features, less mature
Important interview tip
Pick technologies YOU know! If you have used RabbitMQ, explain why it could work. If you know Spark, use Spark instead of Flink. Interviewers care more about your reasoning than the specific tool.
Data Model and Storage
Now let me show how we organize the data. We have two types of storage: fast storage (Redis) for real-time scoring and slow storage (PostgreSQL/S3) for history and training.
What to tell the interviewer
The key insight is that we store data twice - once in Redis for fast lookups and once in PostgreSQL for history. This is intentional duplication because the access patterns are completely different. Scoring needs millisecond reads. Training needs to scan billions of rows.
Fast Storage: Redis - Features for Real-Time Scoring
This stores pre-computed signals about each card, user, and merchant. Organized as key-value pairs for instant lookup.
| Key Pattern | What it stores | Example |
|---|---|---|
| card:{id}:txn_count_1h | Number of transactions in last hour | card:abc123:txn_count_1h = 5 |
| card:{id}:txn_sum_24h | Total amount spent in last 24 hours | card:abc123:txn_sum_24h = 523.50 |
Redis TTL for automatic cleanup
Each key has a TTL (time to live). The 1h keys expire after 1 hour, 24h keys after 24 hours, etc. This way old data automatically disappears. No cleanup jobs needed.
Slow Storage: PostgreSQL - Transaction History
This stores every transaction permanently. Used for investigations, training ML models, and generating reports.
| Column | What it stores | Example |
|---|---|---|
| id | Unique transaction ID | txn_789xyz |
| card_id | Which card was used | card_abc123 |
Why save features_json?
When we train ML models, we need to know what features the model SAW when it made the decision. If we just recompute features later, they would be different (more transactions happened since then). Saving the snapshot lets us train on exactly what the model saw.
Table 2: Rules - What fraud analysts configure
Fraud analysts write rules without needing engineers. Rules are stored in a database and cached in Redis for fast access.
| Column | What it stores | Example |
|---|---|---|
| id | Unique rule ID | rule_001 |
| name | Human readable name | Block high-amount new accounts |
Important: Rules need monitoring
Every rule tracks its hit count and false positive rate. If a rule blocks too many good customers (high false positive rate), we need to tune or disable it. Bad rules can block millions of dollars of good transactions!
Feature Engineering Deep Dive
Features are the signals we feed to the ML model. Good features are the difference between catching fraud and missing it. Let me explain what features we compute and how.
What to tell the interviewer
Features fall into three categories: transaction features (about this specific purchase), historical features (patterns from past behavior), and derived features (combining multiple signals). The tricky part is computing historical features fast because they need to look at past data.
Category 1: Transaction Features (easy - just look at current transaction)
These come directly from the current transaction. No history needed.
| Feature Name | What it measures | Why it helps catch fraud |
|---|---|---|
| amount | How much money | Very large amounts are riskier |
| is_international | Is card used in different country than it was issued? | Stolen cards often used abroad |
Category 2: Historical Features (hard - need to track over time)
These look at past behavior. They are the most powerful fraud signals but hardest to compute fast.
| Feature Name | What it measures | Why it helps catch fraud |
|---|---|---|
| txn_count_1h | Purchases in last 1 hour | Fraudsters try many purchases quickly before card is blocked |
| txn_count_24h | Purchases in last 24 hours | Unusual spike in activity |
Category 3: Derived Features (computed from combining other features)
These combine multiple signals to create more powerful indicators.
| Feature Name | How it is computed | Why it helps catch fraud |
|---|---|---|
| amount_vs_avg_ratio | amount / avg_amount_30d | Value of 5 means spending 5x more than usual - suspicious! |
| velocity_score | txn_count_1h / typical_hourly_rate | Spending much faster than normal pattern |
WHEN A NEW TRANSACTION COMES IN:
// Step 1: Get existing features from Redis (instant - 1ms)Updating counters asynchronously
Notice that we UPDATE counters after scoring, not before. And we do it asynchronously (does not block). This means if someone makes two purchases 100ms apart, both might see txn_count_1h = 5 instead of the second one seeing 6. This tiny inconsistency is acceptable because: (1) it is rare, (2) the ML model handles noise well, (3) keeping scoring fast is more important.
ML Model and Rules Engine
Say this to the interviewer
We use BOTH rules and ML models. Rules catch known fraud patterns with 100% certainty - if amount > $10,000 AND account is 1 day old, always block. ML catches subtle patterns humans cannot write rules for - like this combination of features looks like past fraud cases.
Why we need both rules AND machine learning:
Rules are good for: - Hard limits: Always block transactions over $50,000 - Known fraud patterns: Fraudsters from country X buying gift cards - Compliance: Laws say we must block certain transactions - Instant updates: Fraud analyst sees new attack, adds rule immediately - Explainable: We can tell customer exactly why they were blocked
ML is good for: - Subtle patterns: This combination of 50 features looks suspicious - New fraud types: Catches patterns we have not seen before - Nuance: Not just block or not block, but how suspicious on a scale of 0-100 - Scale: Can evaluate thousands of signals that humans cannot track
How Rules and ML Work Together
The ML Model - How it works (simplified):
- 1.Training: We feed the model millions of past transactions, each labeled fraud or not fraud 2. Learning: The model learns patterns like: high amount + new account + international + unusual time = probably fraud 3. Prediction: For a new transaction, it outputs a probability from 0.0 (definitely safe) to 1.0 (definitely fraud)
Common models used: - Gradient Boosted Trees (XGBoost, LightGBM): Fast, works well with tabular data, most common choice - Neural Networks: Can find complex patterns but needs more data and compute - Ensemble: Run multiple models and combine their scores for better accuracy
FUNCTION score_transaction(transaction, features):
// STEP 1: Check hard block rules firstThresholds can be tuned
The thresholds (0.3 for approve, 0.8 for block) are not fixed. We tune them based on business needs. If fraud losses are high, lower the block threshold to catch more. If customer complaints are high, raise the block threshold to let more through. This is why we output a score, not just approve or block - it gives flexibility to tune later.
Real-Time Processing Pipeline
What to tell the interviewer
The scoring path and the feature update path are separate. Scoring is synchronous and must be fast - under 100ms. Feature updates are asynchronous - they can take 1-2 seconds. This separation is key to meeting our latency requirements.
The Two Paths:
Path 1: Scoring (Synchronous - blocks the payment) 1. Payment service calls our API with transaction details 2. We read features from Redis (1-2ms) 3. We run rules engine (1-2ms) 4. We run ML model (5-10ms) 5. We make decision (1ms) 6. We return decision to payment service 7. Total: 10-20ms typically, 50ms worst case
Path 2: Feature Updates (Asynchronous - happens in background) 1. After scoring, we publish transaction to Kafka 2. Flink reads from Kafka 3. Flink updates all the counters (count, sum, unique values) 4. Flink writes updates to Redis 5. Total: 500ms to 2 seconds 6. This does not block the payment - it happens after we already returned the decision
Scoring Path vs Update Path
How Flink computes rolling aggregates:
Flink is a stream processing engine. It keeps track of windows of time and computes aggregates over those windows.
// Flink job that updates transaction counts
FLINK JOB: UpdateTransactionCountsWhat if Flink is slow or down?
If feature updates are delayed, scoring still works! We just use slightly stale features. A fraudster might get 1-2 extra transactions through before the counts update. This is acceptable because: (1) it is rare, (2) most fraud is caught by other features, (3) we can review and claw back money later. Never let feature updates block scoring.
Handling high traffic spikes:
During Black Friday, traffic can spike 10x. How do we handle it?
- 1.Auto-scaling: More API servers spin up automatically when traffic increases 2. Redis cluster: Can handle millions of reads per second, way more than we need 3. ML model batching: Instead of scoring one at a time, batch 10-50 together for GPU efficiency 4. Kafka buffering: If Flink cannot keep up, transactions queue in Kafka. Features are slightly delayed but scoring still works 5. Graceful degradation: If really overloaded, skip some features and use a simpler model
Human Review and Feedback Loop
What to tell the interviewer
Borderline transactions (score between 0.3 and 0.8) go to human reviewers. Their decisions serve two purposes: (1) catch fraud the model missed, (2) provide labeled data to improve the model. This feedback loop is how the model gets smarter over time.
The human review process:
- 1.Transaction scores 0.5 - not clearly safe, not clearly fraud 2. It goes into a review queue 3. A fraud analyst sees it in their dashboard 4. They see: transaction details, features, similar past transactions, ML explanation 5. They decide: approve, block, or request more info from customer 6. Their decision is recorded and fed back to train future models
The Feedback Loop
What analysts see in their tool:
- Transaction details: amount, merchant, location, time - Customer history: how long account exists, past transactions, past fraud - Risk signals: which features are unusual, what rules triggered - Similar cases: past transactions that looked like this - were they fraud? - ML explanation: why did the model give this score?
The training cycle:
- 1.Daily: Collect all labeled transactions (human decisions + confirmed fraud + confirmed good) 2. Weekly: Retrain the ML model with new data 3. Testing: Compare new model vs old model on holdout data 4. Shadow mode: Run new model in parallel but do not use its decisions 5. Gradual rollout: Route 10% of traffic to new model, then 50%, then 100%
The label delay problem
We do not know if a transaction is fraud immediately. The customer might not notice for days or weeks. So we have to wait before we can use transactions for training. Typical process: wait 30-90 days for chargebacks and fraud reports to come in, then use that transaction for training. This means our training data is always 1-3 months old.
FOR EACH transaction that is 90 days old:
// Check all possible fraud signalsWhat Can Go Wrong and How We Handle It
Tell the interviewer about failures
Good engineers think about what can break. Let me walk through the things that can go wrong and how we protect against them.
Common failures and how we handle them:
| What breaks | What happens to users | How we fix it | Why this approach |
|---|---|---|---|
| Redis is down | Cannot read features | Use default features + simpler rules-only model | Some protection is better than none. Mark for later review. |
| ML model service is slow | Scoring takes too long | Timeout after 50ms, fall back to rules-only | Rules catch 60% of fraud. We review the rest later. |
The most important principle: FAIL OPEN
If our fraud system is completely down, what should happen? Two options:
- 1.Fail closed: Block all transactions until we are back up 2. Fail open: Let all transactions through, flag them for later review
We choose fail open. Why? - Blocking all transactions loses millions of dollars in sales - Most transactions (99.9%) are legitimate - We can review and catch fraud later - Customer experience matters - cannot make everyone wait
FUNCTION score_transaction_safely(transaction):
TRY:
// Try the full scoring pipelineMonitoring for model drift
ML models can go bad slowly. Maybe fraudsters changed tactics, or maybe data distribution shifted. We monitor: (1) Average fraud score over time - sudden changes are suspicious, (2) False positive rate - are we blocking more good customers? (3) Fraud loss - are we missing more fraud? If metrics drift, we alert the team to investigate.
Scaling and Performance
What to tell the interviewer
The system is designed to scale horizontally. Each component can be scaled independently: more API servers for more traffic, bigger Redis cluster for more features, more Flink workers for faster updates. The bottleneck is usually ML inference - we solve this with batching and caching.
How each component scales:
API Servers: Stateless, just add more behind load balancer - Normal: 20 servers - Black Friday: Auto-scale to 200 servers
Redis: Cluster mode, shard by card_id - 6 nodes normally, each handles different card ranges - Can add nodes dynamically for more capacity
ML Model Servers: GPU servers for fast inference - Batch multiple predictions together (10-50 at once) - Cache predictions for identical feature sets (rare but helps) - 10 servers normally, 50 during peak
Kafka: Partition by card_id for ordering - 50 partitions normally - Each partition can handle 10,000 messages/second
Flink: Parallel workers process different partitions - 20 workers normally - Scale to 100 workers for peak traffic
How we scale for Black Friday
Performance optimizations:
- 1.Batch ML predictions: Instead of calling ML model once per transaction, collect 10-50 transactions and predict all at once. GPU loves batching.
- 2.Feature prefetching: When we see a card starting a session (browsing), we prefetch its features into local cache before they even start checkout.
- 3.Model quantization: Convert ML model to use smaller numbers (int8 instead of float32). Faster with tiny accuracy loss.
- 4.Feature hashing: For high-cardinality features like merchant_id, hash them into buckets. Reduces model size and lookup time.
- 5.Circuit breakers: If ML service is slow, stop calling it for 30 seconds (circuit open). This prevents cascading failures.
// Instead of predicting one at a time:
// SLOW - 100 transactions = 100 model calls = 1000msThe latency vs throughput tradeoff
Batching increases throughput (more predictions per second) but increases latency (each prediction waits for the batch to fill). We balance this by using small batch sizes (10-50) and short timeouts (5ms). If the batch does not fill in 5ms, we predict anyway. This gives us most of the throughput benefit with small latency cost.
Advanced Topics
Mention these if you have time
If the interview is going well and you have time, mention these advanced topics. They show depth of knowledge.
1. Graph-based fraud detection
Fraudsters work in networks. One stolen card leads to many fake accounts. Graph analysis can find these connections.
- Build a graph: cards, users, devices, addresses as nodes - Edges: same phone used by two accounts, same shipping address, etc. - Look for clusters of connected nodes with high fraud rates - If one node is confirmed fraud, flag all connected nodes
Fraud Ring Detection with Graphs
2. Explainable AI (XAI)
When we block a transaction, we need to explain why. Black box ML models are not enough.
- Use SHAP (SHapley Additive exPlanations) to explain each prediction - Show which features contributed most to the score - Example: Blocked because (1) amount is 10x your usual, (2) new country, (3) late night - Helps analysts review decisions and helps customers understand blocks
3. Adversarial fraud - fraudsters gaming the system
Smart fraudsters study fraud detection systems and try to evade them.
- Problem: If they know we flag large amounts, they split into small transactions - Problem: If they know we flag new accounts, they age accounts before using them - Solution: Do not rely on any single feature. Use hundreds of features. - Solution: Add randomness - sometimes do extra checks on random low-score transactions - Solution: Honeypots - fake vulnerable-looking accounts that catch fraudsters
4. Consortium data - sharing fraud signals across companies
A card that commits fraud at Store A will likely try Store B too.
- Companies share fraud signals (with privacy protections) - If Card X committed fraud at another company, we know to be careful - Challenge: Privacy laws, competitive concerns - Solution: Share hashed identifiers and risk scores, not raw data
5. Real-time model updates (online learning)
Normal ML: Train model weekly, deploy new version. Online learning: Update model weights with each new labeled transaction.
- Benefit: Model adapts to new fraud patterns within hours, not weeks - Challenge: Model could become unstable, need careful monitoring - Solution: Use online learning for score adjustments, not core model
How real companies do it
Stripe uses a system called Radar with ML + rules. PayPal uses graph analysis to find fraud rings. Visa processes 65,000 transactions per second with ML. All major payment companies use some combination of rules, ML, and human review.