System Design Masterclass
fraud-detectionreal-timemachine-learningstreamingrisk-scoringadvanced

Design Fraud Detection System

Design a real-time system to catch fraudulent transactions with low false positives

Millions of transactions per second|Similar to Stripe, PayPal, Visa, Mastercard, Square, Amazon, Netflix, Uber|45 min read

Summary

A fraud detection system checks every payment to see if it looks suspicious. When someone uses a credit card or sends money, we need to decide in less than 100 milliseconds: Is this real or is a thief trying to steal money? The tricky parts are: checking millions of payments every second super fast, not blocking good customers by mistake (false positives), catching new types of fraud we have never seen before, and keeping up with smart criminals who keep changing their tricks. Companies like Stripe, PayPal, Visa, Mastercard, and all banks ask this question in interviews.

Key Takeaways

Core Problem

The main job is to check every payment in under 100 milliseconds and give it a risk score. Too strict means you block good customers and lose business. Too loose means thieves steal money.

The Hard Part

We need to look at the current payment AND the customer history to decide if something is fishy. This means computing dozens of signals like how many times did this card buy something today in real-time for every single payment.

Scaling Axis

During big sales like Black Friday, payments can spike 10x. The system must handle these spikes without slowing down or missing fraud.

Critical Invariant

Never let the fraud check become a bottleneck. If our system is slow or down, payments should still go through (fail open) but we flag them for review later.

Performance Requirement

P99 latency must be under 100ms. This means 99 out of 100 fraud checks finish in 0.1 seconds. The payment screen cannot make customers wait.

Key Tradeoff

Catching more fraud means blocking more good customers too (false positives). Finding the right balance is the hardest part. Usually we aim for 1-2% false positive rate.

Design Walkthrough

Problem Statement

The Question: Design a system that checks every payment transaction in real-time and decides if it is fraud or not.

What the system needs to do (most important first):

  1. 1.Score every transaction - When a payment comes in, give it a risk score from 0 (safe) to 100 (definitely fraud) in under 100 milliseconds.
  2. 2.Block or allow automatically - Based on the score, either approve the payment, block it, or send it for human review.
  3. 3.Learn from history - Use patterns from past transactions to spot suspicious behavior. If a card normally buys coffee in New York but suddenly buys electronics in another country, that is suspicious.
  4. 4.Update in real-time - Track things like how many purchases a card made in the last hour so we can spot unusual spikes.
  5. 5.Handle rules - Let fraud analysts add rules like block all transactions over $5000 from new accounts or flag purchases from certain countries.
  6. 6.Minimize false positives - Do not block good customers. Every blocked legitimate purchase loses money and annoys customers.
  7. 7.Adapt to new fraud - Criminals keep inventing new tricks. The system must learn and adapt without engineers rewriting everything.

What to say first

Let me understand the requirements first. What types of transactions are we checking - credit cards, bank transfers, or both? What is our acceptable false positive rate? Do we need to support human review for borderline cases? And what is our latency budget - how fast must we decide?

What the interviewer really wants to see: - Can you process millions of transactions per second in real-time? - Do you understand the tradeoff between catching fraud and blocking good customers? - Can you compute features (signals) like transaction velocity in real-time? - How do you combine rules (written by humans) with machine learning (learned from data)?

Clarifying Questions

Before you start designing, ask questions to understand what you are building. Good questions show the interviewer you think before you code.

Question 1: How big is this?

How many transactions per second do we need to handle? What is the peak load during sales events like Black Friday?

Why ask this: A system handling 100 transactions per second is very different from one handling 100,000.

What interviewers usually say: 10,000 transactions per second normally, up to 100,000 during peak times like Black Friday or big sales.

How this changes your design: At this scale, we need distributed processing, efficient caching, and pre-computed features. A single server cannot handle this load.

Question 2: How fast must we decide?

What is the maximum time we have to score a transaction? Is 100ms okay, or do we need it faster?

Why ask this: If we have more time, we can do more checks. If we need to be super fast, we must pre-compute everything.

What interviewers usually say: P99 latency under 100 milliseconds. This means 99% of requests must finish in 0.1 seconds.

How this changes your design: We cannot do complex database queries for every transaction. We need to pre-compute and cache everything we might need.

Question 3: What false positive rate is acceptable?

How many good transactions can we accidentally block? Is 1% okay, or do we need to be more accurate?

Why ask this: Every blocked good transaction loses money and annoys customers. But being too loose lets fraud through.

What interviewers usually say: False positive rate should be under 1-2%. We would rather let some small fraud through than block many good customers.

How this changes your design: We need a tiered system - auto-approve most transactions, auto-block obvious fraud, and send borderline cases for human review.

Question 4: Do we need human review?

Can humans review suspicious transactions, or must everything be fully automated?

Why ask this: Human review is expensive but catches edge cases that machines miss.

What interviewers usually say: Yes, we have a team of fraud analysts who review borderline cases. About 2-5% of transactions go to human review.

How this changes your design: We need a queue system for human review, tools for analysts, and a way to feed their decisions back into the machine learning model.

Summarize your assumptions

Let me summarize: 10,000 transactions per second normally, 100,000 at peak. P99 latency under 100ms. False positive rate under 2%. We have human reviewers for borderline cases. The system should learn from new fraud patterns automatically.

The Hard Part

Say this to the interviewer

The hardest part of fraud detection is computing features in real-time. A feature is a signal like how many times has this card been used in the last hour? We need dozens of these signals for every transaction, and we need them in milliseconds.

Why real-time features are tricky (explained simply):

  1. 1.Volume is huge - With 10,000 transactions per second, we cannot query the database for each one. That would be 10,000 database queries per second just for one feature.
  2. 2.Features need history - To know how many purchases a card made today, we need to remember all of today purchases for every card. That is a lot of data to keep in memory.
  3. 3.Time windows overlap - We might need last 1 hour, last 24 hours, and last 7 days all at once. Each window needs separate tracking.
  4. 4.Must be consistent - If a fraudster makes 10 purchases in 1 second, each one must see the correct count. We cannot have race conditions.
  5. 5.Features expire - Data from 8 days ago is not needed for a 7-day window. We must clean up old data to save memory.

Common mistake candidates make

Many people say: just query the database to count transactions. This is wrong because: (1) querying millions of rows for each transaction is too slow, (2) the database would crash under load, (3) it does not give real-time counts - there is always a delay.

Two ways to compute real-time features:

Option 1: Pre-aggregated counters (Recommended) - Keep running counters in Redis: card_123_purchases_1h = 5 - When a new purchase comes in, increment the counter - Reading the feature is instant - just read from Redis - Use TTL (time to live) to auto-expire old data

Option 2: Stream processing with Kafka/Flink - Every transaction goes to a Kafka stream - Flink computes rolling aggregates (sum, count, average) in real-time - Results are stored in a fast database for lookup - Better for complex features like standard deviation of purchase amounts

How we compute features in real-time

Scale and Access Patterns

Before designing, let me figure out how big this system needs to be. This helps us choose the right tools.

What we are measuringNumberWhat this means for our design
Normal transactions per second10,000Need distributed processing - one server cannot handle this
Peak transactions per second100,000System must auto-scale 10x during Black Friday
+ 5 more rows...

What to tell the interviewer

At 10,000 TPS normally and 100,000 at peak, we need a distributed system. The key insight is that latency requirement of 100ms means we CANNOT do database queries for features during scoring. Everything must be pre-computed and cached. I will use Redis for real-time features and a streaming system like Kafka for computing aggregates.

Understanding the class imbalance problem

Only 0.1% of transactions are fraud. If our model just said everything is safe it would be 99.9% accurate! But it would catch zero fraud. This is why we use special metrics like precision (of the ones we flagged, how many were actually fraud?) and recall (of all the fraud, how many did we catch?).

How the system is used (from most common to least common):

  1. 1.Score a transaction - A payment comes in, we score it in real-time. This happens 10,000+ times per second. Must be super fast.
  2. 2.Update features - After each transaction, update the running counters (purchases today, total spent this week, etc). Happens for every transaction.
  3. 3.Human review - Analysts look at flagged transactions and decide fraud or not fraud. Maybe 200-500 reviews per hour.
  4. 4.Add new rules - Fraud analysts add rules like block transactions over $10,000 from accounts less than 7 days old. Happens a few times per day.
  5. 5.Retrain ML model - Feed new fraud examples to improve the model. Happens daily or weekly.
How much data flows through?
- 10,000 transactions/second x 86,400 seconds/day = 864 million transactions/day
- Each transaction: about 1 KB of data
+ 14 more lines...

High-Level Architecture

Now let me draw the big picture of how all the pieces fit together. I will keep it simple and explain what each part does.

What to tell the interviewer

I will break this into three paths: the real-time scoring path (must be fast), the feature update path (can be slightly delayed), and the learning path (runs in background). The scoring path touches only fast systems like Redis. Slow things like database writes happen asynchronously.

Fraud Detection System - The Big Picture

What each service does and WHY it is separate:

ServiceWhat it doesWhy it is separate (what to tell interviewer)
Feature ServiceLooks up pre-computed features for a card/user from Redis. Returns things like: 5 purchases today, $500 spent this week, normally buys in US.Why separate? This is the most called service. We want it to do ONE thing fast - read from Redis. No complex logic here.
Rules EngineChecks hard-coded rules like: block if amount > $10,000 AND account age < 7 days. Rules are written by fraud analysts.Why separate? Rules change often (daily). We do not want to redeploy the whole system for a new rule. Rules can be updated without touching ML models.
ML Scoring ServiceRuns the machine learning model on the features to get a fraud probability (like 0.73 means 73% chance of fraud).Why separate? ML models need special hardware (GPUs sometimes). We want to scale ML servers independently. We can A/B test different models easily.
Decision ServiceTakes the rule results + ML score and makes final decision: approve, block, or send to human review.Why separate? This is where business logic lives. We can tune thresholds (block if score > 0.8) without changing ML model.
Stream Processing (Flink)Updates the running counters in Redis as new transactions flow in. Also writes to database for storage.Why separate? Updates can be slightly delayed (1-2 seconds is fine). We do not want slow writes to affect fast scoring path.

Common interview question: Why not one service?

Interviewers often ask: Why so many services? Your answer: We separate by speed requirement. The scoring path must be under 100ms - it only touches fast things (Redis, cached models). The update path can take 1-2 seconds - it writes to database. The learning path takes hours - it trains models. Mixing fast and slow in one service means the slow parts would block the fast parts.

Technology Choices - Why we picked these tools:

Feature Store: Redis Cluster (Recommended) - Why: Sub-millisecond reads, can handle millions of operations per second - Alternative: Apache Cassandra - better for larger datasets but slightly slower - Alternative: DynamoDB - managed service, good if you are on AWS

Stream Processing: Apache Flink (Recommended) - Why: Built for exactly this use case - real-time aggregations over time windows - Alternative: Kafka Streams - simpler, good if features are not too complex - Alternative: Spark Streaming - good if you already use Spark

ML Serving: TensorFlow Serving or TorchServe - Why: Optimized for low-latency ML inference - Alternative: Custom Python service with model in memory - simpler but less optimized - Alternative: AWS SageMaker - managed service, good for smaller teams

Message Queue: Apache Kafka (Recommended) - Why: High throughput, can replay events, industry standard - Alternative: AWS Kinesis - managed Kafka alternative - Alternative: Pulsar - newer, some nice features, less mature

Important interview tip

Pick technologies YOU know! If you have used RabbitMQ, explain why it could work. If you know Spark, use Spark instead of Flink. Interviewers care more about your reasoning than the specific tool.

Data Model and Storage

Now let me show how we organize the data. We have two types of storage: fast storage (Redis) for real-time scoring and slow storage (PostgreSQL/S3) for history and training.

What to tell the interviewer

The key insight is that we store data twice - once in Redis for fast lookups and once in PostgreSQL for history. This is intentional duplication because the access patterns are completely different. Scoring needs millisecond reads. Training needs to scan billions of rows.

Fast Storage: Redis - Features for Real-Time Scoring

This stores pre-computed signals about each card, user, and merchant. Organized as key-value pairs for instant lookup.

Key PatternWhat it storesExample
card:{id}:txn_count_1hNumber of transactions in last hourcard:abc123:txn_count_1h = 5
card:{id}:txn_sum_24hTotal amount spent in last 24 hourscard:abc123:txn_sum_24h = 523.50
+ 8 more rows...

Redis TTL for automatic cleanup

Each key has a TTL (time to live). The 1h keys expire after 1 hour, 24h keys after 24 hours, etc. This way old data automatically disappears. No cleanup jobs needed.

Slow Storage: PostgreSQL - Transaction History

This stores every transaction permanently. Used for investigations, training ML models, and generating reports.

ColumnWhat it storesExample
idUnique transaction IDtxn_789xyz
card_idWhich card was usedcard_abc123
+ 12 more rows...

Why save features_json?

When we train ML models, we need to know what features the model SAW when it made the decision. If we just recompute features later, they would be different (more transactions happened since then). Saving the snapshot lets us train on exactly what the model saw.

Table 2: Rules - What fraud analysts configure

Fraud analysts write rules without needing engineers. Rules are stored in a database and cached in Redis for fast access.

ColumnWhat it storesExample
idUnique rule IDrule_001
nameHuman readable nameBlock high-amount new accounts
+ 8 more rows...

Important: Rules need monitoring

Every rule tracks its hit count and false positive rate. If a rule blocks too many good customers (high false positive rate), we need to tune or disable it. Bad rules can block millions of dollars of good transactions!

Feature Engineering Deep Dive

Features are the signals we feed to the ML model. Good features are the difference between catching fraud and missing it. Let me explain what features we compute and how.

What to tell the interviewer

Features fall into three categories: transaction features (about this specific purchase), historical features (patterns from past behavior), and derived features (combining multiple signals). The tricky part is computing historical features fast because they need to look at past data.

Category 1: Transaction Features (easy - just look at current transaction)

These come directly from the current transaction. No history needed.

Feature NameWhat it measuresWhy it helps catch fraud
amountHow much moneyVery large amounts are riskier
is_internationalIs card used in different country than it was issued?Stolen cards often used abroad
+ 4 more rows...

Category 2: Historical Features (hard - need to track over time)

These look at past behavior. They are the most powerful fraud signals but hardest to compute fast.

Feature NameWhat it measuresWhy it helps catch fraud
txn_count_1hPurchases in last 1 hourFraudsters try many purchases quickly before card is blocked
txn_count_24hPurchases in last 24 hoursUnusual spike in activity
+ 6 more rows...

Category 3: Derived Features (computed from combining other features)

These combine multiple signals to create more powerful indicators.

Feature NameHow it is computedWhy it helps catch fraud
amount_vs_avg_ratioamount / avg_amount_30dValue of 5 means spending 5x more than usual - suspicious!
velocity_scoretxn_count_1h / typical_hourly_rateSpending much faster than normal pattern
+ 4 more rows...
WHEN A NEW TRANSACTION COMES IN:

// Step 1: Get existing features from Redis (instant - 1ms)
+ 20 more lines...

Updating counters asynchronously

Notice that we UPDATE counters after scoring, not before. And we do it asynchronously (does not block). This means if someone makes two purchases 100ms apart, both might see txn_count_1h = 5 instead of the second one seeing 6. This tiny inconsistency is acceptable because: (1) it is rare, (2) the ML model handles noise well, (3) keeping scoring fast is more important.

ML Model and Rules Engine

Say this to the interviewer

We use BOTH rules and ML models. Rules catch known fraud patterns with 100% certainty - if amount > $10,000 AND account is 1 day old, always block. ML catches subtle patterns humans cannot write rules for - like this combination of features looks like past fraud cases.

Why we need both rules AND machine learning:

Rules are good for: - Hard limits: Always block transactions over $50,000 - Known fraud patterns: Fraudsters from country X buying gift cards - Compliance: Laws say we must block certain transactions - Instant updates: Fraud analyst sees new attack, adds rule immediately - Explainable: We can tell customer exactly why they were blocked

ML is good for: - Subtle patterns: This combination of 50 features looks suspicious - New fraud types: Catches patterns we have not seen before - Nuance: Not just block or not block, but how suspicious on a scale of 0-100 - Scale: Can evaluate thousands of signals that humans cannot track

How Rules and ML Work Together

The ML Model - How it works (simplified):

  1. 1.Training: We feed the model millions of past transactions, each labeled fraud or not fraud 2. Learning: The model learns patterns like: high amount + new account + international + unusual time = probably fraud 3. Prediction: For a new transaction, it outputs a probability from 0.0 (definitely safe) to 1.0 (definitely fraud)

Common models used: - Gradient Boosted Trees (XGBoost, LightGBM): Fast, works well with tabular data, most common choice - Neural Networks: Can find complex patterns but needs more data and compute - Ensemble: Run multiple models and combine their scores for better accuracy

FUNCTION score_transaction(transaction, features):
    
    // STEP 1: Check hard block rules first
+ 44 more lines...

Thresholds can be tuned

The thresholds (0.3 for approve, 0.8 for block) are not fixed. We tune them based on business needs. If fraud losses are high, lower the block threshold to catch more. If customer complaints are high, raise the block threshold to let more through. This is why we output a score, not just approve or block - it gives flexibility to tune later.

Real-Time Processing Pipeline

What to tell the interviewer

The scoring path and the feature update path are separate. Scoring is synchronous and must be fast - under 100ms. Feature updates are asynchronous - they can take 1-2 seconds. This separation is key to meeting our latency requirements.

The Two Paths:

Path 1: Scoring (Synchronous - blocks the payment) 1. Payment service calls our API with transaction details 2. We read features from Redis (1-2ms) 3. We run rules engine (1-2ms) 4. We run ML model (5-10ms) 5. We make decision (1ms) 6. We return decision to payment service 7. Total: 10-20ms typically, 50ms worst case

Path 2: Feature Updates (Asynchronous - happens in background) 1. After scoring, we publish transaction to Kafka 2. Flink reads from Kafka 3. Flink updates all the counters (count, sum, unique values) 4. Flink writes updates to Redis 5. Total: 500ms to 2 seconds 6. This does not block the payment - it happens after we already returned the decision

Scoring Path vs Update Path

How Flink computes rolling aggregates:

Flink is a stream processing engine. It keeps track of windows of time and computes aggregates over those windows.

// Flink job that updates transaction counts

FLINK JOB: UpdateTransactionCounts
+ 25 more lines...

What if Flink is slow or down?

If feature updates are delayed, scoring still works! We just use slightly stale features. A fraudster might get 1-2 extra transactions through before the counts update. This is acceptable because: (1) it is rare, (2) most fraud is caught by other features, (3) we can review and claw back money later. Never let feature updates block scoring.

Handling high traffic spikes:

During Black Friday, traffic can spike 10x. How do we handle it?

  1. 1.Auto-scaling: More API servers spin up automatically when traffic increases 2. Redis cluster: Can handle millions of reads per second, way more than we need 3. ML model batching: Instead of scoring one at a time, batch 10-50 together for GPU efficiency 4. Kafka buffering: If Flink cannot keep up, transactions queue in Kafka. Features are slightly delayed but scoring still works 5. Graceful degradation: If really overloaded, skip some features and use a simpler model

Human Review and Feedback Loop

What to tell the interviewer

Borderline transactions (score between 0.3 and 0.8) go to human reviewers. Their decisions serve two purposes: (1) catch fraud the model missed, (2) provide labeled data to improve the model. This feedback loop is how the model gets smarter over time.

The human review process:

  1. 1.Transaction scores 0.5 - not clearly safe, not clearly fraud 2. It goes into a review queue 3. A fraud analyst sees it in their dashboard 4. They see: transaction details, features, similar past transactions, ML explanation 5. They decide: approve, block, or request more info from customer 6. Their decision is recorded and fed back to train future models

The Feedback Loop

What analysts see in their tool:

  • Transaction details: amount, merchant, location, time - Customer history: how long account exists, past transactions, past fraud - Risk signals: which features are unusual, what rules triggered - Similar cases: past transactions that looked like this - were they fraud? - ML explanation: why did the model give this score?

The training cycle:

  1. 1.Daily: Collect all labeled transactions (human decisions + confirmed fraud + confirmed good) 2. Weekly: Retrain the ML model with new data 3. Testing: Compare new model vs old model on holdout data 4. Shadow mode: Run new model in parallel but do not use its decisions 5. Gradual rollout: Route 10% of traffic to new model, then 50%, then 100%

The label delay problem

We do not know if a transaction is fraud immediately. The customer might not notice for days or weeks. So we have to wait before we can use transactions for training. Typical process: wait 30-90 days for chargebacks and fraud reports to come in, then use that transaction for training. This means our training data is always 1-3 months old.

FOR EACH transaction that is 90 days old:
    
    // Check all possible fraud signals
+ 16 more lines...

What Can Go Wrong and How We Handle It

Tell the interviewer about failures

Good engineers think about what can break. Let me walk through the things that can go wrong and how we protect against them.

Common failures and how we handle them:

What breaksWhat happens to usersHow we fix itWhy this approach
Redis is downCannot read featuresUse default features + simpler rules-only modelSome protection is better than none. Mark for later review.
ML model service is slowScoring takes too longTimeout after 50ms, fall back to rules-onlyRules catch 60% of fraud. We review the rest later.
+ 5 more rows...

The most important principle: FAIL OPEN

If our fraud system is completely down, what should happen? Two options:

  1. 1.Fail closed: Block all transactions until we are back up 2. Fail open: Let all transactions through, flag them for later review

We choose fail open. Why? - Blocking all transactions loses millions of dollars in sales - Most transactions (99.9%) are legitimate - We can review and catch fraud later - Customer experience matters - cannot make everyone wait

FUNCTION score_transaction_safely(transaction):
    TRY:
        // Try the full scoring pipeline
+ 27 more lines...

Monitoring for model drift

ML models can go bad slowly. Maybe fraudsters changed tactics, or maybe data distribution shifted. We monitor: (1) Average fraud score over time - sudden changes are suspicious, (2) False positive rate - are we blocking more good customers? (3) Fraud loss - are we missing more fraud? If metrics drift, we alert the team to investigate.

Scaling and Performance

What to tell the interviewer

The system is designed to scale horizontally. Each component can be scaled independently: more API servers for more traffic, bigger Redis cluster for more features, more Flink workers for faster updates. The bottleneck is usually ML inference - we solve this with batching and caching.

How each component scales:

API Servers: Stateless, just add more behind load balancer - Normal: 20 servers - Black Friday: Auto-scale to 200 servers

Redis: Cluster mode, shard by card_id - 6 nodes normally, each handles different card ranges - Can add nodes dynamically for more capacity

ML Model Servers: GPU servers for fast inference - Batch multiple predictions together (10-50 at once) - Cache predictions for identical feature sets (rare but helps) - 10 servers normally, 50 during peak

Kafka: Partition by card_id for ordering - 50 partitions normally - Each partition can handle 10,000 messages/second

Flink: Parallel workers process different partitions - 20 workers normally - Scale to 100 workers for peak traffic

How we scale for Black Friday

Performance optimizations:

  1. 1.Batch ML predictions: Instead of calling ML model once per transaction, collect 10-50 transactions and predict all at once. GPU loves batching.
  2. 2.Feature prefetching: When we see a card starting a session (browsing), we prefetch its features into local cache before they even start checkout.
  3. 3.Model quantization: Convert ML model to use smaller numbers (int8 instead of float32). Faster with tiny accuracy loss.
  4. 4.Feature hashing: For high-cardinality features like merchant_id, hash them into buckets. Reduces model size and lookup time.
  5. 5.Circuit breakers: If ML service is slow, stop calling it for 30 seconds (circuit open). This prevents cascading failures.
// Instead of predicting one at a time:

// SLOW - 100 transactions = 100 model calls = 1000ms
+ 8 more lines...

The latency vs throughput tradeoff

Batching increases throughput (more predictions per second) but increases latency (each prediction waits for the batch to fill). We balance this by using small batch sizes (10-50) and short timeouts (5ms). If the batch does not fill in 5ms, we predict anyway. This gives us most of the throughput benefit with small latency cost.

Advanced Topics

Mention these if you have time

If the interview is going well and you have time, mention these advanced topics. They show depth of knowledge.

1. Graph-based fraud detection

Fraudsters work in networks. One stolen card leads to many fake accounts. Graph analysis can find these connections.

  • Build a graph: cards, users, devices, addresses as nodes - Edges: same phone used by two accounts, same shipping address, etc. - Look for clusters of connected nodes with high fraud rates - If one node is confirmed fraud, flag all connected nodes

Fraud Ring Detection with Graphs

2. Explainable AI (XAI)

When we block a transaction, we need to explain why. Black box ML models are not enough.

  • Use SHAP (SHapley Additive exPlanations) to explain each prediction - Show which features contributed most to the score - Example: Blocked because (1) amount is 10x your usual, (2) new country, (3) late night - Helps analysts review decisions and helps customers understand blocks

3. Adversarial fraud - fraudsters gaming the system

Smart fraudsters study fraud detection systems and try to evade them.

  • Problem: If they know we flag large amounts, they split into small transactions - Problem: If they know we flag new accounts, they age accounts before using them - Solution: Do not rely on any single feature. Use hundreds of features. - Solution: Add randomness - sometimes do extra checks on random low-score transactions - Solution: Honeypots - fake vulnerable-looking accounts that catch fraudsters

4. Consortium data - sharing fraud signals across companies

A card that commits fraud at Store A will likely try Store B too.

  • Companies share fraud signals (with privacy protections) - If Card X committed fraud at another company, we know to be careful - Challenge: Privacy laws, competitive concerns - Solution: Share hashed identifiers and risk scores, not raw data

5. Real-time model updates (online learning)

Normal ML: Train model weekly, deploy new version. Online learning: Update model weights with each new labeled transaction.

  • Benefit: Model adapts to new fraud patterns within hours, not weeks - Challenge: Model could become unstable, need careful monitoring - Solution: Use online learning for score adjustments, not core model

How real companies do it

Stripe uses a system called Radar with ML + rules. PayPal uses graph analysis to find fraud rings. Visa processes 65,000 transactions per second with ML. All major payment companies use some combination of rules, ML, and human review.

Design Trade-offs

Advantages

  • +Simple to build and understand
  • +Easy to explain decisions
  • +Instant updates - add a rule and it works immediately
  • +No training data needed

Disadvantages

  • -Cannot catch subtle patterns
  • -Fraudsters learn the rules and evade them
  • -Hard to maintain thousands of rules
  • -High false positive rates
When to use

Starting out with no historical data, or as a supplement to ML for known fraud patterns.