Patterns

35 items

System Design Pattern

Storagebloom-filterprobabilisticmembershipspace-efficientfalse-positiveintermediate

Bloom Filters Pattern

Probabilistic membership testing for space efficiency

Used in: Databases, Caches, Distributed Systems|20 min read

Summary

Bloom filters are space-efficient probabilistic data structures that test whether an element is a member of a set. They can definitively say "definitely not in set" but only probabilistically say "probably in set" - meaning false positives are possible but false negatives are not. This tradeoff enables them to use 10-100x less memory than exact data structures like hash sets, making them invaluable for reducing expensive disk lookups, network calls, and memory usage in distributed systems.

Key Takeaways

Space-Time Tradeoff

Bloom filters trade accuracy for space efficiency. With just 10 bits per element, you can achieve a 1% false positive rate - using 10-100x less memory than storing actual elements in a hash set.

Asymmetric Guarantees

The key property: if bloom filter says NO, the element is definitely not there (no false negatives). If it says YES, the element might be there (false positives possible). This asymmetry is what makes them useful.

Multiple Hash Functions

Use k independent hash functions (typically 3-10) to set/check k bits in the array. More hash functions reduce false positives but increase computation time and false positive rate if too many.

Consider common scenarios in distributed systems:

Database Query Problem: Before querying a database for a record, you want to know if it exists. Without a bloom filter, you query the database (10-50ms), discover the record does not exist, and wasted an expensive operation.

Cache Penetration: Attackers query for non-existent keys to bypass cache and overwhelm your database. Every query hits the database because cache does not know the key does not exist.

URL Deduplication: A web crawler needs to track 10 billion URLs already crawled. Storing all URLs in memory requires ~1TB (100 bytes per URL * 10B). This is prohibitively expensive.

Username Availability: An application with 100M users needs to check if usernames are available during registration. Checking against database every time is slow.

The Core Insight

In all these cases, we often check for items that DO NOT EXIST. A bloom filter lets us definitively rule out non-existent items instantly, only falling back to expensive checks for items that might exist.

The fundamental tradeoff:

Exact Set (Hash Set):
✓ No false positives
✓ Can delete elements
✗ 64-800 bits per element (pointers + data)
✗ 100M elements = 800MB - 10GB

Bloom Filter:
✓ 10-20 bits per element
✓ 100M elements = 120MB - 250MB
✗ 1% false positive rate
✗ Cannot delete elements

For most systems, 1% false positives are acceptable if they save 90% of memory and enable keeping the filter entirely in RAM.

Summary

Key Takeaways

Space-Time Tradeoff

Bloom filters trade accuracy for space efficiency. With just 10 bits per element, you can achieve a 1% false positive rate - using 10-100x less memory than storing actual elements in a hash set.

Asymmetric Guarantees

Multiple Hash Functions

Use k independent hash functions (typically 3-10) to set/check k bits in the array. More hash functions reduce false positives but increase computation time and false positive rate if too many.

Perfect for Negative Caching

Ideal for avoiding expensive operations on non-existent keys: checking if a username is taken, if a URL has been crawled, if a record exists before querying disk.

Immutable After Construction

Cannot remove elements from a standard bloom filter. Deletions would cause false negatives. Use counting bloom filters if deletions are needed, but they use 4x more space.

False Positive Rate

FP rate = (1 - e^(-kn/m))^k where k=hash functions, n=elements, m=bits. With 10 bits per element and optimal k, FP rate is ~1%. This is tunable based on your requirements.

Pattern Details

Consider common scenarios in distributed systems:

Cache Penetration: Attackers query for non-existent keys to bypass cache and overwhelm your database. Every query hits the database because cache does not know the key does not exist.

URL Deduplication: A web crawler needs to track 10 billion URLs already crawled. Storing all URLs in memory requires ~1TB (100 bytes per URL * 10B). This is prohibitively expensive.

Username Availability: An application with 100M users needs to check if usernames are available during registration. Checking against database every time is slow.

The Core Insight

The fundamental tradeoff:

Exact Set (Hash Set):
✓ No false positives
✓ Can delete elements
✗ 64-800 bits per element (pointers + data)
✗ 100M elements = 800MB - 10GB

Bloom Filter:
✓ 10-20 bits per element
✓ 100M elements = 120MB - 250MB
✗ 1% false positive rate
✗ Cannot delete elements

For most systems, 1% false positives are acceptable if they save 90% of memory and enable keeping the filter entirely in RAM.

Trade-offs

Aspect	Advantage	Disadvantage

Patterns

Horizontal Scaling Pattern

Retry with Backoff Pattern

Queue-based Load Leveling Pattern

Replication Pattern

Caching Strategies Pattern

Fan-out Pattern

Fan-in Pattern

Persistent Connections Pattern

Load Balancing Pattern

Circuit Breaker Pattern