SystemExpertsSystemExperts
Pricing

Patterns

35 items

Horizontal Scaling Pattern

15mbeginner

Retry with Backoff Pattern

15mbeginner

Queue-based Load Leveling Pattern

20mintermediate

Replication Pattern

25mintermediate

Caching Strategies Pattern

25mintermediate

Fan-out Pattern

20mintermediate

Fan-in Pattern

20mintermediate

Persistent Connections Pattern

20mintermediate

Load Balancing Pattern

20mintermediate

Circuit Breaker Pattern

20mintermediate

Bloom Filters Pattern

20mintermediate

Time-Series Storage Pattern

20mintermediate

Bulkhead Pattern

20mintermediate

Batch Processing Pattern

20mintermediate

Write-Ahead Log Pattern

20mintermediate

API Gateway Pattern

20mintermediate

Backend for Frontend Pattern

20mintermediate

Sidecar Pattern

20mintermediate

Idempotency Pattern

20mintermediate

Rate Limiting Pattern

20mintermediate

Backpressure Pattern

20mintermediate

Pub/Sub Pattern

25mintermediate

Eventual Consistency Pattern

25mintermediate

Sharding Pattern

25madvanced

Conflict Resolution Pattern

25madvanced

Strong Consistency Pattern

30madvanced

Leader Election Pattern

25madvanced

Consensus Protocols Pattern

30madvanced

Stream Processing Pattern

25madvanced

Change Data Capture Pattern

25madvanced

Distributed Locking Pattern

25madvanced

Two-Phase Commit Pattern

25madvanced

LSM Trees Pattern

25madvanced

Event Sourcing Pattern

30madvanced

CQRS Pattern

28madvanced
System Design Pattern
Storagebloom-filterprobabilisticmembershipspace-efficientfalse-positiveintermediate

Bloom Filters Pattern

Probabilistic membership testing for space efficiency

Used in: Databases, Caches, Distributed Systems|20 min read

Summary

Bloom filters are space-efficient probabilistic data structures that test whether an element is a member of a set. They can definitively say "definitely not in set" but only probabilistically say "probably in set" - meaning false positives are possible but false negatives are not. This tradeoff enables them to use 10-100x less memory than exact data structures like hash sets, making them invaluable for reducing expensive disk lookups, network calls, and memory usage in distributed systems.

Key Takeaways

Space-Time Tradeoff

Bloom filters trade accuracy for space efficiency. With just 10 bits per element, you can achieve a 1% false positive rate - using 10-100x less memory than storing actual elements in a hash set.

Asymmetric Guarantees

The key property: if bloom filter says NO, the element is definitely not there (no false negatives). If it says YES, the element might be there (false positives possible). This asymmetry is what makes them useful.

Multiple Hash Functions

Use k independent hash functions (typically 3-10) to set/check k bits in the array. More hash functions reduce false positives but increase computation time and false positive rate if too many.

Consider common scenarios in distributed systems:

Database Query Problem: Before querying a database for a record, you want to know if it exists. Without a bloom filter, you query the database (10-50ms), discover the record does not exist, and wasted an expensive operation.

Cache Penetration: Attackers query for non-existent keys to bypass cache and overwhelm your database. Every query hits the database because cache does not know the key does not exist.

URL Deduplication: A web crawler needs to track 10 billion URLs already crawled. Storing all URLs in memory requires ~1TB (100 bytes per URL * 10B). This is prohibitively expensive.

Username Availability: An application with 100M users needs to check if usernames are available during registration. Checking against database every time is slow.

The Core Insight

In all these cases, we often check for items that DO NOT EXIST. A bloom filter lets us definitively rule out non-existent items instantly, only falling back to expensive checks for items that might exist.

The fundamental tradeoff:

Exact Set (Hash Set):
✓ No false positives
✓ Can delete elements
✗ 64-800 bits per element (pointers + data)
✗ 100M elements = 800MB - 10GB

Bloom Filter:
✓ 10-20 bits per element
✓ 100M elements = 120MB - 250MB
✗ 1% false positive rate
✗ Cannot delete elements

For most systems, 1% false positives are acceptable if they save 90% of memory and enable keeping the filter entirely in RAM.

Summary

Bloom filters are space-efficient probabilistic data structures that test whether an element is a member of a set. They can definitively say "definitely not in set" but only probabilistically say "probably in set" - meaning false positives are possible but false negatives are not. This tradeoff enables them to use 10-100x less memory than exact data structures like hash sets, making them invaluable for reducing expensive disk lookups, network calls, and memory usage in distributed systems.

Key Takeaways

Space-Time Tradeoff

Bloom filters trade accuracy for space efficiency. With just 10 bits per element, you can achieve a 1% false positive rate - using 10-100x less memory than storing actual elements in a hash set.

Asymmetric Guarantees

The key property: if bloom filter says NO, the element is definitely not there (no false negatives). If it says YES, the element might be there (false positives possible). This asymmetry is what makes them useful.

Multiple Hash Functions

Use k independent hash functions (typically 3-10) to set/check k bits in the array. More hash functions reduce false positives but increase computation time and false positive rate if too many.

Perfect for Negative Caching

Ideal for avoiding expensive operations on non-existent keys: checking if a username is taken, if a URL has been crawled, if a record exists before querying disk.

Immutable After Construction

Cannot remove elements from a standard bloom filter. Deletions would cause false negatives. Use counting bloom filters if deletions are needed, but they use 4x more space.

False Positive Rate

FP rate = (1 - e^(-kn/m))^k where k=hash functions, n=elements, m=bits. With 10 bits per element and optimal k, FP rate is ~1%. This is tunable based on your requirements.

Pattern Details

Consider common scenarios in distributed systems:

Database Query Problem: Before querying a database for a record, you want to know if it exists. Without a bloom filter, you query the database (10-50ms), discover the record does not exist, and wasted an expensive operation.

Cache Penetration: Attackers query for non-existent keys to bypass cache and overwhelm your database. Every query hits the database because cache does not know the key does not exist.

URL Deduplication: A web crawler needs to track 10 billion URLs already crawled. Storing all URLs in memory requires ~1TB (100 bytes per URL * 10B). This is prohibitively expensive.

Username Availability: An application with 100M users needs to check if usernames are available during registration. Checking against database every time is slow.

The Core Insight

In all these cases, we often check for items that DO NOT EXIST. A bloom filter lets us definitively rule out non-existent items instantly, only falling back to expensive checks for items that might exist.

The fundamental tradeoff:

Exact Set (Hash Set):
✓ No false positives
✓ Can delete elements
✗ 64-800 bits per element (pointers + data)
✗ 100M elements = 800MB - 10GB

Bloom Filter:
✓ 10-20 bits per element
✓ 100M elements = 120MB - 250MB
✗ 1% false positive rate
✗ Cannot delete elements

For most systems, 1% false positives are acceptable if they save 90% of memory and enable keeping the filter entirely in RAM.

Trade-offs

AspectAdvantageDisadvantage

Premium Content

Sign in to access this content or upgrade for full access.