Patterns
35 items
35 items
Probabilistic membership testing for space efficiency
Bloom filters are space-efficient probabilistic data structures that test whether an element is a member of a set. They can definitively say "definitely not in set" but only probabilistically say "probably in set" - meaning false positives are possible but false negatives are not. This tradeoff enables them to use 10-100x less memory than exact data structures like hash sets, making them invaluable for reducing expensive disk lookups, network calls, and memory usage in distributed systems.
Bloom filters trade accuracy for space efficiency. With just 10 bits per element, you can achieve a 1% false positive rate - using 10-100x less memory than storing actual elements in a hash set.
The key property: if bloom filter says NO, the element is definitely not there (no false negatives). If it says YES, the element might be there (false positives possible). This asymmetry is what makes them useful.
Use k independent hash functions (typically 3-10) to set/check k bits in the array. More hash functions reduce false positives but increase computation time and false positive rate if too many.
Consider common scenarios in distributed systems:
Database Query Problem: Before querying a database for a record, you want to know if it exists. Without a bloom filter, you query the database (10-50ms), discover the record does not exist, and wasted an expensive operation.
Cache Penetration: Attackers query for non-existent keys to bypass cache and overwhelm your database. Every query hits the database because cache does not know the key does not exist.
URL Deduplication: A web crawler needs to track 10 billion URLs already crawled. Storing all URLs in memory requires ~1TB (100 bytes per URL * 10B). This is prohibitively expensive.
Username Availability: An application with 100M users needs to check if usernames are available during registration. Checking against database every time is slow.
In all these cases, we often check for items that DO NOT EXIST. A bloom filter lets us definitively rule out non-existent items instantly, only falling back to expensive checks for items that might exist.
The fundamental tradeoff:
Exact Set (Hash Set):
✓ No false positives
✓ Can delete elements
✗ 64-800 bits per element (pointers + data)
✗ 100M elements = 800MB - 10GB
Bloom Filter:
✓ 10-20 bits per element
✓ 100M elements = 120MB - 250MB
✗ 1% false positive rate
✗ Cannot delete elementsFor most systems, 1% false positives are acceptable if they save 90% of memory and enable keeping the filter entirely in RAM.