What is a Bloom Filter?
A Bloom filter is a space-efficient probabilistic data structure that tells you whether an element is definitely not in a set or possibly in a set. Invented by Burton Howard Bloom in 1970, it's one of the most elegant solutions to the set membership problem.
Think of it as a bouncer at an exclusive club:
- If the bouncer says "You're NOT on the list" → 100% accurate (you're definitely not getting in)
- If the bouncer says "You might be on the list" → Could be wrong (need to double-check)
This asymmetric behavior makes Bloom filters incredibly useful for eliminating unnecessary work.
Why Should You Care?
Bloom filters are everywhere in modern systems:
- Databases: Skip disk reads for non-existent keys (NexaDB, LevelDB, RocksDB, Cassandra)
- Web Browsers: Check if a URL is malicious (Chrome's Safe Browsing)
- CDNs: Determine if content is cached
- Distributed Systems: Reduce network calls (Akamai, CloudFlare)
- Spell Checkers: Quick dictionary lookup
- Bitcoin: SPV wallet transaction filtering
- Medium: Check if you've already read an article
The Problem Bloom Filters Solve
Imagine you have a database with 100 million keys. A user queries for a key that doesn't exist. Without optimization:
Query: GET user:xyz123
→ Check MemTable (memory): NOT FOUND
→ Check SSTable-1 (disk): NOT FOUND
→ Check SSTable-2 (disk): NOT FOUND
→ Check SSTable-3 (disk): NOT FOUND
→ ... (potentially dozens of disk reads)
→ Finally return: NOT FOUND
Each disk read takes ~10ms. That's hundreds of milliseconds wasted for a key that doesn't exist!
With Bloom filters:
Query: GET user:xyz123
→ Check Bloom filter: DEFINITELY NOT IN SET
→ Return: NOT FOUND (0ms disk I/O!)
NexaDB uses Bloom filters to achieve 95% reduction in unnecessary disk reads.
How Bloom Filters Work
The Basic Structure
A Bloom filter consists of:
- A bit array of size
m(initially all zeros) - k hash functions that map elements to bit positions
Initial state (m=16 bits, all zeros):
┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
│ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │
└───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Adding Elements
To add an element, we:
- Pass it through all k hash functions
- Set the resulting bit positions to 1
// Adding "hello" with k=3 hash functions
hash1("hello") % 16 = 3
hash2("hello") % 16 = 7
hash3("hello") % 16 = 11
// Bit array after adding "hello":
┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
│ 0 │ 0 │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 0 │ 0 │ 0 │ 0 │
└───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
↑ ↑ ↑
Adding More Elements
// Adding "world"
hash1("world") % 16 = 1
hash2("world") % 16 = 7 // Already set!
hash3("world") % 16 = 14
// Bit array after adding "world":
┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
│ 0 │ 1 │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 0 │ 0 │ 1 │ 0 │
└───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
↑ ↑ ↑ ↑ ↑
Notice that position 7 was already set by "hello". This is fine—bits can be set by multiple elements.
Checking Membership
To check if an element exists:
- Hash the element with all k functions
- Check if ALL corresponding bits are 1
// Checking "hello" (was added)
hash1("hello") % 16 = 3 → bit[3] = 1 ✓
hash2("hello") % 16 = 7 → bit[7] = 1 ✓
hash3("hello") % 16 = 11 → bit[11] = 1 ✓
Result: "POSSIBLY IN SET" ✓
// Checking "foo" (never added)
hash1("foo") % 16 = 2 → bit[2] = 0 ✗
Result: "DEFINITELY NOT IN SET" ✓
// Checking "bar" (never added, but false positive!)
hash1("bar") % 16 = 1 → bit[1] = 1 ✓
hash2("bar") % 16 = 3 → bit[3] = 1 ✓
hash3("bar") % 16 = 14 → bit[14] = 1 ✓
Result: "POSSIBLY IN SET" (FALSE POSITIVE!)
Understanding False Positives
Why Do False Positives Happen?
False positives occur when different elements happen to set the same bits. As the filter fills up, the probability of collision increases.
Bits set by different elements overlap:
"hello" sets bits: 3, 7, 11
"world" sets bits: 1, 7, 14
"bar" checks bits: 1, 3, 14 → All happen to be set!
False Positive Probability Formula
The probability of a false positive is:
P(false positive) ≈ (1 - e^(-kn/m))^k
Where:
k = number of hash functions
n = number of elements inserted
m = size of bit array
Key Insight: No False Negatives!
Bloom filters never produce false negatives. If all bits are set, the element was definitely added (unless there's a hash collision, which is different from a false positive).
If filter says "NOT IN SET" → Element is definitely not there (100%)
If filter says "IN SET" → Element is probably there (1-FPR)
The Mathematics of Bloom Filters
Optimal Number of Hash Functions
For a given m (bit array size) and n (expected elements), the optimal number of hash functions is:
k_optimal = (m/n) × ln(2) ≈ 0.693 × (m/n)
Optimal Bit Array Size
For a desired false positive rate p and n elements:
m_optimal = -n × ln(p) / (ln(2))²
Example: n=1,000,000 elements, p=1% (0.01) false positive rate
m = -1,000,000 × ln(0.01) / (ln(2))²
m = -1,000,000 × (-4.605) / 0.480
m ≈ 9,585,059 bits ≈ 1.14 MB
Bits Per Element
A useful rule of thumb:
bits_per_element = -ln(p) / (ln(2))²
For 1% false positive rate:
bits_per_element ≈ 9.6 bits per element
For 0.1% false positive rate:
bits_per_element ≈ 14.4 bits per element
Quick Reference Table
| False Positive Rate | Bits per Element | Hash Functions |
|---|---|---|
| 50% | 1.44 | 1 |
| 10% | 4.79 | 3 |
| 1% | 9.58 | 7 |
| 0.1% | 14.38 | 10 |
| 0.01% | 19.17 | 14 |
Implementation in Python
Here's a complete, production-ready implementation:
import math
import mmh3 # MurmurHash3 - fast, well-distributed
class BloomFilter:
"""
A space-efficient probabilistic set membership structure.
Guarantees:
- No false negatives (if says "not in set", 100% correct)
- Configurable false positive rate
"""
def __init__(self, expected_items: int, false_positive_rate: float = 0.01):
"""
Initialize a Bloom filter.
Args:
expected_items: Number of items you plan to add
false_positive_rate: Desired FPR (default 1%)
"""
self.expected_items = expected_items
self.false_positive_rate = false_positive_rate
# Calculate optimal size
self.size = self._calculate_size(expected_items, false_positive_rate)
# Calculate optimal hash count
self.hash_count = self._calculate_hash_count(self.size, expected_items)
# Initialize bit array (using bytearray for efficiency)
self.bit_array = bytearray((self.size + 7) // 8)
# Track insertions
self.count = 0
def _calculate_size(self, n: int, p: float) -> int:
"""Calculate optimal bit array size."""
m = -n * math.log(p) / (math.log(2) ** 2)
return int(math.ceil(m))
def _calculate_hash_count(self, m: int, n: int) -> int:
"""Calculate optimal number of hash functions."""
k = (m / n) * math.log(2)
return max(1, int(round(k)))
def _get_bit(self, index: int) -> bool:
"""Get value of bit at index."""
byte_index = index // 8
bit_index = index % 8
return bool(self.bit_array[byte_index] & (1 << bit_index))
def _set_bit(self, index: int):
"""Set bit at index to 1."""
byte_index = index // 8
bit_index = index % 8
self.bit_array[byte_index] |= (1 << bit_index)
def _hash_functions(self, item: str) -> list[int]:
"""
Generate k hash values using double hashing technique.
Uses two hash functions to simulate k functions:
h_i(x) = (h1(x) + i * h2(x)) mod m
This is proven to be as good as k independent hash functions.
"""
# Get two independent hash values
h1 = mmh3.hash(item, seed=0) % self.size
h2 = mmh3.hash(item, seed=42) % self.size
# Generate k hash values
return [(h1 + i * h2) % self.size for i in range(self.hash_count)]
def add(self, item: str):
"""Add an item to the filter."""
for index in self._hash_functions(str(item)):
self._set_bit(index)
self.count += 1
def __contains__(self, item: str) -> bool:
"""
Check if item might be in the filter.
Returns:
False: Item is DEFINITELY not in the set
True: Item is PROBABLY in the set
"""
return all(self._get_bit(i) for i in self._hash_functions(str(item)))
def might_contain(self, item: str) -> bool:
"""Alias for __contains__."""
return item in self
def definitely_not_contains(self, item: str) -> bool:
"""Returns True if item is definitely not in the set."""
return not (item in self)
@property
def current_false_positive_rate(self) -> float:
"""Calculate current FPR based on items added."""
if self.count == 0:
return 0.0
k = self.hash_count
n = self.count
m = self.size
return (1 - math.exp(-k * n / m)) ** k
@property
def memory_usage_bytes(self) -> int:
"""Return memory usage in bytes."""
return len(self.bit_array)
def __repr__(self):
return (
f"BloomFilter(expected={self.expected_items}, "
f"fpr={self.false_positive_rate}, size={self.size} bits, "
f"hashes={self.hash_count}, count={self.count})"
)
Usage Example
# Create a filter for 1 million items with 1% false positive rate
bloom = BloomFilter(expected_items=1_000_000, false_positive_rate=0.01)
print(bloom)
# BloomFilter(expected=1000000, fpr=0.01, size=9585059 bits, hashes=7, count=0)
print(f"Memory: {bloom.memory_usage_bytes / 1024 / 1024:.2f} MB")
# Memory: 1.14 MB (vs 100+ MB for a hash set!)
# Add some items
bloom.add("user:12345")
bloom.add("user:67890")
bloom.add("user:11111")
# Check membership
print("user:12345" in bloom) # True (correctly identified)
print("user:99999" in bloom) # False (correctly rejected)
# Check definitely not contains
print(bloom.definitely_not_contains("user:99999")) # True
Implementation in JavaScript
class BloomFilter {
constructor(expectedItems, falsePositiveRate = 0.01) {
this.expectedItems = expectedItems;
this.fpr = falsePositiveRate;
// Calculate optimal parameters
this.size = Math.ceil(-expectedItems * Math.log(falsePositiveRate) / (Math.log(2) ** 2));
this.hashCount = Math.round((this.size / expectedItems) * Math.log(2));
// Initialize bit array
this.bits = new Uint8Array(Math.ceil(this.size / 8));
this.count = 0;
}
// FNV-1a hash function
_hash(str, seed = 0) {
let hash = 2166136261 ^ seed;
for (let i = 0; i < str.length; i++) {
hash ^= str.charCodeAt(i);
hash = Math.imul(hash, 16777619);
}
return hash >>> 0; // Convert to unsigned
}
// Generate k hash positions using double hashing
_getPositions(item) {
const str = String(item);
const h1 = this._hash(str, 0) % this.size;
const h2 = this._hash(str, 42) % this.size;
const positions = [];
for (let i = 0; i < this.hashCount; i++) {
positions.push(Math.abs((h1 + i * h2) % this.size));
}
return positions;
}
add(item) {
const positions = this._getPositions(item);
for (const pos of positions) {
const byteIndex = Math.floor(pos / 8);
const bitIndex = pos % 8;
this.bits[byteIndex] |= (1 << bitIndex);
}
this.count++;
}
mightContain(item) {
const positions = this._getPositions(item);
for (const pos of positions) {
const byteIndex = Math.floor(pos / 8);
const bitIndex = pos % 8;
if (!(this.bits[byteIndex] & (1 << bitIndex))) {
return false; // Definitely not in set
}
}
return true; // Possibly in set
}
definitelyNotContains(item) {
return !this.mightContain(item);
}
}
// Usage
const bloom = new BloomFilter(1_000_000, 0.01);
bloom.add("hello");
bloom.add("world");
console.log(bloom.mightContain("hello")); // true
console.log(bloom.mightContain("foo")); // false
Real-World Applications
1. Database Query Optimization (NexaDB)
class SSTable:
"""A sorted string table with Bloom filter optimization."""
def __init__(self, filepath):
self.filepath = filepath
self.bloom_filter = BloomFilter(expected_items=100_000, false_positive_rate=0.01)
self.index = {}
def write(self, data: dict):
"""Write key-value pairs to SSTable."""
with open(self.filepath, 'wb') as f:
for key, value in sorted(data.items()):
# Add to bloom filter
self.bloom_filter.add(key)
# Write to file and record offset
offset = f.tell()
f.write(self._serialize(key, value))
self.index[key] = offset
def get(self, key: str):
"""Get value by key, using Bloom filter to avoid unnecessary reads."""
# Fast check with Bloom filter
if self.bloom_filter.definitely_not_contains(key):
return None # Key definitely not here - skip disk read!
# Bloom filter says "maybe" - need to check disk
if key in self.index:
return self._read_from_disk(self.index[key])
return None
# This saves ~95% of disk reads for non-existent keys!
2. Web Cache Optimization
class CacheSystem:
def __init__(self):
self.bloom = BloomFilter(expected_items=10_000_000, false_positive_rate=0.001)
self.cache = {} # Actual cache (could be Redis)
self.db = Database() # Slow database
def get(self, key: str):
# Check bloom filter first
if self.bloom.definitely_not_contains(key):
return None # Don't even check cache or DB
# Might exist - check cache
if key in self.cache:
return self.cache[key]
# Not in cache - check database
value = self.db.get(key)
if value:
self.cache[key] = value
return value
return None
def put(self, key: str, value):
self.bloom.add(key) # Track in filter
self.cache[key] = value
self.db.put(key, value)
3. Duplicate Detection
class WebCrawler:
def __init__(self):
# Track visited URLs without storing all of them
self.visited = BloomFilter(
expected_items=100_000_000, # 100M URLs
false_positive_rate=0.001 # 0.1% FPR
)
# Only ~180MB of memory for 100M URLs!
def crawl(self, url: str):
if url in self.visited:
return # Probably already visited (or false positive)
self.visited.add(url)
content = self.fetch(url)
self.process(content)
for link in self.extract_links(content):
self.crawl(link)
4. Spell Checker
class SpellChecker:
def __init__(self, dictionary_file: str):
self.bloom = BloomFilter(expected_items=500_000, false_positive_rate=0.01)
# Load dictionary into bloom filter
with open(dictionary_file) as f:
for word in f:
self.bloom.add(word.strip().lower())
def check(self, word: str) -> bool:
"""
Returns True if word might be correctly spelled.
False if definitely misspelled.
"""
return word.lower() in self.bloom
def check_document(self, text: str) -> list[str]:
"""Return list of potentially misspelled words."""
words = text.split()
return [w for w in words if not self.check(w)]
5. Bitcoin SPV Wallets
class SPVWallet:
"""Simplified Payment Verification wallet using Bloom filters."""
def __init__(self, addresses: list[str]):
# Create filter for our addresses
self.bloom = BloomFilter(
expected_items=len(addresses) * 2, # Account for growth
false_positive_rate=0.0001 # Very low FPR for security
)
for addr in addresses:
self.bloom.add(addr)
def create_filter_message(self) -> bytes:
"""
Send to full nodes to filter transactions.
Full nodes only send transactions that match the filter.
This provides privacy - node doesn't know exactly which
addresses are ours due to false positives.
"""
return self.bloom.bit_array
def is_relevant_transaction(self, tx) -> bool:
"""Check if transaction involves our addresses."""
for output in tx.outputs:
if output.address in self.bloom:
return True
return False
Advanced Techniques
1. Counting Bloom Filters
Standard Bloom filters don't support deletion. Counting Bloom filters use counters instead of bits:
class CountingBloomFilter:
def __init__(self, expected_items: int, false_positive_rate: float = 0.01):
self.size = self._calculate_size(expected_items, false_positive_rate)
self.hash_count = self._calculate_hash_count(self.size, expected_items)
self.counters = [0] * self.size # Counters instead of bits
def add(self, item: str):
for i in self._hash_functions(str(item)):
self.counters[i] += 1
def remove(self, item: str):
"""Remove an item (not possible with standard Bloom filter!)"""
if item not in self:
return # Don't decrement if not present
for i in self._hash_functions(str(item)):
self.counters[i] -= 1
def __contains__(self, item: str) -> bool:
return all(self.counters[i] > 0 for i in self._hash_functions(str(item)))
2. Scalable Bloom Filters
When you don't know how many items you'll add:
class ScalableBloomFilter:
"""Dynamically grows as items are added."""
def __init__(self, initial_capacity: int = 1000,
false_positive_rate: float = 0.01,
growth_factor: int = 2):
self.fpr = false_positive_rate
self.growth_factor = growth_factor
self.filters = [BloomFilter(initial_capacity, false_positive_rate)]
def add(self, item: str):
# Check if current filter is at capacity
current = self.filters[-1]
if current.count >= current.expected_items:
# Create new filter with tighter FPR to maintain overall FPR
new_fpr = self.fpr * (0.5 ** len(self.filters))
new_capacity = current.expected_items * self.growth_factor
self.filters.append(BloomFilter(new_capacity, new_fpr))
self.filters[-1].add(item)
def __contains__(self, item: str) -> bool:
return any(item in f for f in self.filters)
3. Cuckoo Filters (Alternative)
Cuckoo filters offer some advantages over Bloom filters:
| Feature | Bloom Filter | Cuckoo Filter |
|---|---|---|
| Deletion | No (without counting) | Yes |
| Lookup | O(k) | O(1) |
| Space | ~1.44 log₂(1/ε) bits | ~1.05 log₂(1/ε) bits |
| Cache-friendliness | Good | Better |
# Simplified Cuckoo filter concept
class SimpleCuckooFilter:
def __init__(self, size: int):
self.buckets = [[] for _ in range(size)]
self.bucket_size = 4
self.size = size
def fingerprint(self, item: str) -> int:
return hash(item) & 0xFF # 8-bit fingerprint
def add(self, item: str) -> bool:
fp = self.fingerprint(item)
i1 = hash(item) % self.size
i2 = (i1 ^ hash(str(fp))) % self.size
# Try to insert in either bucket
if len(self.buckets[i1]) < self.bucket_size:
self.buckets[i1].append(fp)
return True
if len(self.buckets[i2]) < self.bucket_size:
self.buckets[i2].append(fp)
return True
# Kick out existing item and retry (cuckoo hashing)
# ... (implementation continues)
return False
Performance Comparison
Here's how Bloom filters compare to alternatives:
Dataset: 10 million strings (average 32 bytes each)
Structure | Memory | Insert | Lookup | Delete
------------------|---------|---------|---------|--------
HashSet | 763 MB | O(1) | O(1) | O(1)
TreeSet | 851 MB | O(log n)| O(log n)| O(log n)
Bloom (1% FPR) | 11.4 MB | O(k) | O(k) | No
Bloom (0.1% FPR) | 17.2 MB | O(k) | O(k) | No
Counting Bloom | 45.6 MB | O(k) | O(k) | O(k)
Cuckoo Filter | 10.5 MB | O(1)* | O(1) | O(1)
* Cuckoo insert can fail if filter is too full
NexaDB's Bloom Filter Implementation
NexaDB uses Bloom filters extensively in its LSM-Tree storage engine:
# How NexaDB uses Bloom filters (simplified)
class NexaDBStorage:
def __init__(self):
self.memtable = {} # In-memory writes
self.sstables = [] # Sorted string tables on disk
def get(self, key: str):
# 1. Check memtable (always in memory)
if key in self.memtable:
return self.memtable[key]
# 2. Check SSTables with Bloom filter optimization
for sstable in reversed(self.sstables): # Newest first
# Bloom filter check: O(k) memory operation
if sstable.bloom_filter.definitely_not_contains(key):
continue # Skip this SSTable - key not here
# Bloom filter says "maybe" - do disk read
value = sstable.read_from_disk(key)
if value is not None:
return value
return None # Key doesn't exist
# Performance impact:
# - Without Bloom: ~50ms per SSTable check (disk read)
# - With Bloom: ~0.001ms per SSTable check (memory)
# - 95% of non-existent key lookups avoid disk entirely!
Common Pitfalls and Best Practices
Pitfall 1: Underestimating Capacity
# BAD: Underestimating leads to high false positive rate
bloom = BloomFilter(expected_items=1000, false_positive_rate=0.01)
for i in range(100_000): # Adding 100x more items!
bloom.add(str(i))
print(bloom.current_false_positive_rate) # Could be 50%+ !
# GOOD: Use ScalableBloomFilter or estimate generously
bloom = ScalableBloomFilter(initial_capacity=1000, false_positive_rate=0.01)
# Or
bloom = BloomFilter(expected_items=100_000, false_positive_rate=0.01)
Pitfall 2: Wrong Hash Function
# BAD: Using Python's built-in hash (not consistent across runs!)
def bad_hash(item, seed):
return hash((item, seed)) # Changes between Python sessions!
# GOOD: Use a consistent hash like MurmurHash or xxHash
import mmh3
def good_hash(item, seed):
return mmh3.hash(item, seed)
Pitfall 3: Ignoring False Positive Rate Trade-offs
# Very low FPR = more memory
bloom_paranoid = BloomFilter(1_000_000, 0.0001) # 24 MB
bloom_practical = BloomFilter(1_000_000, 0.01) # 1.2 MB
# For most applications, 1% FPR is fine
# Only use lower FPR when false positives are very expensive
Best Practice: Always Verify Positive Results
def safe_lookup(key: str, bloom: BloomFilter, actual_store: dict):
if bloom.definitely_not_contains(key):
return None # Trust the negative!
# Bloom says "maybe" - ALWAYS verify with actual data
return actual_store.get(key) # Could be None (false positive)
Conclusion
Bloom filters are a elegant solution for space-efficient set membership testing. Key takeaways:
- Space efficiency: 10 bits per item for 1% false positive rate
- No false negatives: "Not in set" is always correct
- Tunable accuracy: Trade memory for lower false positive rate
- Wide applications: Databases, caching, networking, blockchain
The mathematics is beautiful, the implementation is straightforward, and the performance gains are dramatic. Whether you're building a database, a web crawler, or a spam filter, Bloom filters deserve a place in your toolkit.
NexaDB uses Bloom filters to achieve 95% reduction in disk reads. Try it yourself and see the performance difference.
Further Reading
- Original Paper by Burton Bloom (1970)
- Network Applications of Bloom Filters: A Survey
- Less Hashing, Same Performance - Kirsch & Mitzenmacher
Have questions about Bloom filters or database internals? Join our Discord community or check out the NexaDB source code.