What is a Bloom Filter?

A Bloom filter is a space-efficient probabilistic data structure that tells you whether an element is definitely not in a set or possibly in a set. Invented by Burton Howard Bloom in 1970, it's one of the most elegant solutions to the set membership problem.

Think of it as a bouncer at an exclusive club:

If the bouncer says "You're NOT on the list" → 100% accurate (you're definitely not getting in)
If the bouncer says "You might be on the list" → Could be wrong (need to double-check)

This asymmetric behavior makes Bloom filters incredibly useful for eliminating unnecessary work.

Why Should You Care?

Bloom filters are everywhere in modern systems:

Databases: Skip disk reads for non-existent keys (NexaDB, LevelDB, RocksDB, Cassandra)
Web Browsers: Check if a URL is malicious (Chrome's Safe Browsing)
CDNs: Determine if content is cached
Distributed Systems: Reduce network calls (Akamai, CloudFlare)
Spell Checkers: Quick dictionary lookup
Bitcoin: SPV wallet transaction filtering
Medium: Check if you've already read an article

The Problem Bloom Filters Solve

Imagine you have a database with 100 million keys. A user queries for a key that doesn't exist. Without optimization:

Query: GET user:xyz123
  → Check MemTable (memory): NOT FOUND
  → Check SSTable-1 (disk): NOT FOUND
  → Check SSTable-2 (disk): NOT FOUND
  → Check SSTable-3 (disk): NOT FOUND
  → ... (potentially dozens of disk reads)
  → Finally return: NOT FOUND

Each disk read takes ~10ms. That's hundreds of milliseconds wasted for a key that doesn't exist!

With Bloom filters:

Query: GET user:xyz123
  → Check Bloom filter: DEFINITELY NOT IN SET
  → Return: NOT FOUND (0ms disk I/O!)

NexaDB uses Bloom filters to achieve 95% reduction in unnecessary disk reads.

How Bloom Filters Work

The Basic Structure

A Bloom filter consists of:

A bit array of size m (initially all zeros)
k hash functions that map elements to bit positions

Initial state (m=16 bits, all zeros):
┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
│ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │
└───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15

Adding Elements

To add an element, we:

Pass it through all k hash functions
Set the resulting bit positions to 1

// Adding "hello" with k=3 hash functions
hash1("hello") % 16 = 3
hash2("hello") % 16 = 7
hash3("hello") % 16 = 11

// Bit array after adding "hello":
┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
│ 0 │ 0 │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 0 │ 0 │ 0 │ 0 │
└───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
              ↑               ↑               ↑

Adding More Elements

// Adding "world"
hash1("world") % 16 = 1
hash2("world") % 16 = 7  // Already set!
hash3("world") % 16 = 14

// Bit array after adding "world":
┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
│ 0 │ 1 │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 0 │ 0 │ 1 │ 0 │
└───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
      ↑       ↑               ↑               ↑           ↑

Notice that position 7 was already set by "hello". This is fine—bits can be set by multiple elements.

Checking Membership

To check if an element exists:

Hash the element with all k functions
Check if ALL corresponding bits are 1

// Checking "hello" (was added)
hash1("hello") % 16 = 3  → bit[3] = 1 ✓
hash2("hello") % 16 = 7  → bit[7] = 1 ✓
hash3("hello") % 16 = 11 → bit[11] = 1 ✓
Result: "POSSIBLY IN SET" ✓

// Checking "foo" (never added)
hash1("foo") % 16 = 2  → bit[2] = 0 ✗
Result: "DEFINITELY NOT IN SET" ✓

// Checking "bar" (never added, but false positive!)
hash1("bar") % 16 = 1  → bit[1] = 1 ✓
hash2("bar") % 16 = 3  → bit[3] = 1 ✓
hash3("bar") % 16 = 14 → bit[14] = 1 ✓
Result: "POSSIBLY IN SET" (FALSE POSITIVE!)

Understanding False Positives

Why Do False Positives Happen?

False positives occur when different elements happen to set the same bits. As the filter fills up, the probability of collision increases.

Bits set by different elements overlap:
         "hello" sets bits: 3, 7, 11
         "world" sets bits: 1, 7, 14
         "bar" checks bits: 1, 3, 14 → All happen to be set!

False Positive Probability Formula

The probability of a false positive is:

P(false positive) ≈ (1 - e^(-kn/m))^k

Where:
  k = number of hash functions
  n = number of elements inserted
  m = size of bit array

Key Insight: No False Negatives!

Bloom filters never produce false negatives. If all bits are set, the element was definitely added (unless there's a hash collision, which is different from a false positive).

If filter says "NOT IN SET" → Element is definitely not there (100%)
If filter says "IN SET"     → Element is probably there (1-FPR)

The Mathematics of Bloom Filters

Optimal Number of Hash Functions

For a given m (bit array size) and n (expected elements), the optimal number of hash functions is:

k_optimal = (m/n) × ln(2) ≈ 0.693 × (m/n)

Optimal Bit Array Size

For a desired false positive rate p and n elements:

m_optimal = -n × ln(p) / (ln(2))²

Example: n=1,000,000 elements, p=1% (0.01) false positive rate
m = -1,000,000 × ln(0.01) / (ln(2))²
m = -1,000,000 × (-4.605) / 0.480
m ≈ 9,585,059 bits ≈ 1.14 MB

Bits Per Element

A useful rule of thumb:

bits_per_element = -ln(p) / (ln(2))²

For 1% false positive rate:
bits_per_element ≈ 9.6 bits per element

For 0.1% false positive rate:
bits_per_element ≈ 14.4 bits per element

Quick Reference Table

False Positive Rate	Bits per Element	Hash Functions
50%	1.44	1
10%	4.79	3
1%	9.58	7
0.1%	14.38	10
0.01%	19.17	14

Implementation in Python

Here's a complete, production-ready implementation:

import math
import mmh3  # MurmurHash3 - fast, well-distributed

class BloomFilter:
    """
    A space-efficient probabilistic set membership structure.

    Guarantees:
    - No false negatives (if says "not in set", 100% correct)
    - Configurable false positive rate
    """

    def __init__(self, expected_items: int, false_positive_rate: float = 0.01):
        """
        Initialize a Bloom filter.

        Args:
            expected_items: Number of items you plan to add
            false_positive_rate: Desired FPR (default 1%)
        """
        self.expected_items = expected_items
        self.false_positive_rate = false_positive_rate

        # Calculate optimal size
        self.size = self._calculate_size(expected_items, false_positive_rate)

        # Calculate optimal hash count
        self.hash_count = self._calculate_hash_count(self.size, expected_items)

        # Initialize bit array (using bytearray for efficiency)
        self.bit_array = bytearray((self.size + 7) // 8)

        # Track insertions
        self.count = 0

    def _calculate_size(self, n: int, p: float) -> int:
        """Calculate optimal bit array size."""
        m = -n * math.log(p) / (math.log(2) ** 2)
        return int(math.ceil(m))

    def _calculate_hash_count(self, m: int, n: int) -> int:
        """Calculate optimal number of hash functions."""
        k = (m / n) * math.log(2)
        return max(1, int(round(k)))

    def _get_bit(self, index: int) -> bool:
        """Get value of bit at index."""
        byte_index = index // 8
        bit_index = index % 8
        return bool(self.bit_array[byte_index] & (1 << bit_index))

    def _set_bit(self, index: int):
        """Set bit at index to 1."""
        byte_index = index // 8
        bit_index = index % 8
        self.bit_array[byte_index] |= (1 << bit_index)

    def _hash_functions(self, item: str) -> list[int]:
        """
        Generate k hash values using double hashing technique.

        Uses two hash functions to simulate k functions:
        h_i(x) = (h1(x) + i * h2(x)) mod m

        This is proven to be as good as k independent hash functions.
        """
        # Get two independent hash values
        h1 = mmh3.hash(item, seed=0) % self.size
        h2 = mmh3.hash(item, seed=42) % self.size

        # Generate k hash values
        return [(h1 + i * h2) % self.size for i in range(self.hash_count)]

    def add(self, item: str):
        """Add an item to the filter."""
        for index in self._hash_functions(str(item)):
            self._set_bit(index)
        self.count += 1

    def __contains__(self, item: str) -> bool:
        """
        Check if item might be in the filter.

        Returns:
            False: Item is DEFINITELY not in the set
            True: Item is PROBABLY in the set
        """
        return all(self._get_bit(i) for i in self._hash_functions(str(item)))

    def might_contain(self, item: str) -> bool:
        """Alias for __contains__."""
        return item in self

    def definitely_not_contains(self, item: str) -> bool:
        """Returns True if item is definitely not in the set."""
        return not (item in self)

    @property
    def current_false_positive_rate(self) -> float:
        """Calculate current FPR based on items added."""
        if self.count == 0:
            return 0.0
        k = self.hash_count
        n = self.count
        m = self.size
        return (1 - math.exp(-k * n / m)) ** k

    @property
    def memory_usage_bytes(self) -> int:
        """Return memory usage in bytes."""
        return len(self.bit_array)

    def __repr__(self):
        return (
            f"BloomFilter(expected={self.expected_items}, "
            f"fpr={self.false_positive_rate}, size={self.size} bits, "
            f"hashes={self.hash_count}, count={self.count})"
        )

Usage Example

# Create a filter for 1 million items with 1% false positive rate
bloom = BloomFilter(expected_items=1_000_000, false_positive_rate=0.01)

print(bloom)
# BloomFilter(expected=1000000, fpr=0.01, size=9585059 bits, hashes=7, count=0)

print(f"Memory: {bloom.memory_usage_bytes / 1024 / 1024:.2f} MB")
# Memory: 1.14 MB (vs 100+ MB for a hash set!)

# Add some items
bloom.add("user:12345")
bloom.add("user:67890")
bloom.add("user:11111")

# Check membership
print("user:12345" in bloom)  # True (correctly identified)
print("user:99999" in bloom)  # False (correctly rejected)

# Check definitely not contains
print(bloom.definitely_not_contains("user:99999"))  # True

Implementation in JavaScript

class BloomFilter {
  constructor(expectedItems, falsePositiveRate = 0.01) {
    this.expectedItems = expectedItems;
    this.fpr = falsePositiveRate;

    // Calculate optimal parameters
    this.size = Math.ceil(-expectedItems * Math.log(falsePositiveRate) / (Math.log(2) ** 2));
    this.hashCount = Math.round((this.size / expectedItems) * Math.log(2));

    // Initialize bit array
    this.bits = new Uint8Array(Math.ceil(this.size / 8));
    this.count = 0;
  }

  // FNV-1a hash function
  _hash(str, seed = 0) {
    let hash = 2166136261 ^ seed;
    for (let i = 0; i < str.length; i++) {
      hash ^= str.charCodeAt(i);
      hash = Math.imul(hash, 16777619);
    }
    return hash >>> 0;  // Convert to unsigned
  }

  // Generate k hash positions using double hashing
  _getPositions(item) {
    const str = String(item);
    const h1 = this._hash(str, 0) % this.size;
    const h2 = this._hash(str, 42) % this.size;

    const positions = [];
    for (let i = 0; i < this.hashCount; i++) {
      positions.push(Math.abs((h1 + i * h2) % this.size));
    }
    return positions;
  }

  add(item) {
    const positions = this._getPositions(item);
    for (const pos of positions) {
      const byteIndex = Math.floor(pos / 8);
      const bitIndex = pos % 8;
      this.bits[byteIndex] |= (1 << bitIndex);
    }
    this.count++;
  }

  mightContain(item) {
    const positions = this._getPositions(item);
    for (const pos of positions) {
      const byteIndex = Math.floor(pos / 8);
      const bitIndex = pos % 8;
      if (!(this.bits[byteIndex] & (1 << bitIndex))) {
        return false;  // Definitely not in set
      }
    }
    return true;  // Possibly in set
  }

  definitelyNotContains(item) {
    return !this.mightContain(item);
  }
}

// Usage
const bloom = new BloomFilter(1_000_000, 0.01);
bloom.add("hello");
bloom.add("world");

console.log(bloom.mightContain("hello"));  // true
console.log(bloom.mightContain("foo"));    // false

Real-World Applications

1. Database Query Optimization (NexaDB)

class SSTable:
    """A sorted string table with Bloom filter optimization."""

    def __init__(self, filepath):
        self.filepath = filepath
        self.bloom_filter = BloomFilter(expected_items=100_000, false_positive_rate=0.01)
        self.index = {}

    def write(self, data: dict):
        """Write key-value pairs to SSTable."""
        with open(self.filepath, 'wb') as f:
            for key, value in sorted(data.items()):
                # Add to bloom filter
                self.bloom_filter.add(key)

                # Write to file and record offset
                offset = f.tell()
                f.write(self._serialize(key, value))
                self.index[key] = offset

    def get(self, key: str):
        """Get value by key, using Bloom filter to avoid unnecessary reads."""
        # Fast check with Bloom filter
        if self.bloom_filter.definitely_not_contains(key):
            return None  # Key definitely not here - skip disk read!

        # Bloom filter says "maybe" - need to check disk
        if key in self.index:
            return self._read_from_disk(self.index[key])

        return None

# This saves ~95% of disk reads for non-existent keys!

2. Web Cache Optimization

class CacheSystem:
    def __init__(self):
        self.bloom = BloomFilter(expected_items=10_000_000, false_positive_rate=0.001)
        self.cache = {}  # Actual cache (could be Redis)
        self.db = Database()  # Slow database

    def get(self, key: str):
        # Check bloom filter first
        if self.bloom.definitely_not_contains(key):
            return None  # Don't even check cache or DB

        # Might exist - check cache
        if key in self.cache:
            return self.cache[key]

        # Not in cache - check database
        value = self.db.get(key)
        if value:
            self.cache[key] = value
            return value

        return None

    def put(self, key: str, value):
        self.bloom.add(key)  # Track in filter
        self.cache[key] = value
        self.db.put(key, value)

3. Duplicate Detection

class WebCrawler:
    def __init__(self):
        # Track visited URLs without storing all of them
        self.visited = BloomFilter(
            expected_items=100_000_000,  # 100M URLs
            false_positive_rate=0.001    # 0.1% FPR
        )
        # Only ~180MB of memory for 100M URLs!

    def crawl(self, url: str):
        if url in self.visited:
            return  # Probably already visited (or false positive)

        self.visited.add(url)
        content = self.fetch(url)
        self.process(content)

        for link in self.extract_links(content):
            self.crawl(link)

4. Spell Checker

class SpellChecker:
    def __init__(self, dictionary_file: str):
        self.bloom = BloomFilter(expected_items=500_000, false_positive_rate=0.01)

        # Load dictionary into bloom filter
        with open(dictionary_file) as f:
            for word in f:
                self.bloom.add(word.strip().lower())

    def check(self, word: str) -> bool:
        """
        Returns True if word might be correctly spelled.
        False if definitely misspelled.
        """
        return word.lower() in self.bloom

    def check_document(self, text: str) -> list[str]:
        """Return list of potentially misspelled words."""
        words = text.split()
        return [w for w in words if not self.check(w)]

5. Bitcoin SPV Wallets

class SPVWallet:
    """Simplified Payment Verification wallet using Bloom filters."""

    def __init__(self, addresses: list[str]):
        # Create filter for our addresses
        self.bloom = BloomFilter(
            expected_items=len(addresses) * 2,  # Account for growth
            false_positive_rate=0.0001  # Very low FPR for security
        )

        for addr in addresses:
            self.bloom.add(addr)

    def create_filter_message(self) -> bytes:
        """
        Send to full nodes to filter transactions.
        Full nodes only send transactions that match the filter.
        This provides privacy - node doesn't know exactly which
        addresses are ours due to false positives.
        """
        return self.bloom.bit_array

    def is_relevant_transaction(self, tx) -> bool:
        """Check if transaction involves our addresses."""
        for output in tx.outputs:
            if output.address in self.bloom:
                return True
        return False

Advanced Techniques

1. Counting Bloom Filters

Standard Bloom filters don't support deletion. Counting Bloom filters use counters instead of bits:

class CountingBloomFilter:
    def __init__(self, expected_items: int, false_positive_rate: float = 0.01):
        self.size = self._calculate_size(expected_items, false_positive_rate)
        self.hash_count = self._calculate_hash_count(self.size, expected_items)
        self.counters = [0] * self.size  # Counters instead of bits

    def add(self, item: str):
        for i in self._hash_functions(str(item)):
            self.counters[i] += 1

    def remove(self, item: str):
        """Remove an item (not possible with standard Bloom filter!)"""
        if item not in self:
            return  # Don't decrement if not present

        for i in self._hash_functions(str(item)):
            self.counters[i] -= 1

    def __contains__(self, item: str) -> bool:
        return all(self.counters[i] > 0 for i in self._hash_functions(str(item)))

2. Scalable Bloom Filters

When you don't know how many items you'll add:

class ScalableBloomFilter:
    """Dynamically grows as items are added."""

    def __init__(self, initial_capacity: int = 1000,
                 false_positive_rate: float = 0.01,
                 growth_factor: int = 2):
        self.fpr = false_positive_rate
        self.growth_factor = growth_factor
        self.filters = [BloomFilter(initial_capacity, false_positive_rate)]

    def add(self, item: str):
        # Check if current filter is at capacity
        current = self.filters[-1]
        if current.count >= current.expected_items:
            # Create new filter with tighter FPR to maintain overall FPR
            new_fpr = self.fpr * (0.5 ** len(self.filters))
            new_capacity = current.expected_items * self.growth_factor
            self.filters.append(BloomFilter(new_capacity, new_fpr))

        self.filters[-1].add(item)

    def __contains__(self, item: str) -> bool:
        return any(item in f for f in self.filters)

3. Cuckoo Filters (Alternative)

Cuckoo filters offer some advantages over Bloom filters:

Feature	Bloom Filter	Cuckoo Filter
Deletion	No (without counting)	Yes
Lookup	O(k)	O(1)
Space	~1.44 log₂(1/ε) bits	~1.05 log₂(1/ε) bits
Cache-friendliness	Good	Better

# Simplified Cuckoo filter concept
class SimpleCuckooFilter:
    def __init__(self, size: int):
        self.buckets = [[] for _ in range(size)]
        self.bucket_size = 4
        self.size = size

    def fingerprint(self, item: str) -> int:
        return hash(item) & 0xFF  # 8-bit fingerprint

    def add(self, item: str) -> bool:
        fp = self.fingerprint(item)
        i1 = hash(item) % self.size
        i2 = (i1 ^ hash(str(fp))) % self.size

        # Try to insert in either bucket
        if len(self.buckets[i1]) < self.bucket_size:
            self.buckets[i1].append(fp)
            return True
        if len(self.buckets[i2]) < self.bucket_size:
            self.buckets[i2].append(fp)
            return True

        # Kick out existing item and retry (cuckoo hashing)
        # ... (implementation continues)
        return False

Performance Comparison

Here's how Bloom filters compare to alternatives:

Dataset: 10 million strings (average 32 bytes each)

Structure         | Memory  | Insert  | Lookup  | Delete
------------------|---------|---------|---------|--------
HashSet           | 763 MB  | O(1)    | O(1)    | O(1)
TreeSet           | 851 MB  | O(log n)| O(log n)| O(log n)
Bloom (1% FPR)    | 11.4 MB | O(k)    | O(k)    | No
Bloom (0.1% FPR)  | 17.2 MB | O(k)    | O(k)    | No
Counting Bloom    | 45.6 MB | O(k)    | O(k)    | O(k)
Cuckoo Filter     | 10.5 MB | O(1)*   | O(1)    | O(1)

* Cuckoo insert can fail if filter is too full

NexaDB's Bloom Filter Implementation

NexaDB uses Bloom filters extensively in its LSM-Tree storage engine:

# How NexaDB uses Bloom filters (simplified)
class NexaDBStorage:
    def __init__(self):
        self.memtable = {}  # In-memory writes
        self.sstables = []  # Sorted string tables on disk

    def get(self, key: str):
        # 1. Check memtable (always in memory)
        if key in self.memtable:
            return self.memtable[key]

        # 2. Check SSTables with Bloom filter optimization
        for sstable in reversed(self.sstables):  # Newest first
            # Bloom filter check: O(k) memory operation
            if sstable.bloom_filter.definitely_not_contains(key):
                continue  # Skip this SSTable - key not here

            # Bloom filter says "maybe" - do disk read
            value = sstable.read_from_disk(key)
            if value is not None:
                return value

        return None  # Key doesn't exist

# Performance impact:
# - Without Bloom: ~50ms per SSTable check (disk read)
# - With Bloom: ~0.001ms per SSTable check (memory)
# - 95% of non-existent key lookups avoid disk entirely!

Common Pitfalls and Best Practices

Pitfall 1: Underestimating Capacity

# BAD: Underestimating leads to high false positive rate
bloom = BloomFilter(expected_items=1000, false_positive_rate=0.01)
for i in range(100_000):  # Adding 100x more items!
    bloom.add(str(i))
print(bloom.current_false_positive_rate)  # Could be 50%+ !

# GOOD: Use ScalableBloomFilter or estimate generously
bloom = ScalableBloomFilter(initial_capacity=1000, false_positive_rate=0.01)
# Or
bloom = BloomFilter(expected_items=100_000, false_positive_rate=0.01)

Pitfall 2: Wrong Hash Function

# BAD: Using Python's built-in hash (not consistent across runs!)
def bad_hash(item, seed):
    return hash((item, seed))  # Changes between Python sessions!

# GOOD: Use a consistent hash like MurmurHash or xxHash
import mmh3
def good_hash(item, seed):
    return mmh3.hash(item, seed)

Pitfall 3: Ignoring False Positive Rate Trade-offs

# Very low FPR = more memory
bloom_paranoid = BloomFilter(1_000_000, 0.0001)  # 24 MB
bloom_practical = BloomFilter(1_000_000, 0.01)   # 1.2 MB

# For most applications, 1% FPR is fine
# Only use lower FPR when false positives are very expensive

Best Practice: Always Verify Positive Results

def safe_lookup(key: str, bloom: BloomFilter, actual_store: dict):
    if bloom.definitely_not_contains(key):
        return None  # Trust the negative!

    # Bloom says "maybe" - ALWAYS verify with actual data
    return actual_store.get(key)  # Could be None (false positive)

Conclusion

Bloom filters are a elegant solution for space-efficient set membership testing. Key takeaways:

Space efficiency: 10 bits per item for 1% false positive rate
No false negatives: "Not in set" is always correct
Tunable accuracy: Trade memory for lower false positive rate
Wide applications: Databases, caching, networking, blockchain

The mathematics is beautiful, the implementation is straightforward, and the performance gains are dramatic. Whether you're building a database, a web crawler, or a spam filter, Bloom filters deserve a place in your toolkit.

NexaDB uses Bloom filters to achieve 95% reduction in disk reads. Try it yourself and see the performance difference.