Skip to main content

Probabilistic data structures: Why Bloom filters are one of the most elegant hack

· 14 min read
Pranjal Kumar
Software Engineer, @Bentley Systems

Bloom filter bit array visualization showing how hash functions map elements to bit positions for probabilistic set membership testing | Pranjal Kumar

Here's a question that sounds impossible: can you check if an element is in a set of 1 billion items using only 1.2 GB of RAM, with lookups completing in constant time, and never producing a false negative? Oh, and the data structure is a single contiguous array of bits; no pointers, no allocations, no garbage collection pauses.

The answer is yes. It's called a Bloom filter, and once you understand it, you'll see it everywhere; from the database engine under your app to Chrome's malicious URL checker to the network switches routing your packets.

This post goes deep. Not "here's a pretty diagram" deep. "I need to tune the false positive rate for a production system handling 400K QPS" deep.


Why probabilistic data structures exist

Let's start with a confession: deterministic data structures are expensive.

A HashSet<String> holding 1 billion URLs? That's roughly:

  • 1 billion pointers (8 bytes each) = 8 GB just for the pointer array
  • Plus the actual string data (average URL ~80 bytes) = 80 GB
  • Plus hash table overhead (load factor, tombstones, resizing) = another 30-40%

You're looking at ~120 GB of RAM for a set membership test. And every single read might chase a pointer into some random heap location; cache miss city. Your L3 cache is crying.

Probabilistic data structures trade a small, bounded error rate for massive savings in space and time. They answer a subtly different question: instead of "is X in the set?" they answer "is X definitely not in the set, or probably in the set?"

That sounds like a downgrade. But in practice? It's one of the most powerful engineering tradeoffs you can make.


The pool of probabilistic structures

Before we go deep on Bloom filters, let me give you the lay of the land:

StructureWhat it answersError typeSpace
Bloom filter"Is X in the set?"False positives (never false negatives)~10 bits per element for 1% FPR
Counting Bloom filterSame, but supports deletionSame~40 bits per element
Cuckoo filterSame, supports deletion, often better FPRFalse positives~12 bits per element for <1% FPR
HyperLogLog"How many distinct elements?"±0.8% error12 KB (!) for billions of elements
Count-Min Sketch"How many times has X appeared?"Over-estimates onlyO(1/ε × log(1/δ))
Quotient filter"Is X in the set?" (cache-friendly)False positives~10-12 bits per element
XOR filter"Is X in the set?" (immutable, faster)False positives~9 bits per element

Each one is beautiful in its own way. But Bloom filters are the gateway drug; once you grok them, the rest click into place.

FPR: False Positive Rate


Bloom filters from first principles

The simplest possible explanation

A Bloom filter is:

  1. A bit array of m bits, all initialized to 0
  2. k independent hash functions, each mapping an element to a position in [0, m)

To add element x: compute h₁(x), h₂(x), ..., hₖ(x) and set those bit positions to 1.

To query element x: compute the same hashes. If all k positions are 1, answer "probably yes." If any position is 0, answer "definitely no."

That's it. That's the whole thing. No linked lists, no trees, no rebalancing, no resizing. Just bits and hashes.

Why false negatives are impossible

Think about it: if we added element x, we set bits at positions h₁(x) through hₖ(x). We never unset bits. So when we query x later, those positions will still be 1. Always. The filter never forgets what it's seen.

Why false positives happen

But other elements might have also set those same positions. If elements a, b, and c collectively lit up the same bit positions that h₁(x) through hₖ(x) would need; then querying x returns "probably yes" even though x was never inserted. The bits don't remember who set them.

This is the tradeoff. And it's quantifiable.


The math; and why you should actually understand it

I know, I know; "just tell me what parameters to use." But if you don't understand the math, you can't tune these things in production. And you will need to tune them. Trust me.

False positive probability

For a Bloom filter with m bits, k hash functions, and n inserted elements:

The probability a specific bit is still 0 after all insertions:

P(bit=0)=(11m)knekn/mP(\text{bit} = 0) = \left(1 - \frac{1}{m}\right)^{kn} \approx e^{-kn/m}

The false positive probability (all k bits are 1 for an element not in the set):

P(FP)=(1ekn/m)kP(\text{FP}) = \left(1 - e^{-kn/m}\right)^k

Optimal number of hash functions

Given m and n, the optimal k that minimizes false positive rate:

kopt=mnln20.693mnk_{\text{opt}} = \frac{m}{n} \ln 2 \approx 0.693 \cdot \frac{m}{n}

This is a beautiful result. It says: for a given bits-per-element ratio, there's a sweet spot for the number of hashes. Too few hashes → not enough discrimination. Too many hashes → you fill up the bit array too fast.

Required bits per element

If you want a false positive rate of p:

mn=lnp(ln2)21.44log2p\frac{m}{n} = -\frac{\ln p}{(\ln 2)^2} \approx -1.44 \cdot \log_2 p

Let's make this concrete:

Desired FP rateBits per elementHash functions (k)
10%4.83
1%9.67
0.1%14.410
0.01%19.213
0.001%24.017

The takeaway: each additional order of magnitude in accuracy costs ~4.8 bits per element. That's incredibly space-efficient. A 1% false positive Bloom filter uses ~1.2 bytes per element regardless of element size; whether you're storing 10-byte integers or 10 KB URLs.


Implementation; the parts nobody tells you about

The hash function problem

The textbook says "use k independent hash functions." In practice, nobody does this. Computing k independent hashes is expensive, and you don't actually need independence.

Kirsch-Mitzenmacker optimization (2006): You can simulate k hash functions using only two hash functions:

gi(x)=h1(x)+ih2(x)modm,i=0,1,...,k1g_i(x) = h_1(x) + i \cdot h_2(x) \mod m, \quad i = 0, 1, ..., k-1

This gives the same asymptotic false positive rate as k truly independent hashes. In practice, people use:

  • MurmurHash3; fast, excellent distribution, 128-bit output gives you two 64-bit hashes
  • xxHash; even faster, modern SIMD-friendly design
  • SipHash; if you need cryptographic-grade collision resistance (usually you don't)

Here's what this looks like in practice:

fn get_bit_positions(item: &[u8], num_hashes: u32, num_bits: u64) -> Vec<u64> {
let hash = murmurhash3_x64_128(item, 0);
let h1 = hash as u64;
let h2 = (hash >> 64) as u64;

(0..num_hashes)
.map(|i| {
h1.wrapping_add((i as u64).wrapping_mul(h2)) % num_bits
})
.collect()
}

Two hashes, k bit positions. Simple, fast, and proven correct.

Bit array implementation

The bit array seems trivial; just malloc(m / 8) bytes, right? In theory, yes. In practice, there are a few gotchas that separate a toy implementation from a production one:

// Setting a bit
void bloom_set(uint8_t *filter, uint64_t pos) {
filter[pos >> 3] |= (1 << (pos & 7));
}

// Testing a bit
bool bloom_test(uint8_t *filter, uint64_t pos) {
return filter[pos >> 3] & (1 << (pos & 7));
}

Cache line awareness: On modern CPUs, a cache line is 64 bytes (512 bits). If your k hash outputs are spread uniformly across a large filter, each query potentially touches k different cache lines. For a 1 GB filter with k=7, you're looking at 7 cache misses per lookup; at ~100 ns each, that's 700 ns per query.

Blocked Bloom filters solve this: partition the bit array into cache-line-sized blocks. Hash the element to pick a block, then do all k hash probes within that block. You lose a bit of theoretical FP optimality, but you gain cache-friendliness that more than compensates in practice.

Traditional: k probes scattered across m bits → k cache misses
Blocked: 1 hash to select block + k probes within 512 bits → 1 cache miss

I've seen this take query time from ~500 ns to ~50 ns on real hardware. A 10x improvement just from thinking about cache lines.

Memory-mapped filters

For filters that don't fit in RAM (or that you want to persist across process restarts), mmap is your friend:

int fd = open("bloom.dat", O_RDWR | O_CREAT, 0644);
ftruncate(fd, filter_size_bytes);
uint8_t *filter = mmap(NULL, filter_size_bytes,
PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

The OS handles paging; hot parts stay in RAM, cold parts stay on disk. This is exactly how databases like RocksDB implement their Bloom filters: they're part of the SST file format and get mmap'd when the file is opened.


Where Bloom filters live in the real world

This isn't academic. Bloom filters are everywhere in production systems. Let me walk you through the ones I find most interesting.

LSM-tree databases (RocksDB, LevelDB, Cassandra, HBase)

This is probably the most impactful use. In an LSM-tree, data lives in sorted immutable files (SSTables) across multiple levels. A point lookup (GET key) might need to check every SSTable for the key.

Without Bloom filters: reading a non-existent key requires checking all SSTables → potentially dozens of disk reads → unacceptable latency.

With Bloom filters: each SSTable has an associated Bloom filter (stored in the file's metadata block). Before reading the SSTable's data blocks, check the filter. If it says "definitely not here" → skip the file entirely. Zero disk I/O for that SSTable.

Impact: RocksDB's documentation states that Bloom filters reduce read amplification by 10-100x for point lookups. Cassandra uses them on every SSTable, and a misconfigured bloom_filter_fp_chance (their tuning knob) is one of the most common causes of read latency spikes I've seen in production.

# Cassandra table options
CREATE TABLE users (
...
) WITH bloom_filter_fp_chance = 0.01; -- 1% FP rate, ~10 bits/key

Changing this from 0.01 to 0.001 doubles memory usage but reduces unnecessary disk reads by 10x. Whether that's worth it depends entirely on your read/write ratio.

Chrome's Safe Browsing

Your browser maintains a local Bloom filter of ~4 million malicious URLs. When you navigate to a page, Chrome checks the URL against this filter locally; no network request needed. If the filter says "definitely not malicious" (which is >99.99% of the time), you proceed instantly.

If the filter says "maybe malicious," Chrome then makes an API call to Google's Safe Browsing service for a definitive answer. The Bloom filter turned a network round-trip on every single navigation into a network round-trip on 0.01% of navigations.

CDN cache routing (Akamai, Cloudflare)

When a CDN edge node receives a request, it needs to know if the content is cached somewhere in the cluster before fetching from origin. The nodes share Bloom filter summaries of their cache contents. A request arrives → check all peer filters → if any say "maybe here" → ask that peer → otherwise fetch from origin.

This is why CDNs can maintain hit rates >95% even with object counts in the hundreds of billions.

Network routers (packet forwarding)

Hardware routers use Bloom filters in TCAM (Ternary Content-Addressable Memory) for longest-prefix match in routing tables. At 100 Gbps line rate, you have ~6.7 nanoseconds per packet. A Bloom filter in SRAM gives you constant-time membership checks at wire speed.

Bitcoin SPV clients

Lightweight Bitcoin wallets use Bloom filters to request only relevant transactions from full nodes without revealing which addresses they own (BIP 37). The wallet sends a Bloom filter of its addresses to the full node, which then only forwards transactions matching the filter. Privacy through false positives; the node can't tell which matches are real and which are noise.

Spell checkers (the OG use case)

The original 1970 Bloom filter paper by Burton Bloom was literally about spell checking. Store a dictionary in a Bloom filter. Check each word in the document. If the filter says "not in dictionary" → it's definitely misspelled. If it says "in dictionary" → it's probably spelled correctly (and you can do a more expensive exact check if needed).


The variants; because vanilla isn't always enough

Counting Bloom filters

The problem: standard Bloom filters don't support deletion. You can't unset a bit because you don't know if other elements also need that bit.

The solution: replace each bit with a counter (typically 4 bits). Insert increments, delete decrements. A position is "set" if its counter > 0.

Standard: [0][1][0][1][1][0][0][1]1 bit per position
Counting: [0][3][0][1][2][0][0][1]4 bits per position

Tradeoff: 4x the space (4 bits instead of 1 per position). And there's a counter overflow risk; if more than 15 elements hash to the same position, the 4-bit counter wraps. In practice, the probability of this is negligible for properly sized filters.

When to use: when your set changes over time and you need to remove elements. Think: "which users are currently online?"; users go offline, and you need to remove them.

Cuckoo filters

I think cuckoo filters are underrated. They were introduced by Fan et al. in 2014 and in many practical scenarios they're just... better.

How they work: instead of k bit positions, store a fingerprint (compact hash) of each element in one of two candidate bucket positions (using cuckoo hashing; hence the name).

Bucket 0: [fp₁][fp₂][ ][ ]4 slots per bucket
Bucket 1: [fp₃][ ][ ][ ]
Bucket 2: [fp₄][fp₅][fp₆][ ]
...

To insert: compute two candidate buckets via h₁(x) and h₂(x) = h₁(x) ⊕ hash(fingerprint(x)). If either has space, place the fingerprint. If both are full, evict an existing fingerprint to its alternative bucket (cuckoo displacement). Repeat until placed or max iterations reached.

To query: check both candidate buckets for the fingerprint.

Advantages over Bloom filters:

  • Supports deletion (just remove the fingerprint) without the 4x overhead of counting
  • Better space efficiency at FP rates below ~3%
  • Better cache performance; at most 2 cache misses per lookup

Disadvantages:

  • Insertion can fail if the table is too full (>95% load)
  • Implementation is more complex
  • Not as embarrassingly parallelizable

Here's the crossover point I use as a rule of thumb:

  • FP rate > 3%: Bloom filter wins on space
  • FP rate < 3%: Cuckoo filter wins
  • Need deletion: Cuckoo filter (or counting Bloom with 4x penalty)
  • Maximum simplicity: Bloom filter, always

Scalable Bloom filters

Standard Bloom filters require you to know n (number of elements) upfront. What if you don't? Scalable Bloom filters (Almeida et al., 2007) solve this:

  • Start with a small initial filter
  • When the estimated fill ratio exceeds a threshold, allocate a new filter with a tighter FP rate
  • Query checks all filters (OR of results)
  • Each successive filter has a geometrically tighter FP rate so the overall error stays bounded

This is what Redis uses in its BF.ADD/BF.EXISTS implementation; it auto-scales without requiring you to pre-declare the capacity.

XOR filters (and Ribbon filters)

If your set is static (built once, queried many times), XOR filters are the new hotness:

  • ~9 bits per element (vs. ~10 for Bloom filters at same FP rate)
  • Exactly 3 memory accesses per query (no variance)
  • Construction uses a peeling algorithm on a random 3-hypergraph

The downside: you can't add elements after construction. It's an immutable snapshot. Perfect for SSTable-level filters in databases (you build the filter when the SSTable is written and never modify it), but useless for streaming membership tracking.


Production tuning; the stuff they don't teach in school

Sizing for your workload

Here's my mental model for sizing a production Bloom filter:

import math

def bloom_filter_size(n_elements, fp_rate):
"""Calculate optimal Bloom filter parameters."""
m = -n_elements * math.log(fp_rate) / (math.log(2) ** 2)
k = (m / n_elements) * math.log(2)
return {
'bits': int(math.ceil(m)),
'bytes': int(math.ceil(m / 8)),
'megabytes': m / 8 / 1024 / 1024,
'hash_functions': int(math.ceil(k)),
'bits_per_element': m / n_elements
}

# Example: 100 million URLs, 0.1% false positive rate
params = bloom_filter_size(100_000_000, 0.001)
# → ~172 MB, 10 hash functions, 14.4 bits/element

Rule of thumb: 10 bits per element gives you 1% FP. Each additional 5 bits per element gives you ~10x better FP rate.

Monitoring false positive rate in production

Your Bloom filter's actual FP rate drifts as you insert more elements than planned. Monitor it:

# Approximate current FP rate based on fill ratio
def estimated_fp_rate(m_bits, k_hashes, n_inserted):
return (1 - math.exp(-k_hashes * n_inserted / m_bits)) ** k_hashes

Set an alert when the estimated FP rate exceeds your threshold. When it fires, you need to resize; which for a standard Bloom filter means building a new, larger filter and reinserting everything. Plan for this.

Serialization and network transfer

Bloom filters are just byte arrays. They serialize trivially:

  • Write the header (m, k, n, hash seed)
  • Write the raw bytes

But transferring a 100 MB filter over the network on every update is wasteful. Two tricks:

  1. Delta compression: if you're syncing filters between nodes, XOR the old and new filter. The delta is sparse (mostly zeros) and compresses beautifully with LZ4 or zstd.

  2. Partitioned filters: split the filter into segments. Only transfer segments that changed since the last sync. This is how distributed systems like Cassandra share filter updates during anti-entropy repair.

Thread safety

Bloom filters are naturally read-safe without locks; reading bits is idempotent. For concurrent writes, you have options:

  • Atomic bit-set operations: __sync_fetch_and_or(&filter[byte_pos], bit_mask); lock-free, correct, minimal overhead
  • Partitioned writes: each writer owns a segment, readers check all segments
  • Copy-on-write: build a new filter in the background, atomically swap the pointer

In my experience, atomic bit-set is the sweet spot for most workloads. The false positive rate increases very slightly due to non-atomic multi-bit operations (a query might see a partially-inserted element), but in practice this is within noise.


A war story: the Bloom filter that saved us $2M/year

A few years back, I was working on a system that processed event deduplication at ~400K events/second. Each event had a UUID, and we needed to detect duplicates within a 24-hour window.

The naive approach: a Redis SET with 24-hour TTL. At 400K/sec × 86400 seconds × ~40 bytes per UUID = ~1.4 TB of Redis. At AWS ElastiCache pricing, that's... not cheap.

The Bloom filter approach:

  • 24 rotating hourly filters (each covering ~1.44 billion events)
  • 0.01% FP rate per filter (19.2 bits/element)
  • Total memory: 24 × ~3.5 GB = ~84 GB

On the rare false positive (0.01% × 24 filters ≈ 0.24% worst case), we'd do a more expensive exact check against a backing store. In practice, the composite FP rate was much lower because events rarely hash identically across multiple hourly filters.

Result: 17x less memory, easily fit on a few commodity machines, and the latency per dedup check went from ~1 ms (Redis network round trip) to ~200 ns (in-process Bloom filter). The infra cost savings alone justified the engineering effort in the first sprint.


Common mistakes I've seen in production

1. Not accounting for growth

"We have 10 million users, so we'll size for 10 million."

Six months later, you have 25 million users. Your 1% FP rate is now 15%, and your database is getting hammered with false-positive-triggered lookups. Always size for 2-3x your expected growth, or use scalable Bloom filters.

2. Choosing k based on vibes

I've seen codebases that hardcode k=3 or k=7 without any justification. Use the formula: k = (m/n) × ln(2). If your bits-per-element ratio is 10, optimal k is 7. If it's 20, optimal k is 14. Getting k wrong by even 2-3 costs you 2-5x in FP rate for free.

3. Using a bad hash function

hashCode() in Java is a 32-bit hash with terrible distribution for Bloom filter use. I once saw a production Bloom filter using String.hashCode() with k=7; but since the hash is only 32 bits, the entropy was exhausted after ~4 hash functions. The last 3 were essentially correlated noise. FP rate was 10x worse than predicted.

Use a 128-bit hash (MurmurHash3-128 or xxHash-128) and split it into two 64-bit values for the Kirsch-Mitzenmacker trick. This gives you effectively unlimited k without entropy exhaustion.

4. Ignoring the "definitely not" guarantee in application logic

The whole point of a Bloom filter is the asymmetry: "no" means NO. "Yes" means MAYBE. I've seen code that treats both answers as probabilistic:

# WRONG: treating "not found" as probabilistic
if not bloom.might_contain(key):
result = "probably not in set" # NO! It's DEFINITELY not in set!

Build your system logic around the certainty of negatives. That's where the value is.

5. Not versioning your hash function

If you change the hash function (or its seed), every existing filter becomes invalid. All queries will return nonsense. Version your filters with a header that includes the hash algorithm and seed. This sounds obvious until you're doing a zero-downtime migration and half your nodes are running the old hash and half the new one.


The deeper insight

Here's what I want you to walk away with: Bloom filters aren't just a clever trick. They represent a fundamental insight about information theory.

Shannon taught us that information has an irreducible cost. If you need to distinguish N items with error rate ε, you need at minimum log₂(1/ε) bits per item. A Bloom filter uses 1.44 × log₂(1/ε) bits per item; only 44% above the information-theoretic minimum. That's astounding for a structure this simple.

The reason they're everywhere in systems engineering isn't because engineers love clever algorithms. It's because the tradeoff they offer; certainty of absence in exchange for ambiguity of presence; maps perfectly onto how distributed systems actually work:

  • A database doesn't need to know a key is present to open the file; it needs to know a key is absent to skip the file.
  • A CDN doesn't need to know content is cached; it needs to know content is not cached to go to origin.
  • A firewall doesn't need to prove a URL is safe; it needs to catch URLs that are definitely malicious.

In all these cases, the certain negative is more valuable than the uncertain positive. Bloom filters are the mathematical embodiment of this asymmetry.


TL;DR

Bloom filter = bit array + k hashes

Insert(x): set bits at h₁(x), h₂(x), ..., hₖ(x)
Query(x): if ALL bits at h₁(x)...hₖ(x) are 1"maybe"
if ANY bit is 0"definitely not"

Size it: m = -n·ln(p) / (ln2)²
Tune it: k = (m/n)·ln(2)
Hack it: gᵢ(x) = h₁(x) + i·h₂(x) mod m (only need 2 hashes)

Use it when: "definitely not" is more valuable than "definitely yes"

The next time someone proposes a HashMap for a membership check on a billion items, you'll know there's a better way. And now you know exactly how to size it, tune it, and keep it healthy in production.


Bloom, Burton H. "Space/time trade-offs in hash coding with allowable errors." Communications of the ACM 13.7 (1970): 422-426. — Still worth reading. The paper is 4 pages long and changed systems engineering forever.


Further reading