
Let’s implement a Bloom Filter - onatm
https://onatm.dev/2020/08/10/let-s-implement-a-bloom-filter/
======
FreakLegion
Since the submitter is the author:

> 6 days after I started to write this post, it is proved that the math behind
> false positive probability was wrong for 50 years

This isn't true.

The "Bloom math is wrong!" clamor started at least 12 years ago with [1],
which brazenly cites Bloom, then proceeds to give an analysis of Knuth's
subtly different filter from TAOCP instead.

There are issues with [1], but its exact false positive formula was eventually
proven correct. Its argument is incompatible with Bloom's original filter,
though, and Knuth explicitly states that the false positive formula he gives
is approximate.

In other words, nothing was proved wrong. The analysis doesn't apply to Bloom
and doesn't contradict Knuth.

Now, [1] may be the first paper to give the exact false positive formula for
Knuth's filter (and Bloom's was given in 1979 by [2]), so it's not entirely
without merit. But in practice none of this matters anyway. At useful sizes
the approximations are accurate. Knuth's is a lower, Bloom's an upper bound,
and the two become functionally equivalent very quickly.

1\. [https://cglab.ca/~morin/publications/ds/bloom-
submitted.pdf](https://cglab.ca/~morin/publications/ds/bloom-submitted.pdf)

2\. Algorithm-A in [https://sci-hub.tw/10.1109/proc.1979.11543](https://sci-
hub.tw/10.1109/proc.1979.11543)

~~~
onatm
I believe I jumped into conclusion without properly reading the article I
linked and the first reference you gave. I will update my post not to give any
false information. Thank you!

------
veselin
The article has a good implemenetation, but when reading about Bloom filters,
I often see claims about optimality, but it does not say optimial for what. In
this example, an optimal Bloom filter for 1M elements does 7 lookups in a
bitvector of 1.14MB, which is usually 3 to 5 L2 cache misses (with 256KB or
512KB cache).

But looking up the 1M elements directly in a BTree will usually lead to 1-2
cache misses and depending on the data element size, may even fit in the L3
cache. Then, what is this bloom filter saving me from?

~~~
nwallin
Assuming your keys are short strings, 1M elements is 32MB plus btree overhead.
If your keys are longer strings, (URLs are typically too long to benefit from
short string optimization) 1M elements is still 32MB, but now it's also a
pointer to somewhere else in memory with more data at the other end. So
there's that; probably more than just 1-2 cache misses.

That being said, in the case of just 1M elements, a Bloom filter won't buy you
much. For a bloom filter to be truly effective, you need all of the following:

1\. Many queries come back negative; that is, you spend a lot of time
searching for keys that are not in the DB. This is not the typical case. When
the typical web app searches for a _key_ the key is typically in the DB. (as
opposed to "find all sales on November 7th" or whatever, which Bloom filters
cannot assist with)

2\. The actual query is disproportionately expensive. Unless a Bloom filter
can save you a disk/network call, it's generally not worth the extra overhead.

3\. You rarely/never remove items, and your data grows slowly.

Many problems that can be solved with a Bloom filter can also be solved by
simply buying more RAM. Many problems that can be solved with a Bloom filter
can also be solved by using a faster lookup. (everybody overlooks hash tables.
I don't know why.) Bloom filters are _not_ a general purpose algorithm. You
really need to grok your data before pulling the trigger on a Bloom filter.
(don't forget to benchmark)

~~~
cb321
> (everybody overlooks hash tables. I don't know why.)

It is probably worse than you think since a hash table can also replace the
Bloom filter in its "false positives allowed" filtration role. Not sure if
that's what you meant.

If all you save is the "fingerprint" (or a B-bit truncated hash code of the
keys) then you get a very similar probabilistic data structure. Further, if
you put these small integer fingerprints in an open addressed hash table with
linear probing then you get a (very likely) single-cache miss test for
membership. This alternate structure was, in fact, also analyzed by Burton
Bloom (in the very same paper he introduced his famous filter). He found
fingerprint filters to be slower only on very unrealistic memory systems with
neither cache nor "word fetch" structure where access to every bit had unit
cost no matter its location. If you can grab a big batch of bits in unit cost
the analysis changes a lot. Modern systems grab 512..1024 bits at the same
cost as 1...

This fingerprint filter alternative does take 2..4x more space depending on
scale and allowed false positive rates, but that space is organized better in
terms of caches. Most would choose to spend that space to save k-1 cache
misses of time in today's era of 60-120 ns memory and multi GHz CPUs. It is
slightly easier to implement a large bitvector than an array of B-bit numbers,
depending on B. { It's pretty trivial for B=8*m, though. ;-) } It's kind of
crazy that hash tables for this purpose fell into "also ran/obscure" status...

~~~
aidenn0
If you switch from linear probing to cuckoo hashing, then you end up with a
cuckoo filter, which is more compact than a bloom filter for a given false-
positive rate. The compactness comes from the fact that Cuckoo hashing allows
load factors of over 90% which are impractical for linear-probing.

~~~
cb321
Well, sure. You can also switch to Robin-Hood linear probing to work well at
high load factors and retain the almost never >1 DRAM hit. These are just
various hash table options to claw back some load factor ratios of space
usage.

------
_jsdw
“This looks really promising! A Bloom Filter that represents a set of million
items with a false-positive rate of 0.01 requires only 9585059 bits and 7 hash
functions.“

My immediate thought here was that this wasn't very space efficient as it
requires over 9 bits per item?

I'd have to dig more into this though to confirm or deny my intuition!

Edit: this page agrees with the blog post and shows some interesting graphs
[https://hur.st/bloomfilter/?n=1000000&p=0.01&m=&k=](https://hur.st/bloomfilter/?n=1000000&p=0.01&m=&k=)

~~~
joshuamorton
Yes and no. It requires 9 bits per item _in_ the filter, but the total
keyspace may be much larger. If your items are keyed based on 64-bit
identifiers, you've compressed things by a factor of 7.1 .

~~~
dspillett
Furthermore, most used of bloom filters I've seen are indexing something
notably larger than 64-bits, and usually variable length. Arbitrary names,
URLs/URIs, etc.

If you are looking for a key value then chances are it _does_ exist in the DB
so the bloom filter won't save you anything as you are then hitting slower
storage to read the item that the key represents anyway.

Bloom filters are suited for situations where you are looking for larger non-
keyed data such as URLs, where there is a fairly good chance that what you are
looking for isn't in the data yet so the "no false negatives" property is
important for saving further accesses (which is why cache arrangements is
where you'll often find these structures used).

------
anonytrary
Bloom filters confused me at first, but they're really cool. I like to think
of bloom filters with a (oversimplified) physics analogy. When you collide two
particles together, other smaller particles come flying out of the explosion.
You measure their tracks with a film. If you keep smashing particles together,
eventually the film will be saturated and you will basically lose the ability
to ascertain with confidence whether or not two specific particles were
smashed together.

------
hajile
Bing's BitFunnel paper is a pretty interesting read. The parts about scaling
bloom filters by flipping their axis and using multiply filter layers is
particularly of note if you are actually interested in production use.

[https://danluu.com/bitfunnel-sigir.pdf](https://danluu.com/bitfunnel-
sigir.pdf)

------
wruza
I've read that some old factories used bloom filters to quickly look up where
a worker is. At the morning they pressed three buttons on a wall that
represented a workshop, and with a knowledge of these numbers one could check
if that person was not in the workshop.

The similar technique but in a reverse I've seen in a factory myself, when 5-7
closest people at the gate queue pressed their buttons, and a controller
collected their outstanding passes from the other side and then checked
against their faces very quickly. That greatly saved time in a queue.

------
kilotaras
> hashers: [DefaultHasher; 2],

Bikeshedding a little, but that probably shouldn't be an array. hashers[0] and
hashers[1] aren't semantically equivalent and the only usage that array gets
is just getting element by index. Something like `startHasher` and
`stepHasher` would probably better convey intent.

------
gmanis
Very nice read. I knew bloom filters but the detailed explanation and the rust
code (almost Greek to me) were very nicely explained. I also read through the
linked page of Kiran.

------
hn_throwaway_99
I think most people who took CS classes in college are familiar with hash
table implementations, but perhaps not bloom filters.

Bloom filters are just chaining hash tables, without the chain.

------
ballooney
I like hand drawn diagrams and use them a lot, especially since work gave me
an ipad, but whenever I see them in ‘high-production-value’, well researched
and edited blog posts like this, I am reminded that it still seems to be a
hard problem to quickly produce good quality vector diagrams (with foss
software).

Has the state of the art moved on recently, or is it still a case of fighting
TikZ?

~~~
onatm
Those are not hand drawn diagrams. I used
[https://excalidraw.com](https://excalidraw.com)

