> 6 days after I started to write this post, it is proved that the math behind false positive probability was wrong for 50 years
This isn't true.
The "Bloom math is wrong!" clamor started at least 12 years ago with [1], which brazenly cites Bloom, then proceeds to give an analysis of Knuth's subtly different filter from TAOCP instead.
There are issues with [1], but its exact false positive formula was eventually proven correct. Its argument is incompatible with Bloom's original filter, though, and Knuth explicitly states that the false positive formula he gives is approximate.
In other words, nothing was proved wrong. The analysis doesn't apply to Bloom and doesn't contradict Knuth.
Now, [1] may be the first paper to give the exact false positive formula for Knuth's filter (and Bloom's was given in 1979 by [2]), so it's not entirely without merit. But in practice none of this matters anyway. At useful sizes the approximations are accurate. Knuth's is a lower, Bloom's an upper bound, and the two become functionally equivalent very quickly.
I believe I jumped into conclusion without properly reading the article I linked and the first reference you gave. I will update my post not to give any false information. Thank you!
In many applications, such as filtering queries to a data store (Figure 1), the Bloom filter will start out empty and fill as items are added to the data store and inserted into the Bloom filter. The false positive rate will be very low and then increase non-linearly as the Bloom filter fills. The average false positive rate over the life of the filter is much lower than the design false positive rate for the full filter given by the math.
The article has a good implemenetation, but when reading about Bloom filters, I often see claims about optimality, but it does not say optimial for what. In this example, an optimal Bloom filter for 1M elements does 7 lookups in a bitvector of 1.14MB, which is usually 3 to 5 L2 cache misses (with 256KB or 512KB cache).
But looking up the 1M elements directly in a BTree will usually lead to 1-2 cache misses and depending on the data element size, may even fit in the L3 cache. Then, what is this bloom filter saving me from?
Assuming your keys are short strings, 1M elements is 32MB plus btree overhead. If your keys are longer strings, (URLs are typically too long to benefit from short string optimization) 1M elements is still 32MB, but now it's also a pointer to somewhere else in memory with more data at the other end. So there's that; probably more than just 1-2 cache misses.
That being said, in the case of just 1M elements, a Bloom filter won't buy you much. For a bloom filter to be truly effective, you need all of the following:
1. Many queries come back negative; that is, you spend a lot of time searching for keys that are not in the DB. This is not the typical case. When the typical web app searches for a _key_ the key is typically in the DB. (as opposed to "find all sales on November 7th" or whatever, which Bloom filters cannot assist with)
2. The actual query is disproportionately expensive. Unless a Bloom filter can save you a disk/network call, it's generally not worth the extra overhead.
3. You rarely/never remove items, and your data grows slowly.
Many problems that can be solved with a Bloom filter can also be solved by simply buying more RAM. Many problems that can be solved with a Bloom filter can also be solved by using a faster lookup. (everybody overlooks hash tables. I don't know why.) Bloom filters are _not_ a general purpose algorithm. You really need to grok your data before pulling the trigger on a Bloom filter. (don't forget to benchmark)
> (everybody overlooks hash tables. I don't know why.)
It is probably worse than you think since a hash table can also replace the Bloom filter in its "false positives allowed" filtration role. Not sure if that's what you meant.
If all you save is the "fingerprint" (or a B-bit truncated hash code of the keys) then you get a very similar probabilistic data structure. Further, if you put these small integer fingerprints in an open addressed hash table with linear probing then you get a (very likely) single-cache miss test for membership. This alternate structure was, in fact, also analyzed by Burton Bloom (in the very same paper he introduced his famous filter). He found fingerprint filters to be slower only on very unrealistic memory systems with neither cache nor "word fetch" structure where access to every bit had unit cost no matter its location. If you can grab a big batch of bits in unit cost the analysis changes a lot. Modern systems grab 512..1024 bits at the same cost as 1...
This fingerprint filter alternative does take 2..4x more space depending on scale and allowed false positive rates, but that space is organized better in terms of caches. Most would choose to spend that space to save k-1 cache misses of time in today's era of 60-120 ns memory and multi GHz CPUs. It is slightly easier to implement a large bitvector than an array of B-bit numbers, depending on B. { It's pretty trivial for B=8*m, though. ;-) } It's kind of crazy that hash tables for this purpose fell into "also ran/obscure" status...
If you switch from linear probing to cuckoo hashing, then you end up with a cuckoo filter, which is more compact than a bloom filter for a given false-positive rate. The compactness comes from the fact that Cuckoo hashing allows load factors of over 90% which are impractical for linear-probing.
Well, sure. You can also switch to Robin-Hood linear probing to work well at high load factors and retain the almost never >1 DRAM hit. These are just various hash table options to claw back some load factor ratios of space usage.
If you can store the entire BTree a bloom filter is useless. Bloom filters are compact caches for very large DBs with high negative lookup rates. Think instead of N cache misses, it's querying a remote database vs a local vector lookup.
The entire Bloom filter is a loss in terms of space. The Bloom filter may give me better lookup performance (by skipping lookups in the underlying data structure) at the expense of this space. But this is if the Bloom filter itself is fast.
I think you're getting a few things wrong. Bloom filters do not serve as data structures for faster lookups, on the contrary, other approached perform probably much better. Bloom filters are only useful because of their low space complexity. E.g. look at genome assembly https://en.wikipedia.org/wiki/Bloom_filters_in_bioinformatic.... Storing and looking up every k-mer is super expensive (memory-wise). Bloom filters allow for efficient k-mer lookup, without actually having to store all the possible k-mers in memory.
Bing's BitFunnel paper is a pretty interesting read. The parts about scaling bloom filters by flipping their axis and using multiply filter layers is particularly of note if you are actually interested in production use.
“This looks really promising! A Bloom Filter that represents a set of million items with a false-positive rate of 0.01 requires only 9585059 bits and 7 hash functions.“
My immediate thought here was that this wasn't very space efficient as it requires over 9 bits per item?
I'd have to dig more into this though to confirm or deny my intuition!
Yes and no. It requires 9 bits per item in the filter, but the total keyspace may be much larger. If your items are keyed based on 64-bit identifiers, you've compressed things by a factor of 7.1 .
Furthermore, most used of bloom filters I've seen are indexing something notably larger than 64-bits, and usually variable length. Arbitrary names, URLs/URIs, etc.
If you are looking for a key value then chances are it does exist in the DB so the bloom filter won't save you anything as you are then hitting slower storage to read the item that the key represents anyway.
Bloom filters are suited for situations where you are looking for larger non-keyed data such as URLs, where there is a fairly good chance that what you are looking for isn't in the data yet so the "no false negatives" property is important for saving further accesses (which is why cache arrangements is where you'll often find these structures used).
Bloom filters confused me at first, but they're really cool. I like to think of bloom filters with a (oversimplified) physics analogy. When you collide two particles together, other smaller particles come flying out of the explosion. You measure their tracks with a film. If you keep smashing particles together, eventually the film will be saturated and you will basically lose the ability to ascertain with confidence whether or not two specific particles were smashed together.
I've read that some old factories used bloom filters to quickly look up where a worker is. At the morning they pressed three buttons on a wall that represented a workshop, and with a knowledge of these numbers one could check if that person was not in the workshop.
The similar technique but in a reverse I've seen in a factory myself, when 5-7 closest people at the gate queue pressed their buttons, and a controller collected their outstanding passes from the other side and then checked against their faces very quickly. That greatly saved time in a queue.
Bikeshedding a little, but that probably shouldn't be an array. hashers[0] and hashers[1] aren't semantically equivalent and the only usage that array gets is just getting element by index. Something like `startHasher` and `stepHasher` would probably better convey intent.
Very nice read. I knew bloom filters but the detailed explanation and the rust code (almost Greek to me) were very nicely explained. I also read through the linked page of Kiran.
I like hand drawn diagrams and use them a lot, especially since work gave me an ipad, but whenever I see them in ‘high-production-value’, well researched and edited blog posts like this, I am reminded that it still seems to be a hard problem to quickly produce good quality vector diagrams (with foss software).
Has the state of the art moved on recently, or is it still a case of fighting TikZ?
> 6 days after I started to write this post, it is proved that the math behind false positive probability was wrong for 50 years
This isn't true.
The "Bloom math is wrong!" clamor started at least 12 years ago with [1], which brazenly cites Bloom, then proceeds to give an analysis of Knuth's subtly different filter from TAOCP instead.
There are issues with [1], but its exact false positive formula was eventually proven correct. Its argument is incompatible with Bloom's original filter, though, and Knuth explicitly states that the false positive formula he gives is approximate.
In other words, nothing was proved wrong. The analysis doesn't apply to Bloom and doesn't contradict Knuth.
Now, [1] may be the first paper to give the exact false positive formula for Knuth's filter (and Bloom's was given in 1979 by [2]), so it's not entirely without merit. But in practice none of this matters anyway. At useful sizes the approximations are accurate. Knuth's is a lower, Bloom's an upper bound, and the two become functionally equivalent very quickly.
1. https://cglab.ca/~morin/publications/ds/bloom-submitted.pdf
2. Algorithm-A in https://sci-hub.tw/10.1109/proc.1979.11543