
Bloom Filters for the Perplexed - kedmi
https://sagi.io/2017/07/bloom-filters-for-the-perplexed/
======
voidmain
Bloom filters are a nice data structure, and you should absolutely have them
in your toolbox, but if you go looking for a reason to use one you are likely
to wind up making things worse. The following is not valid reasoning: "Bloom
filters are efficient. Therefore if I can find a way to use a bloom filter, my
solution will be efficient."

The "SSH keys" protocol in the article seems like an example of this. It
doesn't make any sense. Why would the server send the client a Bloom filter if
the client has already told it what key it wants to check? The server only has
to send one bit back to the client! And if the goal is to not trust the server
with the client's (public) key, this protocol doesn't accomplish that either.

And if you do for some reason have to transmit the entire database of
compromised SSH keys in a way that permits only membership tests, a Bloom
filter isn't the most compact way to do it! For example, off the top of my
head, you could calculate an (15+N)-bit hash for each element of the list,
sort the hashes, and rice code the deltas. That would take very roughly 32768
* (N+2) bits and give about 1 in 2^N false positives. So for N=13 it is about
the size of the bloom filter in the article but gives a false positive rate 8
times lower. This data structure isn't random access like a Bloom filter, but
that doesn't matter for something you are sending over the network (which is
always O(N)).

~~~
gopalv
> The server only has to send one bit back to the client!

As much as the example is a bad one because it leaks server-side info to an
unauthenticated client, I've had scenarios where if you have > 3 ssh keys in
your key-chain all ssh login attempts fail after 3 keys are tried & cause
failures. I end up writing ~/.ssh/config entries; a lot of them for the client
to remember which key to try first.

My favourite real-life bloom-filter story is the "unsafe URLs" list that is in
Chrome - the "Safe Browsing Bloom" is a neat way to send out obscured
information about the bad URLs without actually handing out a list to a user.
The web URLs or domains which find a match in this, do need to be checked
upstream, but it avoids having to check for every single request with a
central service.

On a similar note, been playing with a variant of bloom filters at work called
a Bloom-1 filter [1] which works much faster than a regular bloom filter which
has a lot of random memory access for 1 bit reads.

[1] -
[https://github.com/prasanthj/bloomfilter/blob/master/core/sr...](https://github.com/prasanthj/bloomfilter/blob/master/core/src/main/java/com/github/prasanthj/bloomfilter/Bloom1Filter.java#L179)

------
malkia
When I first heard of Bloom filters, I thought of the Bloom effect -
[https://en.wikipedia.org/wiki/Bloom_(shader_effect)](https://en.wikipedia.org/wiki/Bloom_\(shader_effect\))
\- and we used to call these things filters for a while - then friend of mine
told me about the other meaning :)

~~~
QuercusMax
I think the first thing any article about Bloom filters should mention is that
they were invented by a guy named Burton Bloom. They don't have anything to do
with "blooming" of any sort.

Very similar with Shellsort, which was designed by Donald Shell.

~~~
jszymborski
I always thought it was called shell sort because it invoked the image of a
shell game and the pointer was like a shell covering the array element!!

Thanks for the edification!!!

------
captaintacos
Some weeks ago someone posted a very useful interactive demo of a bloom filter
(implemented in js) that you might want to play with after reading this
article:
[https://www.jasondavies.com/bloomfilter/](https://www.jasondavies.com/bloomfilter/)

------
tzs
How does a Bloom filter with k hash functions hashing to a shared table of m
bits compare to just using k hash functions each hashing into its own separate
hash table of m/k bits?

~~~
cmurphycode
Having many filters is worse, but not hugely so. I believe the problem is
reducible to the fact that you'll have many smaller filters thus the random
distribution hurts you a bit more.

The key word to google is "blocked bloom filter" e.g. as proposed in
[http://algo2.iti.kit.edu/documents/cacheefficientbloomfilter...](http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-
jea.pdf)

Here's a nice paper with some improvements
[http://tfk.mit.edu/pdf/bloom.pdf](http://tfk.mit.edu/pdf/bloom.pdf)

We use blocked bloom filters for a couple of reasons, but one major benefit is
the memory locality (our "bloom filter" is 32GB or larger, so it's handy &
fast to be able to address it with separate "pages" which are really just
individual bloom filters.)

------
jason_slack
Can anyone tell me about a use case in game design?

~~~
notgood
This link[0] has an example; so basically almost anything in a game where it
has to lookup massive amounts of data (e.g. logs), the example in the article
is to quickly check if the user has already seen an item (in a game with
thousands of items [e.g because those were pseudo-randomly generated, think
RPGs]). But it's easy to think further applications: quickly check if the
player already played this chess game before (all pieces where in the same
position) to make sure the enemy does something smarter this time (because
that time the enemy lost), and such.

[0] [https://blog.demofox.org/2015/02/08/estimating-set-
membershi...](https://blog.demofox.org/2015/02/08/estimating-set-membership-
with-a-bloom-filter/)

~~~
Retric
It's not clear to me that the added complexity aka risk of bugs is worth it in
your case for any reasonable game design.

In the chess example, knowing it's possible to lose from X positions is almost
meaningless most of the time. The issue is search space sizes are either to
large to be useful or small enough for brute force.

~~~
notgood
You can always use it things that are not critical for the game to work; and a
good coder would make it in a separate class/container so most possible bugs
are isolated as well.

In the chess example the most importan thing its not for the machine to win
(or lose); is so the user feels its not the same game they already played; or
-dare I say- think that the enemy is "learning"

~~~
Retric
Even cellphones can run chess programs powerful enough to beat any human. The
normal way to tone them down is to make them more random which prevents games
from repeating in the first place. Making this a non issue.

Even ignoring that and assuming you wanted to do a lookup vs all games played
with someone that's a tiny dataset so looking it up has no real downside.

------
oli5679
Every month or so there is a new version of an article like this posted on hn

~~~
bradleyjg
I read a few of these articles a couple of years ago. Along with a few of the
inevitable cuckoo hash filter rejoinders.

I think they are neat algorithms and I'm glad to have come across them. But
that said, I have yet to find a problem in my day to day work which required
set membership, with space at a premium, and where false positives were
acceptable. So I've never used either in anger.

~~~
arielweisberg
If you are bored with bloom and cuckoo filters then check out quotient
filters. Quotienting was one of those mind blown things for me.

~~~
jason_slack
Thanks for the reading list! :-)

