
What are Bloom filters? (2015) - trumbitta2
https://medium.com/the-story/what-are-bloom-filters-1ec2a50c68ff#.x11blygds
======
bradleyjg
Whenever considering a bloom filter also look at cuckoo filters. There are
pros and cons of each approach.

See [https://bdupras.github.io/filter-
tutorial/](https://bdupras.github.io/filter-tutorial/) and
[http://11011110.livejournal.com/327681.html](http://11011110.livejournal.com/327681.html)
as well as the corresponding HN discussions
[https://news.ycombinator.com/item?id=12124722](https://news.ycombinator.com/item?id=12124722)
and
[https://news.ycombinator.com/item?id=11795779](https://news.ycombinator.com/item?id=11795779)

~~~
lpage
For that matter, cuckoo hashing [1] in general is an interesting topic. Take a
look at its application to cache oblivious dictionaries [2]. They work quite
well in practice, especially for space constrained hash tables (e.g. caches).

[1]
[https://en.wikipedia.org/wiki/Cuckoo_hashing](https://en.wikipedia.org/wiki/Cuckoo_hashing)

[2] [https://arxiv.org/abs/1107.4378](https://arxiv.org/abs/1107.4378)

------
gwbas1c
I read about halfway into this article, and when the author started talking
about Javascript, the "this" keyword, and the politics about the job, I
stopped.

What does that have to do with bloom filters? Get to the point!

~~~
aji
the post is how he's returning a favor, which he explains in the post, in the
"returning the favor" section

~~~
slang800
But that doesn't tell me how Bloom filters work. I don't care about his dinner
or what favors he owes, unless he's planning on turning those expository
details into a clever metaphor later in the article, or if they really are
something that's important for me to understand.

I'm all for educational pieces that are told through a conversation between
characters - you can get some creative writing out of that. However, the bug
that he's asked to fix, the Polish typographer friend, and the dinner scene
are barely connected to the subject. We don't even get a nice technical
explanation of why the bug is impossible to fix or what that means with
regards to the trade-offs in using Bloom filters.

------
hal9000xp
There is cool and very quick explanation of bloom filters:

[https://www.youtube.com/watch?v=-SuTGoFYjZs](https://www.youtube.com/watch?v=-SuTGoFYjZs)

------
s_q_b
Bloom filters are beautiful and highly underutilized.

I used a Bloom filter in an interview and they looked at me like it was some
form of arcane sorcery.

Isn't this standard in algorithms and data structures courses? Does anyone
know if it has been removed from the curriculum?

~~~
Retric
Bloom filters have a lot of problems for critical processes in practice. They
are best used when being wrong is a non issue or as a type of cache, but
preforming a hash is generally _really_ slow. Also, the way virtual memory
works looking up an answer often makes storing the new value much faster.

So, while cool they have generally been replaced by far more useful and less
complex topics.

One example is if your looking for duplicates among huge data sets. You farm
out your Filter to 1,000 machines merge them, then re run to get a list of
possible collisions. Then compare that list. But, well merge sort can do the
same thing a little slower, it's a rare problem, and the pathological case for
Bloom filters is really bad. So, even if you know about Bloom filters you may
go with the simpler idea and be done sooner.

PS: In that example hash speed might mean sorting is actually faster.

~~~
jedberg
The place where bloom filters really shine is doing lookups on data where most
of the time the data isn't there.

For example, when you load a page on reddit, it has to check for a vote on
every single comment on the page to see if you voted on it. Chances are 99% of
the time you didn't vote on an item. Using the built in bloom filter in
Cassandra, it can very quickly tell if you voted on an item without having to
hit the data store most of the time, which saves a ton of time.

(As a side note, there is an extra optimization that makes things even faster
-- reddit tracks the last time you voted on anything as an attribute of your
user, so if the page is newer than your last vote, it just assumes you
couldn't have possible voted on it, skipping the lookups altogether).

~~~
Retric
A list of votes per user per page would also work. ~90% of the time that list
would be empty, if it's not your likely going to eventually need to know which
specific items where voted on so might as well just load them all.

Further, voting records are just not that big a data structure. Sure, it's
much better than a poor implementation and might be a great optimization to
bolt onto a different design, but again it's niche with minimal befits when
starting from scratch.

Edit: Now, I really want to know how they actually do this.

~~~
jedberg
> A list of votes per user per page would also work. ~90% of the time that
> list would be empty,

Not really. People tend to vote on one or two things on a page. So you'll have
a long list of people who voted on one or two of the 1000+ items, and now
you've had to do an extra lookup.

It's a lot faster to just go straight for the votes and take advantage of the
bloom filter.

~~~
zardeh
You work(ed) at reddit, d(o|idi)n't you?

~~~
jedberg
Worked, yes.

------
taeric
Bloom filters originally upset me because I expected the word "Bloom" to refer
to an action of the algorithm. Not so someone's name.

I'm fond of Concierge filtering as an easy description of how they work. I
view them as a brief conversation with the front desk staff at a hotel to
determine if someone is there. "Is anyone here wearing a hat? Is anyone here
over 6'? Did anyone come in carrying a briefcase?"

~~~
tnecniv
>Bloom filters originally upset me because I expected the word "Bloom" to
refer to an action of the algorithm. Not so someone's name.

I thought Canny edge detection had to do with the fact that the edge detector
was canny.

~~~
abecedarius
There ought to be a list of surprisingly-eponymous things like Page rank,
Killing vector, Poynting vector, ...

------
pklausler
If you like hash functions and Bloom filters, you'll also enjoy Jon Bentley's
description of the original UNIX spell(1) utility, which ran in 64KB by
representing its dictionary as a sparse bitmap.

~~~
drchickensalad
Is this in programming pearls? Otherwise where?

~~~
pklausler
I saw it in a column in Comm. of the ACM a long time ago, sorry.

~~~
dalke
McIlroy's "Development of a Spelling List" (1982) paper is
doi:10.1109/TCOM.1982.1095395 available at
[https://web.archive.org/web/20120324185456/http://unix-
spell...](https://web.archive.org/web/20120324185456/http://unix-
spell.googlecode.com/svn/trunk/McIlroy_spell_1982.pdf) . I found it from
[https://code.google.com/archive/p/unix-
spell/](https://code.google.com/archive/p/unix-spell/) , which also says:

> It is interesting to note that one of the world's best compressor (paq8l)
> can compress the 250kb word list down to 48.5kb, less than the space taken
> by the lossy compression methods proposed in the paper! For comparison,
> regular modern compression methods (such as gzip, lzma etc) only achieve
> twice that size (85-90kb).

Bentley's commentary is in "A Spelling Checker", CACM 28(5): 456-462 (1985),
doi:10.1145/3532.315102 , behind the paywall at
[http://dl.acm.org/citation.cfm?id=315102&dl=ACM&coll=DL&CFID...](http://dl.acm.org/citation.cfm?id=315102&dl=ACM&coll=DL&CFID=803658785&CFTOKEN=18698707)
and in the book "Programming Pearls", Second Edition, according to
[http://www.linuxjournal.com/article/3846](http://www.linuxjournal.com/article/3846)
.

------
geophile
He takes a long time to get to the point, doesn't he?

------
wstrange
OpenAM uses bloom filters to blacklist sessions. Neil has a nice write up
here:

[https://neilmadden.wordpress.com/2016/02/25/stateless-
sessio...](https://neilmadden.wordpress.com/2016/02/25/stateless-session-
logout-in-openam-13/)

[Disclaimer, I work for ForgeRock]

------
m00dy
I also have researched this topic lately. You can look at my scientific
results.
[https://github.com/m00dy/BloomFilter](https://github.com/m00dy/BloomFilter)

------
looki
Brilliant! I never anticipated Bloom filters would be so simple. What comes to
my mind is that for "forgetful" Bloom filters, one could remove a hash from
the table, also erasing all equivalent entries from the memory. Is this a
useful trade-off in practice?

~~~
prashnts
No, it isn't.

The problem here would be that you'd be essentially invalidating arbitrary
hashes from the table with each operation where they share the same "row" (or
address in a bitmap) with other hashes.

Suppose we have a table such that "foo", "bar", "baz" all have overlapping
entry -- if we attempt to remove any of them, we'd remove the rest also.

This being said, you might be interested in `Counting filter' instead, which
support delete operations.

[0] -
[https://en.wikipedia.org/wiki/Bloom_filter#Counting_filters](https://en.wikipedia.org/wiki/Bloom_filter#Counting_filters)

~~~
looki
Thanks. That's what I meant with deleting "equivalent entries", I was
wondering if there are cases where that would still be a net gain. But it
makes sense that there are more specialized versions for such things.

PS I don't understand why my comment was downvoted, is there something wrong
with asking such questions here?

~~~
nitrogen
_I don 't understand why my comment was downvoted, is there something wrong
with asking such questions here?_

Not AFAIK, but sometimes people downvote questions that are based on
misunderstandings or false premises. Give it some time and you'll probably be
voted back up to where you started.

------
warrenmar
Understanding Bloom filters helps you understand Count-min sketch.
[https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch](https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch)

------
HammadB
This was a god-awful pain to read. Meanders around the point of bloom filters
endlessly.

------
jasode
Fyi... the previous discussion about venison glazed in honey (or was it bloom
filters? I can't remember for sure):

(July 2015)
[https://news.ycombinator.com/item?id=9918365](https://news.ycombinator.com/item?id=9918365)

------
Kenji
Ah, another article about Bloom filters on HN.

That reminds me, does anyone know if Windows 10 uses bloom filters?
Specifically, if I have a folder with a lot of files inside and I paste a file
into it, if the file name is different from all the names of the files in the
folder, it is nearly instant (true negative, no further testing needed) but if
the file name matches one of the names, it takes significantly longer (true or
false positive of bloom filter -> check all files manually for a match).

It's just something that's been going through my mind lately when I moved
files.

