
Bloom Filters by example - gpsarakis
http://billmill.org/bloomfilter-tutorial/
======
javindo
Interesting anecdote: an alumni of my university was chatting with me recently
about his work at Google, essentially he replaced what he described as a
"large, complex machine learning pipeline" with a "simple bloom filter" for
displaying product results at the top of a search page if it determines you
have searched for a product.

For example, if you search for "iPhone 5S", the filter determines whether to
show something like this
[http://i.imgur.com/Dp3y1Gi.png](http://i.imgur.com/Dp3y1Gi.png) (not sure if
sponsored makes a difference here, possibly a bad example).

~~~
yeukhon
Is EE shop sites you would have visited before the search?

> a "simple bloom filter" for displaying product results at the top of a
> search page if it determines you have searched for a product.

"have searched" is not clear to me. I was expecting Google would just spit out
a JSON response with all the search query for page 1 (including sponsored
ads). Can you please elaborate on this please?

~~~
nicolethenerd
> Is EE shop sites you would have visited before the search?

No. I think you might be parsing his statement wrong - it's not "have
searched" as in searched for previously, it's "have searched for a PRODUCT" as
in, your search is for a product (something purchasable in a store, preferably
an online store) as opposed to something else.

The Bloom filter figures out if your search is for a product, and if it is,
puts that box w/ sponsored links to stores that sell that product up at the
top of your results.

------
mbostock
Jason Davies made a nice visual explanation of bloom filters:
[https://www.jasondavies.com/bloomfilter/](https://www.jasondavies.com/bloomfilter/)

~~~
llimllib
His demo is much smoother than mine, I'm jealous.

------
GhotiFish
I always loved bloom filters, because to me, they showed something that was
very fundamentally important.

That the amount of information required so that you can say whether an element
is in a set or not is significantly less that the set itself.

That is just wierd, but amazing.

If someone asks me about weird and wonderful things in computer science, this
is what I talk about.

~~~
MichaelGG
Well it doesn't tell you the element's membership. It only can tell you it
might be in, or definitely isn't.

Like reviewing a passenger list where everyone's name is abbreviated to
initials.

------
fennecfoxen
Now we just need HyperLogLog by example.

(HyperLogLogs are vaguely similar data structures in that you hash things a
bunch to store a value inside, except instead of set-membership inquiries,
they're better at cardinality-estimation purposes.)

~~~
abolibibelot
Well, there you are: a nice explanation of HyperLogLog, along with graphs, and
a javascript demo to play with. And a tribute to Philippe Flajolet who came up
with the algorithm.

[http://blog.aggregateknowledge.com/2012/10/25/sketch-of-
the-...](http://blog.aggregateknowledge.com/2012/10/25/sketch-of-the-day-
hyperloglog-cornerstone-of-a-big-data-infrastructure/)

The Aggregate Knowledge blog is mind blowing for everything sketch related.

~~~
llimllib
Is that demo widget custom, or did they use a tool to generate it? Do you
know?

~~~
abolibibelot
According to this comment
[http://blog.aggregateknowledge.com/2012/10/25/sketch-of-
the-...](http://blog.aggregateknowledge.com/2012/10/25/sketch-of-the-day-
hyperloglog-cornerstone-of-a-big-data-infrastructure/#comment-789) , it's
custom made and based on d3.

~~~
llimllib
It would make the world a lot better if somebody made it easy to create that
widget in IPython Notebook or some similar tool.

#IShouldDoThatButIProbablyWont

------
makmanalp
Question, instead of using two hash functions, why not use just a single
function but use a = hash(foo) and b = hash(foo+"something")? A well
distributed hash function should give drastically different results for a and
b for different values of foo, which is what we want, correct?

~~~
kurige
It's standard to have at least two hashing functions. You can then simulate
having _x_ hash functions simply by combining those two.[1]

But, your question is entirely valid and I'm not really sure why you _couldn
't_ do that. It might simply be that the kinds of hash functions used in bloom
filters (non-cryptographic) are _not_ well distributed.

[1]:
[http://citeseer.ist.psu.edu/viewdoc/download;jsessionid=4060...](http://citeseer.ist.psu.edu/viewdoc/download;jsessionid=4060353E67A356EF9528D2C57C064F5A?doi=10.1.1.152.579&rep=rep1&type=pdf)

~~~
makmanalp
Fascinating, thank you! I'll take a look at the paper when I get back home.

------
rethab
Chapter 26 of 'Real World Haskell' shows how to design a library by
implementing a bloom filter: [http://book.realworldhaskell.org/read/advanced-
library-desig...](http://book.realworldhaskell.org/read/advanced-library-
design-building-a-bloom-filter.html)

------
pearkes
bloomd[1] is a "high performance c server for bloom filters" for those looking
for a low-overhead and deployable implementation with quite a few clients,
like Python and Ruby.

[1] [https://github.com/armon/bloomd](https://github.com/armon/bloomd)

------
ngcazz
Wow, I didn't realize they were this simple to implement and reason about!

------
sandwell
This looks very cool, however I was able to get lots of false positives using
the first few letters of the alphabet as test data. Could you reduce these by
using k hashes and k bit vectors with m bits, i.e. one bit vector for each
hash algorithm, at the expense of a little extra computation?

~~~
IanCal
That's probably because it's a tiny filter. In reality you'd calculate the
size more appropriately:

[http://hur.st/bloomfilter?n=4&p=1.0E-20](http://hur.st/bloomfilter?n=4&p=1.0E-20)

Here you can give it a false positive rate you're happy with and a number of
elements. It'll give you the optimal size and number of hashes to use. Your
suggestion would use far too much space.

------
gngeal
Google Chrome not only uses bloom filters for malicious URL filtering but also
as a part of the CSS selector matching process. I'm not sure how exactly, but
allegedly, it gives them a nice speed bump in majority of the cases without
compromising on generality.

------
roye
BFs are getting pretty popular in bioinformatics. The post mentions one
example, here is my favorite recent one:
[http://minia.genouest.org/files/minia.pdf](http://minia.genouest.org/files/minia.pdf)

~~~
sakai
Very cool – I hadn't seen this paper before (haven't done any _real_ work on
de novo assembly or otherwise requiring de Bruijn graphs).

Here's a paper using Bloom filters in metagenomic classification (my relative*
area):
[http://bioinformatics.oxfordjournals.org/content/26/13/1595....](http://bioinformatics.oxfordjournals.org/content/26/13/1595.long)

In this vein, a friend and I are researching/implementing a probabilistic key-
value store for some bioinformatics applications (one is metagenomic organism
and gene identification). It's fast and space-efficient, just like a BF
(though obviously less so as it stores keys and not single-set-membership).
Any use cases for that kind of thing in your sub-field? Always trying to
figure out interesting new applications (we aren't ready to write it all up
yet, but hope to at some point not too far down the road).

* I'm really just a dabbler / don't have a formal bioinformatics background. My friend is the genetics PhD.

~~~
roye
Can you explain "probabilistic key-value store?" Would it be that each gene
has some defined probability of belonging to a given organism, or is it
probabilistic in the sense of having a defined error rate as BFs do?

~~~
sakai
The latter - probabilistic in the sense of having a defined error rate. In one
of our use cases, the keys are simply kmers and the values are organism or
gene IDs.

~~~
roye
not sure, but would definitely take a look.

~~~
sakai
Contact info?

~~~
roye
You'll find it here: [http://tau.ac.il/~rozovr/](http://tau.ac.il/~rozovr/)

------
timruffles
Coincidence: used this page on Friday to implement bloom filters for a kata at
Software Craftsmanship 2013 (in Bletchley Park of all places).

Fun to implement, I'd recommend having a go. Here's our terrible clojure code:
[https://gist.github.com/timruffles/7195405](https://gist.github.com/timruffles/7195405)

------
aidos
Interesting. I'd never looked into how BFs work before and this is a nice
clear explaination.

One thing I don't quite get - as you're hashing into the same bit array for
each of the hashes - you must get false positives from combinations of hashes.
So say you chose a bad combination of hash algorithms where one key hashed
into bits a and b using the two different hashes respectively. Maybe some
other key might hash into b and a. With a totally suboptimal choice of hash
algorithms you could end up with a 50% error rate. Or am I misunderstanding
something?

~~~
chamblin
This is kind of a fundamental part of bloom filters. They're probabilistic
data structures, which is the trade you make for space efficiency.

If you get a false back, you can guarantee that some item is not in the
filter. If you get a true back, you can be fairly certain it is. Once you have
a degree of certainty, you can do a more thorough check for membership while
cutting out large numbers of negatives.

~~~
aidos
I get the concept - you're trading space for certainty. It's more the bit
where it says:

    
    
        "The more hash functions you have, the slower your bloom filter, and the quicker it fills up. *If you have too few, however, you may suffer too many false positives.*"
    

Just made me wonder - couldn't badly chosen hash functions actually give you
more false positives?

------
tropicalmug
Another great Bloom Filter library not mentioned in this article is Bitly's
implementation on GitHub[0]. It has the best of both worlds: a C
implementation and higher-level wrappers (Golang, PHP, Python, and Common
Lisp). If you don't want to play around with a fancier DBMS and really just
want the damn filters, I would look here.

[0] [https://github.com/bitly/dablooms](https://github.com/bitly/dablooms)

~~~
llimllib
It is elliptically mentioned, via the link to this wonderful pull request:
[https://github.com/bitly/dablooms/pull/19](https://github.com/bitly/dablooms/pull/19)

------
jbochi
There is nice summary of other interesting probabilistic data structures here:
[http://highlyscalable.wordpress.com/2012/05/01/probabilistic...](http://highlyscalable.wordpress.com/2012/05/01/probabilistic-
structures-web-analytics-data-mining/)

------
joeblau
Last time I checked, the Cassandra implementation was really messed up. Myself
and another dev at our last company spent a few weeks debugging, adding test
cases and fixing the code to get it to work correctly.

~~~
Elfan
Cassandra user in production here. Could you elaborate a little bit on what
was messed up (size, false positives etc.) and which version this was? Did you
have trouble with Cassandra itself that you tracked down to the bloom filters
or trying to re-use the bloom filters in another project?

~~~
sakai
I'm only familiar with a Python clone of the Cassandra implementation ("hydra"
\-- used it a while back), but two issues I do remember are: 1) I _believe_ it
only uses 32-bit ints for the bit array addressing, so you can overflow it
(and this also may be less-than-ideal from a hash distribution perspective,
but I don't know offhand); and 2) as someone coming from a different
background, I found the whole thing to have a bit too much "OO" magic, with
several helper classes to set up the filter that (to me) obfuscate what's
really going on.

------
dpcx
Anyone got pointers to "simple" real-world use cases?

~~~
supergirl
Don't know what you consider simple, but databases use bloom filters to do
faster joins for example. Chrome also uses them for malware detection I think.

~~~
throwaway1979
Specifically, a place where DBs use bloom filters is distributed joins. This
isn't a common scenario IMHO. Found a good description here:

[http://www.coolsnap.net/kevin/?p=19](http://www.coolsnap.net/kevin/?p=19)

P.S. Loved the tutorial.

~~~
supergirl
it's also useful on local hash joins because it's faster than a regular hash
table lookup. so you can use it to eliminate some stuff before going to the
big hash table.

------
mrcactu5
I had just put down this bloom filter tutorial by Jason Davies, now another
one comes up. It's a sign. www.jasondavies.com/bloomfilter/

------
malkia
As an example of real usage - In my ChromeBook home folder, there are several
"bloom" filter files for various predefined filtering in chrome

