

Ask HN: I want to understand Bloom Filters - tocomment

I tried reading the Wikipedia article but it's confusing the bejezus out of me.  Does anyone know a good tutorial, or perhaps would like to write a simple explanation here?
======
RiderOfGiraffes
Not sure what your confusion is, but here's an explanation ...

A Bloom Filter lets you know, for each object I feed you, whether you've seen
it before or not.

You could just take a copy of each one and compare the one I feed you against
all the ones you have, but that has some disadvantages. (I'll leave it as an
exercise to name some of them)

An alternative is to take a hash of each object I feed you and then keep the
hash. Then when I give you an item you hash it and see if you have the hash.
There's always a chance that you'll say "yes" because of a coincidence, but
you'll never say "No" when you shouldn't. That also has disadvantages. In
particular, for example, you don't know how big your storage space will have
to get, and lookup times can get very inconvenient.

So here's how a bloom filter works. Take a very long vector of 0/1, and set
them all to 0. Also choose 10 (for some value of 10) hash functions that map
an object to an apparently random place in the vector. When you get given an
object, set all those semi-random places to a 1.

Now when I give you an object, you check to see that all the places are set to
1. If not, then the object I've given you can't be one you've seen. On the
other hand, if they are all set to 1, then very likely that's because you've
seen this object before. It's not certain, but you can make the probability of
error really, really small.

With any luck that will let you read and understand the wikipedia entry.

<http://en.wikipedia.org/wiki/Bloom_filter>

~~~
tocomment
Thanks. My only remaining point of confusion is what the 10 hash functions are
doing? Does each one set one bit in your vector? Does that mean you need a
hash function that takes a string as input and returns 0 or 1?

~~~
RiderOfGiraffes
No, each hash function tells you a bit to set. Your hash functions each take
an input object and return a location in the vector. That location is set to
1.

With regards how big to make things, you need to do the math. If you have a
vector of 1000 bits in your vector and you have 10 hash functions then there
are 1000^10 = 10^30 possible collections of ten bits that can be set. If your
plausible collection is only of size a few billion then coincidences won't be
common or frequent.

256 bytes is 2048 bits, 16 hash functions, suddenly there are 2048^16 possible
combinations, or 2^176 which is about 10^50.

~~~
tocomment
So why is this an improvement over an ordinary hash table? Does having more
than one hash function help in some way?

~~~
RiderOfGiraffes
Er, I've just shown that you can store several billion records in a bit vector
of length 256 bytes and you ask why it's better than a hash table? There's
something that at least one of us isn't understanding ...

------
RiderOfGiraffes
OK, just as some code for you to play with, here's an explicit implementation.
You shouldn't use md5, the hash functions aren't very good, and it's only a
toy example.

    
    
        #!/usr/bin/python
    
        import md5
    
        def h0(obj): return ord(md5.md5(obj).digest()[0]) % size
        def h1(obj): return ord(md5.md5(obj).digest()[1]) % size
        def h2(obj): return ord(md5.md5(obj).digest()[2]) % size
        def h3(obj): return ord(md5.md5(obj).digest()[3]) % size
    
        def init_bloom(size): return size * [0]
    
        def insert_item(vector,obj):
            for fn in bloom_hash_fns:
                vector[fn(obj)] = 1
    
        def test_for_presence(vector,obj):
            for fn in bloom_hash_fns:
                if 0==vector[fn(obj)]:
                    return False
            return True
    
        # You decide what size your vector will be
    
        size = 32
    
        # The you create hash functions, each taking an
        # object and return a value in the range 0 to size-1.
    
        bloom_hash_fns = [ h0, h1, h2, h3 ]
    
        vector = init_bloom(size)
        print vector
        insert_item(vector,'test')
        print vector
        insert_item(vector,'text')
        print vector
        insert_item(vector,'vext')
        print vector
    
        print 'Should be 1', test_for_presence(vector,'test')
        print 'Should be 1', test_for_presence(vector,'text')
        print 'Should be 1', test_for_presence(vector,'vext')
        print 'Should be 0', test_for_presence(vector,'vest')
        print 'Should be 0', test_for_presence(vector,'abcd')
        print 'Should be 0', test_for_presence(vector,'wxyx')

~~~
tocomment
Thanks, this is perfect. (I wonder if you should add this to the Wikipedia
article? It would make it a lot clearer.)

I can definitely see how it works now. I'm still not sure on why it works. I
guess it's just a probability thing? I.e., each hashed insertion is very
unlikely to collide with another?

~~~
RiderOfGiraffes
It's really, really bad code, and although it serves its purpose of being
explicit, I wouldn't put it on WikiPedia.

By the way, if you really want to learn about learning, I strongly recommend
that you return to the description in the wikipedia article and read it again,
this time with your greater understanding. Ask youself: Why didn't I
understand this first time round, and what should I have done differently to
make sure I did?

It's only by such self-reflection like that you will improve your independent
learning abilities. Here's the section in question:

    
    
      An empty Bloom filter is a bit array of m bits, all
      set to 0. There must also be k different hash functions
      defined, each of which maps or hashes some set element
      to one of the m array positions with a uniform random
      distribution.
    
      To add an element, feed it to each of the k hash
      functions to get k array positions. Set the bits at all
      these positions to 1.
    
      To query for an element (test whether it is in the set),
      feed it to each of the k hash functions to get k array
      positions. If any of the bits at these positions are 0,
      the element is not in the set - if it were, then all the
      bits would have been set to 1 when it was inserted. If
      all are 1, then either the element is in the set, or the
      bits have been set to 1 during the insertion of other
      elements.
    

That descriptions seems to me to be complete and clear. What did you not
understand?

