Ask HN: I want to understand Bloom Filters

RiderOfGiraffes · on Nov 2, 2010

Not sure what your confusion is, but here's an explanation ...

A Bloom Filter lets you know, for each object I feed you, whether you've seen it before or not.

You could just take a copy of each one and compare the one I feed you against all the ones you have, but that has some disadvantages. (I'll leave it as an exercise to name some of them)

An alternative is to take a hash of each object I feed you and then keep the hash. Then when I give you an item you hash it and see if you have the hash. There's always a chance that you'll say "yes" because of a coincidence, but you'll never say "No" when you shouldn't. That also has disadvantages. In particular, for example, you don't know how big your storage space will have to get, and lookup times can get very inconvenient.

So here's how a bloom filter works. Take a very long vector of 0/1, and set them all to 0. Also choose 10 (for some value of 10) hash functions that map an object to an apparently random place in the vector. When you get given an object, set all those semi-random places to a 1.

Now when I give you an object, you check to see that all the places are set to 1. If not, then the object I've given you can't be one you've seen. On the other hand, if they are all set to 1, then very likely that's because you've seen this object before. It's not certain, but you can make the probability of error really, really small.

With any luck that will let you read and understand the wikipedia entry.

http://en.wikipedia.org/wiki/Bloom_filter

tocomment · on Nov 2, 2010

Thanks. My only remaining point of confusion is what the 10 hash functions are doing? Does each one set one bit in your vector? Does that mean you need a hash function that takes a string as input and returns 0 or 1?

RiderOfGiraffes · on Nov 2, 2010

No, each hash function tells you a bit to set. Your hash functions each take an input object and return a location in the vector. That location is set to 1.

With regards how big to make things, you need to do the math. If you have a vector of 1000 bits in your vector and you have 10 hash functions then there are 1000^10 = 10^30 possible collections of ten bits that can be set. If your plausible collection is only of size a few billion then coincidences won't be common or frequent.

256 bytes is 2048 bits, 16 hash functions, suddenly there are 2048^16 possible combinations, or 2^176 which is about 10^50.

jsarch · on Nov 2, 2010

Thanks for taking the time to explain this along with the 1000bits and 256bits example. I've known about bloom filters for quite some time, but never quite understood the statistics.

In my area of research, sequence alignments (aka. sequence comparisons), pruning the search space is immensely important; the absence of false negatives is a huge win.

tocomment · on Nov 2, 2010

So why is this an improvement over an ordinary hash table? Does having more than one hash function help in some way?

RiderOfGiraffes · on Nov 2, 2010

Er, I've just shown that you can store several billion records in a bit vector of length 256 bytes and you ask why it's better than a hash table? There's something that at least one of us isn't understanding ...

kd0amg · on Nov 2, 2010

Each hash function indicates a bit to set. If you have an array of 2^n bits, then the hash function must return an n-bit number.

tocomment · on Nov 2, 2010

Also how do you decide how long to make the vector?

mmastrac · on Nov 2, 2010

It's a tradeoff between space and probability that your bloom filter will return "yes" when it actually hasn't seen the item (and technically number of times you want to perform a hash function for each lookup).

There are a number of calculators and tables that let you pick the ratio of bits to number of items and the number of hashes and spit out a probability, eg:

http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.htm...

RiderOfGiraffes · on Nov 2, 2010

OK, just as some code for you to play with, here's an explicit implementation. You shouldn't use md5, the hash functions aren't very good, and it's only a toy example.

    #!/usr/bin/python

    import md5

    def h0(obj): return ord(md5.md5(obj).digest()[0]) % size
    def h1(obj): return ord(md5.md5(obj).digest()[1]) % size
    def h2(obj): return ord(md5.md5(obj).digest()[2]) % size
    def h3(obj): return ord(md5.md5(obj).digest()[3]) % size

    def init_bloom(size): return size * [0]

    def insert_item(vector,obj):
        for fn in bloom_hash_fns:
            vector[fn(obj)] = 1

    def test_for_presence(vector,obj):
        for fn in bloom_hash_fns:
            if 0==vector[fn(obj)]:
                return False
        return True

    # You decide what size your vector will be

    size = 32

    # The you create hash functions, each taking an
    # object and return a value in the range 0 to size-1.

    bloom_hash_fns = [ h0, h1, h2, h3 ]

    vector = init_bloom(size)
    print vector
    insert_item(vector,'test')
    print vector
    insert_item(vector,'text')
    print vector
    insert_item(vector,'vext')
    print vector

    print 'Should be 1', test_for_presence(vector,'test')
    print 'Should be 1', test_for_presence(vector,'text')
    print 'Should be 1', test_for_presence(vector,'vext')
    print 'Should be 0', test_for_presence(vector,'vest')
    print 'Should be 0', test_for_presence(vector,'abcd')
    print 'Should be 0', test_for_presence(vector,'wxyx')

tocomment · on Nov 3, 2010

Thanks, this is perfect. (I wonder if you should add this to the Wikipedia article? It would make it a lot clearer.)

I can definitely see how it works now. I'm still not sure on why it works. I guess it's just a probability thing? I.e., each hashed insertion is very unlikely to collide with another?

RiderOfGiraffes · on Nov 3, 2010

It's really, really bad code, and although it serves its purpose of being explicit, I wouldn't put it on WikiPedia.

By the way, if you really want to learn about learning, I strongly recommend that you return to the description in the wikipedia article and read it again, this time with your greater understanding. Ask youself: Why didn't I understand this first time round, and what should I have done differently to make sure I did?

It's only by such self-reflection like that you will improve your independent learning abilities. Here's the section in question:

  An empty Bloom filter is a bit array of m bits, all
  set to 0. There must also be k different hash functions
  defined, each of which maps or hashes some set element
  to one of the m array positions with a uniform random
  distribution.

  To add an element, feed it to each of the k hash
  functions to get k array positions. Set the bits at all
  these positions to 1.

  To query for an element (test whether it is in the set),
  feed it to each of the k hash functions to get k array
  positions. If any of the bits at these positions are 0,
  the element is not in the set - if it were, then all the
  bits would have been set to 1 when it was inserted. If
  all are 1, then either the element is in the set, or the
  bits have been set to 1 during the insertion of other
  elements.

That descriptions seems to me to be complete and clear. What did you not understand?