
Bloom Filters by Example - gvenzl
https://llimllib.github.io/bloomfilter-tutorial/
======
natch
I disagree with the standard dogma around bloom filters that you need multiple
hash functions. Just use a simple incrementing salt value to modify the input
so you can hash the resulting salted input as many times as you need to, using
a different salt value each time.

Say you want to hash the string "abc" 8 times. Instead of having 8 hash
functions, just take the hash of, say, "abc-0", "abc-1", ... "abc-7". As long
as your hash function is of good quality and you're treating the output
properly, you don't have to worry about the results from these different
inputs having some significant relationship. For cryptographic types maybe I'm
using the term "salt" loosely. Whatever, think of another term if you like.
Anyway in practice this has worked fine for me.

~~~
asdfaoeu
The only problem is this assumes that the hash unpredictably changes on input
modification. Which is true for cryptographic hashes but not others.

~~~
natch
It assumes nothing of the sort. [Edit: well I mean it RELIES on it yes but
it's not an assumption; it's verified by knowing the properties of the hash
function used.] But it perhaps hinges on your definition of "unpredictably."

If you're taking the modulus of a large integer with respect to a very much
smaller bit array with a length that is a prime number, there is plenty of
unpredictability with most decent hash functions (note I did have "good
quality" as a caveat above) even non-cryptographic ones. That being said, I do
use cryptographic hashes.

------
Freaky
That seems a roundabout way of sizing a bloom filter. Here's what I came up
with when I first implemented them for Newzbin:
[https://hur.st/bloomfilter](https://hur.st/bloomfilter)

    
    
        m = ceil((n * log(p)) / log(1.0 / (pow(2.0, log(2.0)))));
        k = round(log(2.0) * m / n);

~~~
assafmo
Thank you for this!

I actually used this site today. Found it on google...

~~~
Freaky
Glad you found it useful :)

------
tomc1985
Unrelated, but I've always had this question about bloom filters...

If testing for membership in the set is unreliable, but testing for non-
membership in the set works every time, then why can't you simply invert the
boolean for the non-membership test and call that a membership test?

E.g., if it's not not in the group, then it's in the group...

~~~
omlettehead
If you want to invert the boolean check, then you need to invert the input
space as well, which is not possible.

For example, consider a Bloom filter checking the availability of a username
during sign up. If you want an inverse Bloom filter that checks if it is not
not in the group, then you need to load it with all possible usernames.

~~~
tomc1985
Makes sense. OP's link is interactive enough that I was beginning to see that,
though couldn't articulate it. Thanks!

~~~
nandemo
If I was going to explain a bloom filter like you're 5... a bloom filter is
like a savant who never forgets a face -- maybe he's got a job in passport
control in Arstotzka -- if you show him someone's face or picture once, he'll
never forget it: if any time later you show him the same picture and ask him
"have you seen this face before?" he'll say "yes" without fail. If he replies
"no way", you can be 100% sure he's never seen it. But his memory isn't
photographic: he confuses people's faces after a while, recognizing faces he's
never seen, and that becomes worse the more people he's seen. So to be safe he
doesn't reply just "yes" but "hmm, I guess so".

In computing terms the interface has 2 methods:

\- takeALookAtThisFace(x)

\- yaSeenThisFaceBefore?(x) which returns either "no way" or "hmm, I guess
so".

"no way" means P(x is a member) = 0. Its negation is _not_ "hell yes" (P(x is
a member) = 1), it's "maybe" (P(x is a member) > 0).

~~~
meenzu
Thanks for this explanation it was really helpful!

~~~
nandemo
Glad to help. But I noticed the analogy is a bit flawed; testing for
membership does not add anything to the set, but the analogy might imply
asking yaSeenThisFaceBefore(x) will make the savant remember x. I should
change the story and the 2 functions to "remember this terrorist" and "is this
a terrorist?" or something like that.

------
tzs
For implementations that use cryptographic hash functions, do you actually
need more than one hash function invocation per item?

For instance, suppose you were implementing a Bloom filter with 2^20 bits, and
you want to use 10 bits per item. Instead of hashing the item 10 times, could
you hash once with a 256 bit cryptographic hash, and then take the lower 200
bits, divide that into 10 bit strings of 20 bits each, and use those?

~~~
felipeerias
Yes, you can do that. Let me describe something similar that I implemented
recently.

To use your example, let's say that the Bloom filter has a size of 2²⁰ bits
(128 KB) and that we are using 10 bits per item. In other words, for each new
item we need to calculate ten positions in the range [0,2²⁰).

We start by using a cryptographic hash function on the item just once. For
example, SHA-256, which will give us a 256-bit value.

Now we need to extract ten blocks from these 256 bits. We can do as you
suggest and make each block 20 bits long, but I don't know of a reason why we
should not make them a bit longer. A length of 32 bits would be nice, so each
block could be neatly mapped to a long.

To get ten blocks of 32 bits out of a 256-bit value, we just calculate a
_step_ such that block _i_ starts at the position _i_ * _step_. In our
example, this means that the first block is in positions [0,31], the second is
in [22,54]… and the tenth one is in [220,252].

At this point, we have ten blocks of 32 bits and we need ten values in the
range [0,2²⁰). So we just turn each block into a long and calculate modulo 2²⁰
for each of them.

And that's pretty much it.

To enter the item in the filter, we set to true the positions corresponding to
each of those 10 values.

To check whether an item may be in the filter, we do the same procedure and
check that all of the ten positions have been set to true. If any of them is
false, the item is not in the filter.

------
peteretep
I am confused by:

    
    
        > cryptographic hashes such as sha1,
        > though widely used therefore are not
        > very good choices
    

I thought SHA1 was fast, and that was a reason to _not_ use it in applications
where brute-forcing might be an issue.

    
    
        > [the more times you hash it the fewer
        > false positives]
    

That doesn't match my understanding of how any even slightly reasonable hash
function should work, doubly so if it's uniformly distributed.

~~~
masklinn
SHA1 is fast compared to _key-derivation functions_ so you do not want to use
it to _hash passwords_.

However SHA1 is slow compared to _most hash functions_ (especially but not
solely non-cryptographic hashes) so you don't want to use it for _hash-based
collections_ like hash tables or bloom filters.

edit: here's a bit I saved from tptacek (sadly I didn't keep the link, only
the content):

* If you need random fixed-sized URLs, generate UUIDs; don't tie them to content, which can (a) change and (b) be predicted.

* For error detection, CRC schemes aren't weak. Against adversaries, MD5 is weak. For offline file integrity checking, or user-timescale online checking, use SHA256; at the very minimum, use an algorithm that hasn't been broken.

* Do not ever use the MD5(password) password scheme. MD5 is much faster than Unix crypt; even conventional Unix crypt is at least salt'ed to defend against rainbow table attacks, and modern adaptive hashing can be tuned to make dictionary attacks infeasable.

* MD5 is too slow for in-memory hash tables; I cringe when I see people use it. You're probably just hashing a string: use Torek's 31/37 hash. Otherwise, use Jenkins.

* PRNG design is hard. Just running MD5 over trivially small internal state doesn't yield a secure PRNG. Again, a problem other people solved that you have no business hacking on yourself, at least if your code matters.

* If you are concerned about collision attacks on your cryptosystem, which you should be if you're this guy and you're using MD5, use an algorithm that hasn't been broken; don't just jumble up one that already has. Kerckhoff's principal: look it up.

The bits you want are 3 and 4, but everything is good. Just sed s/MD5/SHA1/

~~~
j_s
[https://www.reddit.com/r/programming/comments/2fu8q/we_worsh...](https://www.reddit.com/r/programming/comments/2fu8q/we_worship_md5_the_god_of_hash/c2fwf2/)

------
rhizome
"Simply hash it a few times"?

~~~
bkanber
Definitely not clear from the article alone. You can use more than one hash
algorithm to reduce the chance of collisions. You set the values in the bit
array for both hashes, and then you check both again.

Practical, contrived example: the string "w" gives us "2" from fnv and "11"
from murmur. When you add that to the filter with both hashes, bits 2 and 11
are set.

The string "h" gives us 2 from fnv and 10 from murmur. If you were using only
the fnv hash, you'd get a "maybe exists" result. But since the murmur hash is
different for "h" than it is for "w", you get a "definitely not" result.

Of course, you still have collisions, you just cut them down. Both fnv and
murmur return the same hashes for "w" and "woot", so adding "w" then checking
"woot" still gives you a "maybe" result, but at least checking "h" does not.

------
sk5t
sha1 is very, very fast. This article seems to confuse cryptographically
useful hash functions with key stretching, adaptive work hashing, and the
like.

