
Interactive Demo of Bloom Filters - tekromancr
https://www.jasondavies.com/bloomfilter/
======
geocar
If anyone just wanted to see a collision:

    
    
        a
    
        2
    
        c
    

collides with:

    
    
        f

~~~
taco_emoji
Another anomaly is that "hello" yields the same array position for all three
hashes, which makes it highly prone to collision.

~~~
geocar
Same with:

    
    
        The Answer to The Ultimate Question of Life, The Universe, and Everything.
    

which means that if

    
    
        FORTY-TWO.
    

has been inserted, you'll find that as a possible answer.

------
netcraft
obligatory mention that if you are interested in this you should also know
about cuckoo filters:

[https://www.cs.cmu.edu/~dga/papers/cuckoo-
conext2014.pdf](https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf)

[https://en.wikipedia.org/wiki/Cuckoo_hashing](https://en.wikipedia.org/wiki/Cuckoo_hashing)

~~~
GordonS
Do you have a tldr for how/when it is better than bloom filters?

~~~
kappaloris
When you want an error rate lower than 3% and you can properly size the filter
as it can fail insertions if you overfill it.

------
ThePhysicist
If you're interested in using Bloom filters in the backend, we (at DCSO) have
written efficient and interoperable open-source implementations in Python and
Go, also using FNV-1 hashing:

[https://github.com/DCSO/bloom](https://github.com/DCSO/bloom) (Go version)

[https://github.com/DCSO/flor](https://github.com/DCSO/flor) (Python version)

The Go version comes with a command line tool that allows you to use Bloom
filters on the shell.

~~~
bane
One note: if you've done everything correctly under the hood, instead of
reading and writing the entire filter into and out of memory, just mmap the
file and operate on it directly. It turns out to be pretty fast, especially on
SSDs and you can make both many filters and test them quickly this way, or
make filters that are practically larger than RAM and swap partitions and make
absolutely enormous sets.

~~~
ThePhysicist
Yes good point, we actually have this on the roadmap!

------
cr0sh
Am I understanding this correctly?

1) For each key, you generate a "hash" \- which is a bit position

2) The hash generation is such to ensure that the bit location in the filter
is probabilistically distributed for each key (so they are spread "evenly",
for lack of a better term, over the length of the filter)

3) You generate so many locations per key, the number of which is the number
of hash generators you are using (so if you have 3 generators, you get 3
distributed bit positions)

4) Some keys may cause overlaps/collisions - but this is ok

5) At some point, you fill up the filter with keys

6) To see if a key is -probably- in the filter, you run the same steps again
and see if the bits are set; if they are, then it -probably- is

7) You then run that key on your regular index search; some false positives
though will cause you to run that expensive operation and get back nothing,
but usually you'll get back something (true positive). But if that key wasn't
found in the filter, you definitely don't run the expensive lookup at all.

Is that correct?

There must also be something that the bit position "hash" generators have to
be able to span an arbitrarily large bit range, in order to "store" more keys;
this trivial example would look like it would fill up rather quickly (one all
bits are set to "1", any key would generate a "positive" and you would always
be running the expensive lookup - whether it was in the index or not -
basically, falling back to a default state).

If I have all of that correct (or close) - or even if I don't - it seems like
a very powerful technique, limited to only that bit length (and being able to
perform bitwise operations on it quickly), which (unless I am missing
something?) would need to be fairly long to accommodate a reasonable number of
keys (for say a record lookup in an rdb table).

What might also be nice would be a way to detect it is "filled up" and bypass
the test; you'd fall back to the worst case scenario (ie - no bloom filter),
but at least you wouldn't be running the bloom filter check on top of that as
well, incurring an extra demand. I'm thinking there's probably something
easily done here - some bitwise operation that could be done (maybe take the
inverse and compare it to zero?).

~~~
tekromancr
Pretty close. Except 1) you generate multiple hashes for a particular value,
thus setting multiple bits for a value 5) when this happens, you use a larger
filter 6) You can also determine with 100% confidence that a value isn't in
the set.

Other than that, I think you have a good idea of what it's about. Once I
learned about this, I got tons of great ideas. Right now, I think it would be
cool to construct a bloom filter for detected bot users on twitter. You could
make a plugin that marked tweets from probable bot-users as such without
storing the whole db on disk, and without making tons of network requests.

~~~
afonsotsukamoto
I do believe on the 100% confidence, but when you have full filter is that
still meaningful? As in - are there still any gains on using a bloom filter in
this case?

I know that is sort of a corner case, but in that situation the 100%
confidence case happens 0% of the times which makes the filter a bit useless -
no?

Just trying to understand what are the limitations as been super curious about
bloom filters for a long time

~~~
tekromancr
You are correct.

The filter is completely useless if it is filled with 1s. In this situation,
you would just use a larger filter. I am not sure exactly what the ideal size
is, it depends on the desired probability of false positives, but an optimally
large filter for the data you are using will have some empty bits.

~~~
bane
Here's a helpful calculator, it will even tell you the optimal number of hash
functions. All you have to know is the expected cardinality of the set.

[https://hur.st/bloomfilter](https://hur.st/bloomfilter)

~~~
afonsotsukamoto
ah! ok - I just had a "mind bending" moment after looking at the calculator +
tekromancr's response.

I thought we would want to control the number of hash functions (k) and the
number of items (n) but I guess the most common use case will be to set the
allowed probability of a false positive for the number of allowed items and
let the "system" control the k and m (hash functions and bits in the filter)

Enlightening!

One question remains though - since the number of bits (m) is a function of
the probability of false positives (p) and the number of hash functions (k) is
a function of the number of bits and the number of items (n), does it mean
that once we start a bloom filter we can't update the number of items or the
desired probability of false positives?

My assumption is that if we change p, then m will change, which would make the
k functions "obsolete" as they're being "mapped" to a new sized m. Same for n,
which would change the k functions, making it so that a new item would be
hashed differently, making m hits/misses obsolete again.

Might be missing something here though.

Either way, I still find bloom filters a fascinating data structure.

~~~
tekromancr
Nope, I think you got it. My understanding is that if you want to change the
params, you need to rebuild the filter.

------
Assossa
If you want a simple explanation of bloom filters:

 _To add to filter:_

1) Get multiple hashes of the data. You can use the same hash function and
increment the data each hash, or use multiple hash functions. You can do any
amount of hashes from 1 to infinity, each filter size and dataset size has a
sweet spot.

2) Mod (remainder) each hash by the filter size. The filter size can be any
size, 1 to the maximum hash from your function[s]. The larger the filter, the
more accurate the results, but a filter larger than your dataset is obviously
useless.

3) Set each bit at the locations (from step 2) in the filter to 1.

 _To check filter:_

1) Do steps 1 & 2 from previous procedure.

2) Check each location in the filter. If all locations are 1, the data might
be in the filter. If any of the locations are 0, the data is definitely not in
the filter.

~~~
taco_emoji
> a filter larger than your dataset is obviously useless

This isn't obvious to me, can you explain?

~~~
biggerfisch
At that point, it's harder to search/test than just looking at your dataset,
so there's no point to making the size that large

~~~
irl_zebra
Well,the bloom filter only uses a tiny bit of space per data element and
doesn't actually store the data. When searching the data itself, you're often
searching through the actual data, which is more intensive and slow.

------
natch
You don't need more than one hash function. Just append an incrementing token
onto the input before each successive hash.

~~~
chrisseaton
> Just append an incrementing token onto the input before each successive
> hash.

But that's a new hash function then.

~~~
natch
Obviously not. It's a new input into the same hash function.

------
in9
What is the efficiency of this algorithm? Is this how one can check for used
usernames when a user is creating an account?

~~~
SatvikBeri
Wouldn't a database lookup (with an index) be fast enough in that case? IIRC
that's what Reddit does.

~~~
greenpizza13
Presumably you have indexed based on username or the hash of a username, so
the lookup is very inexpensive.

~~~
taeric
You could also send the bloom filter down to the client to filter out client
side with high probability.

------
thinkr42
This is great, but think "Probably is in there" should be replaced with "Might
be in there".

~~~
tekromancr
For this toy example, sure. But extend it from a 2byte table to, for example,
a 50MB table and use many more hashes. The larger the table, and the more hash
functions you use, the smaller the false positive error becomes. If you have a
limited dataset that you want to test against, you can set these params to
have a near 0% false positive rate, at the expense of a much larger table

------
avaer
I think CS education would be improved if these concepts didn't have
mysterious names.

Respect to Bloom, but the programming median might be raised if Bloom filters
were called something stupid obvious like Hash Array Filler-upper Tables. We
might not even need to spend time making visualizations to explain them.

~~~
middleorigin
I have no idea why you think the name helps remove the need for a
visualization. Besides I think "filter" is more apt than "table"—it doesn't
store anything per se.

~~~
avaer
I never needed a picture to understand a hash table once I knew what a hash
function was. If it were called a McCready table, that's one more trip to
Wikipedia, plus one more every time I forget.

Re: naming: It's a table of hash function results. It probabilistically stores
a set. I don't see any filter here, though one use of the technique is indeed
in filtering a list, though there are plenty more.

~~~
middleorigin
Right, but you're describing the implementation, which doesn't imply anything
about its use. I prefer to name by the latter: the implemenation is just a
detail on how the filter does its filtering.

------
callahad
> _Unfortunately I can 't use the 64-bit trick in the linked post as
> JavaScript only supports bitwise operations on 32 bits._

This could be a good use case for WebAssembly, which supports 64-bit math
natively.

