
Show HN: Bloom Filter - sacert
https://sacert.github.io/Bloom-Filter/
======
aerovistae
The problem is this doesn't _really_ do much to actually show how it works. It
sort of shows what it does, but the utility of it isn't clear ("Why would I
want to know if something was _possibly_ found?") and the process by which it
determines whether something is found is not really elucidated.

For those who are interested, this page does a really great job explaining it:

[https://llimllib.github.io/bloomfilter-
tutorial/](https://llimllib.github.io/bloomfilter-tutorial/)

\---------------

Nonetheless I applaud the author for making it anyway. Takes effort to build
these things. Just offering a supplementary resource here.

My own note of explanation about how to use them: when a Bloom filter says
something _isn 't_ present, it's guaranteed to be true. But a lot of the time
it will instead basically say "I'm not sure," and it can never guarantee that
something is definitely present. So the point is to use them in contexts where
all that matters is to very quickly filter out as many "first occurrence"
cases or "this doesn't exist" cases as possible.

That is to say, contexts where you want to identify things that you haven't
seen before and which therefore aren't present in the dataset you're checking
against, but for which it's _okay_ to be wrong-- cases where being wrong slows
you down a little or slightly reduces content quality rather than actually
breaking anything. Bloom filters are about increasing quality or speed.
Examples here~
[https://en.wikipedia.org/wiki/Bloom_filter#Examples](https://en.wikipedia.org/wiki/Bloom_filter#Examples)

In all the examples you can see how it's _okay_ for the Bloom filter to be
wrong. But it saves a lot of time by being right in the majority of cases.

My initial thought when I learned about them was "why not just use a set and
be right 100% of the time?" And the answer, I believe, is that the set
requires you to store all the items in memory in order to tell whether you've
seen one before, which is expensive. A bloom filter is much, much, much
smaller. A set is for when you need a 100% accuracy rate, a Bloom filter is
not.

~~~
mrspeaker
It's especially confusing if, like me, you're only aware of bloom filters from
post-processing shaders ([https://github.com/cansik/processing-bloom-
filter](https://github.com/cansik/processing-bloom-filter)) and just keep
typing words waiting for a pretty halo to appear around the text or something.

------
orthoganol
With Cuckoo filters visualized too:

[https://bdupras.github.io/filter-tutorial/](https://bdupras.github.io/filter-
tutorial/)

------
joeblau
This was the key to our data analytics url de-deuping platform back in 2011.
We were pulling in 50k social media messages an hour and there were lots of
duplicate links running though our pipeline. We had a 100GB bloom filter
backed by Redis to keep a list of all links that came though our system and it
worked beautifully.

~~~
cjhanks
A 100GB bloom filter? What does that mean? How many hash functions.. how many
bits?

~~~
bane
I would guess that a 100GB bloom filter would have 800 gigabits.

You can make some guesses, with 7 hashes and a false probability of 1%, a
bloom filter designed on a cardinality of 100 billion elements is a little
over 100GB.

------
elliottcarlson
The "not found" message does not clear, so repeated not founds repeat the
message.

~~~
sacert
Oops - accidentally appended in the wrong spot. Fixed, thanks!

------
bane
You can get lots of mileage out of bloom filters as well by just using them
off of disk. If you're smart about it and use mmap'd files, you only ever end
up reading a few bytes at a time. On modern SSDs you can get some really
incredible performance not too far off using them in RAM, or you can use a
bunch of them for a simple classifier.

here's an implementation:
[https://axiak.github.io/pybloomfiltermmap/](https://axiak.github.io/pybloomfiltermmap/)

------
sacert
For anyone interested, the source code is here:
[https://github.com/sacert/Bloom-Filter](https://github.com/sacert/Bloom-
Filter)

------
ne0free
I add qwerty, fire, water, qwerty, land, air, bottle, laptop all this I search
for fig and figure and it does mess..

