

Show HN: BoomFilters – Probabilistic data structures for processing streams - tylertreat
https://github.com/tylertreat/BoomFilters

======
hurin
_/ / The InverseBloomFilter may report a false negative but can never report a
false positive._

This is actually not true and not possible. There is a good explanation here:
[http://cstheory.stackexchange.com/a/14455/43](http://cstheory.stackexchange.com/a/14455/43)

There will be a low probability of false positives ~ 2^32*k. (Also it's not an
inverse-bloom filter, it's a cache).

~~~
frankmcsherry
Well, I'm looking at the code and it looks like it doesn't report false
positives. I understand that you expect it to hash the content down and store
only the hash, but it doesn't do that.

EDIT: I suppose my point was just that there is a difference between "I think
you chose a bad name for your class" and "This is actually not true and not
possible". It seems both true and possible (and arguably inappropriately
named).

~~~
zeeboo
Then it doesn't do it in constant space. Constant space is the property that
makes it not a true inverse bloom filter, and instead is just a cache with an
interesting hashed based eviction method.

That said, this is just a silly definition issue. You can provide a data
structure with no false positives, and some false negatives, but you can't do
it in constant space. I'm not aware of if you can absolutely bound the rate of
false negatives.

~~~
hurin
I guess it's storing the pointer to the data? That sounds pretty bad.

At any-rate it's one of two things, either you store the whole item or you
store a hash. If you're storing the whole item it's not a filter and you don't
get constant space guarantees as you mentioned. If you store a hash - it's
possible to have false positives.

You can always get a bound for false-negatives (or none at all) same as it
with any hash-table. If you wanted to, you could Cuckoo hash before eviction
and you should be able to get a pretty-good theoretical bound for false-
negatives ([http://infoweekly.blogspot.com/2010/02/cuckoo-
hashing.html](http://infoweekly.blogspot.com/2010/02/cuckoo-hashing.html)).

------
wcdolphin
That InverseBloomFilter sounds like an LRU cache of hashes. A true inverse
bloom filter can be proven to be impossible, sadly.

------
Analog24
_With enough data, a traditional Bloom filter "fills up", after which it has a
false-positive probability of 1._

I don't think this correct. The false-positive probability wouldn't become 1,
it's just that every query would return positive, regardless if it's a true or
false positive. I think what you meant to say was that the positive
probability is 1, not the false-positive probability.

------
natch
Why does the readme use Boom and Bloom interchangeably? Is it just a typo that
for some reason is randomly scattered throughout the document (and in the repo
name itself)? Or is there meant to be a difference? I don't see anywhere that
the difference is explained, if it's intentional.

~~~
lstamour
Looks consistently used to me. The project is a collection of stream-optimized
bloom filters and the project or its filters, as a whole, are referred to as
"Boom Filters" in comparison to the "Classic Bloom Filters" referenced in the
paragraph directly above. Basically, use "bloom" except when referring to this
exact source code or library name.

See also: references section at the bottom of the readme for background on
bloom filters, examples, etc.

------
eternalban
You should checkout Tyler's blog. Good solid content for the concurrency
geeks.

