
Probabilistic Filters By Example: Cuckoo Filter and Bloom Filters - rpcope1
https://bdupras.github.io/filter-tutorial/
======
armon
I've given a talk at Papers We Love about Bloom Filters
([http://paperswelove.org/2015/video/armon-dadgar-on-bloom-
fil...](http://paperswelove.org/2015/video/armon-dadgar-on-bloom-filters-and-
hyperloglog/)) for those wanting to learn more. There have been quite a few
follow on extensions to standard bloom filters that address some of the
shortcomings mentioned here.

For example, the cost of having K hash functions can be avoided with some
simple math. This is discussed in "Less Hashing, Same Performance"
([http://www.eecs.harvard.edu/%7Ekirsch/pubs/bbbf/esa06.pdf](http://www.eecs.harvard.edu/%7Ekirsch/pubs/bbbf/esa06.pdf)).
You can also avoid the FPP rate from saturating at 100% using a technique
called "Scalable Bloom Filters"
([http://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf](http://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf)).
The basic idea is you start with a small BF and layer on progressively larger
filters. This also reduces the memory you need to allocate up front without
worrying about the FPP rate saturating. As others have mentioned, Counting
Bloom Filters allow for counts and deletion as well. The basic point is that
there has been a _lot_ of additional work on bloom filters. One of my
favorites is "Adaptive Range Filters"
([http://www.vldb.org/pvldb/vol6/p1714-kossmann.pdf](http://www.vldb.org/pvldb/vol6/p1714-kossmann.pdf)).

At an advertising company I previously worked for, we relied on bloom filters
and hyperloglogs quite heavily for our realtime analytics pipeline. We open
sourced our C daemons
([https://github.com/armon/bloomd](https://github.com/armon/bloomd) and
[https://github.com/armon/hlld](https://github.com/armon/hlld)) which have
been running in production for several years.

------
dangoldin
Neat visualization. One of the nicest things about a Bloom Filter is that
you'll never get a false negative (saying something is not inside even if it
is). I'm not super familiar with Cuckoo filters but since it allows for
removal it's possible to get false negatives.

You can replicate that by entering "cat" 5 times and then removing it 4 times.
Cuckoo will indicate it's not in the filter despite it still showing up there
a single time. I guess that's the "limited counting" aspect of it. Really
cool.

~~~
grogers
Based on the explanation, cuckoo filters need to reject inserts that would
break the invariants, which the demo doesn't seem to do.

You are right that deletion from a probabilistic filter is sort of weird,
because it is possible to delete something that never existed if it happens to
exist as a false positive. If you can guarantee that the false deletion
doesn't happen, then it should work as expected. There are variants of bloom
filters that can do this as well.

~~~
dangoldin
Ah got it - thanks for the explanation. That makes a lot of sense now.

------
echlebek
I'm a little confused by the Big-Oh notation O(2). In my CS classes we were
taught that O(c) = O(1). So seeing something like 'O(1) to O(2)' is a bit odd.

But I love the visualization - it's very clear how things are working
conceptually.

~~~
malaggan
It would seem that the author of the article is abusing the Big-O notation.
When saying some operation is O(f(n)) they probably mean that the operation
takes exactly f(n) steps, rather than the usual meaning of asymptotic
complexity.

~~~
sweezyjeezy
That's not true, O(f(n)) means at most f(n) up to a constant. If you wanted to
say something took exactly f(n), surely you would just write f(n)...

~~~
lorenzhs
No, it means asymptotically equal to f(n). O(n^2 + n log n) = O(n^2), despite
_n log n_ not being a constant.

------
forgotpwtomain
This is cool visualization but the percentages for 140 bits aren't really
telling you much. If you need to use this in production, check the original
papers (there was an improved analysis recently [0]) and test your own
production loads (another point is that with cuckoo filters you can greatly
reduce false-positives, if your application doesn't tolerate them well by
increasing key size).

Also I notice this comparison isn't using easily comparable implementations --
e.g. this bloom filter does not support delete, but bloom filter
implementation which support delete operations do exist.

[0]
[https://news.ycombinator.com/item?id=11795779](https://news.ycombinator.com/item?id=11795779)

~~~
teraflop
Also the "Insertions may be rejected" thing would be a deal-breaker for a lot
of applications.

If you can't guarantee that an insert will be successful, then you have to be
willing to either tolerate false positives _and_ false negatives, or throw
away the cuckoo filter and re-create it from scratch.

~~~
bdupras
Yeah, absolutely. You get the same "feature" from Counting Blooms - i.e. if an
insertion would overflow any of the counters, the filter can knowingly reject
the insertion.

This is both a feature and a limitation. In our use case, our filter grows
predictably, and every few months a rejected insertion triggers a
resize/rebuild from source data. Because we could tolerate a full rebuild, we
chose for now not to implement the technique of growing the filter by adding
successively larger filter segments with lower fpp guarantees, though that is
an option.

------
malaggan
Bloom filters do support counting as opposed to what the article claims. See
[1] for an example.

[1]
[https://en.wikipedia.org/wiki/Bloom_filter#The_union_and_int...](https://en.wikipedia.org/wiki/Bloom_filter#The_union_and_intersection_of_sets)

~~~
lorenzhs
Counting Bloom filters exist (but use more space). But your link estimates the
number of elements inserted into the filter, which is something different.
"Counting" is usually used to refer to the count per key, as in "how often did
I insert key x?".

------
mankash666
Why is these data structures called "filters" rather than hashes.

~~~
forgotpwtomain
> Why is these data structures called "filters" rather than hashes.

I would guess because they are probabilistic data-structures which are
information-lossy. Bloom Filters and Cuckoo Filters aren't guaranteed to give
you a correct result and they cannot retrieve the original input. Cuckoo
Filters of course take the name from Cuckoo hashing.

