
Probabilistic Filters By Example - PokemonNoGo
https://bdupras.github.io/filter-tutorial/
======
ko27
Cuckoo filter is used in the upcoming Cache Digest HTTP Standard:

> HTTP/2 frame type to allow clients to inform the server of their cache’s
> contents. Servers can then use this to inform their choices of what to push
> to clients.

[http://httpwg.org/http-extensions/draft-ietf-httpbis-
cache-d...](http://httpwg.org/http-extensions/draft-ietf-httpbis-cache-
digest.html)

[https://github.com/httpwg/http-
extensions/pull/413](https://github.com/httpwg/http-extensions/pull/413)

~~~
codetrotter
Speaking of which, does anyone know of any HTTP/2 servers that make use of
this feature? Are clients making use of it?

~~~
ko27
Not this standard, but the H2O server can use client cache aware HTTP2 Push:

[https://h2o.examp1e.net/configure/http2_directives.html#http...](https://h2o.examp1e.net/configure/http2_directives.html#http2-casper)

------
plopilop
Strange coincidence, just this morning I was reading a paper [0] from 2008
about Bloom filters. Specifically, their false positive rate is higher than
what is usually advertised - even on Wikipedia.

Edit: also as is written in the document, Cuckoo Filters and Bloom filters are
not adapted to unbounded streams. Better options may include, for instance,
Stable Bloom Filter[1] and block-Decaying Bloom Filter[2]

[0]: [http://cglab.ca/~morin/publications/ds/bloom-
submitted.pdf](http://cglab.ca/~morin/publications/ds/bloom-submitted.pdf)

[1]:
[http://webdocs.cs.ualberta.ca/~drafiei/papers/DupDetExt.pdf](http://webdocs.cs.ualberta.ca/~drafiei/papers/DupDetExt.pdf)

[2]:
[https://link.springer.com/article/10.1007/s11390-008-9192-1](https://link.springer.com/article/10.1007/s11390-008-9192-1)

~~~
FreakLegion
I've only glanced at [0] but I'm skeptical on the face of it. Everything about
Bloom filters falls right out of the binomial distribution, provided you
follow the assumptions for optimal number of bits and hash functions. Given
those,

    
    
        n = 1, k = 2, m = 2
    

from the top of page 2 isn't even a valid configuration. Among other things,
filters in which more than half the bits are set (within some tolerance)
should be rejected. In this example a third of the filter's possible states
are completely saturated and have to be tossed out. Otherwise you have a 100%
false positive rate for that filter.

I.e. if a saturated filter were acceptable, there are 3 states: 01, 10, 11.
These have, respectively, a 1 in 3, 1 in 3, and 3 in 3 false positive chance,
for a total of 5 in 9 (I assume the 5 in 8 from the paper is a typo). But
actually there are only 2 valid states: 01, 10. The valid inputs are still 01,
10, 11, though, so the real false positive chance is 1 in 3.

Maybe I'm missing something.

~~~
plopilop
"Valid" configurations do not exist, there are only optimal/suboptimal and
saturated/unsaturated ones. The classic FPR makes no assumptions about this
(indeed optimality is derived from FPR minimum), and as such should apply as
well to suboptimal saturated configuration, were it correct.

However looking at theorem 3 in [0] I think the difference between "classic"
and "real" FPR is small, especially for larger n and m. At the end of [0] is
mentioned a study in which the authors could not reach the classic FPR in
their simulations, but attributed this result to bad pseudorandomness of
elements. [0]'s author claim that the difference may instead come from the
real FPR value (which is actually very hard to compute, so it's hard to be
sure).

The 5/8 probability mentioned is indeed correct. First element has four
equiprobable possible outputs for its two hashes: (0,0), (0,1), (1,0) and
(1,1). Same for the second element, which yields 16 different cases. Out of
these 16 cases 10 will lead to a false positive.

~~~
FreakLegion
> The 5/8 probability mentioned is indeed correct. First element has four
> equiprobable possible outputs for its two hashes: (0,0), (0,1), (1,0) and
> (1,1)

Thanks, that makes sense. I was thinking of states the filter could be in and
ignored the different paths to get there. Now that I've had a chance to look
at [0], it's interesting, but doesn't seem to matter for Bloom filters used in
the normal way.

> "Valid" configurations do not exist

Sure, you can use any numbers you want, even negative ones. I take a more
practical view, since Bloom filters make assumptions about choice (or
derivation) of parameters. Given those assumptions, the formula criticized in
[0] seems to work. Specifically:

> [0]'s author claim that the difference may instead come from the real FPR
> value (which is actually very hard to compute, so it's hard to be sure).

We know the FPR of any Bloom filter that actually exists. For a load factor
_l_ and number of hash functions _k_ , it's _l^k_. The problem is that the FPR
for the expected load factor of a set of parameters is lower than the expected
FPR for those parameters, _iff_ the expected load factor is greater than half.
But the whole point of a Bloom filter is to have half the bits set; that's the
magic that minimizes size and FPR.

In practice you won't often have an expected load factor of exactly half, due
to rounding. Depending on how you handle that, and whether you use asymptotics
or an exact calculation, some drift in FPR makes sense. It'll be negligible as
long as you're targeting half, though. _n = 1, k = 2, m = 2_ is noticeably off
because at 3/4 it has a much higher expected load factor.

------
bograt
I'm familiar with both cuckoo hashing and Bloom filters but had not until now
seen a demonstration of cuckoo filters; they look very useful. But there is
one stated fact that I'm finding hard to believe:

> Cuckoo filters improve on Bloom filters by supporting deletion

The page implies that this is achieved by removing the fingerprint from the
hash table, but presumably one cannot guarantee that another key doesn't share
the same fingerprint. This would result in a false negative for that key and
violate an essential characteristic of the data structure.

Perhaps there's a nuance of the implementation I've missed.

~~~
PokemonNoGo
I had the same thoughts and checked the paper[0]. They address it under 3.3
Delete

>Deletion does not have to clean the entry after deleting an item. It also
avoids the “false deletion”....

[0][https://www.cs.cmu.edu/~dga/papers/cuckoo-
conext2014.pdf](https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf)

~~~
bograt
Having quickly read the paper, the deletion guarantees seem slightly weak.

An item's position in the table is derived from two things: a fingerprint (a
constant-sized hash) and second hash (ranging over the table). Nothing
prevents two or more items from colliding on both hashes and therefore being
indistinguishable from each other.

If the number of items in such a collision exceeds twice the fixed bucket size
then deletion may result in false negatives.

In most practical applications there will be no useful way to bound the number
of collisions. The paper shows results with bucket sizes of 4 and 8, but I
don't know what the real-world probabilities of breaching these limits would
be.

~~~
hinkley
Hashing items that present themselves as equivalent is a very old problem with
plenty of literature. It can be a pain in the butt to fix retroactively but it
does tend to get fixed when there are big enough performance or correctness
concerns in play.

------
throwawayReply
I was trying out the javascript example, but managed to get a case where there
is a negative conclusion but this goes against the introduction where it says
negative conclusions are always definite.

My multiset was:

    
    
       a, b, c, d, e, f, g, h, j, abc, def, asd, asds, 3g46stb6vy6vsyosyvosfsdfsdsah, oooooooooooooooo
    

Then trying oooooooooooooooo a second time, the bloom filter correctly says it
might be in the set. The cuckoo filter says it is not.

I assume this is just a javascript bug in implementation, because previously
the filter said it might be in the set, actually adding it to the set then
gave a false negative when trying it again.

~~~
bdupras
Yeah - pardon the crappy javascript and UX. Notice the colors on the cuckoo
filter when you insert this sequence. When you try to insert
`3g46stb6vy6vsyosyvosfsdfsdsah`, the fingerprint entries in the cuckoo turn
red indicating an unsuccessful entry. All possible fingerprint slots in the
cuckoo for that entry were already occupied, and the recursive rearranging of
entries was unable to free up a slot.

This is a down-side to cuckoos (and counting blooms) - even when the filter
has reasonable capacity remaining, an entry can be denied due to collision
with previous entries and saturation of their slots.

The UX on this demo could be better - e.g. not insert into the bloom unless
the cuckoo insert is successful (to keep them from diverging). Also, a message
indicating an unsuccessful insertion would be nice.

~~~
bloomer
I wish this was something that people brought up more prominently when
discussing cuckoo filters. They fail catastrophically. A bloom filter
gradually fails as the number of items inserted increases and the false
positive rate tends to one. But a cuckoo filter that fills up the buckets for
a fingerprint has to reject an insert meaning that either you start to have
false negatives or to guarantee that you don't have false negatives you have
to immediately return true for any query, which is a false positive rate of
exactly one with a sharp discontinuity in behavior from before the failed
insert. I think the salesmanship of cuckoo filters has been a little overdone
and in most applications a bloom filter is a better choice.

~~~
bdupras
Yes, cuckoo filters and counting bloom filters can start rejecting insertions
before the filter is loaded - cuckoos because of the inability to find an open
slot, and counting blooms because of saturation of a counter. Note that this
is only the case if deletion or counting support are required.

In some cases, like the when there's a high likelihood of duplicate entries,
this can represent a catastrophic failure mode. This is a good point, and when
I get around to updating the tutorial, I'll make note of this.

Regarding the failure mode, an insertion can be knowingly rejected without
mutating the filter, so the filter remains useful for subsequent insertions
and all reads, retaining all its guarantees. The only loss is the rejection of
the item - in our case we used this as signal to rebuild the filter at a
larger size.

------
chronolitus
"These filters can claim that a given entry is definitely not represented in a
set of entries, or might be represented in the set."

unless the Cuckoo filter has any two filled buckets, right? Which is a pretty
critical failure mode not mentioned here. (What do you do then, accept the
small new false-negative likelihood? start over with another filter?)

~~~
jetrink
If both buckets are filled, one of the existing occupants is evicted and
reinserted at its alternate location. If there is an item already at that
location, it is likewise evicted and reinserted. This process continues until
an open space is found or a loop is detected. If there is a loop, the hash
table needs to be rebuilt with more capacity or new hash functions.

~~~
bdupras
chronolitus did jetrink's comment answer your question? I'm happy to further
clarify if needed.

~~~
chronolitus
ah, I understand. The website demo had me confused. thanks for your help!

------
jitans
Add five "a", now remove four "a" Cuckoo claims there are no "a" in the
multiset...

~~~
yeahboats
There is a comment elsewhere from the creator I think that explains this. The
5th "a" never gets added, you'll see that the cells turn red. After you remove
4 "a"s then you're back at 0 and it is correctly reporting that there aren't
any.

~~~
jitans
well the demo then is broken indeed it shows still an "a" inside the multiset.
Having say that I would never use a cuckoo in a multiset, indeed it becomes a
bounded multiset. Also about the K-hashing for a bloom filter is not exact.
You just need need a K-bit hashing function. If the goal is to support
deletion then I would go for a Counting Bloom filter

~~~
bdupras
Yeah you're correct in saying that the demo is broken - sorry for the
confusion, I stopped working on this when it was good enough for an internal
talk at my company.

Re multisets, yes cuckoo filters and counting blooms are bounded - cuckoos by
entries or bucket x 2, and blooms by bits per counter. Cuckoo's counting
capability is more a side effect of the design. It's really only interesting
in that you get limited counting in roughly the same bit space as a non-
counting bloom. I suppose you could add counting bits to a cuckoo to support
higher bounds in a similar manner to counting blooms.

I'm curious to hear more on your point on K-hashing in bloom filters - would
you expand on your statement about only needing a K-bit hashing function? I'm
happy to update the text on the tutorial to be more clear/exact.

