
Cuckoo Filters, Practically Better Than Bloom [pdf] - chewxy
https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf
======
KarthikT
The big thing about this paper is deletions. This is the first thing I've seen
that seems to be an easy to understand and easy to implement filter that
allows deletions. That it is close to Bloom filters in performance is a bonus.

------
nkurz
Here's my attempt at summarizing the paper with the hope that someone who
understands it better can correct me.

Cuckoo hashes are a type of closed hash table where any given item has a fixed
number of places where it can be located. Unlike other closed hashes, if there
is not room for an item in one of its possible locations, an earlier item is
evicted (in the style of a cuckoo kicking an egg out of strangers nest and
quickly laying its own) and sent to its alternate spot. This repeats until an
empty slot is found, or some large number (500) attempts have been made. If no
slot is found, the table can be rehashed to try again, or enlarged to make
more space available.

Initially cuckoo hashes were proposed to have 2 slots for each item,
corresponding to 2 different hashes. Over time, for greater space efficiency
(load) it's evolved to have a variable number of hashes >= 2 and a variable
number of slots per bucket >= 1. This paper suggests that using 2 hashes and 4
slots per bucket is a good default (a 2,4 cuckoo hash). One of the innovations
in this paper is that instead of using 2 independent hashes (the theoretical
best case) they use the output of a single hash function and a shorter saved
"signature" that can be hashed and xor'ed to find the 'alternate' slot. In
this manner, the signature can be used to move an item to its alternate
location without actually knowing or saving the original value.

This 'cuckoo filter' can be queried (LOOKUP) to see if an item is already in
the hash (INSERT). If an item is present, a 'yes' answer is guaranteed. But if
an item is not present, there is a low percentage chance that a false positive
'yes' answer will be returned. The number of bits required to store each
signature is comparable to the number of bits required for a Bloom filter with
a comparable rate of false positives. Through various optimizations (such as
storing the items in a bucket in a known order) the number of bits per item
stored drops to slightly below that of a Bloom filter.

Unlike a Bloom filter, deletions are made possible by storing multiple copies
of the same signature. But since there only a set number of places that a
given value can be stored, only a limited number of copies (minus the number
of collisions) can be stored. In this case, that number is 2 * 4 = 8. If more
inserts than this are attempted, either the insert must fail or the delete
count will be out of sync.

Performance wise, at the same level of space efficiency, insert speeds are
better than a standard Bloom filter when the hash is mostly empty (low load),
but significantly worse than a Bloom filter at high load when the hash gets
filled and many 'evictions' are necessary for each insert. But if a tradeoff
for space is acceptable (to stay at ~50% load), the Cuckoo Hash inserts much
faster than its Bloom counterpart. Lookup speeds for existing keys are 2-3x
the speed of a standard Bloom filter, and about 1.5x than the "Blocked" Bloom
filter variant. Negative lookups (key not present) are about 2x faster than
Bloom at high load, but about 3/4 the speed at low load.

Overall, it's a nice technique that should probably be used more often.

------
sharth
Thoughts:

1\. It doesn't appear to have O(1) lookup or O(1) insertion.

2\. It appears that this filter can become "full", which is very different
from a bloom filter.

3\. It appears that this doesn't have false positives on lookup either.

4\. I'm not very confident that this has a constant space requirement either.

This seems like something that I wouldn't compare to a bloom filter.

~~~
yid
> It appears that this filter can become "full", which is very different from
> a bloom filter.

How is this different from a bloom filter?

~~~
sharth
A bloom filter's insert function is basically:

    
    
        def insert(x):
            for hash in hashes:
                bloom_filter[hash(x)] = true
    

The downside is that your false positive rate will increase as you insert
elements. But, on the other hand, you get the guarantee of no false negatives.

This cuckoo filter has no false negatives or false positives, but their
Insert() function clearly can hit a case where it returns Failure.

~~~
placeybordeaux
It has false positives.

