
Cuckoo filters and their analysis - robinhouston
http://11011110.livejournal.com/327681.html
======
jbapple
Note that the comparative experiments in the original Bloom filter paper are
very sensitive to the implementation details of the other Bloom filters they
are comparing to. For instance, in the implementation of blocked Bloom
filters, they get a false positive rate of 0.43% when using 13 bits per
element. I get 0.27%, perhaps because in each small Bloom filter I am using a
split Bloom filter.

I emailed one of the authors of the original paper, and, while their Cuckoo
filter implementation is available, they have lost their experimental setup
and comparison code.

------
lqdc13
There is also quotient filter that has theoretical analysis and implementation
(1).

Also smaller than bloom filter and lets you delete.

(1) [https://github.com/vedantk/quotient-
filter](https://github.com/vedantk/quotient-filter)

~~~
dgacmu
Quotient filters are cool, but are both modestly less memory-efficuent and
quite a bit slower than Cuckoo filters. Their advantage is primarily
theoretical -- Cuckoo filters force a minimum false positive rate that grows
with logN/4 (I.e., if you insert 2^32 items, you must use at least 8 bits per
item inserted). In practice, that's usually not much of a problem, but its why
we say the Cuckoo filter is best for applications where you want a 1% or lower
false positive rate.

(disclosure: I'm one of the inventors of Cuckoo filters. Happy to answer
questions about them, and very excited to see theoretical progress on them!)

------
prashnts
Is there any way of being able to store a corresponding value to the elements
in the set? I understand that hash tables do that -- however, what I wish to
ask is that do we have a space efficient data structure for key-value pairs?

~~~
dgacmu
Yes, many - but it depends how practical you want it to be and what tradeoffs
you can accept. Broadly speaking, you're asking about the class of data
structures for (almost) minimal perfect hashing. For example:

* ECT, or Entropy-Coded Tries, (paper: [http://www.cs.cmu.edu/~hl/papers/ect-alenex2013.pdf](http://www.cs.cmu.edu/~hl/papers/ect-alenex2013.pdf) ) are a static data structure (no insertions after you build it) that are very fast to construct, take about 7 microseconds of CPU time to lookup, and have 2.5 bits of overhead per item inserted. This is the best data structure I know of for _external_ perfect hashing, where you want to store the key/value pair itself on something expensive to access, like an SSD. Implementation available in SILT: [https://github.com/silt/silt](https://github.com/silt/silt)

* CMPH ([http://cmph.sourceforge.net/concepts.html](http://cmph.sourceforge.net/concepts.html)) and friends are very fast MPH functions when you want to store the key value pairs in memory.

* There was a really cool paper at SEA a few years ago on Retrieval and Perfect Hashing using Fingerprinting ([http://link.springer.com/chapter/10.1007%2F978-3-319-07959-2...](http://link.springer.com/chapter/10.1007%2F978-3-319-07959-2_12)) that came out of SAP. I don't know if they have code available. It's also internal, but uses the same fingerprint-centric idea as ECT, but in a faster way that doesn't require the expensive huffman coding that ECT uses.

There's a ton of very fun research in this area -- this is just the tip of the
iceberg (and I haven't really looked at it for two years, so there's probably
been more since).

~~~
prashnts
Thanks for these! I haven't given a thorough read yet, will do so later today.

Probably off topic, however, in the CMPH link you quote above, what I gather
about the MPH functions is that in any practical setting they probably would
require a combination of hash functions to avoid collisions -- right?

~~~
dgacmu
Because they're internal hash functions, they have access to the original key
being used, so they can tolerate a few collisions -- in the implementation of
CHD, there's a -t parameter for number of collisions in a bin to provide a bit
of flexibility. So they don't require a completely collision-free hash
function, just one that on the input dataset is not going to produce more than
a few.

------
B1FF_PSUVM
In comments:

"other things that do work well in practice but not in theory?"

"Levit's algorithm for shortest paths in a graph"

Hmm. Quick search, seems it's called Pape-Levit for independent discoveries,
twenty year old benchmark paper at:

[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.54....](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.54.8746&rep=rep1&type=pdf)

