
Show HN: Cuckoo Filter Implementation in Go, Better Than Bloom Filters - irfansharif
https://github.com/irfansharif/cfilter
======
_asummers
After the Bloom Filter post the other day, I've been doing a dive on
probabilistic data structures. This one was obviously on the list along with:
skip list, bloom filter, count-min sketch, linear counting, loglog,
hyperloglog. Are there any major ones I'm missing, or even better, any lesser
known ones I should be aware of? Are there any good resources/courses on this
class of data structure/algorithm?

~~~
dmit
_> Are there any major ones I'm missing, or even better, any lesser known ones
I should be aware of?_

MinHash.

I also highly recommend this blog:
[https://research.neustar.biz/tag/sketching/](https://research.neustar.biz/tag/sketching/)

~~~
senderista
MinHash works well if you need to approximate Jaccard similarity. To
approximate cosine similarity, you can use Charikar's SimHash:
[http://d3s.mff.cuni.cz/~holub/sw/shash/](http://d3s.mff.cuni.cz/~holub/sw/shash/).

------
lrem
There is also a C++ implementation by the original authors:

[https://github.com/efficient/cuckoofilter](https://github.com/efficient/cuckoofilter)

------
donatj
For the uninformed like me, what is the real world application for this? This
is 100% curiosity and not criticism.

Basically I'm curious in plain English, what kind of application would use
this and for what? Also, am I understanding correctly that this is simply a
sort of low fidelity hash table?

~~~
sluukkonen
Here's one example. Web browsers often use bloom filters or similar data
structures to check for malicious URLs. They first check the URL against the
local bloom filter, and if a match is found, they double check against an
online API to make sure it wasn't a false positive.

Shipping a full list of malicious URLs would be too expensive, but a bloom
filter fits in a fraction of the space and can still eliminate a vast majority
of the API calls.

~~~
loeg
They're useful in cases like this, where the set of entries is much smaller
than the total set size (e.g., all URLs). Once you approach sets with 50%
membership, a bitset becomes more efficient (accuracy per space) than a bloom
filter or any other probabilistic structure.

------
perlgeek
This is the paper the the readme links to that explains how a Cuckoo filter
works: [https://www.cs.cmu.edu/%7Edga/papers/cuckoo-
conext2014.pdf](https://www.cs.cmu.edu/%7Edga/papers/cuckoo-conext2014.pdf)

------
Lord_Nightmare
What is the situation with Patents on the Cuckoo filter? I saw no less than 2
pending applications with some quick googling, one by NetSpeed, one by TI, and
quick uspto search shows over a dozen patents.

Is the entire useful concept patented at this point?

------
Kubuxu
What I don't understand is comment on the delete function:

    
    
      // tries deleting 'bonjour' from filter, may delete another element
      // this could occur when another byte slice with the same fingerprint
      // as another is 'deleted'
    

If it deletes another element, then the guarantee of no false negatives is no
longer kept. Is that right?

~~~
irfansharif
yes, you are correct. drawing from the example in the README you could
'delete' "bounjour" assuming it had the same fingerprint as "buongiorno",
subsequent lookups for "buongiorno" would thereby return negatives (assuming
it was entered only once). the original research paper makes no explicit
guarantees of _no false negatives_ , the invariant can only be maintained if
prior to deletion of an element it is ensured that the element is _definitely_
in the filter (e.g., based on records on external storage).

~~~
kobigurk
Even if the element is definitely in the filter, another element can have the
same fingerprint. Thus, deleting that event which is definitely in the filter
will also cause false negatives for that other element. Isn't that right?

~~~
irfansharif
are you assuming the other element is also in the filter? my assumption is
your 'deletions' occur only when you know before hand the element already is
in the filter. if two elements with the same fingerprints are added to the
filter, the finger print is stored two separate times. deletion of any one of
them would effectively remove only the single copy thereby not affecting the
other (you could be deleting the fingerprint that _technically_ belongs to the
other element but it's all the same), so no in this case you do not get false
negatives.

------
jzelinskie
How does this compare to the implementation in the "BoomFilters" library?

[https://godoc.org/github.com/tylertreat/BoomFilters#CuckooFi...](https://godoc.org/github.com/tylertreat/BoomFilters#CuckooFilter)

~~~
irfansharif
hadn't come across that implementation before, i'll try putting up benchmarks
as soon as i can.

------
pram
Can you persist your stored filter items with something like bolt? The struct
fields are all unexported.

------
ejbs2
How are fingerprint's supposed to be generated? Just another hash?

~~~
irfansharif
essentially, yes.

