Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Cuckoo Filter Implementation in Go, Better Than Bloom Filters (github.com)
181 points by irfansharif on Aug 7, 2016 | hide | past | web | favorite | 30 comments



After the Bloom Filter post the other day, I've been doing a dive on probabilistic data structures. This one was obviously on the list along with: skip list, bloom filter, count-min sketch, linear counting, loglog, hyperloglog. Are there any major ones I'm missing, or even better, any lesser known ones I should be aware of? Are there any good resources/courses on this class of data structure/algorithm?


>Are there any major ones I'm missing, or even better, any lesser known ones I should be aware of?

MinHash.

I also highly recommend this blog: https://research.neustar.biz/tag/sketching/


MinHash works well if you need to approximate Jaccard similarity. To approximate cosine similarity, you can use Charikar's SimHash: http://d3s.mff.cuni.cz/~holub/sw/shash/.



The treap, aka what people should learn instead of red black trees (because they're a lot simpler with similar performance).


You might be interested in reading the section at the end of v.3 of Knuth's "The Art of Computer Programming" on superimposed coding. It was first developed by Mooers in the late 1940s as a way to reduce the number of punches needed on a punched card.

Mooers was annoyed that the computer scientists decades after him re-discovered he principles he first worked on, but without citing him. Knuth points out the connection.


There is also a C++ implementation by the original authors:

https://github.com/efficient/cuckoofilter


For the uninformed like me, what is the real world application for this? This is 100% curiosity and not criticism.

Basically I'm curious in plain English, what kind of application would use this and for what? Also, am I understanding correctly that this is simply a sort of low fidelity hash table?


Bloom filters are [one of?] the most used 'negative cache' structures. Consulting a bloom filter can give a decisive answer that a key is not in a collection (ie. a key was never hashed). You can then avoid a (supposedly more expensive) lookup. I used bloom filters often in my database work.

I was not familiar with cockoo filters (I am familiar with cockoo hash though), using them as a 'better bloom' seems to be a relatively new development. A quick googling found 'Cuckoo Filter: Practically Better Than Bloom' [https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf].


The paper also cites the main advantage of cockoo filters over bloom: ability to remove a key. (Vanilla) bloom require a rebuild of the entire filter if a key is removed from the set. There are more details in the paper, seems a good read (I'm going though it now)


Cuckoo. :)


Here's one example. Web browsers often use bloom filters or similar data structures to check for malicious URLs. They first check the URL against the local bloom filter, and if a match is found, they double check against an online API to make sure it wasn't a false positive.

Shipping a full list of malicious URLs would be too expensive, but a bloom filter fits in a fraction of the space and can still eliminate a vast majority of the API calls.


They're useful in cases like this, where the set of entries is much smaller than the total set size (e.g., all URLs). Once you approach sets with 50% membership, a bitset becomes more efficient (accuracy per space) than a bloom filter or any other probabilistic structure.


Interesting. I may in fact have a use for that.


Say you are writing a spider and want to determine whether you have to revisit a website today or tomorrow. The minimal chance of visiting it tomorrow instead of today even if the crawler should have visited it today is trade-off you would take instantly if it increases the look-up speed, space requirements and makes stuff like this even feasible in the first place without a huge budget.

Everything related to caching when working with a huge dataset is a good use-case. Another case would be when the dataset does not fit into memory, but you need very fast lookups.

Edit: To clearify, you are obviously not caching the data itself, but a flag whether to use the cache/do nothing or compute something.


I work at Google, and here is where I saw most of their usage, but I haven't used them directly (yet). There are plenty of documents published outside, especially on bigtable and other structures, for example: https://www.google.com/search?q=bigtable+bloom or http://www.bing.com/search?q=bigtable+bloom


> For applications that store many items and target moderately low false positive rates, cuckoo filters have lower space overhead than space-optimised Bloom filters. Some possible use-cases that depend on approximated set-membership queries would be databases, caches, routers, and storage systems where it is used to decide if a given item is in a (usually large) set, with some small false positive probability. Alternatively, given it is designed to be a viable replacement to Bloom filters, it can also be used to reduce the space required in probabilistic routing tables, speed longest-prefix matching for IP addresses, improve network state management and monitoring, and encode multicast forwarding information in packets, among many other applications.

from the original research paper and the repository itself, thank you for asking!


One could also use it to verify membership (or non-membership) in a set without having access to the set.

Suppose I have a secret list of websites from whom you shouldn't load ads. I can then just publish a bloom filter which you can use in your browser to see if a web request should be allowed or not, without having access to the original list. I know this is contrived, but hopefully you get the idea.


This is the paper the the readme links to that explains how a Cuckoo filter works: https://www.cs.cmu.edu/%7Edga/papers/cuckoo-conext2014.pdf


What is the situation with Patents on the Cuckoo filter? I saw no less than 2 pending applications with some quick googling, one by NetSpeed, one by TI, and quick uspto search shows over a dozen patents.

Is the entire useful concept patented at this point?


What I don't understand is comment on the delete function:

  // tries deleting 'bonjour' from filter, may delete another element
  // this could occur when another byte slice with the same fingerprint
  // as another is 'deleted'
If it deletes another element, then the guarantee of no false negatives is no longer kept. Is that right?


This is basically a hash collision, two keys hash to the same digest. You'd need to add refcount to keep track, same as bloom counting filters [https://en.wikipedia.org/wiki/Bloom_filter#Counting_filters]


yes, you are correct. drawing from the example in the README you could 'delete' "bounjour" assuming it had the same fingerprint as "buongiorno", subsequent lookups for "buongiorno" would thereby return negatives (assuming it was entered only once). the original research paper makes no explicit guarantees of no false negatives, the invariant can only be maintained if prior to deletion of an element it is ensured that the element is definitely in the filter (e.g., based on records on external storage).


Even if the element is definitely in the filter, another element can have the same fingerprint. Thus, deleting that event which is definitely in the filter will also cause false negatives for that other element. Isn't that right?


are you assuming the other element is also in the filter? my assumption is your 'deletions' occur only when you know before hand the element already is in the filter. if two elements with the same fingerprints are added to the filter, the finger print is stored two separate times. deletion of any one of them would effectively remove only the single copy thereby not affecting the other (you could be deleting the fingerprint that technically belongs to the other element but it's all the same), so no in this case you do not get false negatives.


How does this compare to the implementation in the "BoomFilters" library?

https://godoc.org/github.com/tylertreat/BoomFilters#CuckooFi...


hadn't come across that implementation before, i'll try putting up benchmarks as soon as i can.


Can you persist your stored filter items with something like bolt? The struct fields are all unexported.


How are fingerprint's supposed to be generated? Just another hash?


essentially, yes.




Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: