Hacker News new | past | comments | ask | show | jobs | submit login
Probabilistic Data Structure Showdown: Cuckoo Filters vs. Bloom Filters (fastforwardlabs.com)
208 points by williamsmj on Nov 28, 2016 | hide | past | web | favorite | 44 comments



  > Since it’s possible to hash multiple items to the same 
  > indices, a membership test returns either false or maybe.
As someone who perpetually has trouble keeping the terminology of {false|no false} {positives|negatives} straight, I just wanted to remark that this is a very intuitive and straightforward way of explaining the guarantees of lookup in a Bloom filter.


True/False says if the test was correct. Positive/Negative is if the test passed.

True Positive: Test was correctly passed, hit. False Positive: Test was incorrectly passed, false alarm. True Negative: Test was correctly failed, reject. False Negative: Test was incorrectly failed, miss.

I find the terms to be too technical, so I remember a common situation. Medical tests are often False Positives. Which means the test was positive, but the test was wrong and you are okay (good!). From there is pretty easy to infer the rest.

Infact, that situation leads to a large issue with medical testing at a whole. If you were tested for 100 terrible diseases that you did not have, you would probably get a bunch of False Positives. Medical tests are set up for a high False Positive rate but low False Negative rate, because its less damaging to confirm a diagnosis with follow up testing than to miss a test. But confirming False Positive tests is expensive, along with the initial testing. So doctors try to limit tests to at-risk groups to spend less time on healthy patients' False Positive tests.


As an aside, this is how null is defined in many languages as well. Many behaviors of SQL, for instance, are much more intuitive if you remember that SQL sees null as "unknown".

If a boolean value is null in SQL, it conceptually means it could be either true or false - it's a maybe. If you ask "is maybe true equal to maybe true", the answer is "maybe". Likewise, asking SQL 'null = null", it answers "null", instead of true/false.


...which is why for determining if something is unknown, you ask SQL...are you indeed NULL? And SQL should say 'true' or 'false'.

    SQL null IS null


It's a much better notation because NOT(MAYBE) = MAYBE.

That's hard to explain when you work with binary booleans.


I try to think of these filters as the opposite of a cache. Bloom/Cuckoo has no false negatives and a cache has no false positives. Besides this they are very similar conceptually and used in the similar contexts.


For those who are new to Bloom filters, the article doesn't quite finish connecting the dots to a common use case:

Bloom filters give no false negatives but a controllable rate of false positives. If a Bloom filter indicates that an item has not been seen before, we can be certain that’s the case; but if it indicates an item has been seen, it’s possible that’s not the case (a false positive).

This style of probabilistic data structure can be used to achieve a significant speedup when the full test for set membership is very expensive. For example, run the test using the Bloom filter. A negative result can be taken as truth, while only positive results will have to be tested against the expensive algorithm to weed out false positives. Obviously, the expectation should be that negative results are far more common than positive results.


As a practical example of this, consider searching a corpus of texts for a phrase. You can build a bloom filter from each text ahead of time, using each individual word as the elements to put in the filter. Then when searching, you can split the search phrase into words and check if all the words occur in the bloom filters. If any word does not occur, then you know the phrase won't match that text. But if the bloom filter can't rule out any of the words, then you can do the more expensive full-text search. This means you can search the whole corpus much quicker, as you can skip any text that doesn't match the words.


I might be inclined to use a lightly massaged prefix tree (adding an "end of word" symbol) instead of a bloom filter in this particular case, depending on the size of dictionary and cost of a false-maybe.


The Cuckoo Hashing algorithm is quite fun. I've been reading quite a bit about probabilistic data structures like Bloom Filters, Cuckoo Filters, HLLs etc lately. Here's a paper where they're empirically compared to bloom filters of various types: https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf

> A single HyperLogLog++ structure, for example, can count up to 7.9 billion unique items using 2.56KB of memory with only a 1.65% error rate.

Incidentally, Redis has an implementation for HyperLogLogs that has an even smaller error rate of 0.83%, though it uses more memory if I recall correctly. :)


The HyperLogLog implementation in Redis is extremely well documented and amazingly quick. It drops down to non-probabilistic counting for smaller sample sizes, providing 100% accuracy with comparable speeds. As a fun exercise, which I wrote about [0], I estimated the number of unique words throughout the major works of Charles Dickens. Side note, that man had an amazing hairdo [1].

[0] https://blog.codeship.com/counting-distinct-values-with-hype... [1] https://upload.wikimedia.org/wikipedia/commons/a/aa/Dickens_...


Just so nobody is confused, HyperLogLog etc. are used for a different operation than Bloom/Cuckoo filters.

HLL is used to determine how many unique items you have, the cardinality of a set.

Bloom/Cuckoo is used to determine set membership, as in "have I seen this item before".

Not sure how much that helps but the distinction was lost on me when I started reading about them.


HyperLogLog can trade space for accuracy. One annoying thing about some implementations is that they don't give the programmer access to that trade-off.

As an example, a search engine might want to have several HyperLogLog counts for every URL on the Internet. In that case, a 64-bit count may be a good choice over 12 kilobytes.


Hi there, I've wrestled with Cuckoo filters myself, so I took a look at your lib and there's a some things that stood out.

Fingerprint size- It allows fingerprints that are too short, basically less than 4 bits doesn't allow a reasonable fill capacity. The paper authors only hinted at this, but check out the fill capacity graphs on page 6. This could be why your inserts are slowing down around 80% level when in my experience it doesn't happen till around 93%.

Modulo bias - Your filter capacity code doesn't seem to round the filter size to a power of two. This is a simple fix, but without it your array indexes will be skewed by modulo bias, possibly badly if someone picks a nasty number.

Alternate bucket positions - Your code seems to do a full hash for calculating alternate bucket positions. I know the paper mentions this, but I haven't seen anyone actually doing it :). Its a lot faster to just XOR with a mixing constant. TBH that's what most libraries are doing... whether it's a good idea is debatable :).

Fingerprint zero corner case - I'm not that great at python, but I didn't see any special handling for the rare case that the hash fingerprint is zero. In most implementations zero means empty bucket, so this could violate the "no false negatives" aspect of the filter by making items rarely disappear when they were supposed to be inserted. Most implementations either just add one to it, but I prefer to re-hash with a salt.

No victim cache - Didn't look too much into it, but I didn't see a victim slot used in your code. This will cause problems when the first insert fails. The problem is, by the time the first insert actually fails, you've relocated a bunch of different fingerprints like 500 times. It becomes unclear which fingerprint you originally tried to insert, and you're left holding a dangling item from a random place in the filter that you cannot insert. This violates the "no false negatives" mantra. Even though the filter is full it shouldn't break by deleting a random item when the first insert fails. You either need to store this last item or figure out a way to unwind your insert attempts to be able to reject the item that originally failed to insert.

Check out my Java library if you want to see how I dealt with these things. Also I have a bunch of unit tests there that I either came up with or borrowed from other Cuckoo libs. Should be pretty easy to convert some of those to python :) .

https://github.com/MGunlogson/CuckooFilter4J

Cheers, Mark


Hi Mark,

One of the authors here. Great points especially on the "No victim cache" aka insertion failure. This was mostly a test implementation for us to see how cuckoo filters work, but agree that it is critical to have these issues fixed.

I'll check out your library for improvements.

thanks!


No problem, The paper authors left a lot of the specifics off... I got burned by the victim cache too. I saw one in their reference code and thought it was pointless until weeks later when I got around to writing tests and had one of those aha moments.


> With the Cuckoo filter, we notice an insertion throughput increase of up to 85 percent as it fills up to 80 percent capacity

That's not a throughput increase, that's a throughput decrease. Throughput is operations per unit time, not time per operation.


Hmmm.... this might be an artifact of the python implementation?

The insert speed should be roughly constant until the filter begins moving items into their alternate buckets(when one of the two buckets for that item is full). In my implementation this doesn't usually happen until well over 90% capacity.

Insert speed will slow slightly as the filters fills up if the implementation uses a shortcut to determine the first empty bucket position by just checking if it's zero. Conceptually, if a bucket is empty you only need to check the first position before the insert. When the filter is half full you need to check ~%50 of bucket positions before an insert. In practice the bucket operations are so fast and buckets only hold 4 items each, I haven't noticed a difference.


Yes, I understand why it slows down (though the additional details you've provided are helpful). My point was that slowing down is correctly described as a decrease, not an increase, in throughput.

Since you've implemented one, maybe you can answer a couple more questions I have:

> Typically, a Cuckoo filter is identified by its fingerprint and bucket size. For example, a (2,4) Cuckoo filter stores 2 bit length fingerprints and each bucket in the Cuckoo hash table can store up to 4 fingerprints.

Wouldn't that fingerprint length normally be two bytes rather than two bits? The Python sample code given appears to use bytes.

Also, when you start doing deletion, isn't there a nonzero probability of causing false negatives because of fingerprint collisions? I guess a 2 byte fingerprint would make that probability acceptably small, but it doesn't seem to be zero. Correct?


The code should use bits, the point of the filter is to increase space efficiency so you need to pack everything together as close as possible. Having fingerprint sizes of only multiples of 8 bits doesn't make sense.

Also, from the quote, it's probably more accurate to say a Cuckoo filter is identified by it's fingerprint size, bucket size, and capacity. In most implementations the bucket size is fixed at 4, so you can define a filter with only fingerprint size and capacity.

And yes, deleting can cause false negatives but only in a roundabout way. Deleting will only cause false negatives if you delete something that wasn't added previously. Because the filter has false positives, sometimes it will tell you it's seen an item that it hasn't. If you delete one of these false positives it will cause a false negative. If you only delete items that have been added before, and in the case of duplicate items, added the same number of times, you won't get any false negatives.

Essentially if you saved a log of all the inserts then replayed it as deletions you would have a perfectly empty filter every time with no false negatives. I actually have a unit test that does this in my java Cuckoo filter :), it exorcised quite a few demons from the code.

If you're curious the unit test is sanityFillDeleteAllAndCheckABunchOfStuff() in https://github.com/MGunlogson/CuckooFilter4J/blob/master/src...


I maintain that if you want to think of how a bloom filter works, think of it as the front desk staff at an office building. You can ask a series of questions such as "Has anyone wearing a hat come in?" "Any exceptionally tall people?" etc... It won't help answer if Peter from accounting is there, but can help give a good indication if you know enough about him today.

Obviously, if this is not a correct way to think of it, I'm open for more correct ways.


Not really, it's more like a guest list with only the first two letters of guests last names.

If someone shows you might learn they are not a guest, but having the 'correct' "La" letters does not mean they are a guest.

The advantage is a doorman can have a tiny list to direct people inside, the downside is it's poor validation and does not scale very far.

PS: You can obviously add more letters, but the same problem happens if two people have the same names.


This is just one of the hashes, though. If you consider a function of "color of shirt someone is wearing" to "are they here". Then "color of shirt" is now part of the filter. Same for "is wearing a hat", "is exceptionally tall", etc.

So, if you know that Peter is wearing a green shirt today, then you can rule out his being there if the guard says nobody wearing a green shirt has arrived.

Is this a rigorous model? No. It is just an easy mental model for me to conceptualize this.


Except the hash format needs to be decided ahead of time. If you ask a guard if they had seen someone with a peg leg they may remember this even without any other instruction.

If you really want to use a range of questions then police codes is a good proxy. The guard may say we have people in the drunk tank, but not anything else.


Yes, this is why I typically stick with easy and likely to be considered questions.

Again, not a rigorous explanation. Just an easy one. For the advanced algo, I typically word it as imagine if the front desk had a notesheet that they would tick next to "person carrying a backpack" and/or "female" and/or "male" and then I could run my candidate through the questions related to them and see if they all have non-zero quantities of people.

But, again, just a mental way of thinking about this in analogy. If you aren't the kind that needs/wants analogies, this is worthless. On that I fully agree.


That's a beautifully simple analogy.


Imagine a security guard who can't list from memory everyone who is in the building, but if you show him a picture of Peter from accounting, he can with certainty tell you that he hasn't seen Peter yet.

Not sure a bloom filter needs an analogy but there you go.


In this example, you have dropped it down to a single function in lookup. Namely, "is the person in this picture in the building?"

Now, if you are saying the guard internalizes this with "has someone with that hair color entered? What about of that height? With that facial hair? etc. Then, yes, this is the same.

Analogies can almost always help, for the class of people that are helped by analogies. :)


There are two things I don't fully understand:

1. If an item in the Cuckoo filter needs to be moved, how does one know its other hash/location if only a fingerprint of the original item is stored (i.e. it can't be hashed again)?

2. [From the linked pdf] "Insertions are amortized O(1) with reasonably high probability." In case of a rehash, every item needs to be hashed and inserted again. This seems very expensive and impractical for data on Twitter scale, even if it only happens seldomly. Or are there any workarounds to mitigate this?


To 1, they answer this in the paper; one knows it because

The xor operation in Eq. (1) ensures an important property: h1(x) can also be calculated from h2(x) and the fingerprint using the same formula. In other words, to displace a key originally in bucket i (no matter if i is h1(x) or h2(x)), we directly calculate its alternate bucket j from the current bucket index i and the fingerprint stored in this bucket by

    j = i ⊕ hash(fingerprint).


Thanks. That's quite nice, but also reduces available hashing functions to pairs with that property. But I guess that's not really a problem.


If I'm understanding your comment right, you mean that the underlying cuckoo hash table can only use 2 buckets per item, right? We actually generalized this, but only in Bin's thesis:

https://www.cs.cmu.edu/~binfan/thesis/thesis_ss.pdf

See section 4.5. It was a bit more computationally expensive, so we didn't push forward with it in our implementation, as "2,4" cuckoo hash tables (2 hash functions to identify buckets, 4 associative elements) performed so well already.


The thing with Cuckoo filters is that the item's alternate position is determined solely from it's current position and fingerprint.

Mathematically this is done by XOR'ing with a magic mixing constant borrowed from Murmur2 :) (or in my lib from Mumur3). The quality of the randomness of alternate bucket positions is...probably bad, but in practice it seems to work fine.

Because of this finding alternate positions doesn't require a rehash per se, just a few CPU instructions.

However, there IS a corner case that requires re-hash that comes up more frequently the shorter the fingerprints used. Conceptually a tag/fingerprint of zero means empty bucket, and occasionally the fingerprint hash for an item will end up being zero. In this case I've seen some libraries just add 1, which I didn't really like since it increases hash collisions slightly. Or you can do it like my lib does and just re-hash the item with a salt.


Hm, the empty/zero issue is quite interesting. How are buckets usually represented? I mean, if it were just an array of numbers, that problem wouldn't exist. If it's e.g. multiple bit sequences "bit-or-ed" into one integer, couldn't we use one additional bit per fingerprint to indicate whether it's there?

    3-bit fingerprints + sign, v: "signs"

    v      |v        |v      |v
    1 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0


Depends on the language I guess. The most efficient way to represent buckets is to pack the buckets right next to eachother in a giant memory block made of bits. The other part of your question about using a flag, your method definitely works but adds an extra bit when you only need one value for zero.

Using bit flags: Using zero as empty, 4 bits could represent numbers 0-15, so 15 values plus empty/zero. With 5 bits you can represent 0-31, so 30 values plus empty/zero. If you're always using one of your bits for the flag it reduces the uniqueness of a fingerprint by almost 50%.

How the filter is built in memory: At least in Java, the best way to build the backing data structure is a giant array of longs. Really what you want is just a giant block of memory you can address by bit offset. Unfortunately modern CPU's can only index into a byte offset. Because of this, buckets and fingerprints can overlap byte boundaries. To make this happen as little as possible, it's best to go the other way and index into the largest primitive structure the CPU supports. In Java at least, this means you want to simulate bit-addressable memory using a long array.

My java filter uses a Bitset implementation borrowed from Apache Lucene because the Java builtin Bitset only allows a 32 bit index(same with java array indices). You would think this limits you to a 2GB filter but it's actually only around 270MB since the Bitset indexes are a 32 bit int. Anyways, the Lucene Bitset uses a long array to simulate bit addressable memory.

Using regular arrays of booleans doesn't work in higher level languages unfortunately. They tend to use CPU supported primitives underneath so for example a boolean could take up a byte underneath. Every Cuckoo filter library I know of besides the reference implementation and mine doesn't handle this properly and is extremely space-inefficient :(.


Does anybody have an example of an application where you need the counting/deletion properties of a counting bloom filter or cuckoo filter over a normal bloom filter? If you do, how do you deal with the filter rejecting your writes because a bucket is full?

It seems like for most applications, silently degrading (instead of rejecting the insertion) when the bloom filter is above capacity is a super useful property.


The failure mode doesn't matter much in practice. You need to track how many inserts you've done on both for practical reasons so with either you'll have a set cutoff before rebuild. Cuckoo filters are more likely to be used in place of counting bloom filters than vanilla ones.

You could always avoid duplicate inserts in cuckoo by checking contains before calling insert again. A modified insert-only-once routine would only have a small performance penalty. You can't use counting or deletion while doing this though, so its a trade-off. This same trade-off happens with counting bloom filters but they are much less space efficient.

Practically the use case for cuckoo filters over bloom probably lies in bi directional communication. Partner nodes can keep each other's state updated without needing to exchange a ton of data. Think distributed caches. So two data nodes exchange cuckoo filters of each other's data initially. As things fall out of their caches they can tell each other to delete those items from each other's filters by sending only the bucket index and fingerprint. Probably much smaller than the piece of data that represented originally. Since each data node independently knows what was in their cache there's no risk of false deletions. You can't really use bloom filters for this because you can't delete


If a writer doesn't know the difference between 'unique' and 'distinct', suspect their mathematics.


http://math.stackexchange.com/questions/24119/difference-bet...

It seems the article's use of unique does not matter much in the context. The level of math explanation seems an over-kill.


What's the difference?


A set of things are distinct if they are all different. A thing is unique if it is different from every other thing (in some implicit universe).

For example, the names of the three people in my office (Jack, Preston, Justin) are distinct, but my name (Justin) is not unique: many other people share the same name.


Isn't that (distinctness) implicit in the term 'set'? If you had duplicates it'd be some other sort of collection, no?


Rr, yes. I should have talked about a set of people with distinct names.


The election memes apparently really got to me, I was confused to see the headline 'Cuck filter' for a moment in HN frontpage.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: