
Probabilistic Data Structure Showdown: Cuckoo Filters vs. Bloom Filters - williamsmj
http://blog.fastforwardlabs.com/post/153566952648/probabilistic-data-structure-showdown-cuckoo
======
kibwen

      > Since it’s possible to hash multiple items to the same 
      > indices, a membership test returns either false or maybe.
    

As someone who perpetually has trouble keeping the terminology of {false|no
false} {positives|negatives} straight, I just wanted to remark that this is a
very intuitive and straightforward way of explaining the guarantees of lookup
in a Bloom filter.

~~~
jakewins
As an aside, this is how null is defined in many languages as well. Many
behaviors of SQL, for instance, are much more intuitive if you remember that
SQL sees null as "unknown".

If a boolean value is null in SQL, it conceptually means it could be either
true or false - it's a maybe. If you ask "is maybe true equal to maybe true",
the answer is "maybe". Likewise, asking SQL 'null = null", it answers "null",
instead of true/false.

~~~
443633456
...which is why for determining if something is unknown, you ask SQL...are you
indeed NULL? And SQL should say 'true' or 'false'.

    
    
        SQL null IS null

------
saidajigumi
For those who are new to Bloom filters, the article doesn't _quite_ finish
connecting the dots to a common use case:

 _Bloom filters give no false negatives but a controllable rate of false
positives. If a Bloom filter indicates that an item has not been seen before,
we can be certain that’s the case; but if it indicates an item has been seen,
it’s possible that’s not the case (a false positive)._

This style of probabilistic data structure can be used to achieve a
significant speedup when the full test for set membership is very expensive.
For example, run the test using the Bloom filter. A negative result can be
taken as truth, while only positive results will have to be tested against the
expensive algorithm to weed out false positives. Obviously, the expectation
should be that negative results are far more common than positive results.

~~~
eridius
As a practical example of this, consider searching a corpus of texts for a
phrase. You can build a bloom filter from each text ahead of time, using each
individual word as the elements to put in the filter. Then when searching, you
can split the search phrase into words and check if all the words occur in the
bloom filters. If any word does not occur, then you know the phrase won't
match that text. But if the bloom filter can't rule out any of the words, then
you can do the more expensive full-text search. This means you can search the
whole corpus much quicker, as you can skip any text that doesn't match the
words.

~~~
sk5t
I might be inclined to use a lightly massaged prefix tree (adding an "end of
word" symbol) instead of a bloom filter in this particular case, depending on
the size of dictionary and cost of a false-maybe.

------
manish_gill
The Cuckoo Hashing algorithm is quite fun. I've been reading quite a bit about
probabilistic data structures like Bloom Filters, Cuckoo Filters, HLLs etc
lately. Here's a paper where they're empirically compared to bloom filters of
various types: [https://www.cs.cmu.edu/~dga/papers/cuckoo-
conext2014.pdf](https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf)

> A single HyperLogLog++ structure, for example, can count up to 7.9 billion
> unique items using 2.56KB of memory with only a 1.65% error rate.

Incidentally, Redis has an implementation for HyperLogLogs that has an even
smaller error rate of 0.83%, though it uses more memory if I recall correctly.
:)

~~~
pselbert
The HyperLogLog implementation in Redis is extremely well documented and
amazingly quick. It drops down to non-probabilistic counting for smaller
sample sizes, providing 100% accuracy with comparable speeds. As a fun
exercise, which I wrote about [0], I estimated the number of unique words
throughout the major works of Charles Dickens. Side note, that man had an
amazing hairdo [1].

[0] [https://blog.codeship.com/counting-distinct-values-with-
hype...](https://blog.codeship.com/counting-distinct-values-with-hyperloglog/)
[1]
[https://upload.wikimedia.org/wikipedia/commons/a/aa/Dickens_...](https://upload.wikimedia.org/wikipedia/commons/a/aa/Dickens_Gurney_head.jpg)

------
mgunlogson
Hi there, I've wrestled with Cuckoo filters myself, so I took a look at your
lib and there's a some things that stood out.

Fingerprint size- It allows fingerprints that are too short, basically less
than 4 bits doesn't allow a reasonable fill capacity. The paper authors only
hinted at this, but check out the fill capacity graphs on page 6. This could
be why your inserts are slowing down around 80% level when in my experience it
doesn't happen till around 93%.

Modulo bias - Your filter capacity code doesn't seem to round the filter size
to a power of two. This is a simple fix, but without it your array indexes
will be skewed by modulo bias, possibly badly if someone picks a nasty number.

Alternate bucket positions - Your code seems to do a full hash for calculating
alternate bucket positions. I know the paper mentions this, but I haven't seen
anyone actually doing it :). Its a lot faster to just XOR with a mixing
constant. TBH that's what most libraries are doing... whether it's a good idea
is debatable :).

Fingerprint zero corner case - I'm not that great at python, but I didn't see
any special handling for the rare case that the hash fingerprint is zero. In
most implementations zero means empty bucket, so this could violate the "no
false negatives" aspect of the filter by making items rarely disappear when
they were supposed to be inserted. Most implementations either just add one to
it, but I prefer to re-hash with a salt.

No victim cache - Didn't look too much into it, but I didn't see a victim slot
used in your code. This will cause problems when the first insert fails. The
problem is, by the time the first insert actually fails, you've relocated a
bunch of different fingerprints like 500 times. It becomes unclear which
fingerprint you originally tried to insert, and you're left holding a dangling
item from a random place in the filter that you cannot insert. This violates
the "no false negatives" mantra. Even though the filter is full it shouldn't
break by deleting a random item when the first insert fails. You either need
to store this last item or figure out a way to unwind your insert attempts to
be able to reject the item that originally failed to insert.

Check out my Java library if you want to see how I dealt with these things.
Also I have a bunch of unit tests there that I either came up with or borrowed
from other Cuckoo libs. Should be pretty easy to convert some of those to
python :) .

[https://github.com/MGunlogson/CuckooFilter4J](https://github.com/MGunlogson/CuckooFilter4J)

Cheers, Mark

~~~
adebayoj
Hi Mark,

One of the authors here. Great points especially on the "No victim cache" aka
insertion failure. This was mostly a test implementation for us to see how
cuckoo filters work, but agree that it is critical to have these issues fixed.

I'll check out your library for improvements.

thanks!

~~~
mgunlogson
No problem, The paper authors left a lot of the specifics off... I got burned
by the victim cache too. I saw one in their reference code and thought it was
pointless until weeks later when I got around to writing tests and had one of
those aha moments.

------
ScottBurson
> With the Cuckoo filter, we notice an insertion throughput increase of up to
> 85 percent as it fills up to 80 percent capacity

That's not a throughput _increase_ , that's a throughput _decrease_.
Throughput is operations per unit time, not time per operation.

~~~
mgunlogson
Hmmm.... this might be an artifact of the python implementation?

The insert speed should be roughly constant until the filter begins moving
items into their alternate buckets(when one of the two buckets for that item
is full). In my implementation this doesn't usually happen until well over 90%
capacity.

Insert speed will slow slightly as the filters fills up if the implementation
uses a shortcut to determine the first empty bucket position by just checking
if it's zero. Conceptually, if a bucket is empty you only need to check the
first position before the insert. When the filter is half full you need to
check ~%50 of bucket positions before an insert. In practice the bucket
operations are so fast and buckets only hold 4 items each, I haven't noticed a
difference.

~~~
ScottBurson
Yes, I understand why it slows down (though the additional details you've
provided are helpful). My point was that slowing down is correctly described
as a decrease, not an increase, in throughput.

Since you've implemented one, maybe you can answer a couple more questions I
have:

> Typically, a Cuckoo filter is identified by its fingerprint and bucket size.
> For example, a (2,4) Cuckoo filter stores 2 bit length fingerprints and each
> bucket in the Cuckoo hash table can store up to 4 fingerprints.

Wouldn't that fingerprint length normally be two _bytes_ rather than two
_bits_? The Python sample code given appears to use bytes.

Also, when you start doing deletion, isn't there a nonzero probability of
causing false negatives because of fingerprint collisions? I guess a 2 byte
fingerprint would make that probability acceptably small, but it doesn't seem
to be zero. Correct?

~~~
mgunlogson
The code should use bits, the point of the filter is to increase space
efficiency so you need to pack everything together as close as possible.
Having fingerprint sizes of only multiples of 8 bits doesn't make sense.

Also, from the quote, it's probably more accurate to say a Cuckoo filter is
identified by it's fingerprint size, bucket size, and capacity. In most
implementations the bucket size is fixed at 4, so you can define a filter with
only fingerprint size and capacity.

And yes, deleting can cause false negatives but only in a roundabout way.
Deleting will only cause false negatives if you delete something that wasn't
added previously. Because the filter has false positives, sometimes it will
tell you it's seen an item that it hasn't. If you delete one of these false
positives it will cause a false negative. If you only delete items that have
been added before, and in the case of duplicate items, added the same number
of times, you won't get any false negatives.

Essentially if you saved a log of all the inserts then replayed it as
deletions you would have a perfectly empty filter every time with no false
negatives. I actually have a unit test that does this in my java Cuckoo filter
:), it exorcised quite a few demons from the code.

If you're curious the unit test is sanityFillDeleteAllAndCheckABunchOfStuff()
in
[https://github.com/MGunlogson/CuckooFilter4J/blob/master/src...](https://github.com/MGunlogson/CuckooFilter4J/blob/master/src/test/java/com/github/mgunlogson/cuckoofilter4j/TestCuckooFilter.java)

------
taeric
I maintain that if you want to think of how a bloom filter works, think of it
as the front desk staff at an office building. You can ask a series of
questions such as "Has anyone wearing a hat come in?" "Any exceptionally tall
people?" etc... It won't help answer if Peter from accounting is there, but
can help give a good indication if you know enough about him today.

Obviously, if this is not a correct way to think of it, I'm open for more
correct ways.

~~~
Retric
Not really, it's more like a guest list with only the first two letters of
guests last names.

If someone shows you might learn they are not a guest, but having the
'correct' "La" letters does not mean they are a guest.

The advantage is a doorman can have a tiny list to direct people inside, the
downside is it's poor validation and does not scale very far.

PS: You can obviously add more letters, but the same problem happens if two
people have the same names.

~~~
taeric
This is just one of the hashes, though. If you consider a function of "color
of shirt someone is wearing" to "are they here". Then "color of shirt" is now
part of the filter. Same for "is wearing a hat", "is exceptionally tall", etc.

So, if you know that Peter is wearing a green shirt today, then you can rule
out his being there if the guard says nobody wearing a green shirt has
arrived.

Is this a rigorous model? No. It is just an easy mental model for me to
conceptualize this.

~~~
Retric
Except the hash format needs to be _decided ahead of time._ If you ask a guard
if they had seen someone with a peg leg they may remember this even without
any other instruction.

If you really want to use a range of questions then police codes is a good
proxy. The guard may say we have people in the drunk tank, but not anything
else.

~~~
taeric
Yes, this is why I typically stick with easy and likely to be considered
questions.

Again, not a rigorous explanation. Just an easy one. For the advanced algo, I
typically word it as imagine if the front desk had a notesheet that they would
tick next to "person carrying a backpack" and/or "female" and/or "male" and
then I could run my candidate through the questions related to them and see if
they all have non-zero quantities of people.

But, again, just a mental way of thinking about this in analogy. If you aren't
the kind that needs/wants analogies, this is worthless. On that I fully agree.

------
blauditore
There are two things I don't fully understand:

1\. If an item in the Cuckoo filter needs to be moved, how does one know its
other hash/location if only a fingerprint of the original item is stored (i.e.
it can't be hashed again)?

2\. [From the linked pdf] "Insertions are amortized O(1) with reasonably high
probability." In case of a rehash, every item needs to be hashed and inserted
again. This seems very expensive and impractical for data on Twitter scale,
even if it only happens seldomly. Or are there any workarounds to mitigate
this?

~~~
thenewwazoo
To 1, they answer this in the paper; one knows it because

The xor operation in Eq. (1) ensures an important property: h1(x) can also be
calculated from h2(x) and the fingerprint using the same formula. In other
words, to displace a key originally in bucket i (no matter if i is h1(x) or
h2(x)), we directly calculate its alternate bucket j from the current bucket
index i and the fingerprint stored in this bucket by

    
    
        j = i ⊕ hash(fingerprint).

~~~
blauditore
Thanks. That's quite nice, but also reduces available hashing functions to
pairs with that property. But I guess that's not really a problem.

~~~
dgacmu
If I'm understanding your comment right, you mean that the underlying cuckoo
hash table can only use 2 buckets per item, right? We actually generalized
this, but only in Bin's thesis:

[https://www.cs.cmu.edu/~binfan/thesis/thesis_ss.pdf](https://www.cs.cmu.edu/~binfan/thesis/thesis_ss.pdf)

See section 4.5. It was a bit more computationally expensive, so we didn't
push forward with it in our implementation, as "2,4" cuckoo hash tables (2
hash functions to identify buckets, 4 associative elements) performed so well
already.

------
grogers
Does anybody have an example of an application where you need the
counting/deletion properties of a counting bloom filter or cuckoo filter over
a normal bloom filter? If you do, how do you deal with the filter rejecting
your writes because a bucket is full?

It seems like for most applications, silently degrading (instead of rejecting
the insertion) when the bloom filter is above capacity is a super useful
property.

~~~
throwbsidbdk
The failure mode doesn't matter much in practice. You need to track how many
inserts you've done on both for practical reasons so with either you'll have a
set cutoff before rebuild. Cuckoo filters are more likely to be used in place
of counting bloom filters than vanilla ones.

You could always avoid duplicate inserts in cuckoo by checking contains before
calling insert again. A modified insert-only-once routine would only have a
small performance penalty. You can't use counting or deletion while doing this
though, so its a trade-off. This same trade-off happens with counting bloom
filters but they are much less space efficient.

Practically the use case for cuckoo filters over bloom probably lies in bi
directional communication. Partner nodes can keep each other's state updated
without needing to exchange a ton of data. Think distributed caches. So two
data nodes exchange cuckoo filters of each other's data initially. As things
fall out of their caches they can tell each other to delete those items from
each other's filters by sending only the bucket index and fingerprint.
Probably much smaller than the piece of data that represented originally.
Since each data node independently knows what was in their cache there's no
risk of false deletions. You can't really use bloom filters for this because
you can't delete

------
pklausler
If a writer doesn't know the difference between 'unique' and 'distinct',
suspect their mathematics.

~~~
neeleshs
What's the difference?

~~~
justinpombrio
A set of things are distinct if they are all different. A thing is unique if
it is different from _every_ other thing (in some implicit universe).

For example, the names of the three people in my office (Jack, Preston,
Justin) are distinct, but my name (Justin) is not unique: many other people
share the same name.

~~~
WaxProlix
Isn't that (distinctness) implicit in the term 'set'? If you had duplicates
it'd be some other sort of collection, no?

~~~
justinpombrio
Rr, yes. I should have talked about a set of _people_ with distinct _names_.

------
stonewhite
The election memes apparently really got to me, I was confused to see the
headline 'Cuck filter' for a moment in HN frontpage.

