> Since it’s possible to hash multiple items to the same
> indices, a membership test returns either false or maybe.
True Positive: Test was correctly passed, hit.
False Positive: Test was incorrectly passed, false alarm.
True Negative: Test was correctly failed, reject.
False Negative: Test was incorrectly failed, miss.
I find the terms to be too technical, so I remember a common situation. Medical tests are often False Positives. Which means the test was positive, but the test was wrong and you are okay (good!). From there is pretty easy to infer the rest.
Infact, that situation leads to a large issue with medical testing at a whole. If you were tested for 100 terrible diseases that you did not have, you would probably get a bunch of False Positives. Medical tests are set up for a high False Positive rate but low False Negative rate, because its less damaging to confirm a diagnosis with follow up testing than to miss a test. But confirming False Positive tests is expensive, along with the initial testing. So doctors try to limit tests to at-risk groups to spend less time on healthy patients' False Positive tests.
If a boolean value is null in SQL, it conceptually means it could be either true or false - it's a maybe. If you ask "is maybe true equal to maybe true", the answer is "maybe". Likewise, asking SQL 'null = null", it answers "null", instead of true/false.
SQL null IS null
That's hard to explain when you work with binary booleans.
Bloom filters give no false negatives but a controllable rate of false positives. If a Bloom filter indicates that an item has not been seen before, we can be certain that’s the case; but if it indicates an item has been seen, it’s possible that’s not the case (a false positive).
This style of probabilistic data structure can be used to achieve a significant speedup when the full test for set membership is very expensive. For example, run the test using the Bloom filter. A negative result can be taken as truth, while only positive results will have to be tested against the expensive algorithm to weed out false positives. Obviously, the expectation should be that negative results are far more common than positive results.
> A single HyperLogLog++ structure, for example, can count up to 7.9 billion unique items using 2.56KB of memory with only a 1.65% error rate.
Incidentally, Redis has an implementation for HyperLogLogs that has an even smaller error rate of 0.83%, though it uses more memory if I recall correctly. :)
HLL is used to determine how many unique items you have, the cardinality of a set.
Bloom/Cuckoo is used to determine set membership, as in "have I seen this item before".
Not sure how much that helps but the distinction was lost on me when I started reading about them.
As an example, a search engine might want to have several HyperLogLog counts for every URL on the Internet. In that case, a 64-bit count may be a good choice over 12 kilobytes.
Fingerprint size- It allows fingerprints that are too short, basically less than 4 bits doesn't allow a reasonable fill capacity. The paper authors only hinted at this, but check out the fill capacity graphs on page 6. This could be why your inserts are slowing down around 80% level when in my experience it doesn't happen till around 93%.
Modulo bias - Your filter capacity code doesn't seem to round the filter size to a power of two. This is a simple fix, but without it your array indexes will be skewed by modulo bias, possibly badly if someone picks a nasty number.
Alternate bucket positions - Your code seems to do a full hash for calculating alternate bucket positions. I know the paper mentions this, but I haven't seen anyone actually doing it :). Its a lot faster to just XOR with a mixing constant. TBH that's what most libraries are doing... whether it's a good idea is debatable :).
Fingerprint zero corner case - I'm not that great at python, but I didn't see any special handling for the rare case that the hash fingerprint is zero. In most implementations zero means empty bucket, so this could violate the "no false negatives" aspect of the filter by making items rarely disappear when they were supposed to be inserted. Most implementations either just add one to it, but I prefer to re-hash with a salt.
No victim cache - Didn't look too much into it, but I didn't see a victim slot used in your code. This will cause problems when the first insert fails. The problem is, by the time the first insert actually fails, you've relocated a bunch of different fingerprints like 500 times. It becomes unclear which fingerprint you originally tried to insert, and you're left holding a dangling item from a random place in the filter that you cannot insert. This violates the "no false negatives" mantra. Even though the filter is full it shouldn't break by deleting a random item when the first insert fails. You either need to store this last item or figure out a way to unwind your insert attempts to be able to reject the item that originally failed to insert.
Check out my Java library if you want to see how I dealt with these things. Also I have a bunch of unit tests there that I either came up with or borrowed from other Cuckoo libs. Should be pretty easy to convert some of those to python :) .
One of the authors here. Great points especially on the "No victim cache" aka insertion failure. This was mostly a test implementation for us to see how cuckoo filters work, but agree that it is critical to have these issues fixed.
I'll check out your library for improvements.
That's not a throughput increase, that's a throughput decrease. Throughput is operations per unit time, not time per operation.
The insert speed should be roughly constant until the filter begins moving items into their alternate buckets(when one of the two buckets for that item is full). In my implementation this doesn't usually happen until well over 90% capacity.
Insert speed will slow slightly as the filters fills up if the implementation uses a shortcut to determine the first empty bucket position by just checking if it's zero. Conceptually, if a bucket is empty you only need to check the first position before the insert. When the filter is half full you need to check ~%50 of bucket positions before an insert. In practice the bucket operations are so fast and buckets only hold 4 items each, I haven't noticed a difference.
Since you've implemented one, maybe you can answer a couple more questions I have:
> Typically, a Cuckoo filter is identified by its fingerprint and bucket size. For example, a (2,4) Cuckoo filter stores 2 bit length fingerprints and each bucket in the Cuckoo hash table can store up to 4 fingerprints.
Wouldn't that fingerprint length normally be two bytes rather than two bits? The Python sample code given appears to use bytes.
Also, when you start doing deletion, isn't there a nonzero probability of causing false negatives because of fingerprint collisions? I guess a 2 byte fingerprint would make that probability acceptably small, but it doesn't seem to be zero. Correct?
Also, from the quote, it's probably more accurate to say a Cuckoo filter is identified by it's fingerprint size, bucket size, and capacity. In most implementations the bucket size is fixed at 4, so you can define a filter with only fingerprint size and capacity.
And yes, deleting can cause false negatives but only in a roundabout way. Deleting will only cause false negatives if you delete something that wasn't added previously. Because the filter has false positives, sometimes it will tell you it's seen an item that it hasn't. If you delete one of these false positives it will cause a false negative. If you only delete items that have been added before, and in the case of duplicate items, added the same number of times, you won't get any false negatives.
Essentially if you saved a log of all the inserts then replayed it as deletions you would have a perfectly empty filter every time with no false negatives. I actually have a unit test that does this in my java Cuckoo filter :), it exorcised quite a few demons from the code.
If you're curious the unit test is sanityFillDeleteAllAndCheckABunchOfStuff() in
Obviously, if this is not a correct way to think of it, I'm open for more correct ways.
If someone shows you might learn they are not a guest, but having the 'correct' "La" letters does not mean they are a guest.
The advantage is a doorman can have a tiny list to direct people inside, the downside is it's poor validation and does not scale very far.
PS: You can obviously add more letters, but the same problem happens if two people have the same names.
So, if you know that Peter is wearing a green shirt today, then you can rule out his being there if the guard says nobody wearing a green shirt has arrived.
Is this a rigorous model? No. It is just an easy mental model for me to conceptualize this.
If you really want to use a range of questions then police codes is a good proxy. The guard may say we have people in the drunk tank, but not anything else.
Again, not a rigorous explanation. Just an easy one. For the advanced algo, I typically word it as imagine if the front desk had a notesheet that they would tick next to "person carrying a backpack" and/or "female" and/or "male" and then I could run my candidate through the questions related to them and see if they all have non-zero quantities of people.
But, again, just a mental way of thinking about this in analogy. If you aren't the kind that needs/wants analogies, this is worthless. On that I fully agree.
Not sure a bloom filter needs an analogy but there you go.
Now, if you are saying the guard internalizes this with "has someone with that hair color entered? What about of that height? With that facial hair? etc. Then, yes, this is the same.
Analogies can almost always help, for the class of people that are helped by analogies. :)
1. If an item in the Cuckoo filter needs to be moved, how does one know its other hash/location if only a fingerprint of the original item is stored (i.e. it can't be hashed again)?
2. [From the linked pdf] "Insertions are amortized O(1) with reasonably high probability." In case of a rehash, every item needs to be hashed and inserted again. This seems very expensive and impractical for data on Twitter scale, even if it only happens seldomly. Or are there any workarounds to mitigate this?
The xor operation in Eq. (1) ensures an important property: h1(x) can also be calculated from h2(x) and the fingerprint using the same formula. In other words, to displace a key originally in bucket i (no matter if i is h1(x) or h2(x)), we directly calculate its alternate bucket j from the current bucket index i and the fingerprint stored in this bucket by
j = i ⊕ hash(fingerprint).
See section 4.5. It was a bit more computationally expensive, so we didn't push forward with it in our implementation, as "2,4" cuckoo hash tables (2 hash functions to identify buckets, 4 associative elements) performed so well already.
Mathematically this is done by XOR'ing with a magic mixing constant borrowed from Murmur2 :) (or in my lib from Mumur3). The quality of the randomness of alternate bucket positions is...probably bad, but in practice it seems to work fine.
Because of this finding alternate positions doesn't require a rehash per se, just a few CPU instructions.
However, there IS a corner case that requires re-hash that comes up more frequently the shorter the fingerprints used. Conceptually a tag/fingerprint of zero means empty bucket, and occasionally the fingerprint hash for an item will end up being zero. In this case I've seen some libraries just add 1, which I didn't really like since it increases hash collisions slightly. Or you can do it like my lib does and just re-hash the item with a salt.
3-bit fingerprints + sign, v: "signs"
v |v |v |v
1 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0
Using bit flags:
Using zero as empty, 4 bits could represent numbers 0-15, so 15 values plus empty/zero. With 5 bits you can represent 0-31, so 30 values plus empty/zero. If you're always using one of your bits for the flag it reduces the uniqueness of a fingerprint by almost 50%.
How the filter is built in memory:
At least in Java, the best way to build the backing data structure is a giant array of longs. Really what you want is just a giant block of memory you can address by bit offset. Unfortunately modern CPU's can only index into a byte offset. Because of this, buckets and fingerprints can overlap byte boundaries. To make this happen as little as possible, it's best to go the other way and index into the largest primitive structure the CPU supports. In Java at least, this means you want to simulate bit-addressable memory using a long array.
My java filter uses a Bitset implementation borrowed from Apache Lucene because the Java builtin Bitset only allows a 32 bit index(same with java array indices). You would think this limits you to a 2GB filter but it's actually only around 270MB since the Bitset indexes are a 32 bit int. Anyways, the Lucene Bitset uses a long array to simulate bit addressable memory.
Using regular arrays of booleans doesn't work in higher level languages unfortunately. They tend to use CPU supported primitives underneath so for example a boolean could take up a byte underneath. Every Cuckoo filter library I know of besides the reference implementation and mine doesn't handle this properly and is extremely space-inefficient :(.
It seems like for most applications, silently degrading (instead of rejecting the insertion) when the bloom filter is above capacity is a super useful property.
You could always avoid duplicate inserts in cuckoo by checking contains before calling insert again. A modified insert-only-once routine would only have a small performance penalty. You can't use counting or deletion while doing this though, so its a trade-off. This same trade-off happens with counting bloom filters but they are much less space efficient.
Practically the use case for cuckoo filters over bloom probably lies in bi directional communication. Partner nodes can keep each other's state updated without needing to exchange a ton of data. Think distributed caches. So two data nodes exchange cuckoo filters of each other's data initially. As things fall out of their caches they can tell each other to delete those items from each other's filters by sending only the bucket index and fingerprint. Probably much smaller than the piece of data that represented originally. Since each data node independently knows what was in their cache there's no risk of false deletions. You can't really use bloom filters for this because you can't delete
It seems the article's use of unique does not matter much in the context. The level of math explanation seems an over-kill.
For example, the names of the three people in my office (Jack, Preston, Justin) are distinct, but my name (Justin) is not unique: many other people share the same name.