Hacker News new | comments | show | ask | jobs | submit login
A General-Purpose Counting Filter: Making Every Bit Count [pdf] (stonybrook.edu)
11 points by gbrown_ 74 days ago | hide | past | web | 1 comment | favorite

This is misleading for some aspects of the cuckoo filter.

> The cuckoo filter can also support a small number of duplicates of some items. In the authors’ reference implementation, each slot can actually hold 4 values, so the system can support up to 8 duplicates, although its not clear how this will impact the probability of failure during inserts of other items. One could attempt to add counting to the cuckoo filter by associating a counter with each fingerprint, but this would increase the space usage.

Cuckoo filters support buckets of arbitrary size. (They can also support an arbitrary number of candidate buckets per item, but that requires a different scheme for the partial-key cuckoo hashing. It also reduces how much compression is possible, as described in the note below.)

The upside of larger buckets is better load factors; the downside, in theory, is the need for more bits per item and longer query times. Cuckoo filters are amenable to compression via integer coding within buckets, though, so in practice reasonably large buckets are free.

> The space savings from these optimizations make the RSQF more space efficient than the Bloom filter for false-positive rates less than 1/64 and more space efficient than the cuckoo filter for all false-positive rates. In contrast, the original quotient filter is less space efficient than the cuckoo filter for all false-positive rates and the Bloom filter for false-positive rates larger than to 2^-36.

The reference semi-sorted cuckoo filter is more space-efficient than the RSQF at +2 bits per item, vs. +2.125 for the RSQF. The same can be true of a cuckoo filter with compressed buckets, e.g. Golomb coding gets you down to ~+2.54[1] regardless of bucket size, while also allowing the load factor to approach 1 thanks to larger buckets. For low false positive rates, that makes the cuckoo filter quite a bit smaller.

1. Assuming two candidate buckets per item. Specifically it's +(1 - Bth_root(1 - ε))^(-1) bits per item, where ε is the desired false positive rate and B is ε * the number of candidate buckets.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact