I did straight up HLL++ a while ago but had to do intersection with the inclusion–exclusion principle. It would be cool to try to fold this in somehow!
I haven't messed around with the count-min augmentation for HLLs, but I've had a lot of success in practice.
I've experimented with using HLL sketches for set operations for genome comparisons. The great thing about these operations is that the cost of operations scales with the size of the sketch, not the size of the dataset being sketched. Add SIMD acceleration, and you have an extremely fast, compact, accurate way to perform approximate set operations.
Unfortunately, the naive intersection operation (a sketch consisting of an element-wise 'min' of the counts in both sketches) does not perform well in cardinality estimation.
The recent Ertl paper  and associated code  puts a lot of effort into more accurate estimation and set operations. In my experiments, I've found that his modified estimation formula is always more accurate than the standard method and remains accurate to much higher cardinalities for a given sketch size.
My SIMD-accelerated, threadsafe HyperLogLog implementation in C++ (with python bindings) using this improved estimation method is available at . Using this structure, I was able to perform over a billion full-genome Jaccard index calculations overnight using `bonsai dist` from .
As an aside, we also have a prototype Python implementation , but your implementation looks more efficient than ours, and I may start pointing people to your Github repo too.
What sort of loss do you see the more HyperLogLogs you merge? As in, if you had Set1,2,3,4,5.....n
HLLs can be merged together with the same sort of result as if you ran the whole set over the same HLL.
The merge can't do better than that, but it can do that even if you mess around a bit with the HLL sizes (like, HIVE-18079 is fixing p=14 bits to be p=10 accuracy by default, while retaining old HLLs as-is).
Generally in the big data space, libraries like this usually come out for Java first.
In practice, combining together HLL and MinHash as they do for AdRoll gives the same kinds of estimation errors. However, using a raw MinHash data structure requires O(log n) space. If you are not space constrained, I second nemothekid and suggest just using off-the-shelf HLL intersection + MinHash libraries, such as the ones he links to.
The HyperMinHash algorithm, which the OP implemented in Golang, combines together the MinHash and HLL data structures so that we only need O(log log n) space, where n is the size of the union. In practice, the difference is between buckets of size log n + loglog n bits for HLL-MinHash and loglog n + 10 bits for HyperMinHash.
Given the constants involved, for n = 2^32, this is only a space savings of about 60% (37 bits vs 15 bits). However, this increases as n grows, so for n=2^64, it's 77% (70 bits vs 16 bits).