Hacker News new | comments | show | ask | jobs | submit login
Show HN: HyperLogLog intersections (github.com)
88 points by seiflotfy 7 months ago | hide | past | web | favorite | 16 comments

This is great! Probabilistic datastructures are amazing.

I did straight up HLL++ a while ago but had to do intersection with the inclusion–exclusion principle. It would be cool to try to fold this in somehow! (https://github.com/mynameisfiber/gohll).

I recently discovered probabilistic data structure and find it very interesting, bloom filter and hyperloglog is super cool, I am interested in learning more, is there recommend resource for this topic if one want to dive in deeper?

Great work!

I haven't messed around with the count-min augmentation for HLLs, but I've had a lot of success in practice.

I've experimented with using HLL sketches for set operations for genome comparisons. The great thing about these operations is that the cost of operations scales with the size of the sketch, not the size of the dataset being sketched. Add SIMD acceleration, and you have an extremely fast, compact, accurate way to perform approximate set operations.

Unfortunately, the naive intersection operation (a sketch consisting of an element-wise 'min' of the counts in both sketches) does not perform well in cardinality estimation.

The recent Ertl paper [0] and associated code [1] puts a lot of effort into more accurate estimation and set operations. In my experiments, I've found that his modified estimation formula is always more accurate than the standard method and remains accurate to much higher cardinalities for a given sketch size.

My SIMD-accelerated, threadsafe HyperLogLog implementation in C++ (with python bindings) using this improved estimation method is available at [2]. Using this structure, I was able to perform over a billion full-genome Jaccard index calculations overnight using `bonsai dist` from [3].

[0]: https://arxiv.org/pdf/1702.01284.pdf

[1]: https://github.com/oertl/hyperloglog-sketch-estimation-paper

[2]: https://github.com/dnbaker/hll

[3]: https://github.com/dnbaker/bonsai

How do you find the relevant papers? Is there any tool for this task?

To be honest, a combination of arxiv automailings, colleagues, and github. A lot of people I work with are interested in sketch data structures.


Thanks so much for producing a Golang implementation of our HyperMinHash paper (I'm the first author) [0] and pointing me to your Hackernews post!

As an aside, we also have a prototype Python implementation [1], but your implementation looks more efficient than ours, and I may start pointing people to your Github repo too.

[0]: https://arxiv.org/abs/1710.08436

[1]: https://github.com/yunwilliamyu/hyperminhash

the is supposed to show you how stupid i sometimes can be :P

Ok fixed it. My bad :(

Forgive me if this is a dumb question, but is this library a way of merging multiple HyperLogLogs together and still have a reasonable cardinality estimate?

What sort of loss do you see the more HyperLogLogs you merge? As in, if you had Set1,2,3,4,5.....n

> a way of merging multiple HyperLogLogs together and still have a reasonable cardinality estimate?

HLLs can be merged together with the same sort of result as if you ran the whole set over the same HLL.

The merge can't do better than that, but it can do that even if you mess around a bit with the HLL sizes (like, HIVE-18079 is fixing p=14 bits to be p=10 accuracy by default, while retaining old HLLs as-is).

its meant for merging and intersecting and it should be max 5% error

Useful library. I hope someone out there is writing one for Java as well.

HLL intersections using MinSketch has been around for a while now, I remember writing a Go implementation a couple years back based on a AdRoll blog post (http://tech.adroll.com/blog/data/2013/07/10/hll-minhash.html). They also open sourced their library (https://github.com/AdRoll/cantor) as well.

Generally in the big data space, libraries like this usually come out for Java first.

Hi there! I'm the author of the paper the OP implemented, so I thought I'd give a bit more context.

In practice, combining together HLL and MinHash as they do for AdRoll gives the same kinds of estimation errors. However, using a raw MinHash data structure requires O(log n) space. If you are not space constrained, I second nemothekid and suggest just using off-the-shelf HLL intersection + MinHash libraries, such as the ones he links to.

The HyperMinHash algorithm, which the OP implemented in Golang, combines together the MinHash and HLL data structures so that we only need O(log log n) space, where n is the size of the union. In practice, the difference is between buckets of size log n + loglog n bits for HLL-MinHash and loglog n + 10 bits for HyperMinHash.

Given the constants involved, for n = 2^32, this is only a space savings of about 60% (37 bits vs 15 bits). However, this increases as n grows, so for n=2^64, it's 77% (70 bits vs 16 bits).

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact