
Show HN: HyperLogLog intersections - seiflotfy
https://github.com/axiomhq/hyperminhash/tree/master
======
mynameisfiber
This is great! Probabilistic datastructures are amazing.

I did straight up HLL++ a while ago but had to do intersection with the
inclusion–exclusion principle. It would be cool to try to fold this in
somehow!
([https://github.com/mynameisfiber/gohll](https://github.com/mynameisfiber/gohll)).

~~~
bulldoa
I recently discovered probabilistic data structure and find it very
interesting, bloom filter and hyperloglog is super cool, I am interested in
learning more, is there recommend resource for this topic if one want to dive
in deeper?

------
stochastic_monk
Great work!

I haven't messed around with the count-min augmentation for HLLs, but I've had
a lot of success in practice.

I've experimented with using HLL sketches for set operations for genome
comparisons. The great thing about these operations is that the cost of
operations scales with the size of the sketch, not the size of the dataset
being sketched. Add SIMD acceleration, and you have an extremely fast,
compact, accurate way to perform approximate set operations.

Unfortunately, the naive intersection operation (a sketch consisting of an
element-wise 'min' of the counts in both sketches) does not perform well in
cardinality estimation.

The recent Ertl paper [0] and associated code [1] puts a lot of effort into
more accurate estimation and set operations. In my experiments, I've found
that his modified estimation formula is always more accurate than the standard
method and remains accurate to much higher cardinalities for a given sketch
size.

My SIMD-accelerated, threadsafe HyperLogLog implementation in C++ (with python
bindings) using this improved estimation method is available at [2]. Using
this structure, I was able to perform over a billion full-genome Jaccard index
calculations overnight using `bonsai dist` from [3].

[0]:
[https://arxiv.org/pdf/1702.01284.pdf](https://arxiv.org/pdf/1702.01284.pdf)

[1]: [https://github.com/oertl/hyperloglog-sketch-estimation-
paper](https://github.com/oertl/hyperloglog-sketch-estimation-paper)

[2]: [https://github.com/dnbaker/hll](https://github.com/dnbaker/hll)

[3]: [https://github.com/dnbaker/bonsai](https://github.com/dnbaker/bonsai)

~~~
xstartup
How do you find the relevant papers? Is there any tool for this task?

~~~
stochastic_monk
To be honest, a combination of arxiv automailings, colleagues, and github. A
lot of people I work with are interested in sketch data structures.

------
yunwilliamyu
Thanks so much for producing a Golang implementation of our HyperMinHash paper
(I'm the first author) [0] and pointing me to your Hackernews post!

As an aside, we also have a prototype Python implementation [1], but your
implementation looks more efficient than ours, and I may start pointing people
to your Github repo too.

[0]: [https://arxiv.org/abs/1710.08436](https://arxiv.org/abs/1710.08436)

[1]:
[https://github.com/yunwilliamyu/hyperminhash](https://github.com/yunwilliamyu/hyperminhash)

------
geocar
What is this[1] supposed to do?

[1]:
[https://github.com/axiomhq/hyperminhash/blob/master/hypermin...](https://github.com/axiomhq/hyperminhash/blob/master/hyperminhash.go#L102)

~~~
seiflotfy
the is supposed to show you how stupid i sometimes can be :P

------
djhworld
Forgive me if this is a dumb question, but is this library a way of merging
multiple HyperLogLogs together and still have a reasonable cardinality
estimate?

What sort of loss do you see the more HyperLogLogs you merge? As in, if you
had Set1,2,3,4,5.....n

~~~
gopalv
> a way of merging multiple HyperLogLogs together and still have a reasonable
> cardinality estimate?

HLLs can be merged together with the same sort of result as if you ran the
whole set over the same HLL.

The merge can't do better than that, but it can do that even if you mess
around a bit with the HLL sizes (like, HIVE-18079 is fixing p=14 bits to be
p=10 accuracy by default, while retaining old HLLs as-is).

------
kgdinesh
Useful library. I hope someone out there is writing one for Java as well.

~~~
nemothekid
HLL intersections using MinSketch has been around for a while now, I remember
writing a Go implementation a couple years back based on a AdRoll blog post
([http://tech.adroll.com/blog/data/2013/07/10/hll-
minhash.html](http://tech.adroll.com/blog/data/2013/07/10/hll-minhash.html)).
They also open sourced their library
([https://github.com/AdRoll/cantor](https://github.com/AdRoll/cantor)) as
well.

Generally in the big data space, libraries like this usually come out for Java
first.

~~~
yunwilliamyu
Hi there! I'm the author of the paper the OP implemented, so I thought I'd
give a bit more context.

In practice, combining together HLL and MinHash as they do for AdRoll gives
the same kinds of estimation errors. However, using a raw MinHash data
structure requires O(log n) space. If you are not space constrained, I second
nemothekid and suggest just using off-the-shelf HLL intersection + MinHash
libraries, such as the ones he links to.

The HyperMinHash algorithm, which the OP implemented in Golang, combines
together the MinHash and HLL data structures so that we only need O(log log n)
space, where n is the size of the union. In practice, the difference is
between buckets of size log n + loglog n bits for HLL-MinHash and loglog n +
10 bits for HyperMinHash.

Given the constants involved, for n = 2^32, this is only a space savings of
about 60% (37 bits vs 15 bits). However, this increases as n grows, so for
n=2^64, it's 77% (70 bits vs 16 bits).

