
HyperMinHash: Bringing intersections to HyperLogLog - pjf
https://github.com/axiomhq/hyperminhash
======
stochastic_monk
The subtitle is misleading, as hyperloglogs are excellent for intersections.
HyperMinHash claims to have some asymptotically superior behavior by mixing
concepts from HyperLogLogs and MinHash. However, I haven’t seen an optimized
implementation and corresponding benchmarks, which makes me wonder how
effective it really is.

~~~
yunwilliamyu
Hi there! I'm the lead author on the paper that Seif (expertly) implements.
You're right that my Python implementation isn't at all optimized, as I was
more interested in the theory, though I think Seif's Go implementation does
much better.

HyperLogLogs are normally used for intersections through inclusion-exclusion
(though there are more sophisticated methods out there). As such, it is really
good when the two sets A and B being intersected are about the same size and
the intersection is large. It performs poorly when |A ^ B| << |A v B| because
inclusion-exclusion means that a small error in the union size results in a
large relative error in the intersection size.

HyperLogLog uses 6 bits per bucket in normal use. For our purposes, the rule
of thumb we propose uses 16 bits per bucket. So we lose a factor of ~3 on the
number of buckets. In exchange, we don't lose as much accuracy when |A ^ B| <<
|A v B| and also do much better for multi-set intersections (where inclusion-
exclusion becomes unwieldy).

As an aside, there are a bunch of other really cool Jaccard index
fingerprinting methods out there that are asymptotically better even than
HyperMinHash. But, they aren't as easy to work with and lack some of the other
nice properties of MinHash.

~~~
bruce_lipshitz
> _As an aside, there are a bunch of other really cool Jaccard index
> fingerprinting methods out there that are asymptotically better even than
> HyperMinHash. But, they aren 't as easy to work with and lack some of the
> other nice properties of MinHash._

Would you be willing to list out these better methods and detail their pros /
cons versus MinHash? It would be super helpful for those of us who are
students in this field.

~~~
yunwilliamyu
Things like b-bit MinHash [Li, Konig, 2010]. A good summary of the problem
with asymptotics and some lower bounds is Pagh, Stockel, Woodruff, 2014, "Is
Min-Wise Hashing Optimal for Summarizing Set Intersection." There's been more
recent work, but that's a good place to start.

These other techniques have lower space complexity than MinHash (or even
HyperMinHash) but have the con that they lose the streaming and composability
properties of MinHash. One of the main advantages of a MinHash (or HLL) sketch
is that you can combine sketch(A) with sketch(B) to get sketch(A v B).
Alternately, the data structures can be progressively updated as you see a
stream of items. The other fingerprinting techniques often require storing
auxiliary data during the generation process.

------
pjungwir
Neat! I can definitely imagine cases where this would be helpful.

Another thing I've wished for is being able to _remove_ an element from an
hll. I don't think intersection quite gets me there; it would need to be a
separate operation.

Here is where I mentioned this once before, although that particular use case
was achievable without needing an hll at all:
[https://news.ycombinator.com/item?id=14637903](https://news.ycombinator.com/item?id=14637903)

