

HyperLogSandwich - boyd
https://github.com/chanian/hyperlogsandwich/wiki

======
t1m
Note that intersections using HLLs don't work in general. The error rates
become gigantic for sets that have small intersections, or whose cardinality
differs greatly.

A general solution to cardinality estimations for intersections is to just use
HLLs for union operations, and keep a MinHash sketch for each key as well to
perform intersections.

There is a very good analysis of this technique over at
[http://tech.adroll.com/blog/data/2013/07/10/hll-
minhash.html](http://tech.adroll.com/blog/data/2013/07/10/hll-minhash.html)

~~~
iamwil
HN is great. I was looking for something similar to HLL to do intersections,
but couldn't find it because I didn't know what it was called. MinHash,
apparently!

------
bpodgursky
It would be really great if this was implemented in stream-lib
([https://github.com/addthis/stream-lib](https://github.com/addthis/stream-
lib)), which is a pretty standard library for HLL counters.

Also, the community supporting that lib is very familiar with the mathematics
behind these structures, and would be be better able to critique the error
rates and such.

------
Hengjie
Can someone explain to me like I'm 5 how this is useful?

~~~
temuze
HyperLogLog is an algorithm that allows you to find an estimate of the
distinct elements of a dataset with very little memory. Here's a simple
explanation:

[http://opensourceconnections.com/blog/2015/02/04/its-log-
its...](http://opensourceconnections.com/blog/2015/02/04/its-log-its-log-its-
big-its-hyper-its-good/)

HyperLogSandwich solves a similar problem. From the README:

> Unlike the HyperLogLog, which estimates the cardinality of unique items in a
> set, and unlike a CountMinSketch, which estimates the frequency of a
> specific item, the HyperLogSandwich estimates the number of unique items
> that have occurred for any frequency.

Real world examples include:

\- How many people viewed my Tweets 4+ times this month?

\- How many users watched this video 2 times this week?

\- How many users visited my website 3 times during an arbitrary time range?

\- How many log exceptions happened 10+ times last week?

