
Show HN: An improved version of HyperLogLog - seiflotfy
https://github.com/axiomhq/hyperloglog.git
======
antirez
Somewhat similar to Redis implementation in many ways: Redis also uses loglog-
beta, sparse representations, and dense representations in packed format
without wasting 1 byte. However Redis representation uses 6 bits registers.

~~~
seiflotfy
Its heavily inspired by the rowdies implementation, fact is I looked back then
how you did loglog-beta in redis and did a repo for it. Now I just added
TailCut to it. Do you mind if I do a PR for redis?

~~~
antirez
Cool! Please drop me an email if possible, I would like to explore how the
reduced bits in the registers affects Redis ability to estimate very very high
cardinalities: imagine millions of events / second running for years. The
problem with improving the current Redis implementation is backward
compatibility. I made the error of representing the HyperLogLog in Redis as a
string instead of using an opaque data type, so it is not possible to do
conversion-on-load of old data, but there is to convert at the first access.

~~~
seiflotfy
If there is a version annotation it is very easy to convert the old
HyperLogLog to a 4 bit register representation. I will send you an email ASAP.

------
seiflotfy
Author here, the main 3 things that make this stand out is: 1) 4-bit register
instead of 5 (HLL) and 6 (HLL++), but most implementations use 1 byte
registers out of convenience 2) Sparse representation for lower cadinalities
(like HyperLogLog++) 3) loglog-beta for dynamic bias correction medium and
high cardinalities.

------
StreamBright
This is absolutely amazing. Seems like a great tradeoff of having smaller
space usage and better accurancy on average at the same time. It is also worth
to read the related white paper:
[http://cse.seu.edu.cn/PersonalPage/csqjxiao/csqjxiao_files/p...](http://cse.seu.edu.cn/PersonalPage/csqjxiao/csqjxiao_files/papers/INFOCOM17.pdf)

------
chimeracoder
This is awesome! We make heavy use of HyperLogLogs in our monitoring
systems[0], which are also written in Go and currently use Clark Duvall's
library, which this library is based on.

I'm excited to try this out on our systems and see what results we get.

[0] [https://github.com/stripe/veneur](https://github.com/stripe/veneur)

~~~
seiflotfy
I am happy to help if needed :D

------
jbapple
How does the Xiao et al. paper (which introduces TailCut, "Better with Fewer
Bits: Improving the Performance of Cardinality Estimation of Large Data
Streams",
<[http://cse.seu.edu.cn/PersonalPage/csqjxiao/csqjxiao_files/p...](http://cse.seu.edu.cn/PersonalPage/csqjxiao/csqjxiao_files/papers/INFOCOM17.pdf>))
compare with Ertl's "New cardinality estimation algorithms for HyperLogLog
sketches",
<[https://arxiv.org/abs/1702.01284>](https://arxiv.org/abs/1702.01284>)?

~~~
oertl
The tailcut approach breaks one important property of cardinality estimation
algorithms: Adding the same value multiple times should change the state of
the sketch only once when adding it for the first time.

However, due to the chance of overflows the same value can lead to multiple
changes when using the tailcut approach. The introduced error depends on the
order and also on the frequency distribution of inserted elements. It would be
interesting to know which assumptions on the input data were made for the
promised accuracy. In their paper I could not find any hint how the input data
looked like in their experiments. It is fairly easy to obtain good results for
the trivial case where all input values are distinct.

The tailcut approach reminds me of the HyperBitBit algorithm which also drops
the property described above and promises a better storage factor (less error
with same amount of memory).

It is true that the traditional 5-bit or 6-bit representations of Hyperloglog
registers are suboptimal. Even with lossless compression (see
[https://pdfs.semanticscholar.org/6558/d618556f812328f969b60b...](https://pdfs.semanticscholar.org/6558/d618556f812328f969b60b5f97dae6f940c8.pdf))
a significant amount of memory can be saved.

It is interesting that the standard error is reduced from 1.04/sqrt(m) down to
1.0/sqrt(m) despite the loss of information after the tailcut. Therefore, I
conclude that it must be the estimation method itself which is superior to
existing methods. I will need to check.

~~~
seiflotfy
Hey oertl, big fan of your paper
[https://arxiv.org/pdf/1706.07290.pdf](https://arxiv.org/pdf/1706.07290.pdf)
:D I going to start working on "intersections" and some improvements of the
large scale cardinalities based on your research. Can I hit you up an email
with some of my questions (concerning the intersection work? how does it
compare to
[https://arxiv.org/pdf/1704.03911.pdf](https://arxiv.org/pdf/1704.03911.pdf))

~~~
oertl
Sure, send me an email.

------
pjungwir
I was interested in using HLL for a project where I had ~20 million objects
and each had a "status" with about 6 possibilities. I wanted to get fast
counts of how many objects were in each status. But their status can change
over time, so I needed a way to also _remove_ items from the count. HLL
doesn't support this. Does anyone know of an algorithm that does? (Keeping a
separate HLL for "removed" doesn't help because objects can revisit the same
status they had before.)

~~~
teej
I'm sure you thought of this, but what are the drawbacks to simply INCR/DECR
some counters when an object changes state?

~~~
pjungwir
Indeed after ruling out HLL I decided that was the way to go. :-) I guess an
HLL is more appropriate when you can't so easily keep track of which items
have already been added to each set. For example if you were counting unique
visitors per day, you would have a lot of sets (one per day), so you wouldn't
want to store n visitor IDs for every day. But when an item can only be in one
set at a time (and the sets are finite) incr/decr is possible so you might as
well go with it.

------
lorenzhs
I don't understand what the result table measures. What are the numbers and
what is the " _Exact_ " column?

~~~
StreamBright
Exact is the actual cardinality. How many distinct elements are in the set.
HLL can estimate the cardinality by using fraction of the space that would
require the store the unique elements instead.

~~~
lorenzhs
Then wouldn't it be useful to add two columns for space used?

~~~
seiflotfy
Its always going to be half. The thing is the space is allocated upon creation
of the data structure and does not increase afterwards

~~~
lorenzhs
That doesn't match up with the tagline " _An improved version of HyperLogLog
for the count-distinct problem, approximating the number of distinct elements
in a multiset using 20-50% less space than other usual HyperLogLog
implementations._ "

~~~
seiflotfy
I am comparing the implementations from influx and axiom

------
JensRantil
I'm looking for implementations of this in other languages. Specifically in
Python or Java. Please hook me up if you know of any such implementation.

------
buremba
Do you guys know any way to calculate the intersection of two HLL instances?
The algorithm allows merging two sets but it's not that easy to take the
intersection of them. A + B - (A U B) is not the optimal way so I would love
to hear your suggestions.

~~~
seiflotfy
I have a way will post it soon

~~~
buremba
Would love to hear about the solution, are you going to implement it or
publish an article?

------
helper
There really are a lot of hyperloglog implementations in go.

I'd be interested in seeing a comprehensive comparison between implementations
for speed, accuracy and space efficiency.

~~~
seiflotfy
I'll take care ofd this over the weekend :D

~~~
JensRantil
Any progress?

------
tmaly
why use metro hash instead of xxhash ?

~~~
seiflotfy
In my test metro hash gave me higher throughput, must be something with the
xxhash implementation in Go

~~~
dom0
I tested both 64 bit variants a while ago and there was no throughput
difference worth mentioning with the C++ and C implementations, respectively.
(I settled on xxHash since it's around for longer and the code is portable,
also I didn't need to do a C++->C translation).

If there is a large difference, perhaps one of the implementations is
vectorized or botched.

