Btw, Redis recently switched from HyperLogLog to the slightly better LogLog-Beta algorithm which is not listed in this publication AFAIK.
Interesting, has anything been written about that? I've learned most of what I know about cardinality estimation from Redis and your writing on the matter, so I'd love to see more. It would be great to re-visit an article I wrote about HLL from a new perspective as well.
There, I present two different algorithms based on theoretical considerations that are both accurate over the entire cardinality range. Unlike LogLog-Beta or HyperLogLog++, they do not need any empirically determined coefficients or bias correction data.
Found this from the Redis code (I was curious, too).
A simpler implementation of red-black trees. Came across it and enjoyed it.
HyperLogLog is definitely one of the most elegant algorithms I'm aware of. The only thing is that the problem it solves is very specific, and it's hard to generalize it to do other things.
From my quick calculations, I'm getting [1.03, 0.96, 0.96, 1.05] but the table specifically has [1.05, 1.03, 0.96, 1.03]
Am I missing something here? Specifically the 2M case has a positive ratio when the HBB estimate is higher than the actual count.
• Start with a working program (algorithm implementation).
• Develop mathematical model of its behavior.
• Use the model to formulate hypotheses on resource usage.
• Use the program to validate hypotheses.
• Iterate on basis of insights gained.
Knuth's methodology (slide 2) can be used to analyze programs we currently use, which have no obvious mathematical content.
e.g. how do these fine code linters spot errors in your code? https://github.com/showcases/clean-code-linters but also does it manage your keystroke events an present information to you in a handy interface? algorithms are managing all of that