Hacker News new | past | comments | ask | show | jobs | submit login
Cardinality Estimation (2011) [pdf] (princeton.edu)
134 points by ingve on Dec 22, 2016 | hide | past | web | favorite | 11 comments

Sedgewick is a very good writer IMHO. I learned algorithms with his famous book "Algorithms in C". Later I read many different algorithms books, and there are better books from specific point of views, however as an entry level book I think that AiC is one of the most accessible for students.

Btw, Redis recently switched from HyperLogLog to the slightly better LogLog-Beta algorithm which is not listed in this publication AFAIK.

> Redis recently switched from HyperLogLog to the slightly better LogLog-Beta

Interesting, has anything been written about that? I've learned most of what I know about cardinality estimation from Redis and your writing on the matter, so I'd love to see more. It would be great to re-visit an article I wrote about HLL[0] from a new perspective as well.

[0]: https://blog.codeship.com/counting-distinct-values-with-hype...

If you are interested in new cardinality estimation algorithms for HyperLogLog sketches you could also have a look on the paper I am currently working on:


There, I present two different algorithms based on theoretical considerations that are both accurate over the entire cardinality range. Unlike LogLog-Beta or HyperLogLog++, they do not need any empirically determined coefficients or bias correction data.

See the Arxiv paper[0] on it.

Found this from the Redis code[1] (I was curious, too).



Sedgewick also has a nice coursera lecture on 2-3 trees leading to LLRB trees. Covered in these slides:


A simpler implementation of red-black trees. Came across it and enjoyed it.

I haven't actually found LLRBs to be simpler to implement, and a couple other sources agree with me. They're pedagogically simpler than standard RB trees, but if your goal is to introduce students to a balanced binary tree, I'd argue that AVLs are easier to teach and implement than either RBs or LLRBs.



I had an opportunity to dig into the details of hyperloglog recently, and I wrote up my results here - http://moderndescartes.com/essays/hyperloglog

HyperLogLog is definitely one of the most elegant algorithms I'm aware of. The only thing is that the problem it solves is very specific, and it's hard to generalize it to do other things.

In the HyperBitBit initial experiments slide, the table with the results and estimates is throwing me off guard in regard to the ratios.

From my quick calculations, I'm getting [1.03, 0.96, 0.96, 1.05] but the table specifically has [1.05, 1.03, 0.96, 1.03]

Am I missing something here? Specifically the 2M case has a positive ratio when the HBB estimate is higher than the actual count.

As a layperson, I found this to be a pretty accessible background that ultimately leads up to a useful description of HyperLogLog. I had never heard of Philippe Flajolet before, and now I'm glad I have.

This presentation is from 2016, not 2011.

Understood since Babbage: AofA is a scientific endeavor.

• Start with a working program (algorithm implementation).

• Develop mathematical model of its behavior.

• Use the model to formulate hypotheses on resource usage.

• Use the program to validate hypotheses.

• Iterate on basis of insights gained.

Knuth's methodology (slide 2) can be used to analyze programs we currently use, which have no obvious mathematical content.

e.g. how do these fine code linters spot errors in your code? https://github.com/showcases/clean-code-linters but also does it manage your keystroke events an present information to you in a handy interface? algorithms are managing all of that

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact