Hacker News new | past | comments | ask | show | jobs | submit login
Counting Billion Distinct Objects Using Only 1.5KB of Memory (2012) (highscalability.com)
90 points by yagizdegirmenci on May 24, 2021 | hide | past | favorite | 11 comments



Just to be that guy, the title of the article should include the word "approximately" as HyperLogLog and other probabilistic sketches are all (albeit often quite accurate) approximation methods.


Nah we need that guy, because I'm sick of clickbait titles.


The article like you pointed out is comparing apples to oranges. I could use 0 bits of memory to count if I’m able to accept 100% error to take the extreme case


You could even be statistically better with a probabilistic guess from the size of the entry and still no memory used.


This article is old but really good at explaining HyperLogLog: http://blog.notdot.net/2012/09/Dam-Cool-Algorithms-Cardinali...


> Specifically, by throwing out the 30% of buckets with the largest values, and averaging only 70% of buckets with the smaller values, accuracy can be improved from 1.30/sqrt(m) to only 1.05/sqrt(m)!

How do you do this without sorting? Or, do you approximate again by throwing out everything < 0.3 * max_hash? It’s not obvious to me how that approach would actually give any better results: the spacing between 0.3 * max_hash and the next lowest hash is just as prone to variance as the distance between 0 and the absolute lowest hash, and it’s that distance which feeds the estimate.


For Python, there's Bounter, "Efficient Counter that uses a limited (bounded) amount of memory regardless of data size."

https://github.com/RaRe-Technologies/bounter

Apart from HyperLogLog, Bounter implements a bunch of other nifty (fast) approximate counting algorithms like CountMinSketch.


These are in use within a lot of production systems: https://datasketches.apache.org/



> Here is an example: 4f67bfc603106cb2. These 16 characters represent 128 bits.

These look like hex digits. 16 hex digits is 8 bytes which is 64 bits. What am I missing?


The author was probably viewing those as some sort of byte-per-character encoding rather than hex.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: