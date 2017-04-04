Hacker News new | comments | show | ask | jobs | submit login
Distributed count(distinct) with HyperLogLog on Postgres (citusdata.com)
HLL is great for a lot of reasons. He gives the primary reason as getting uniques from randomly sharded data in a distributed system.

If your distributed system allows you to do a hash or range based sharding, for example by user_id, then you can do an accurate count(distinct user_id) across the system without a reshuffle of the data, knowing that all the data for a particular user lives on the same node.

Great explanation of HLL in here - I hadn't fully understood how you can combine HLLs together, which is key to understanding why they can help distribute count distinct over multiple shards without needing to copy vast amounts of data around to check for uniqueness across multiple shards.

