
Distributed count(distinct) with HyperLogLog on Postgres - anarazel
https://www.citusdata.com/blog/2017/04/04/distributed_count_distinct_with_postgresql/
======
Manozco
I still think the best explanation of HyperLogLog is the one done by antirez
(Redis Developer) here:
[http://antirez.com/news/75](http://antirez.com/news/75)

------
simonw
Great explanation of HLL in here - I hadn't fully understood how you can
combine HLLs together, which is key to understanding why they can help
distribute count distinct over multiple shards without needing to copy vast
amounts of data around to check for uniqueness across multiple shards.

------
lima
Is citus + cstore_fdw a good choice for space-efficient archiving and
occassionally querying large amounts of (high cardinality) log and (lower
cardinality) time series data?

ElasticSearch works great but it unsuitable for long term storage for
performance and storage reasons.

Great writeup by the 18F guys on this topic:
[https://github.com/18F/api.data.gov/issues/235](https://github.com/18F/api.data.gov/issues/235)

Unfortunately, they did not really consider Citus and they seem to have lots
of issues with the solution they chose (Kylin).

~~~
ozgune
TLDR: This depends on your requirements. We find that users with similar
requirements follow one of two approaches:

(1) You keep raw data for a certain time period and regularly roll it up:
[https://www.citusdata.com/blog/2016/11/29/event-
aggregation-...](https://www.citusdata.com/blog/2016/11/29/event-aggregation-
at-scale-with-postgresql)

(2) Citus + cstore_fdw if you're cool with doing some manual work. In
particular, our customers who follow this approach shard on one column
(automatic) and then manually partition by time. We're looking to leverage
upcoming partitioning improvements in Postgres 10 and further automate this
setup:
[https://github.com/citusdata/citus/issues/183](https://github.com/citusdata/citus/issues/183)

For more on Citus + cstore, this recent HN comment could also be helpful:
[https://news.ycombinator.com/item?id=14039001](https://news.ycombinator.com/item?id=14039001)

Happy to chat more if you drop us a line. How much data did you have in mind?
Are you looking to power a B2B or B2C application?

~~~
lima
Internal analytics, may accidentally become a B2B at some point (who knows),
but for now, non business-critical with no availability requirements. About
30+ GB per day, but that's uncompressed and unprocessed.

Thanks for the pointers! Manual sharding sounds like an option.

------
brd529
HLL is great for a lot of reasons. He gives the primary reason as getting
uniques from randomly sharded data in a distributed system.

If your distributed system allows you to do a hash or range based sharding,
for example by user_id, then you can do an accurate count(distinct user_id)
across the system without a reshuffle of the data, knowing that all the data
for a particular user lives on the same node.

~~~
ozgune
(Ozgun from Citus Data)

Yup, good point. This example shards the github_events table on user_id and
then shows running count(distinct user_id). Since the sharding and
count(distinct) column is the same, Citus can push down the count(distinct
user_id) to each shard and then sum up the results from those shards. In this
case, reshuffles don't come into the picture.

In this example, HLLs would be most useful if the user then issued
count(distinct repo_id) for example. Citus would then ask for HLL sketches
from each shard and add them up on the coordinator node.

