Ask HN: What database should I use for analytics data?

PeterZaitsev · on Nov 3, 2019

I would surely look at ClickHouse for such use case.

If you're rolling your own solution (rather than using DBaaS) ClickHouse is unparalleled when it comes to scaling and efficiency. Its SQL is somewhat restricted but it should be able to do what you're looking very well, especially it was born out of the needs of Yandex Metrica which is quite similar use case of yours.

If you're using Kubernetes there is great Operator available from Altinity to run Clickhouse on Kubernetes.

hwwc · on Nov 3, 2019

I’ve also had a great experience with clickhouse. Very easy to set up and maintain. Perhaps a little rough around the edges compared to Postgres, but I would look to clickhouse first for analytics.

sixhobbits · on Nov 3, 2019

> This is all working fine

Are there any specific reasons that you think Postgres won't be able to scale?

I don't have personal experience solving the problem you're describing but if you do choose to use something other than Postgres I think Apache Kafka might be interesting for some parts of your infrastructure at least.

Segment [0] has written quite a lot about their tech in a similar space.

[0] https://segment.com/blog/the-10m-engineering-problem/

hodgesrm · on Nov 3, 2019

Cloudflare documented their PostgreSQL to ClickHouse journey here: https://blog.cloudflare.com/http-analytics-for-6m-requests-p...

It covers a lot of the scaling bottlenecks in their original PG-based pipeline and how they eliminated them.

(Disclaimer: I work on ClickHouse.)

AdriaanvRossum · on Nov 3, 2019

I think you are right, I'm kind of looking for a technical challenge where I actually should just improve the current setup. Thanks for the link!

armis · on Nov 3, 2019

I've built some dedicated analytics systems handling billions of events per month on the past, here are my comments:

1. Always know your limits. PG is impressive even on a large scale, so I would suggest you clone your existing database server and fill it with dummy data. After this, check how well it performs and when it breaks. Who knows, maybe you can live 5x the traffic with your current setup.

2. I can personally recommend checking HBase[0] and thinking about OLAP[1] based analytics. The relational version of OLAP is very good, but it's not very scalable. However, you can build it on top of a NoSQL solution (like HBase) as I did in the past. It was very robust and performant.

3. Lastly, it doesn't matter which database you select, as long as you understand how it works and know its limits. But generally, you might want to move away from relational databases as they're not designed to deal with massive datasets (although, you always can shard).

[0] https://hbase.apache.org/ [1] https://www.guru99.com/online-analytical-processing.html

buremba · on Nov 4, 2019

We also developed a product that collects the user event data via SDKs (https://github.com/rakam-io/rakam), depending on the data volume the users pick one of the two deployment types. The first one is based on Postgresql 11 and it takes advantage of partitioned tables (one table for each event time for the configured time period such as month, week, etc.), BRIN indexes, and materialized views.

It's capable of handling 250M events per month without an issue but beyond that we suggest the users to use data-warehouse that supports horizontal scaling and columnar storage engine so we use Snowflake with Snowpipe, Kinesis or Kafka (pluggable) and S3.

I have also tried using Clickhouse which is much cheaper but it lacks tooling for automatic scaling.

rkwz · on Nov 3, 2019

I built Freshlytics using PipelineDB (Postgres extension) - https://github.com/sheshbabu/freshlytics

It uses continuous aggregations so you don't need to manually run rollups every few mins/hours. You don't need to store individual events so it's better for user's privacy.

Here's the link to the sql - https://github.com/sheshbabu/freshlytics/blob/master/src/ser...

viraptor · on Nov 3, 2019

Have you tried to feed some extra generated data into your current setup? Maybe you'll be happy with the performance.

The queries sound like something that should be aggregated over time once and saved so you don't have to do any non-trivial query from the dashboard.

If you're actually pushing limits on the current database, then I don't think 2 weeks will be enough to get familiar enough with operations and failure modes of a new one.

AdriaanvRossum · on Nov 3, 2019

I think I can still optimize the extra generated data. And I think you're right about the 2 weeks being to limited. I will optimize my processing scripts (those are the culprit now)

manceraio · on Nov 3, 2019

I use MySQL for running some analytics in my project. I thought on switching to a time series database, but haven't done because I could see it would give me a lot of trouble. I also end up with your same solution. Running a worker that caches the different data for the dashboard plots. If you make the change, I would be happy to read about your implementation.

missosoup · on Nov 3, 2019

If you can meet your demands in PG currently, you will probably never need something as heavy as Spark.

If PG is currently working for you, then chances are it'll keep working if you optimise it (partitioning, clustering, etc).

If you do want something else, the particular use cases you described are pretty much exactly what Clickhouse was built for.

mattbillenstein · on Nov 3, 2019

I've built a couple data warehouses around Google BigQuery - it's not free, but nothing is.

Think of it as petabyte scale SQL as a service - and a lot of things connect to it now, so you can use whatever tools you want for the visualization piece.

AdriaanvRossum · on Nov 3, 2019

Thanks for your comment. I don't want a free product (should have been more clear) but I would love a free to use license. I'm more than happy to run it on my own servers. Actually, that's a requirement.

lusbuab · on Nov 3, 2019

Maybe CrateDB is worth a look.

https://crate.io/products/cratedb/

dylz · on Nov 3, 2019

I use https://clickhouse.yandex for first-party analytics.

saschaheyer · on Nov 3, 2019

We also have very good experience with Google's BigQuery.

fauxfauxpas · on Nov 3, 2019

Look at Druid or Imply