Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What database should I use for analytics data?
11 points by AdriaanvRossum on Nov 3, 2019 | hide | past | favorite | 18 comments
Let me rephrase that. I'm running Simple Analytics where we collect page views, events and specific details about a page view (url, time, referrer, utm_source, screen width ...).

At the moment we use PostgreSQL which serves us pretty well. We have some caching tables setup and with our processing scripts we have the data in customers dashboards within 2 minutes. We move our raw visits in tables with visits aggregated per hour and per day.

This is all working fine. But we want to start targeting bigger clients now and would love to be fine then as well.

A few things that are important to us when selecting a database:

- it is very popular (we want to solve issues fast)

- it is free to use (the license)

- it is easy to maintain (very important)

- it can run on my own servers

One that really appeals to us is Apache Cassandra with Apache Spark.

But we're not sure if we should optimize our PostgreSQL workflow. I wouldn't mind investing 1 or 2 weeks of my time into setting up a new database, but it should also save time in the future. The common queries we will have are (see our demo dashboard to see what we show [1]):

- get (unique) page views from a website grouped by per 24 hours

- get list of most visited pages and show those with the percentages of referrals aggregated by day (/contact: twitter.com 20%, organic 80%)

- get conversions between events of sessions/list of events

There are probably many tools out there to help, but I wouldn't use a sledgehammer to hammer a nail.

[1] https://simpleanalytics.com/simpleanalytics.com




I would surely look at ClickHouse for such use case.

If you're rolling your own solution (rather than using DBaaS) ClickHouse is unparalleled when it comes to scaling and efficiency. Its SQL is somewhat restricted but it should be able to do what you're looking very well, especially it was born out of the needs of Yandex Metrica which is quite similar use case of yours.

If you're using Kubernetes there is great Operator available from Altinity to run Clickhouse on Kubernetes.


I’ve also had a great experience with clickhouse. Very easy to set up and maintain. Perhaps a little rough around the edges compared to Postgres, but I would look to clickhouse first for analytics.


> This is all working fine

Are there any specific reasons that you think Postgres won't be able to scale?

I don't have personal experience solving the problem you're describing but if you do choose to use something other than Postgres I think Apache Kafka might be interesting for some parts of your infrastructure at least.

Segment [0] has written quite a lot about their tech in a similar space.

[0] https://segment.com/blog/the-10m-engineering-problem/


Cloudflare documented their PostgreSQL to ClickHouse journey here: https://blog.cloudflare.com/http-analytics-for-6m-requests-p...

It covers a lot of the scaling bottlenecks in their original PG-based pipeline and how they eliminated them.

(Disclaimer: I work on ClickHouse.)


I think you are right, I'm kind of looking for a technical challenge where I actually should just improve the current setup. Thanks for the link!


I've built some dedicated analytics systems handling billions of events per month on the past, here are my comments:

1. Always know your limits. PG is impressive even on a large scale, so I would suggest you clone your existing database server and fill it with dummy data. After this, check how well it performs and when it breaks. Who knows, maybe you can live 5x the traffic with your current setup.

2. I can personally recommend checking HBase[0] and thinking about OLAP[1] based analytics. The relational version of OLAP is very good, but it's not very scalable. However, you can build it on top of a NoSQL solution (like HBase) as I did in the past. It was very robust and performant.

3. Lastly, it doesn't matter which database you select, as long as you understand how it works and know its limits. But generally, you might want to move away from relational databases as they're not designed to deal with massive datasets (although, you always can shard).

[0] https://hbase.apache.org/ [1] https://www.guru99.com/online-analytical-processing.html


We also developed a product that collects the user event data via SDKs (https://github.com/rakam-io/rakam), depending on the data volume the users pick one of the two deployment types. The first one is based on Postgresql 11 and it takes advantage of partitioned tables (one table for each event time for the configured time period such as month, week, etc.), BRIN indexes, and materialized views.

It's capable of handling 250M events per month without an issue but beyond that we suggest the users to use data-warehouse that supports horizontal scaling and columnar storage engine so we use Snowflake with Snowpipe, Kinesis or Kafka (pluggable) and S3.

I have also tried using Clickhouse which is much cheaper but it lacks tooling for automatic scaling.


I built Freshlytics using PipelineDB (Postgres extension) - https://github.com/sheshbabu/freshlytics

It uses continuous aggregations so you don't need to manually run rollups every few mins/hours. You don't need to store individual events so it's better for user's privacy.

Here's the link to the sql - https://github.com/sheshbabu/freshlytics/blob/master/src/ser...


Have you tried to feed some extra generated data into your current setup? Maybe you'll be happy with the performance.

The queries sound like something that should be aggregated over time once and saved so you don't have to do any non-trivial query from the dashboard.

If you're actually pushing limits on the current database, then I don't think 2 weeks will be enough to get familiar enough with operations and failure modes of a new one.


I think I can still optimize the extra generated data. And I think you're right about the 2 weeks being to limited. I will optimize my processing scripts (those are the culprit now)


I use MySQL for running some analytics in my project. I thought on switching to a time series database, but haven't done because I could see it would give me a lot of trouble. I also end up with your same solution. Running a worker that caches the different data for the dashboard plots. If you make the change, I would be happy to read about your implementation.


If you can meet your demands in PG currently, you will probably never need something as heavy as Spark.

If PG is currently working for you, then chances are it'll keep working if you optimise it (partitioning, clustering, etc).

If you do want something else, the particular use cases you described are pretty much exactly what Clickhouse was built for.


I've built a couple data warehouses around Google BigQuery - it's not free, but nothing is.

Think of it as petabyte scale SQL as a service - and a lot of things connect to it now, so you can use whatever tools you want for the visualization piece.


Thanks for your comment. I don't want a free product (should have been more clear) but I would love a free to use license. I'm more than happy to run it on my own servers. Actually, that's a requirement.


Maybe CrateDB is worth a look.

https://crate.io/products/cratedb/


I use https://clickhouse.yandex for first-party analytics.


We also have very good experience with Google's BigQuery.


Look at Druid or Imply




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: