
Ask HN: What database should I use for analytics data? - AdriaanvRossum
Let me rephrase that. I&#x27;m running Simple Analytics where we collect page views, events and specific details about a page view (url, time, referrer, utm_source, screen width ...).<p>At the moment we use PostgreSQL which serves us pretty well. We have some caching tables setup and with our processing scripts we have the data in customers dashboards within 2 minutes. We move our raw visits in tables with visits aggregated per hour and per day.<p>This is all working fine. But we want to start targeting bigger clients now and would love to be fine then as well.<p>A few things that are important to us when selecting a database:<p>- it is very popular (we want to solve issues fast)<p>- it is free to use (the license)<p>- it is easy to maintain (very important)<p>- it can run on my own servers<p>One that really appeals to us is Apache Cassandra with Apache Spark.<p>But we&#x27;re not sure if we should optimize our PostgreSQL workflow. I wouldn&#x27;t mind investing 1 or 2 weeks of my time into setting up a new database, but it should also save time in the future. The common queries we will have are (see our demo dashboard to see what we show [1]):<p>- get (unique) page views from a website grouped by per 24 hours<p>- get list of most visited pages and show those with the percentages of referrals aggregated by day (&#x2F;contact: twitter.com 20%, organic 80%)<p>- get conversions between events of sessions&#x2F;list of events<p>There are probably many tools out there to help, but I wouldn&#x27;t use a sledgehammer to hammer a nail.<p>[1] https:&#x2F;&#x2F;simpleanalytics.com&#x2F;simpleanalytics.com
======
PeterZaitsev
I would surely look at ClickHouse for such use case.

If you're rolling your own solution (rather than using DBaaS) ClickHouse is
unparalleled when it comes to scaling and efficiency. Its SQL is somewhat
restricted but it should be able to do what you're looking very well,
especially it was born out of the needs of Yandex Metrica which is quite
similar use case of yours.

If you're using Kubernetes there is great Operator available from Altinity to
run Clickhouse on Kubernetes.

~~~
hwwc
I’ve also had a great experience with clickhouse. Very easy to set up and
maintain. Perhaps a little rough around the edges compared to Postgres, but I
would look to clickhouse first for analytics.

------
sixhobbits
> This is all working fine

Are there any specific reasons that you think Postgres won't be able to scale?

I don't have personal experience solving the problem you're describing but if
you do choose to use something other than Postgres I think Apache Kafka might
be interesting for some parts of your infrastructure at least.

Segment [0] has written quite a lot about their tech in a similar space.

[0] [https://segment.com/blog/the-10m-engineering-
problem/](https://segment.com/blog/the-10m-engineering-problem/)

~~~
AdriaanvRossum
I think you are right, I'm kind of looking for a technical challenge where I
actually should just improve the current setup. Thanks for the link!

------
armis
I've built some dedicated analytics systems handling billions of events per
month on the past, here are my comments:

1\. Always know your limits. PG is impressive even on a large scale, so I
would suggest you clone your existing database server and fill it with dummy
data. After this, check how well it performs and when it breaks. Who knows,
maybe you can live 5x the traffic with your current setup.

2\. I can personally recommend checking HBase[0] and thinking about OLAP[1]
based analytics. The relational version of OLAP is very good, but it's not
very scalable. However, you can build it on top of a NoSQL solution (like
HBase) as I did in the past. It was very robust and performant.

3\. Lastly, it doesn't matter which database you select, as long as you
understand how it works and know its limits. But generally, you might want to
move away from relational databases as they're not designed to deal with
massive datasets (although, you always can shard).

[0] [https://hbase.apache.org/](https://hbase.apache.org/) [1]
[https://www.guru99.com/online-analytical-
processing.html](https://www.guru99.com/online-analytical-processing.html)

------
buremba
We also developed a product that collects the user event data via SDKs
([https://github.com/rakam-io/rakam](https://github.com/rakam-io/rakam)),
depending on the data volume the users pick one of the two deployment types.
The first one is based on Postgresql 11 and it takes advantage of partitioned
tables (one table for each event time for the configured time period such as
month, week, etc.), BRIN indexes, and materialized views.

It's capable of handling 250M events per month without an issue but beyond
that we suggest the users to use data-warehouse that supports horizontal
scaling and columnar storage engine so we use Snowflake with Snowpipe, Kinesis
or Kafka (pluggable) and S3.

I have also tried using Clickhouse which is much cheaper but it lacks tooling
for automatic scaling.

------
rkwz
I built Freshlytics using PipelineDB (Postgres extension) -
[https://github.com/sheshbabu/freshlytics](https://github.com/sheshbabu/freshlytics)

It uses continuous aggregations so you don't need to manually run rollups
every few mins/hours. You don't need to store individual events so it's better
for user's privacy.

Here's the link to the sql -
[https://github.com/sheshbabu/freshlytics/blob/master/src/ser...](https://github.com/sheshbabu/freshlytics/blob/master/src/server/migrations/1561030715992_init.sql)

------
viraptor
Have you tried to feed some extra generated data into your current setup?
Maybe you'll be happy with the performance.

The queries sound like something that should be aggregated over time once and
saved so you don't have to do any non-trivial query from the dashboard.

If you're actually pushing limits on the current database, then I don't think
2 weeks will be enough to get familiar enough with operations and failure
modes of a new one.

~~~
AdriaanvRossum
I think I can still optimize the extra generated data. And I think you're
right about the 2 weeks being to limited. I will optimize my processing
scripts (those are the culprit now)

------
manceraio
I use MySQL for running some analytics in my project. I thought on switching
to a time series database, but haven't done because I could see it would give
me a lot of trouble. I also end up with your same solution. Running a worker
that caches the different data for the dashboard plots. If you make the
change, I would be happy to read about your implementation.

------
missosoup
If you can meet your demands in PG currently, you will probably never need
something as heavy as Spark.

If PG is currently working for you, then chances are it'll keep working if you
optimise it (partitioning, clustering, etc).

If you do want something else, the particular use cases you described are
pretty much exactly what Clickhouse was built for.

------
mattbillenstein
I've built a couple data warehouses around Google BigQuery - it's not free,
but nothing is.

Think of it as petabyte scale SQL as a service - and a lot of things connect
to it now, so you can use whatever tools you want for the visualization piece.

~~~
AdriaanvRossum
Thanks for your comment. I don't want a free product (should have been more
clear) but I would love a free to use license. I'm more than happy to run it
on my own servers. Actually, that's a requirement.

------
lusbuab
Maybe CrateDB is worth a look.

[https://crate.io/products/cratedb/](https://crate.io/products/cratedb/)

------
dylz
I use [https://clickhouse.yandex](https://clickhouse.yandex) for first-party
analytics.

------
saschaheyer
We also have very good experience with Google's BigQuery.

------
fauxfauxpas
Look at Druid or Imply

