Let me rephrase that. I'm running Simple Analytics where we collect page views, events and specific details about a page view (url, time, referrer, utm_source, screen width ...).
At the moment we use PostgreSQL which serves us pretty well. We have some caching tables setup and with our processing scripts we have the data in customers dashboards within 2 minutes. We move our raw visits in tables with visits aggregated per hour and per day.
This is all working fine. But we want to start targeting bigger clients now and would love to be fine then as well.
A few things that are important to us when selecting a database:
- it is very popular (we want to solve issues fast)
- it is free to use (the license)
- it is easy to maintain (very important)
- it can run on my own servers
One that really appeals to us is Apache Cassandra with Apache Spark.
But we're not sure if we should optimize our PostgreSQL workflow. I wouldn't mind investing 1 or 2 weeks of my time into setting up a new database, but it should also save time in the future. The common queries we will have are (see our demo dashboard to see what we show [1]):
- get (unique) page views from a website grouped by per 24 hours
- get list of most visited pages and show those with the percentages of referrals aggregated by day (/contact: twitter.com 20%, organic 80%)
- get conversions between events of sessions/list of events
There are probably many tools out there to help, but I wouldn't use a sledgehammer to hammer a nail.
[1] https://simpleanalytics.com/simpleanalytics.com
If you're rolling your own solution (rather than using DBaaS) ClickHouse is unparalleled when it comes to scaling and efficiency. Its SQL is somewhat restricted but it should be able to do what you're looking very well, especially it was born out of the needs of Yandex Metrica which is quite similar use case of yours.
If you're using Kubernetes there is great Operator available from Altinity to run Clickhouse on Kubernetes.