I'm using VictoriaMetrics instead of Prometheus, am doing something wrong?
I have zabbix as well as node_exporter and Percona PMM for mysql servers because sometimes it is hard to configure prometheus stack for snmp when zabbix cover this case out of the box.
Prometheus itself is pretty simple, fairly robust, but doesn’t necessarily scale for long-term storage as well. Things like VictoriaMetrics, Mimir, and Thanos tend to be a bit more scalable for longer term storage of metrics.
For a few hundred gigs of metrics, I’ve been fine with Prometheus and some ZFS-send backups.
Just to expand upon some experiences with some of the listed software.
The architecture is quite different between Thanos and the others you've listed as unlike the others, Thanos queries fan out to remote Prometheus instances for hot data and then ship out data (typically older than 2 hours) via a sidecar to s3 storage.
As the routing of the query depends on setting Prometheus external labels, our developer queries would often fan out unnecessarily to multiple prometheus instances. This is because our developers often search for metrics via a service name or some service related label rather than use an external label which describes the location of the workload which is used by Thanos.
Upon identifying this, I migrated to Mimir and we saw immediate drops in query response times for developer queries which now don't have to wait for the slowest promethues instances before displaying the data.
We've also since adopted OpenTelemetry in our workloads and directly ingest otlp in to Mimir (Which VictoriaMetrics also support).
I wrote an extensive reply to this but unfortunately the HN servers restarted and lost it.
The TL;DR was that from where I stand, you’re doing nothing wrong.
In a previous client we ran Prometheus for months, then Thanos, and eventually we implemented Victoria Metrics and everyone was happy. It became an order of magnitude cheaper due to using spinning rust for storage and still getting better performance. It was infinitely and very easily scalable, mostly automatically.
The “non-compliant” bits of the query language turned out to be fixes to the UX and other issues. Lots of new functions and features.
Support was always excellent.
I’m not affiliated with them in any way. Was always just a very happy freeloading user.
I have deployed lots of metrics systems, starting with cacti and moving through graphite, kairosdb (which used Cassandra under the hood), Prometheus, Thanos and now Mimir.
What I've realised is that they're all painful to scale 'really big'. One Prometheus server is easy. And you can scale vertically and go pretty big. But you need to think about redundancy, and you want to avoid ending up accidentally running 50 Prometheus instances, because that becomes a pain for the Grafana people. Unless you use an aggregating proxy like Promxy. But even then you have issues running aggregating functions across all of the instances. You need to think about expiring old data and possibly aggregating it down into into a smaller set so you can still look at certain charts over long periods. What's the Prometheus solution here? MOAR INSTANCES. And reads need to be performant or you'll have very angry engineers during the next SEV1, because their dashboards aren't loading. So you throw in an additional caching solution like Trickster (which rocks!) between Grafana and the metrics. Back in the Kairosdb days you had to know a fair bit about running Cassandra clusters, but these days it's all baked into Mimir.
I'm lucky enough to be working for a smaller company right now, so I don't have to spend a lot of time tending to the monitoring systems. I love that Mimir is basically Prometheus backed by S3, with all of the scalability and redundancy features built in (though you still have to configure them in large deployments). As long as you're small enough to run their monolithic mode you don't have to worry about individually scaling half a dozen separate components. The actual challenge is getting the $CLOUD side of it deployed, and then passing roles and buckets to the nasty helm charts while still making it easy to configure the ~10 things that you actually care about. Oh and the helm charts and underlying configs are still not rock solid yet, so upgrades can be hairy.
Ditto all of that for logging via Loki.
It's very possible that Mimir is no better than Victoria Metrics, but unless it burns me really badly I think I'll stick with it for now.
Well, they claim superior performance (which might be true), but the costs are high and include a small community, low quality APIs, best effort correctness/PromQL compatibility, and FUD marketing, so I decided to go with the de-facto standard without all of the issues above.
No costs if you're hosting everything. It does scale better and has better performance. Used it and have nothing bad to say about it. For the most part a drop-in replacement that just performs better. Didn't run into PromQL compatibility issues with off-the-shelf Grafana dashboards.
I am on mobile, so cannot really link GitHub for examples, but I'd recommend anyone considering using VM over Prometheus to take a cursory look into how similar things are implemented in both projects, and what shortcuts were made in the name of getting "better performance".
Performance-wise e.g. VictoriaMetrics' prometheus-benchmark only covered instant queries without look back for example the last time I checked.
Regarding FUD marketing: All Prometheus community channels (mailing lists, StackOverflow, Reddit, GitHub, etc.) are full of VM devs pushing VM, bashing everything from the ecosystem without mentioning any of the tradeoffs. I am also not aware of VictoriaMetrics giving back anything to the Prometheus ecosystem (can you maybe link some examples if I am wrong?) which is a very similar to Microsoft's embrace, extend, and extinguish strategy.
As per recent actual examples, here's a 2 submission of the same post bashing project in the ecosystem: https://news.ycombinator.com/item?id=40838531, https://news.ycombinator.com/item?id=39391208, but it's really hard to avoid all the rest in the places mentioned above.
> Performance-wise e.g. VictoriaMetrics' prometheus-benchmark only covered instant queries without look back for example the last time I checked.
prometheus-benchmark ( https://github.com/VictoriaMetrics/prometheus-benchmark ) tests CPU usage, RAM usage and disk usage for typical alerting queries. It doesn't test the performance of queries used for building graphs in Grafana because the typical rate of alerting queries is multiple orders of magnitude bigger than the typical rate of queries for building graphs, e.g. alerting queries generate the most of load on CPU, RAM and disk IO in typical production workload.
This submission posts a link to the real-world experience of long-term user of Grafana Loki. This user points to various issues in applications he uses. For example:
As you can see, this user shares his extensive experience with Grafana Loki, and continues using it despite the fact that there is much better solution exists, which is free from all the Loki issues - VictoriaLogs. This user isn't affiliated with VictoriaMetrics in any way.