
Big Prometheus: Thanos, Cortex, M3DB and VictoriaMetrics at Scale - davidmr
https://monitoring2.substack.com/p/big-prometheus
======
cfors
Before anybody thinks that they need something like this at work, I have seen
single node HA Prometheus set ups work at one of the largest CDN's in the
country for metrics.

Reddit's own Kubernetes infrastructure team uses single node (pod) Promethei
as well. [0]

If you look all of the components that are required to run Thanos [1], the
operational complexity is incredibly high. I know its a shiny tool, that is
super cool but please make sure you have an actual need for some of these
before devoting resources to them.

[0]
[https://www.reddit.com/r/kubernetes/comments/ebxrkp/we_are_t...](https://www.reddit.com/r/kubernetes/comments/ebxrkp/we_are_the_reddit_infrastructure_team_ama_about/fbbk7jk/)

[1] [https://improbable.io/blog/thanos-prometheus-at-
scale](https://improbable.io/blog/thanos-prometheus-at-scale)

~~~
jrv
Prometheus author here (Julius).

Generally agreed that you can get far with a single Prometheus server (or many
independent vanilla Prometheus servers, potentially also using Prometheus's
own federation). But I still recommend Thanos as an extension to a lot of
people. I like Thanos because it's so easy to deploy alongside an existing
Prometheus installation, while itself being mostly stateless (long-term state
is kept in object storage), and it gives people:

\- a global view over multiple Prometheus servers \- deduplicated view over
servers in an HA pair \- durable long-term storage for little cost

The Thanos architecture diagrams (especially the one in their README.md) can
look a bit intimidating, but I find it sooo easy to get started with in
practice, since you don't even need to deploy all of the components to begin
with it. I usually tell people to just drop in a Thanos sidecar next to each
of their Prometheus servers, so they will get backups of all their Prometheus
server data (for those who are interested in long-term data retention). And
then later, they can add the Querier component for an integrated view over
multiple servers. And then later they can deploy the Store gateway to also
integrate back long-term data into that view. And then at some point, the
compactor...

Without being a Thanos expert, it took me ~15 minutes to deploy all those
components (+ Minio for object storage) in front of a training audience that
wanted to know more about Thanos (while reading Thanos + Minio docs). Of
course a proper production deployment always takes way more time, but still I
like how conceptually simple it is to integrate Thanos with Prometheus.

~~~
gnrl
Do you have by any chance recorded this?

~~~
jrv
No, sorry, it was a private commercial training.

------
ibspoof
When my team agreed to use Prometheus from the client side we looked at
Thanos, Cortex, and M3DB, but none of them gave us the flexibility and comfort
of adoption for a small team providing a service to 10s of internal groups. We
have many private internal DCs and needed metrics to be stored in the cloud,
pulling data to the cloud seemed awkward and required access rights we
couldn't get.

We ended up using Postgres 10 w/ TimeScaleDB and their Prometheus plugin with
a simple emulated push gateway that converts a prom formatted http post to a
postgres batch insert. Postgres is 3 nodes monitored with Patroni.

Working great for us and handling 1000+ metrics a second with ease and we get
SQL for both real-time metrics for monitoring and analytics for business
needs. We are using about 10-15% of our systems giving us room to grow.

~~~
jrv
> We have many private internal DCs and needed metrics to be stored in the
> cloud, pulling data to the cloud seemed awkward and required access rights
> we couldn't get.

You mean having a Prometheus server run in the cloud, but then pulling from
on-prem things from the cloud? Not sure how either Cortex or Thanos would
require that, as you'd still run on-prem Prometheus servers for them, but then
collected data is pushed to the cloud in the end. But maybe I'm
misunderstanding what you mean here.

> Working great for us and handling 1000+ metrics a second

Curious about this - I would expect any system to be able to do that easily,
as that's a tiny, tiny amount. A single big Prometheus server can do roughly
1000x that (I think someone once managed to do 1M samples/second ingested).

------
valyala
VictoriaMetrics author here. I like the post, since it is cleanly written and
it isn't biased to certain solution. I'd recommend readers trying all the
mentioned solutions - Thanos, Cortex, M3DB and VictoriaMetrics and then
choosing the solution that fits them the best.

Each solution has its own weak and strong points. The main selling points for
VictoriaMetrics are:

* Operation simplicity. This is especially true for a single-node version, which is represented by a single self-contained binary without any dependencies. It is configured by a few command-line flags, while the rest of configs have sane defaults, so they shouldn't be touched in most cases.

* Low resource usage (CPU, RAM, disk space and iops, network bandwidth).

* High performance.

See also an interesting talk from PromCon 2019, where all these solutions are
compared by Adidas monitoring team [1].

[1] [https://promcon.io/2019-munich/talks/remote-write-storage-
wa...](https://promcon.io/2019-munich/talks/remote-write-storage-wars/)

------
raisingtable
So here is my case. I'm running multiple Prometheus HA pairs to cover
different teams. At the moment, I'm using Thanos and VictoriaMetrics in
parallel to test them out.

Thanos was the first I set up as VM wasn't open-sourced yet. It wasn't hard to
setup and had it running in about a day together with Minio as an S3 backend.
To this day it's running without a problem, apart from an alarm every now and
then that the Store or Compactor couldn't get something done. But I didn't
look too much into it since everything graphing-wise seems to work. Upgrades
are also easy and I love the global querier option. I sometimes see people
having OOMs on a rather "large" servers on Slack, but Thanos team is suppose
to be working on optimizing memory usage and it's getting better and better.

After the last PromCon, I also configured VictoriaMetrics. Installation was as
simple as it can be, way simpler than Thanos, but I'm using a single node
version. It works really good for the last 3 months. Resource usage is a lot
lower than on Thanos.

Both solutions have their own Slack channels with developers and users there,
so it is easy to get help and resolve issues.

In the end, I think I'll go with VM in my case, since it has less moving
parts, doesn't need S3 backend (we are on-prem and don't have a production S3
storage) and lower resource usage. It can also ingest InfluxDB metrics, which
is a massive bonus for me, since NOC team is using a solution that can only
send metrics to InfluxDB (snmpcollector).

------
rektide
I enjoyed the post. Good links to a lot of relevant, recent stories & events.

Not the article's fault, but it cites the "ClickHouse Cost-Efficiency in
Action: Analyzing 500 Billion Rows on an Intel NUC" article that was published
January 1. It's a week old, & I kind of feel like I'm never going to get away
with it. It seems like a great, fun, interesting premise, but the authors took
what is a challenging, huge data-set, and, under the guise of making the data
look "realistic" they drained all the entropy out of the dataset, & then
claimed they were 10-100x faster.

Well, yes, maybe for some workloads maybe. Maybe the changes they made might
in some circumstances be "realistic" for some IoT use cases, maybe.

But I feel like I'm going to see this article come up again, and again, and
again. And each time, I'll have these frustrations, about how while they may
still be running queries on the same number of rows, they are running queries
on many orders of magnitude less data. It's a fun read, & genuinely useful- in
some circumstances- tech, but I don't expect to see this nuance showing up.
I'm already weary, seeing this Clickhouse article again.

~~~
manigandham
The difference between rowstores (Scylla/Cassandra) and columnstores
(Clickhouse) comes down to the physical layout of data with batch/vectorized
processing and other techniques.

There will always be a 1-2 magnitude increase in performance regardless of the
data. They also used the same number of rows, except with smaller cardinality
in measurements which would make an insignificant speed difference.

------
netingle
Cortex author here (Tom Wilkie). Great post that honestly highlights the
differences between these systems - thank you!

The biggest take home here - and the first thing the post mentions - is the a
single HA pair of Prometheus servers is enough for 80-90% of people. TLDR you
probably don’t need Cortex (or Thanos, etc)...

...unless you run multiple, segregated networks (regions). Then something like
Thanos (or Cortex) is useful - not for a the scale argument, but because you
need a way to “federate” queries and get that global view. IMO!

~~~
chucky_z
Isn't this the whole point of the federate endpoint? That you just run a
central Prometheus pair to federate metrics at low resolution from a ton of
places?

I only care about high resolution metrics for alerts. Otherwise I can just
take a handful of them at 5m intervals, but from a lot of places.

~~~
jrv
There's different tradeoffs... Prometheus's own federation is a pretty simple
scrape-time federation - a Prometheus server pulls over the most recent
samples of a subset of another Prometheus server's metrics on an ongoing
basis. Thanos does query-time federation rather than actually collecting and
persisting data for all "federated" servers in a central place (other than the
e.g. S3 bucket for long-term data). So with Prometheus federation you have to
choose pretty carefully which aggregated stats you'd like to pull into some
higher-up Prometheus layer, and then you only have access to those (in that
server). Thanos allows you to query over the data in multiple Prometheus
servers at once, in all their detail.

And I think Cortex is mostly useful for people who want to run a big
centralized, multi-tenant service in their org to keep all the global view and
long-term data. (most people tend to use Thanos)

------
freeseacher
Let me explain my experience with tsdb selection. in 2018 we understood that
we need something for long term storage. Selection was between thanos,
elasticsearch and m3db.

M3db looks promising but after reading issues and docs i found i have to test
it like database not like a drop in solution. for example that topic
[https://groups.google.com/forum/#!topic/m3db/6iG2NL7hJ7A](https://groups.google.com/forum/#!topic/m3db/6iG2NL7hJ7A)
And cortex and thanos announced that tweet
[https://twitter.com/fredbrancz/status/1043060822988259333](https://twitter.com/fredbrancz/status/1043060822988259333)

Elasticsearch got disqualification because of no remote_read support. So i
stopped looking for anything for at least half year just updated retention
policy in Prometheus to 150d.

Also VictoriaMetrics was banned because of no source code. Also
[https://github.com/akumuli/Akumuli](https://github.com/akumuli/Akumuli) was
banned because of nobody hear about it. :(

Than after some time VictoriaMetrics appears to be open source and there was
no issues with rate function and useless extrapolation.

So i test it on a small setup at about 7k metrics per second on single server.
And it was amazing. Than 14k/s and 20k/s Previously i have the volume for
Prometheus data and it was about 30 gigs on smallest install to 70 gigs on
largest Moving from storing 30days in Prometheus to 90 days in vm was the huge
benefit. On every of three instances with 7, 14 and 20k metrics per second i
can extend retention from 3 to 5x on the same volume. With same dashboards.
With same alerts. Just added remote read and remote write.

Than i decide to take it on a real life web scale production. So i started
from 11 servers f2-micro on gcp. * 3 storage * 2 insert nodes * 2 select nodes
* 2 promxy * 1 grafana * 1 selfmon prometheus

Got lots of expected ooms on 60k per second. Than i move to n1-standart-1 for
storage and insert. It can handle at about 650k per second insert load for
several weeks without ooms or any unaxpected behaviors. That was real life
data from prometheus-operator from one of our rc clusters. node-exporters,
application metrics and kubemetrics.

Tuning it to n1-highmem-2 for storage nodes so get enough room for background
merges and so on.

Also i copy my Prometheus rules from prod to promxy (at about 200 in sum).
That makes some noise for read. So i got at about 70 reads per second and
pretty 90% cpu utilization on promxy servers. But almost no additional cpu
load on vm servers. So i just bump all numbers in queries from seconds i moved
to minutes, minutes to hours and hours to days in every query that have offset
or rate or increase. That add some load to vm but not that much i expected.

In summary i'm amazed with simplicity of scheme i got. Performance is also
great. My dashboards looks the same in Prometheus and VictoriaMetrics.

Oh. by the way i have some experience asking questions in issue tracker of
Prometheus and VictoriaMetrics. And honestly prefer Aliaksander style of
answering - long and with good under the hood info.

