Introducing Thanos: Prometheus at Scale (improbable.io)
on May 18, 2018

Really great project. Instead of a fragile centralized approach Thanos embraced the federated nature of Prometheus. We see us using this for GitLab.com. I just discussed Thanos with Ben who works on the Prometheus project itself and who leads monitoring for GitLab: https://youtu.be/JzlwwGZ3yQ4

My interpretation is that Thanos is for massive scale, perhaps suitable for gitlab but most organisations would manage with just prometheus to start with and might scale up to thanos later.

Is this a correct observation?

Either that or just use basic Thanos setup (as described in a blog post) that gives you better Prometheus HA support and global view. You can always add setup for long-term metric retention later on.

Yes, a single Prometheus server can probably handle the scale of most SaaS applications.

Great project - just wanted to mention Cortex which is a different take on long-term large-scale Prometheus.


Best to read the design document [1] to learn more about the different choices that Cortex makes.

Also no connotations of large-scale destruction :-)

(Note I work on Cortex)

1: http://goo.gl/prdUYV

Is Weaveworks Cortex completely open source? Does it require Weave Cloud subscription to post messages to Slack?

Yes it is Open Source: the whole project is on GitHub, Apache 2 licenced.

Posting to Slack (and OpsGenie, PagerDuty, etc) is a feature of the Prometheus AlertManager, which Cortex builds upon, no subscription required.

The commercial Weave Cloud product gives you a hosted instance of the same code, storage - we ingest all your metrics and store them for a year - and a nice GUI with user logins, team permissions, etc. Plus the Deploy and Explore features which are hosted versions of two more Open Source projects.

(disclaimer: I work at Improbable)

The reason why we built Thanos was to enable monitoring of large scale simulation systems, which are inherently stateful, such as the Survival demo (link https://youtu.be/lGWON5TtS04).

We use Thanos to provide the observability features (monitoring in particular) to Workers (i.e. user processes we run on our Cloud) that perform the simulation. You can have multiple Workers collaborating on a simulation of the economics or ecology that export monitoring variables that you want to track.

Since the simulation is inherently dynamic, and the number of Workers can change, Thanos helps us with achieving the necessary scale and retention for a hosted platform that is SpatialOS.

I would love to hear any experiences folks here have with this. We are seriously looking at it right now.

(disclaimer: Blog post co-author) You are welcome to join our growing community to know more. (: Follow slack join button here: https://github.com/improbable-eng/thanos

I suggest the second part of this talk from the last Kubecon https://www.youtube.com/watch?v=IpGfmmJ2hcw

Wow. Nice to see my talk here. If you have any questions, feel free to ask.

Fantastic, thanks!

Thanos seems like the closest thing to a silver bullet for Prometheus missing features (as by design).

Quick question:

In a multi Prometheus setup, if all the Thanos nodes are behind a load balancer (without sticky sessions), do a particular query from a dashboard interface like Grafana to that Load Balancer result in the same dataset, if run multiple times?

If by "Thanos nodes" you mean Thanos querier instances, then yes -> Does not matter to which one you actually ask. All have the same view and access to the old metrics (Store Gateway) and fresh ones (Prometheus+Sidecar - Scraper)

Thanks, that's exactly what I was looking for

Storing the Prometheus data in long term storage raises one question for me... what is the process for upgrading the TSDB data format when it changes over time?

The format already went through one format change since Thanos was started. The format encodes a version itself and Thanos simply supports reading multiple ones.

What app did you use to make those beautiful diagrams?

My hands + Google Drawing (: (blog post co-author here)

Can grafana query Thanos directly? or will a datasource plugin be required?

The Thanos query nodes have the same interface as Prometheus itself, including the web UI (with a few small changes), so you can just use the same Prometheus plugin pointed at Thanos.

