Cortex author here (Tom Wilkie). Great post that honestly highlights the differences between these systems - thank you!
The biggest take home here - and the first thing the post mentions - is the a single HA pair of Prometheus servers is enough for 80-90% of people. TLDR you probably don’t need Cortex (or Thanos, etc)...
...unless you run multiple, segregated networks (regions). Then something like Thanos (or Cortex) is useful - not for a
the scale argument, but because you need a way to “federate” queries and get that global view. IMO!
Isn't this the whole point of the federate endpoint? That you just run a central Prometheus pair to federate metrics at low resolution from a ton of places?
I only care about high resolution metrics for alerts. Otherwise I can just take a handful of them at 5m intervals, but from a lot of places.
There's different tradeoffs... Prometheus's own federation is a pretty simple scrape-time federation - a Prometheus server pulls over the most recent samples of a subset of another Prometheus server's metrics on an ongoing basis. Thanos does query-time federation rather than actually collecting and persisting data for all "federated" servers in a central place (other than the e.g. S3 bucket for long-term data). So with Prometheus federation you have to choose pretty carefully which aggregated stats you'd like to pull into some higher-up Prometheus layer, and then you only have access to those (in that server). Thanos allows you to query over the data in multiple Prometheus servers at once, in all their detail.
And I think Cortex is mostly useful for people who want to run a big centralized, multi-tenant service in their org to keep all the global view and long-term data. (most people tend to use Thanos)
The biggest take home here - and the first thing the post mentions - is the a single HA pair of Prometheus servers is enough for 80-90% of people. TLDR you probably don’t need Cortex (or Thanos, etc)...
...unless you run multiple, segregated networks (regions). Then something like Thanos (or Cortex) is useful - not for a the scale argument, but because you need a way to “federate” queries and get that global view. IMO!