
CNCF's Cortex v1.0: scalable, fast Prometheus implementation - netingle
https://grafana.com/blog/2020/04/02/cortex-v1.0-released-the-highly-scalable-fast-prometheus-implementation-is-generally-available-for-production-use/
======
nopzor
awesome job by the cortex team!

there's a lot of good questions, and some confusion in this thread. here is my
view. note: i'm definitely biased; am the co-founder/ceo at grafana labs.

\- at grafana labs we are huge fans of prometheus. it has become the most
popular metrics backend for grafana. we view cortex and prometheus as
complementary. we are also very active contributors to the prometheus project
itself. in fact, cortex vendors in prometheus.

\- you can think of cortex as a scale-out, multi-tenant, highly available
"implementation" of prometheus itself.

\- the reason grafana labs put so much resources into cortex is because it
powers our grafana cloud product (which offers a prometheus backend). like
grafana itself, we are also actively working on an enterprise edition of
cortex that is designed to meet the security and feature requirements of the
largest companies in the world.

\- yes, cortex was born at weaveworks in 2016. tom wilkie (vp of product at
grafana labs) co-created it while he worked there. after tom joined grafana
labs in 2018, we decided to pour a lot more resources into the project, and
managed to convince weave.works to move it to the cncf. this was a great move
for the project and the community, and cortex has come a long long way in the
last 2 years.

once again, a big hat tip to everyone who made this release possible. a big
day for the project, and for prometheus users in general!

[edit: typos]

~~~
Florin_Andrei
I'm worried about this statement:

> _Local storage is explicitly not production ready at this time._

[https://cortexmetrics.io/docs/getting-started/getting-
starte...](https://cortexmetrics.io/docs/getting-started/getting-started-
chunks-storage/)

But I want a scale-out, multitenant implementation of Prometheus with local
storage that's ready for prod. What are my options then? VictoriaMetrics?

~~~
gouthamve
The only one I know with "non-experimental" local-storage is VictoriaMetrics.
But the big thing there is that data in VM is not replicated, so when you lose
a disk/node, you lose that data.

Having said that, both Thanos and Cortex have experimental local-storage modes
that are pretty good. You could also try them for now while they get
production ready.

~~~
Florin_Andrei
> _data in VM is not replicated, so when you lose a disk /node, you lose that
> data_

The vmstorage component in VictoriaMetrics Server - is it RAID0-like
(stripping) or RAID1-like (mirroring)?

[https://github.com/VictoriaMetrics/VictoriaMetrics/tree/clus...](https://github.com/VictoriaMetrics/VictoriaMetrics/tree/cluster)

~~~
valyala
It is easy to implement RAID1-like replication in VictoriaMetrics: just set up
independent VictoriaMetrics instances (single-node or clusters) and replicate
all the incoming data simultaneously to these instances. This can be done
either via providing multiple `remote_write->url` values in Prometheus configs
or via providing multiple `-remoteWrite.url` command-line flags in vmagent
[1]. Then query multiple VictoriaMetrics replicas via Promxy [2].

[1]
[https://github.com/VictoriaMetrics/VictoriaMetrics/blob/mast...](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/app/vmagent/README.md)

[2] [https://github.com/jacksontj/promxy](https://github.com/jacksontj/promxy)

------
kapilvt
also props to [https://weave.works](https://weave.works) for creating cortex,
open-sourcing it and moving it under cncf, something this blog post leaves
out.

------
netingle
Hi! Tom, one of the Cortex authors here. Super proud of the team and this
release - let me know if you have any questions!

~~~
number101010
Hey Tom!

Can you outline how Cortex differs from some of the other available Prometheus
backends?

~~~
netingle
Sure, check out this talk from PromCon I did with Bartek, the Thanos author:
[https://grafana.com/blog/2019/11/21/promcon-recap-two-
househ...](https://grafana.com/blog/2019/11/21/promcon-recap-two-households-
both-alike-in-dignity-cortex-and-thanos/)

~~~
MetalMatze
Love that talk. :)

------
ones_and_zeros
Isn't prometheus an implementation and not an interface? I have "prometheus"
running in my cluster, if it's not cortex, what implementation am I using?

~~~
outworlder
You are using Prometheus.

However, Prometheus can use different storage backends. The TSDB that it comes
with is horrible.

I mean, it's workable. And can store an impressive amount of data points. If
you don't care about historical data or scale, it may be all you need.

However, if your scale is really large, or if you care about the data, it may
not be the right solution, and you'll need something like Cortex.

For instance, Prometheus' own TSSB has no 'fsck'-like tool. From time to time,
it does compaction operations. If your process (or pod in K8s) dies, you may
be left with duplicate time series. And now you have to delete some (or a
lot!) of your data to recover.

Prometheus documentation, last I checked, even says it is not suitable for
long-term storage.

~~~
ecnahc515
The TSDB it uses is actually pretty state of the art. I think your pain point
is more that it's designed for being used on local disk, but that doesn't mean
it isn't possible to store the TSDB remotely. In fact, this is exactly how
Thanos works.

The docs say Prometheus is not intended for long term storage because without
a remote_write configuration, all data is persisted locally, and thus you will
eventually hit limits on the amount that can be stored and queried locally.
However, that is a limitation on how Prometheus is designed, not how the TSDB
is designed, and which can be overcome by using a remote_write adapter.

------
Rapzid
Dat architecture tho:
[https://cortexmetrics.io/docs/architecture/](https://cortexmetrics.io/docs/architecture/)
. Holy bi-gebus.

~~~
netingle
Thats the "microservices" mode - you can run it as a single process and the
architecture becomes super boring.

Its like looking at the module interdependencies of reasonably large piece of
software; of course its going to look complicated.

~~~
valyala
According to Cortex docs [1], a single-process Cortex isn't production ready.
It is intended for development and testing only.

[1] [https://cortexmetrics.io/docs/configuration/single-
process-c...](https://cortexmetrics.io/docs/configuration/single-process-
config/)

------
zytek
Congrats to Grafana Team!

If you're looking at scaling your Prometheus setup - check out also Victoria
Metrics.

Operational simplicity and scalability/robustness are what drive me to it.

I used to to send metrics from multiple Kubernetes clusters with Prometheus -
each cluster having Prom with remote_write directive to send metrics to
central VictoriaMetrics service.

That way my "edge" prometheus installations are practically "stateless",
easily set up using prometheus-operator. You don't even need to add persistent
storage to them.

------
mmcclellan
New to Cortex but when looking at a comparison of Prometheus and InfluxDB
(like
[https://prometheus.io/docs/introduction/comparison/#promethe...](https://prometheus.io/docs/introduction/comparison/#prometheus-
vs-influxdb)) it appears that Cortex offers similar horizontal scalability
features to the InfluxDB Enterprise offering. The linked comparison does note
the difference between event logging and metrics recording but I am curious
(choosy beggar that I am) whether others consider them separate tooling or
whether it is possible to remain performant using one solution.

------
stuff4ben
This was a Weaveworks project right?

~~~
gouthamve
Yes, it was created at Weaveworks, but it was later donated to CNCF and now
the community is much bigger! Having said that Weaveworks is still a major
contributor!

------
mattmendick
Really exciting! Well done

------
rfratto
Great job Cortex team!

------
demilich
Good job, excited!

------
throwaway50203
Reminder: github star history is in no way a measure of quality.

