Hacker News new | past | comments | ask | show | jobs | submit login
Grafana Mimir – Horizontally scalable long-term storage for Prometheus (grafana.com)
262 points by devsecopsify on March 30, 2022 | hide | past | favorite | 119 comments



Grafana Labs needs to make a convincing comparison chart of some kind between Mimir, Thanos, and Cortex. Thanos and Cortex are both mature projects and are both CNCF Incubating projects. Why would anyone switch to a new prometheus long-term storage solution from those?

*EDIT*: I see from another reply there is a basic comparison to Cortex here: https://grafana.com/blog/2022/03/30/announcing-grafana-mimir... To the Mimir folks, I'd love to see something similar Mimir v. Thanos.


It looks like this is a fork of Cortex driven by the maintainers employed by Grafana Labs, done so they can change the license to one that will prevent cloud providers like Amazon from offering it without contributing changes back.

This is interesting, since Amazon offers both hosted Grafana and Cortex today. I was under the impression Amazon and Grafana Labs were successfully collaborating (unlike e.g. AWS and Elastic), but seems like that's not the case.


Does AWS provide managed Cortex? Is that just a part of the AWS managed prometheus thing?


Yes, Amazon's managed Prometheus is based on Cortex. See the first question at https://aws.amazon.com/prometheus/faqs/


Disclosure: I work for AWS, but I don't work on the Amazon Managed Service for Prometheus. I have my own very long held opinions about Free and Open Source software, and I am only speaking for myself.

To me, the AGPLv3 license isn't about forcing software users to "give changes back" to a project. It is about giving the permissions to users of software that are necessary for Software Freedom [1] when they access a program over a network. In practice, that means that changes often flow "upstream" to copyleft licensed programs one way or another. But it was never about obligating changes to be "given back" to upstream. In my personal opinion, you should be "free to fork" Free and Open Source Software (FOSS). Indeed, the Grafana folks seem to have decided to do that with Grafana Mimir.

Personally, I hope that they accept contributions under the AGPLv3 license, and hold themselves to the same obligations that others are held to with regard to providing corresponding source code of derivative works when it is made available to users over a network. In my personal opinion, too often companies use a contributor agreement that excuses them from those obligations, and also allows them to sell the software to others under licenses that do not carry copyleft obligations. See [2] for a blog post that goes into some detail about this.

If you look at the Coretex project MAINTAINERS file [3], you will see that there are two folks listed that currently work at AWS, but no other company other than Grafana Labs today. I would love to see more diversity in maintainers for a project like this, as I think too many maintainers from any one company isn't the best for long term project sustainability.

I think if you look at the Cortex Community Meeting minutes [4], you can see that AWS folks are regularly "showing up" in healthy numbers, and working collaboratively with anyone who accepts the open invitation to participate. There have been some pretty big improvements to Coretex that have merged lately, like some of the work on parallel compaction [5, 6].

TL;DR, I think it is easy to jump to some conclusions about how things are going in a FOSS project that don't hold water if you do some cursory exploration. I think best way to know what's going on in a project is to get involved!

--

[1] the rights needed to: run the program for any purpose; to study how the program works, and modify it; to redistribute copies; to distribute copies of modified versions to others

[2] https://meshedinsights.com/2021/06/14/legally-ignoring-the-l...

[3] https://github.com/cortexproject/cortex/blob/master/MAINTAIN...

[4] https://docs.google.com/document/d/1shtXSAqp3t7fiC-9uZcKkq3m...

[5] https://aws.amazon.com/blogs/opensource/scaling-cortex-with-...

[6] https://github.com/cortexproject/cortex/pull/4624


Their other AGPL projects all have a CLA and they state you can buy them as part of Grafana Enterprise without the AGPL license https://grafana.com/blog/2022/03/30/qa-with-our-ceo-about-gr... so they are not offering symmetric terms to themselves.


Indeed. I was only saying what I would _like_ to see...


> Cortex is used by some of the world’s largest cloud providers and ISVs, who are able to offer Cortex at a lower cost because they do not invest the same amount in developing the project.

> ...

> All CNCF projects must be Apache 2.0-licensed. This restriction also prevents us from contributing our improvements back to Cortex.

I read this as "Amazon has destroyed the CNCF by not playing nice"


Holy crap I did not know CNCF discriminated against copyleft software.

This really discredits the Linux Foundation as an institution.


Seems like people should throw VictoriaMetrics into comparisons like this, as well?


Yea, although making benchmarks properly is no easy task, and can be pretty time consuming, especially if you involve all the contestants for fairness. They are not interested in releasing a benchmark if they don't look good in it.


I agree! Which is why I put one in the blog post ;-) https://grafana.com/blog/2022/03/30/announcing-grafana-mimir...


I'm not seeing a comparison to Thanos


Why would you? Parent says its a comparison of Mimir and Cortex.


Re-read the full thread...

>>Grafana Labs needs to make a convincing comparison chart of some kind between Mimir, Thanos, and Cortex.

>I agree! Which is why I put one in the blog post ;-)


You're forgetting VictoriaMetrics that's presumably the best choice for Prometheus long term storage.

Such a solid solution exists and yet another competitor? Not sure why they didn't just buy VictoriaMetrics and possibly rebrand it.


Agree with you that VictoriaMetrics works like a charm: fast, easy to configure, easy to recover from components crashes (last time I checked Cortex, it was a nightmare to recover from the ingestors). For me, it is the better solution for long term storage Prometheus if you come from a clean state.

But Grafana labs, employ lots of people that have worked on Cortex since its inception at weaveworks, and has developed strong in-house knowledge about it so Grafana is fully commited to Cortex (now Mimir) and have developed derivatives for logs (Loki) and traces (Tempo) heavily based on the Cortex model.


Folks looking for a solution to storing Prometheus metrics from multiple places, definitely consider exploring Victoriametrics.

I'm running a single Victoriametrics instance which has 230bn metrics, consuming ~4GB of memory and barely 200m of CPU utilization (only spikes to ~1.5cores when it flushes these datapoints from RAM to disk). I've previously[1] shared my experience of setting up Victoriametrics for long term Prometheus storage back in 2020 and since then this product has just kept getting better.

Over time, I switched to `vmagent` and `vmalert` as well which offer some nice little things (like did you know, you can't break up the scrape config of Prometheus into multiple files? `vmagent` does that happily). The whole setup is very easy to manage for an Ops person (as compared to Thanos/Cortex. Yet to checkout Mimir though!) as well. I've barely had to tweak any default configs that come in Victoriametrics and I even increased the retention of metrics from a month to multiple months after gaining confidence in prod.

[1]: https://zerodha.tech/blog/infra-monitoring-at-zerodha/


How does this stack up with https://github.com/thanos-io/thanos, which I've used to pretty good success.

The only criticism I have of Thanos though was the amount of moving pieces to maintain.


(Tom here; I started the Cortex project on which Mimir is based and lead the team behind Mimir)

Thanos is an awesome piece of software, and the Thanos team have done a great job building an vibrant community. I'm a big fan - so much so we used Thanos' storage in Cortex.

Mimir builds on this and makes it even more scalable and performance (with a sharded compactor and query engine). Mimir is multitenant from day 1, whereas this is a relatively new thing in Thanos I believe. Mimir has a slightly different deployment model to Thanos, but honestly even this is converging.

Generally: choosing Thanos is always going to be a good choice, but IMO choosing Mimir is an even better one :-p


Okay, but why? I am using Thanos today. It works, it's complex, when it breaks, it's a bit of a challenge to fix, but it happens. It doesn't break often.

It does the job. Mimir, which is based on Cortex, using either Mimir, or Cortex, what benefit am I getting?

I get asked every few months about moving off of Thanos to Cortex, and today now Mimir, and I don't have any substantial reason to do so. It feels like moving for the sake of moving.

I need to see some real reasoning as to why I am going to add value to move everything to Mimir.


Sounds like Thanos is working well for you, so in your position I wouldn't change anything.

There are a bunch of other reasons why people might choose Mimir; perhaps they have out grown some of the scalability limits, or perhaps they want faster high cardinality queries, or a different take on multi-tenancy.

Do remember Cortex (on which Mimir is based) predates Thanos as a project; Thanos was started to pursue a different architecture and storage concept. Thanos storage was clearly the way forward, so we adopted it. The architectures are still different: Thanos is "edge"-style IMO, Mimir is more centralised. Some people have a preference for one over the other.


That's fair, thanks for the input. The only reason we implemented Thanos in the first place was a particular feature that we needed at the time of implementation. Now using it in an extremely large environment, I haven't seen any scalability limits. Speed of queries isn't a driver of anything.

Multi Tenancy certainly is, but we have our own custom multi tenancy solution over top of it we built ourselves. I'd like to get rid of that ultimately, but we're not utilizing whatever multi tenant features exist at the moment. Perhaps that will be a driver.

Appreciate your thoughts.


We were struggling with Cortex a couple years ago, then we tried VictoriaMetrics and haven't look back. It goes pretty much unattended with just monitoring disk space to make sure we still have room to continue pouring in metrics. When a component crashes (not often) it recovers pretty much without noticing.


Multi-tenancy is something that shouldn't be underestimated. A lot of people think it's just a checklist item until (a) they need it or (b) they try to implement it in an existing system. Kudos for making it a day-one feature.


While I agree with your point in the general case, would you mind elaborating on the specific case of Prometheus?

My understanding is that the recommended best-practice for Prometheus is to deploy as many of them as necessary, as close to the monitored infrastructure as possible.

What use case would require deploying a single Mimir, so supposedly Prometheus (cluster) in the case of serving multiple tenants? Why not just deploy a dedicated Prometheus / Mimir stack per client?


I don't know Prometheus, but I would imagine the answer depends on just how many clients you have. Probably doesn't matter if you're talking just a few. If it's a lot, then separate instances can be very expensive in terms of operational complexity and waste due to resource fragmentation. Multi-tenancy is good for bringing both of those back under control. Is there something about Prometheus that would negate that?


For one, it doesn't really support authentication (although it's on the roadmap).

I'm no Prometheus expert, but since you're pretty much expected to be running a bunch of servers anyway, the operational complexity has to be handled even for just one client.

You do have a point on resource fragmentation, but IME Prometheus' resource usage is fairly predictable, so you could probably mitigate that to a point.


(Bartek here: I co-started Thanos and maintain it with other companies)

Thanks for this - it's a good feedback. It's funny you mentioned that, because we actively try to reduce the number of running pieces e.g while we design our query sharding (parallelization) and pushdown features.

As Cortex/Mimir shows it's hard - if you want to scale out every tiny functionality of your system you end up with twenty different microservices. But it's an interesting challenge to have - eventually it comes to trade-offs we try to make in Thanos between simplicity, reliability and cost vs ultra max performance (Mimir/Cortex).


Mimir has a microservices architecture. However, Mimir supports two deployment modes: monolithic and microservices.

In monolithic mode you deploy Mimir as a single process and all microservices (Mimir components) run inside the same process. Then you scale it out running more replicas. Deployment modes are documented here: https://grafana.com/docs/mimir/latest/operators-guide/archit...


There isn't a link to the project on the page (that I could find) so it almost looked like it's not open source. But here it is: https://github.com/grafana/mimir.


You have to find the "Download" button and click it, it's very non-obvious :< The entire page seems to be designed to funnel you into signing up for their paid service, which makes sense, but still doesn't feel great...


Recently switched from their cloud service back to on-premise. The cloud version wasn't being updated and the entire setup experience left a lot to be desired with how you connect their on-premise grafana agent, especially if you aren't using their easy button deployment stuff. Also, billing for metrics is insane, as on any given day my metric load may vary between 5-7k or more. This caused some operational overhead as I was constantly tweaking scrapers to reduce useless metrics.

For $50/mo, you can self host everything easier, cheaper and with more control IMO.


> For $50/mo, you can self host everything easier, cheaper and with more control IMO.

Can you give an example as to how you could self host a grafana stack for $50/month? On AWS that buys you 4 cores, 8GB memory and 0 storage, and it's certainly not easier than clicking one button on the grafana website.


We are running Grafana and Prometheus on a single t3.xlarge instance with 150GB gp3 EBS.

Excluding traffic, it costs ~ $100 USD per month.

We are doing 10 second scrapes and currently have roughly 141k active time series. In Grafana Cloud it would cost...

15000 metrics for free. 126000/1000 * $8 = $882

Now here's the real kicker.. the pricing Grafana puts on their website are assuming 60 second scrape interval (1 data point/minute or DPM). If you are doing 6 DPM, that's $8 * 6 per 1000 time series!

So final bill.. drum rolls

126000/1000 * $8 * 6 = $6048

Yes. That's a 60x.

Now, sure, we don't get the scale, the backups, the SLA.. but we can live without it. And when Prometheus will start acting slowly, we will just bump it to t3.2xl, or spend some time and filter out some of the noisy metrics we might have around.

Btw, if you try to find any information about what is a "time series" or a "metric" on the Grafana's pricing page, good luck.

https://grafana.com/docs/grafana-cloud/metrics-control-usage...


> Excluding traffic, it costs ~ $100 USD per month

I don't doubt that that's affordable, or cost competitive to AWS, but thats' about as cheap as you can do it, _and_ that's not including traffic. It's pretty much impossible to half that bill.


I excluded the traffic because the price is basically 0. This is internal traffic and a bunch of HTTP requests. It doesn't cost us $3000 a month.


There are Helm charts available for all Grafana products so if you already run a Kubernetes cluster and have spare capacity you can just throw it up there. Loki supports shipping logs to GCS/S3 natively and Prometheus can use Cortex (also available as a Helm chart) to do the same. Once you throw Grafana behind SSO and implement a backup cronjob you're done until you reach scale and have to start deploying/scaling individual components separately.

I implemented most of the above using Terraform on a managed DigitalOcean cluster on a Saturday a few months back; it wasn't super-hard. Alternatively you could rent a few VPSes someplace and use k3s or similar to get an unmanaged cluster.


Suggestions for organizing a Helm + Terraform [+ k3s/k3d/MicroShift] provisioning and monitoring git repo with CI for job accounting? (without Ansible & AWX, which I'd create a role with for this too)

- [ ] ENH,BLD: A cookiecutter for this would be cool


> $50/month? On AWS that buys you 4 cores, 8GB memory and 0 storage

Self-hosting on AWS is kind of counterproductive. Look into "cloud" metal servers and the money will go much further.


Two low end Hetzner/OVH Boxes for redundancy should do the trick


That's why AWS charges so much for outgoing traffic.


The first CTA button on the page "Tutorial" links to a tutorial where the first step is to run the project with Docker. Doesn't really feel like an overly forced funnel to their paid service.


Still AGPL, which I guess makes sense given the rest of their stack is too: https://github.com/grafana/mimir/blob/mimir-2.0.0/LICENSE


How does this compare to https://www.timescale.com/promscale

I’m looking into choosing a backend for my metrics and always open for suggestions.


Hey!

Promscale PM here :)

Promscale is the open source observability backend for metrics and traces powered by SQL.

Whereas Mimir/Cortex is designed only for metrics.

Key differences:

1. Promscale is light in architecture as all you need is Promscale connector + TimescaleDB to store and analyse metrics, traces where as Cortex comes with highly scalable micro-services architecture this requires deploying 10's of services like ingestor, distributor, querier, etc.

2. Promscale offers storage for metrics, traces and logs (in future). One system for all observability data. whereas the Mimir/Cortex is purpose built for metrics.

3. Promscale supports querying the metrics using PromQL, SQL and traces using Jaeger query and SQL. whereas in Cortex/Mimir all you can use is PromQL for metrics querying.

4. The Observability data in Cortex/Mimir is stored in object store like S3, GCS whereas in Promscale the data is stored in relational database i.e. TimescaleDB. This means that Promscale can support more complex analytics via SQL but Cortex is better for horizontal scalability at really large scales.

5. Promscale offers per metric retention, whereas Cortex/Mimir offers a global retention policy across the metrics.

I hope this answers your question!


Hi. I'm a Mimir maintainer. I don't have hands-on/production experience with Promscale, so I can't speak about it. I'm chiming in just to add a note about the Mimir deployment modes.

> Cortex comes with highly scalable micro-services architecture this requires deploying 10's of services like ingestor, distributor, querier, etc.

Mimir also supports the monolithic deployment mode. It's about deploying the whole Mimir as a single unit (eg. a Kubernetes StatefulSet) which you then scale out adding more replicas.

More details here: https://grafana.com/docs/mimir/latest/operators-guide/archit...


Thanks... how do we do reporting/dashboards/alerts with Promscale?

Also, any performance benchmarks?


Promscale supports reporting/ingestion of data using Prometheus remote-write for metrics, OTLP (OpenTelemetry Line Protocol) for traces.

Dashboards you can use Promscale as Prometheus datasource for PromQL based querying, visualising, as Jaeger datasource for querying, visualising traces and as PostgreSQL datasource to query both metrics and traces using SQL. If you are interested in visualising data using SQL, we recently published a blog on visualising traces using SQL (https://www.timescale.com/blog/learn-opentelemetry-tracing-w...)

Alerts needs to be configured on the Prometheus end, Promscale doesn't support alerting at the moment. But expect the native alerting from Promscale in the upcoming releases.

We have internally tested Promscale at 1Mil samples/sec, here is the resource recommendation guide for Promscale https://docs.timescale.com/promscale/latest/installation/rec...

If you are interested in evaluating, setting up Promscale reach out to us in Timescale community slack(http://slack.timescale.com/) in #promscale channel.


Thanks... I will try it out. What we really need is SQL based/OTel based systems. It makes life so much easier.


Feel free to reach out to us in Timescale community slack :), Would love to help you in getting started with Promscale!


One interesting question I have is regards to global availability.

With our current Thanos deployment, we can tie a single geo regional deployment together with a tiered query engine.

Basically like this:

"Global Query Layer" -> "Zone Cluster Query Layer" -> "Prom Sidecar / Thanos Store"

We can duplicate the "Global Query Layer" in multiple geo regions with their own replicated Grafana instances. If a single region/zone has trouble we can still access metrics in other regions/zones. This avoids Thanos having any SPoFs for large multi-user(Dev/SRE) orgs.


This is one of my favorite things about Thanos. We run Prometheus in multiple private datacenters, multiple AWS regions across multiple AWS accounts, and multiple Azure regions across multiple subscriptions. We have three global labels: cloud, region, and environment. With Thanos's Store/Querier architecture we have a single Datasource in Grafana where we can quickly query any metric from any environment across the breadth of our infrastructure.

It's really a shame that Loki in particular doesn't share this kind of architecture. Seems like Mimir, frustratingly, will share this deficiency.


The typical way to run Mimir is centralised, with different regions/datacenters feeding metrics in to one place. You can run that central system across multiple AZs.

If you run Mimir with an object store (e.g. S3) that supports replication then you can have copies in multiple geographies and query them, but the copies will not have the most recent data.

(Note I work on Mimir)


Sad news for Cortex, with most of the maintainer moving on to Mimir, I fear it's pretty much dead in the water.


We tried to address this question on the Q&A blog post: https://grafana.com/blog/2022/03/30/qa-with-our-ceo-about-gr...

It doesn't have to mean the end for Cortex, but others will have to step up to lead the project. We've tried to put other maintainers in place to kick start this.


I was going to ask what the migration path was from Cortex to Mimir, but I see you've documented that at https://grafana.com/docs/mimir/latest/migration-guide/migrat... . Thanks for the work you've done to make this easy.


This video also shows a live migration from Cortex to Mimir (running in Kubernetes): https://www.youtube.com/watch?v=aaGxTcJmzBw&ab_channel=Grafa...


If anything, this makes me less interested in moving from Thanos.


So many solutions to the same problem, how does it compare to Victoria Metrics?


VictoriaMetrics co-founder here.

There are many similar features between Mimir and VictoriaMetrics: multi-tenancy, horizontal and vertical scalability, high availability. Features like Graphite and Influx protocols ingestion, Graphite query engine are already supported by VictoriaMetrics. I didn't find references to downsampling in Mimir's docs, but I believe it supports it too.

There are architectural differences. For example, Mimir stores last 2h of data in local filesystem (and mmaps it, I assume) and once in 2h uploads it to the object storage (long-term storage). VictoriaMetrics doesn't support object storage and prefers to use local filesystem for the sake of query speed performance. Both VictoriaMetrics and Mimir can be used as a single binary (Monolithic mode in Mimir's docs) and in cluster mode (Microservices mode in Mimir's docs). The set of cluster components (microservices) is different, though.

It is hard to say something about ingestion and query performance or resource usage so far. While benchmarks from the project owners can be 100% objective, I hope community will perform unbiased tests soon.


Given Victoria Metrics is the only solution I've seen to make data comparing it to other systems easily accessible as part of official documentation, it's the only one I pay attention to.

I knew from reading the docs what VM excelled at and areas it was weak in, long before I ever ran it (and expectations from running it matched the documentation). I hate aspirational marketing-saturated campaigns for deep tech projects where standards should obviously be higher, it speaks more about intended audience than it does the solution, and that's why in this respect VM is automatically a cut above the rest.


Cortex, Thanos and Mimir all support "remote-read" protocol (documented in Prometheus: https://prometheus.io/docs/prometheus/latest/storage/#remote...), so external systems (eg Prometheus) can read data from them easily.


It would be great if you could provide a few practical examples for "Prometheus remote-read" protocol given its' restrictions [1].

[1] https://github.com/prometheus/prometheus/issues/4456


Which restrictions do you have in mind?

Quick look at the issue looks like it wanted to avoid using local storage by Prometheus, but that’s Prometheus specific problem, not remote-read problem.

Remote-read is a generic protocol (https://github.com/prometheus/prometheus/blob/a1121efc18ba15...), you pass query (start/end time and matchers), and get back data.


> Which restrictions do you have in mind?

You wrote in the previous comment:

> ... so external systems (eg Prometheus) can read data from them easily.

I pointed to an issue, which prevents from practical usage of remote read protocol from Prometheus itself.

As for the interoperability with external systems, Prometheus querying API [1] is better suited for this task than Prometheus remote read protocol because of the following reasons:

- Prometheus querying API is easy to use, since it is just JSON over HTTP (unlike compressed protobuf used for Prometheus remote read). E.g. humans can test and debug it either directly in web-browser or in a command-line shell with curl.

- Prometheus querying API is already supported by popular external systems such as Grafana.

- Many Prometheus-compatible systems such as Thanos, Cortex, M3, VictoriaMetrics, etc. support Prometheus querying API out of the box.

[1] https://prometheus.io/docs/prometheus/latest/querying/api/


Presumably AGPLv3 is why Grafana would rather develop this than Cortex?


Hi. I'm Marco, I work at Grafana Labs and I'm a Grafana Mimir maintainer. We just published a couple of blog posts about the project, including more details on your question: https://grafana.com/blog/2022/03/30/announcing-grafana-mimir... and https://grafana.com/blog/2022/03/30/qa-with-our-ceo-about-gr...


Thank you for your answer. That seems like a reasonable strategy.


The thing I need most right now is a confirmation that it's named after this tweet: https://twitter.com/mmoriqomm/status/1272552214658117638


i don't get why there's so much hate here.

cortex is a pain to configure and maintain. would be awesome to have mimir address these issue!


This is about Prometheus but Mimir makes it interesting. I can't find any other open source time series database except Mimir/Cortex which allows this much scale (clustering options in their open source version). Our use case will have high cardinality and Mimir seems to fit very well.

Can we use Prometheus/Mimir as general purpose time series database? Prometheus is built for monitoring purposes and may not be for general purpose time series databases like InfluxDB (I am hoping to be wrong). What are the disadvantages/limitations for using Prometheus/Mimir as general purpose time series database?


> I can't find any other open source time series database except Mimir/Cortex which allows this much scale (clustering options in their open source version)

The following open source time series databases also can scale horizontally to many nodes:

- Thanos - https://github.com/thanos-io/thanos/

- M3 - https://github.com/m3db/m3

- Cluster version of VictoriaMetrics - https://docs.victoriametrics.com/Cluster-VictoriaMetrics.htm... (I'm CTO at VictoriaMetrics)

> Can we use Prometheus/Mimir as general purpose time series database?

This depends on what do you mean under "general purpose time series database". Prometheus/Mimir are optimized for storing (timestamp, value) series where timestamp is a unix timestamp in milliseconds and value is a floating-point number. Each series has a name and can have arbitrary set of additional (label=value) labels. Prometheus/Mimir aren't optimized for storing and processing series of other value types such as strings (aka logs) and complex datastructures (aka events and traces).

So, if you need storing time series with floating-point values, then Prometheus/Mimir may be a good fit. Otherwise take a look at ClickHouse [1] - it can efficiently store and process time series with values of arbitrary types.

[1] https://clickhouse.com/


I meant all Prometheus based solutions, includes Thanos, M3, VictoriaMetrics. Thank you for your answer.


I would love to see some benchmarks when making such a heavy claim. I would be interested in knowing performance of ingestion rate, query timings and resource usage.


It's hard to tell exactly how this works but judging from the tutorial's docker-compose.yml [0] it looks like this runs as a separate API next to Prometheus and you tell Prometheus to write [1] to Mimir. I'm unclear how reads work from it or maybe there is no read?

Maybe I'm completely misunderstanding.

[0] https://github.com/grafana/mimir/blob/main/docs/sources/tuto...

[1] https://github.com/grafana/mimir/blob/main/docs/sources/tuto...


Mimir exposes both remote write API and Prometheus compatible API. The typical setup is that you configure Prometheus (or Grafana Agent) to remote write to Mimir and then you configure Grafana (or your preferred query tool) to query metrics from Mimir.

You may also be interested into looking at a 5 minutes introduction video, where I cover the overall architecture too: https://www.youtube.com/watch?v=ej9y3KILV8g


Cool! Personally I don't like watching videos, preferring to read prose or code or see an arch diagram. But good that it's available.


I'm the author of the video, but personally I also prefer to read prose instead of watching videos!

The architecture is covered here: https://grafana.com/docs/mimir/latest/operators-guide/archit...

There's also an hands-on tutorial here: https://grafana.com/tutorials/play-with-grafana-mimir/


It’s a centralised multi-tenant store, supporting the Prometheus query API. So you can point clients directly at Mimir, they send in PromQL and they get data back in Json.

(Note I work on Mimir)


But who does the scraping of the prometheus agents? Mimir or still prometheus server?


Last year I wrote a blog post about this exact question: Who watches the watchers?

The general takeaway is that you run a minimal prometheus/alertmanager setup that only scrapes the agents, then use a dead man switch-like system to ensure this pipeline keeps working.

Link: https://grafana.com/blog/2021/04/08/how-we-use-metamonitorin...


If you have systems exporting metrics in Prometheus style, then you can use Prometheus to scrape them and remote-write to Mimir.

You can alternately use Prometheus Agent, to save storing the data and running a query engine at the leaf.

You can also use the OpenTelemetry suite to perform the same operation, though this is more appealing if you want some other OpenTelemetry features at the same time. Eg if you prefer the ‘pipeline’ style.


You configure with Remote Write [1] to the Mimir instance. Then the Prometheus agents will send the metrics to Mimir.

1: https://prometheus.io/docs/prometheus/latest/configuration/c...


Is there an example of running mimir without prometheus?


For example sending metrics from an OpenTelemetry pipeline.

Mimir accepts the Prometheus remote-write api, which is protobuf-over-http; can be generated by anything really.


Coincidentally, "mimir" is a funny, baby-like way of saying "dormir" (to sleep) in Spanish.


Technical meetings are going to be fun with hispanic devs...

"And finally we sent the metrics to Mimir /giggles/"

Sadly they don't support encryption at rest (sorry, I really had to do one more pun)


So true!!! LOL I related to "Vamos a mimir!" when I read it!!! ROFL


What's the latency between sending a metric and being able to query it when using object storage (s3) instead of block storage?

How do the transfer/retrieval (GET/PUT) costs factor in as well?


Good question! Grafana Mimir guarantees read-after-write. If a write request succeed, the metric samples you've written are guaranteed to be queried by any subsequent query.

Mimir employes write deamplification: it doesn't write immediately to the object storage but keeps most recently written data in-memory and/or local disk.

Mimir also employes several shared caches (supports Memcached) to reduce object storage (S3) access as much as possible.

You can learn more here in the Mimir architecture documentation: https://grafana.com/docs/mimir/latest/operators-guide/archit...


"the most scalable open source TSDB in the world"

You can be scalable, and still cost a lot of money to scale out. Unit economics are important.


How does it work with Rules? So far I cannot see if this can be a replacement for prometheus since I cannot see how can we re-use our prometheus rules with Mimir. Anyone knows anything around that?


Mimir includes a ruler component, which is responsibile to evaluate Prometheus recording and alerting rules. It also exposes a set of APIs to configure the rule groups.

For example, you can use this API to upload a rule group: https://grafana.com/docs/mimir/latest/operators-guide/refere...

Mimir is released with a CLI tool called "mimirtool" which, among other things, allow you to configure the rule groups (under the hood, it calls the Mimir API). Mimirtool documentation is here: https://grafana.com/docs/mimir/latest/operators-guide/tools/...


Thank you for the reply.


What is the best SASS based dashboard solution for Prometheus?


Grafana Cloud


thanks


Looks like an interesting alternative to Clickhouse with s3 backend...


Is this the project you guys referenced using Apache Arrow for?


Maybe you're thinking of this - the data structure used by datasources for Grafana dashboards:

https://grafana.com/docs/grafana/latest/developers/plugins/d...


I don't think so! I think thats being used in Tempo, but I'm not sure.


We are definitely investigating columnar formats in Tempo to store traces. We expect it to drastically accelerate search as well as open up more complex querying and eventually metrics from distributed tracing data.

However, we are currently primarily targeting Parquet as our columnar format in object storage.

Expect an announcement soon!


What is the relationship to Loki?


Sibling. Much of the architecture is similar; a number of components are shared in https://github.com/grafana/dskit.


More engineering effort going into reinventing things that already exist to upsell people on Grafana cloud.

What about focusing on the core value that Grafana provides, dashboards?

Grafana 8 alerting is still in my opinion at a beta level. Dashboards as code has made no meaningful progress outside of community attempts in the past 3 years. The documentation for Grafana 8 alerts is still subpar.

All of these things as a paid offering are more interesting than migrating my logging system or metrics system. Developers don't want to migrate their observability.


Seconded. While I like the idea of Grafana, and use it for some projects, it lacks features in the graphing and dashboarding part. I too presumed this is because they are spending more on backends, pipelines and collection..

I don't need more backends, pipelines or collections. I need a frontend to display the data that I have (in backends) already.

I need to:

* Be able to pipe KPIs into a storage. Doesn't need big-data, high-volume, or extreme granularity. OR

* Have grafana grab data from an API/HTTP endpoint. It does this with prometheus just fine.

* Have a way to insert some of my own figures. Currently I wire up some google-sheet to grafana and fill that. I always have some data that I cannot or will not (yet) grab automatically. Like "amount of hours spent working on project" or "MRR" or such.

Its possible with Grafana. But the experience is subpar, the tweaking and wiggling is big and the outcome is an OK-ish, but not too convincing dashboard. I'm convinced an alternative that tackles this better (for niches) will eat into grafana.


Understandable critique, but I absolutely love a lot of Grafana’s redundant offerings. For example, operationally speaking it is drastically simpler to set up a scalable Grafana Tempo instance than Jaeger, in my opinion. Grafana offering competent object storage backends for their software has made them dramatically easier to operate and maintain.

That’s also another thing: a decent amount of Grafana software (Mimir, Loki, Tempo…) are OSS, so while they definitely are using those softwares in their paid offering, they absolutely still benefit OSS users. I’m messing with Tempo for telemetry in my (admittedly embarrassingly weak) home lab endeavors and it’s pretty cool.


Is there any competitor in the "primarily dashboards" space? Plenty things I know just use Grafana for small amounts of data where all this "5 new datastores!" isn't really useful, but dashboard improvements would be welcome.


Hey there! I work at Grafana on many of the dashboard components. Beyond dashboards as code and alerts where are you feeling the pain?

I can say that a lot of effort is going into improving dashboards in a number of different dimensions and there are definitely some exciting things on the horizon.


What issues have you seen with Grafana alerting?

I'm curious because in my view it works so well that we abandoned alertmanager for Grafana alerts only well before v8.


Grafana alerts (before version 8) worked great. We use them, but the Grafana 8 alerting features are half-baked at best.

* Grafana 8 alerts removed the Image Preview, which was extremely useful during issues. [0]

* Grafana 8 alerts don't have any way of being stored as code. In fact the API that they provide in their docs [0][1] doesn't work, or isn't up to date.

* The expression languages have zero documentation about them, so aren't exactly useful for things that might get a developer out of bed in the middle of the night.

[0] https://github.com/grafana/grafana/discussions/38030#discuss...

[1] https://editor.swagger.io/?url=https://raw.githubusercontent...

[2] https://community.grafana.com/t/posting-an-alert-using-grafa...


How did you define alarms as code in a practical way before v8? and after?


Hi, I work on Grafana Alerting. Provisioning of alert rules (and other objects used for alerting) will be possible using a new API in Grafana 8.5 and we will update the Grafana Terraform provider right after to take advantage of this new API.


Great to hear! We are looking into jsonnet based approach but having an explicit and granular API and a Terraform provider would be miles and miles better. Thanks!


Did not tbh. We have an ops department that do not complain about menial tasks.

But of course IaC is the way we must follow.


Building a dashboard by clickety/clacking around is not a menial task, consistency across dashboards is a a core unit of observability to ensure x-functional teams can discuss issues across a common language/viewpoint, which is only enforceable through a declarative dashboard syntax.


The question was regarding alerts, not dashboards. We obviously deploy dashboards from json.

But I'm not aware of any way to deploy notification channels, probably can do that now via API. But either way we need to deploy notification channels with webhooks and tokens so that part is done manually. And then the alerts is also done manually.


No alerts possible with dashboards and variables.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: