Grafana Labs launches free incident management tool in Grafana Cloud

nikolay · on Sept 13, 2022

All good except that Grafana Cloud is super expensive when you consider it per metric. This probably is the most expensive service per bit of data!

edumucelli · on Sept 13, 2022

Yes, I have started using their cloud for a personal project. Ended up going back to self-hosted. Great thing their product is open source.

divygoel · on Sept 13, 2022

Hey @edumucelli - Would love to learn more about your experience and identify where we can make improvements. If you're up for a 15 min chat, please send me a note at divy.goel@grafana.com or let me know how best to reach you!

divygoel · on Sept 13, 2022

Hey @nikolay! I work at Grafana Labs & focus on pricing - would you be up for a 15 minute chat to discuss this further? If so, feel free to either drop me a note at divy.goel@grafana.com or let me know how best to reach you :)

gaffneyc · on Sept 13, 2022

We recently swapped our metrics to Grafana Cloud and were really surprised (despite being documented) that pricing is based on samples per minute not metrics series. So, for example, if we send a metric every 15s (the Prometheus default) then we get charged as if that were four separate metrics. Support was very helpful explaining everything and they reversed the charge but it still feels weird.

divygoel · on Sept 13, 2022

Thanks for the feedback! Agree that this can be made more clear and we will work on that.

hztar · on Sept 14, 2022

We experienced the same thing. Pulled the plug on using the cloud version

oxfordmale · on Sept 13, 2022

Recently you pulled one of our Grafana plugins without notice as you had upgraded this to an Enterprise license. We are more than happy yo pay, however, pulling production support without reaching out to negotiate a license is a d** move. We suffered an outage of several days while scrambling to get the payment approved. Luckily we didn't suffer any major outage in that window.

divygoel · on Sept 13, 2022

Hey @oxfordmale - Sorry to hear about your experience. Would you be open to a live discussion so I can understand your issue further?

Surprised to hear this as none of our Enterprise plugins had a change in licensing (i.e. going from free to paid, or shifting within paid tiers) as far as I am aware. Would love to dig into this further.

If you're up for it, feel free to send me a note at divy.goel@grafana.com or let me know how best to reach you!

oxfordmale · on Sept 14, 2022

It concerns this plugin. We had a previous version of this working for years. It suddenly was deactivated one day and after us reaching out the support team, we were advised we had to upgrade to Grafana Enterprise to be able to get this working again. This is fair enough, however, what shows utter lack of respect is that you couldn't give us a grace period and this resulted in our Production Monitoring to be broken for several days. A company that cares would reach out to its customer repeatedly by email first, and if that got ignored, grant a grace period when the customer reaches out. Payment authorisations take time, and it is not great to have broken monitoring on a system that processes millions of dollars each day.

https://grafana.com/blog/2020/07/22/introducing-the-new-and-...

RichiH · on Sept 14, 2022

NB: I work for Grafana Labs as Directory of Community.

As this is not how we (try to) operate, I also had a look.

From what I could find, it seems the account you are referring to is a very early Cloud account. For reasons I don't know and which might be lost to history, your account had an old and non-standard license attached to. From the viewpoint of today, the license itself seems broken.

To be clear, that is neither your fault nor do I believe you could have caught it even if you looked. While there was no change in the licensing requirements on the plugin, an upgrade to a significant rewrite of the plugin "fixed" the problem of accepting a broken license. That "fix" meant the plugin stopped working.

Again, this is not your fault. But it was not a deliberate action by Grafana Labs nor caught by our testing, either.

I believe your company is in contact with David Dorman, our Head of Self-Service, about this. If you'd like me to ask him to follow up with you directly as well, please let me know how to best contact you.

bcjordan · on Sept 13, 2022

Interesting, I'm using Grafana Cloud for just a few Prometheus metrics at the moment and have found it reasonable so far so am interested in what scale up looks like.

I'm curious—what other sorts of services are you referring to in your comparison?

igetspam · on Sept 13, 2022

Are you planning any posts on comparing your new incident tool to other services? We currently use incident.io and are happy with it but we pay a lot for Grafana Cloud right now so it's worth considering if we can reduce spend elsewhere.

Edit: We're happy with incident.io but free is compelling if the product is good and having a single view for observability is useful

sjwhitworth · on Sept 13, 2022

Hey, incident.io CEO here. Glad to hear you’re happy with the product. The people at Grafana are great - congrats on the launch! Will have to take the product for a spin sometime :)

farhan0410 · on Sept 13, 2022

Thanks Stephen (product marketing lead for Incident) - we are also big fans of what y'all are building!

matryer · on Sept 13, 2022

They're both great tools :) Lots of similarities, and plenty of differences.

farhan0410 · on Sept 13, 2022

totally understand that and great shout.

happy to pull something together for ya if there are particular workflows you are most interested in comparing

farhan.manjiyani@grafana.com

CSMastermind · on Sept 13, 2022

Seems great if you're already on the Grafana platform.

One thing I'd say is that I find the "react with a robot emoji on slack to add information to the timeline" as a little kluge, hopefully that's not the only mechanism for doing that.

Also does this tool have a postmortem workflow? I didn't see one in the documentation and that seems like an important part of the incident response process.

drc · on Sept 13, 2022

Thanks for feedback re: robot emoji. You can also use a backslash command if you want to add a new piece of text to the timeline from Slack.

re: postmortem workflow. The timeline view is built to help postmortems, one of the ways we're doing this is making it easy to paste the info from the timeline as rich text or markdown from the timeline into your post mortem workflow. You'll see that on the top right of the timeline view. We have a lot more ideas in this area and will be investing in this.

Curious if there is specific features you'd like for postmortems?

CSMastermind · on Sept 14, 2022

For postmortems the basis is having document templates that are linked to the incident. From there the ability to automatically add information to the individual postmortem documents with things like timeline and MTTA/MTTR. Add the ability to comment on, share, export, and collaborate on these documents.

The second piece of functionality is around action items. Almost all postmortems generate action items. I need a way to tie action items to a specific post mortem and integrate automatically with my project tracking software (Jira, Asana, Linear, whatever). Ideally there's some top level reporting functionality that shows me the status of post mortem action items.

A nice to have would be automatically scheduling the post mortem meeting with a calendar integration.

farhan0410 · on Sept 19, 2022

This is exactly how our post-mortem functions work -- we pre-generate a google doc with a template and we have the ability to copy (in rich text / markdown) the critical timeline.

On the action item section today actions are tied to Github issues and we are working to extend those - namely Jira/ServiceNow. We do already have a few that gives you a quick look at current status, how long the incident took, and open action items which you can filter by labels or severity (e.g. all severe production issues)

The postmortem doc, meeting link and slack channel are all automatically created for every incident

oxfordmale · on Sept 13, 2022

Are you going to upgrade this feature to an Enterprise license one day and then revoke access without a grace period? This happened to one of our Grafana plugins and resulted in a several day outage while we scrambled to sort out the payment.

As your company has shown zero respect for its customers, I will not be using any of your systems.To be absolutely clear it is fair to charge for any of your products, however, if you change it from freemium to paid you can't just pull the plug without reaching out.

RichiH · on Sept 14, 2022

For anyone reading this, there's a reply and relevant context at https://news.ycombinator.com/item?id=32837295

lstamour · on Sept 13, 2022

Another suggestion re gathering data or threads from Slack: using Message Shortcuts for greater visibility/discoverability? https://api.slack.com/interactivity/shortcuts Might need to combine this with Slack modals for adding details (if it lets you do this)

xwowsersx · on Sept 13, 2022

I'm looking into Grafana Cloud currently. We run a few services across 4 different environments. I'd like to have a single place to view metrics as well as response times for various API endpoints, metrics related to RDS, etc. Also interested in incident management tool. We have around 30 EC2 instances running currently but will be scaling that up further. What can I expect in terms of total pricing? Perhaps hopping on a call with someone for Grafana would make the most sense?

divygoel · on Sept 13, 2022

Hey @xwowsersx - Happy to help here! Would you mind sending me a note at divy.goel@grafana.com and we can coordinate a time to connect? Looking forward to it!

xwowsersx · on Sept 13, 2022

Emailed. Thanks!

aglazer · on Sept 13, 2022

This looks great and cool to see more innovation in the space.

We've been using Rootly https://rootly.com and love it.

jjtang1 · on Sept 13, 2022

Thank you for the kind words, Aaron. Been a pleasure partnering with the Taplytics team!

We work with 100s of companies like Canva, Grammarly, OpenSea and others to help build a consistent incident response process on Slack if you're interested. Happy to give you the no-BS sales demo.

FWIW - we are big fans of Grafana and have a native integration (think automatic Grafana metric/dashboard snapshots into #incident channel.

donavanm · on Sept 13, 2022

How do you/users programmatically quantify MTTR (and related metrics) per incident, or in aggregate? Although it shades towards problem management this would seem necessary to achieve the claim of “reduces mean time to repair (MTTR).”

Bonus questions, are you tracking or driving improvement in the related times for detection/response/mitigate/recover?

Disclosure: Principal at AWS currently in a similar apace. Though I ask in a personal capacity and interest.

matryer · on Sept 14, 2022

Hey, thanks for your question. The tool keeps track of declaration and resolution times by watching when the status is changed. It also lets you manually specify when the incident really started, and when it actually ended. We can use this data to measure a few things, and watching how this changes over time helps us figure out if we're getting better or worse, on average. We want to be careful what we incentivise by default, and we're actively working on this area. The data is going to be available for people to build their own visualisations (in Grafana).

I'd be very interested to hear your thoughts too?

donavanm · on Sept 19, 2022

Sorry for the delay, was traveling on holiday.

In short yeah, using the incident status as the implied times makes sense for the bulk of cases. Totally agree on picking out signal from the users inherent actions, but allowing them to provide more specific data when they know better.

Digging in a little further Im personally interested in moving past the incident data and inspecting the incoming alert(s) and related telemetry/metric/alarm data. For example think of the alarm definitions like “five 1m datapoints with a value above 0.1.” There’s a good argument to count impact (and incident duration) from that first datapoint > 0.1. Then theres the delta from metric processing to alert to incident creation. On the backend theres frequently a delta between mitigating impact and actual incident resolution, again I think getting back to the source alarm/alert/metric data would get us a more accurate view of operations and customer impact.

matryer · on Sept 13, 2022

I work at Grafana, so AMA about the tool :)

buro9 · on Sept 13, 2022

I also work at Grafana Labs, and could just Slack you... but as you've asked...

Incident is available in the free tier, that's awesome... are there any limitations on that at all? Is the free tier version of Incident as fully featured as the paid tiers?

matryer · on Sept 13, 2022

Hello :) The free version is fully featured, so you just limited to number of users in the free tier (three).

farhan0410 · on Sept 13, 2022

Yes Grafana Incident is available (fully featured) in the free tier of Grafana Cloud

solarkraft · on Sept 13, 2022

Is there a chance we'll see it open sourced / a self hosting option?

matryer · on Sept 13, 2022

No plans currently, but a self hosted option seems reasonable. Although, most people like their emergency tech not hosted on their own tech :)

jrockway · on Sept 13, 2022

I dunno, I don't really mind self-hosting monitoring infrastructure. I basically pay for a website uptime checker to check that Alertmanager is working. If Alertmanager is down, obviously you have to manually check to see what else is down, but it doesn't fail open.

I wrote a little glue to make this straightforward for anyone else who uses Prometheus/Alertmanager: https://github.com/jrockway/alertmanager-status This ensures that the website check checks the health of the whole alerting pipeline; Prometheus has an always firing alert, Alertmanager is set to send that alert to alertmanager-status, and alertmanager-status starts failing its external health check if it isn't seeing that alert firing at the configured interval. If one of [Prometheus, Alertmanager, alertmanager-status] fails, then your website health check fails.

aalbertson · on Sept 13, 2022

Being not hosted on my same tech is one thing, still being self hosted so I can externalize it for a federal implementation is another. Definitely needs to be self hosted for ALL components. :)

matryer · on Sept 13, 2022

Yeah makes sense. I'll add your vote to the list :)

trog · on Sept 13, 2022

> No plans currently, but a self hosted option seems reasonable. Although, most people like their emergency tech not hosted on their own tech :)

FWIW our core application is hosted in AWS but we maintain our own Grafana infrastructure independently. So it's not hosted on our own tech, per se, though we're still responsible for keeping it online.

This looks great & would also love to see a self-hosted option. Honestly the more stuff like this that gets rolled into the OSS Grafana it actually makes me both more likely to try it and then more likely to eventually end up on the managed Grafana Cloud, as I will inevitably get sick of trying to maintain our own separate infra & the business case for centralising in Cloud makes more and more sense.

DrRobinson · on Sept 13, 2022

It looks really cool and since we already use Grafana it would be a good fit for us, but for on call purposes Slack isn't very useful. If we were to migrate from PagerDuty to this, we would need an app that can override do not distrub and wake people up. Do you have any plans for any such app?

drc · on Sept 13, 2022

hi, disclaimer I work at Grafana,

We have plans to build a native mobile app for ios & android for OnCall that would let you achieve this over the next few months.

OnCall is a separate product from Incident. It's available via OSS and Cloud. Incident and OnCall work well together, or you can use either as standalone!

thayne · on Sept 13, 2022

I'm a little curious how this will work with self hosted OnCall. Will the user need to set up their own push notification accounts for apple and Google, will it used a centralized service from Grafana, will it have a background service that polls the hosted OnCall service, or something else?

prepend · on Sept 13, 2022

Any plans to have this FedRAMP certified so it can be used in US federal government incident management?

SkoChippy · on Sept 13, 2022

No immediate plans for FedRAMP, but this may be on the roadmap in future quarters.

mrtimbo · on Sept 13, 2022

Any plans for Teams integration? We recently switched all our bots over from Slack.

matryer · on Sept 13, 2022

You don't need Slack to use the tool, but yeah, a Teams integration is on the list, and will probably drop early next year.

altdataseller · on Sept 13, 2022

Anyone replacing PAgerduty with this?

buro9 · on Sept 13, 2022

I think that would be Grafana OnCall https://grafana.com/products/oncall/

Internally we (Grafana Labs) already have replaced PagerDuty and are using it for our teams running critical systems.