Are we any closer to having some kind of anomaly detection in grafana (when usin...

danlimerick · on May 18, 2020

Have you seen this talk/blog from last year's Monitorama about anomaly detection with Prometheus:

https://about.gitlab.com/blog/2019/07/23/anomaly-detection-u...

veesahni · on May 18, 2020

Yes, I saw this last year. Complicated and requires a lot of effort on the user's part compared to let's say CloudWatch anomaly alerting: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitori...

cranekam · on May 18, 2020

Just curious: what kind of anomaly detection are you looking for? Can you describe the kind of thing you'd want this system to detect?

veesahni · on May 18, 2020

We have all sorts of data flowing into Prometheus and charted in Grafana. Grafana lets me alert based on thresholds and that works fine for many things.

EXAMPLE 1:

Imagine a service. Load avg of 2 is unacceptable and means things will be slowing down. However, a heavy job runs every night and pushes load averages between 1.5 and 2.5 during a 30 minute window.. I wanted to set a load avg alert threshold @ 2.0 but can't since it would fire false alarms too often. So i end up setting it at 3, and end up getting a late alert when things get really bad.

Instead, an anomaly detection would look for a number of standard deviations away from a mean, and ideally account for seasonality (i.e. compare the data with same data last few weeks). Now, things will behave better. If loads hit 2 during the day, it'll fire. But during that 30 minute window at night, loads would need to exceed 3 before it fires.

EXAMPLE 2:

Let's say I'm tracking 30 metrics for a service. I understand the service well enough to set thresholds in 5 metrics. However, I'm sure a failure won't happen alone and the other 25 metrics may show some oddities and give me an early warning of "something wrong happening" .. Here, some kind of loose "anomaly alerts" could help as an early warning. If a certain metric throws too many false alerts, we find a better way to monitor it.

EXAMPLE 3:

Lets say you have a service that auto scales. There's no limit to how many requests the service handles. So I can't really thresholds on requests/5min. Here, some kind of "anomaly alert" would be useful to let the team know that something is going on that's causing a high level of requests. Again, accounting for seasonality because things tend to change on weekends.

At the end of the day, an anomaly alert is not some "magic solution" but just and extra tool in the toolkit to use when it makes sense. Datadog and AWS both have built in easy to use anomaly alerting. Would be nice for grafana/prometheus to have something similar.