
Show HN: Using machine learning to recommend dashboards during incidents - shinryudbz
https://beta.overseerlabs.io/ui/product/index.html
======
shinryudbz
Hi everyone, I'm one of the founders of Overseer. We built this tool because
we noticed that engineers had to dig through many dashboards when diagnosing
an incident. We felt this process could be streamlined through the use of
machine learning.

When we first started working on this project over a year ago, we weren't sure
if the algorithms would work, or if our insights would be of value to anyone.
We were also struggling to figure out how to make it easier for people to try
the product without having to change their existing workflow.

Since then, we've made huge improvements to the algorithms, deployed the tech
for several large customers, and demonstrated value. Now I'd love to get a bit
more feedback from you guys and see if we're going in the right direction!

So here's how the tool works: 1 - We pull down your dashboards from your
existing monitoring tool (e.g. Datadog/Wavefront/Librato) using your API key.
2 - We integrate with your PagerDuty account via a Webhook to notify us when
an incident has triggered. 3 - When our Webhook is invoked, we will use
machine learning to rank your dashboards, rank the metrics on those
dashboards, and notify you via Slack/Email of the top dashboards/top metrics
on those dashboards to look at.

For this demo, we only expose the Wavefront plugin, and you'll be able to
configure it on the initial page.

To integrate with PagerDuty, you'll need a URL to our end point, and we'll
need an email address where we can send the analysis. You can configure that
by clicking on your user name (on the top right) and doing the following: 1\.
Clicking on "Generate API Key" and jotting down the generated Webhook URL.
PagerDuty will need that. 2\. Filling out the "Organization Email" text box.
We will send your our analysis there!

Given that we'll be dealing with potentially sensitive data, we reluctantly
decided to add a layer of security and have folks register with us first -
this allows us to protect your data better. My apologies for the
inconvenience.

I'd love to see what the HN community thinks and how we can make it better!

~~~
rhizome
The thing that comes to mind is that maybe the incident should create a
notification that also generates its own dashboard specific to the components
that are (likely) involved. So rather than scanning or hunting for the right
collection of indicators, they're all in the notification (or a URL within
it). Note: I'm pretty sure this doesn't exist in a turnkey form.

~~~
josh_overseer
There are a couple things I think it's possible you're saying: (1) When an
incident happens, you should get notified what components are likely involved.
That is basically what we do if you hook us up to PagerDuty- when PagerDuty
sends out an alert, we will also send out an email that lists what dashboards
are likely relevant along with links to Overseer dashboards that show you
which metrics are probably relevant. (2) By "the incident should create a
notification" you might mean you'd like Overseer to generate an alert. While
this would be possible to do, where Overseer really aims to shine is as a
triaging tool rather than being the best alerting service.

------
mendeza
The fact you put Bayes rule on your page excites me. I am taking a graduate
course in Bayesian machine learning, so Bayes is all around me lol. Any use of
Gaussian processes or probabilistic graph models? Would love to hear also if
you use Bayesian treatment for automatic model selection. It sounds ideal in
lecture, but I am interested to see how it is applied in real world settings.

~~~
shinryudbz
So it turns out that these metrics don't exactly follow a Gaussian
distribution, so it's hard to get these algos to work right out the box.
Additionally, speed was an important component for us (for training and
evaluation), so we had to toss out a lot of the fancy, but slower, techniques.

------
rosstex
I love the photos of equations. "We use enhanced Bayes' Rule and Joint
Summation algorithms to serve you the most relevant and useful dashboards."
Probably more truth to it than it seems :)

~~~
shinryudbz
Glad you liked it :) It turns out that modeling operational metrics is a lot
harder than I expected, so there was quite a bit of work we had to do to get
the algos to work.

~~~
nerdponx
I was wondering about this. I assume you're treating this info as proprietary,
but in case you aren't I'd love to hear more about how you actually implement
this.

~~~
shinryudbz
Well, there's a lot of details and I don't think I can cover it all here, but
if you're interested in the general framework that I used to approach this
problem, take a look at this blog post I wrote last year:
[https://medium.com/@upal/how-to-use-machine-learning-to-
debu...](https://medium.com/@upal/how-to-use-machine-learning-to-debug-
outages-10x-faster-e480b7e2a907). Let me know if you have any feedback!

~~~
srean
I am a little surprised that you got some milage out of procedures and
algorithms that rely on the Gaussian assumption (that includes k-means). From
experience, the raw metrics are, shall we say, as violently non-Gaussian as it
gets. OK OK you need not tell what you are doing to address that as I myself
am being rather economical with the truth. But so glad, so glad that someone
is using multivariate analysis, about time too. From someone who dabbles in a
similar space I wish you well.

~~~
shinryudbz
Thank you :) Yes, the use of multivariate analysis was a crucial insight for
me, and I'm hoping these ideas will push the monitoring community forward!

------
bllguo
Hmm.. I'm not convinced that this is a problem that actually needs solving.
You admit to having these misgivings when starting out - could you describe in
general terms what changed your mind?

~~~
shinryudbz
Great question!

Being an engineer myself, this was a personal pain point and I wanted to solve
it, but the key question was whether or not machine learning would help. Thus,
most of the time was spent deploying the tech with early adopters, refining
the algos, and trying to better understand the value.

What I learned was that our message resonated with some companies more than
others. Working with those guys and getting some proof-points on the value is
what kept us going!

------
singold
I like the idea. It is something that we may buy at my job, but it being
"cloud based" is a deal breaker for us, any plan for some self hosted version?

~~~
shinryudbz
Absolutely! Currently all our large deployments are on-prem. Please reach out
to me and we can discuss further: upal@overseerlabs.io.

------
capkutay
Looks interesting...but I don't want to sign up with an email before seeing
what it actually does.

~~~
josh_overseer
Hi, I'm also from Overseer :-) . The site we link to has a summary, but here's
a more in-depth explanation for you since you're interested. Basically we pull
from wherever you keep your ops dashboards and metrics (right now we've only
exposed Wavefront as a source), and then pull down some history for the data
on those dashboards. We use machine learning to determine how the metrics on
your dashboards normally behave, and then start monitoring those dashboards in
real time. We use this info to generate "health scores" for your dashboards
which you can view any time. Also, if you hook up PagerDuty, when a PagerDuty
alert goes off, we can email you info about which dashboards are unhealthy and
what metrics on those dashboards are contributing to that strange state.

------
stephengillie
Dashboards are largely dangerous. They promote a "Star Trek Bridge" theater -
a play where one actor looks at a screen, becomes concerned, and takes action.
Often this action will include a report to a senior officer.

Business leaders often enjoy seeing their engineers act out this scene. It's
gratifying to see someone take expected action, and gratifying to simply look
at your direct reports to see if they're working.

But this theater play requires engineers to be idly starting at dashboards
first, so they are in the right place to see an issue and take action. This
leads to bored engineers, complacency, and delays issue resolution. It's also
inefficient to pay people to be bored.

Instead of having a dashboard display a subset of metrics, have alerting
configured on these. Pager duty notifies the same if you're in another app,
another castle, another room, or another state - you can pay people to do
other things instead of staring at a dashboard all day.

A big TV full of metrics is a prop that your actors and engineers will ignore.

~~~
jamesmishra
Right, but when I get paged... the first thing I do is look at dashboards and
logs.

At a sufficiently large company, looking at dashboards can be difficult
because:

1\. there are so many dashboards to look at... for so many different services

2\. spurious correlations between two graphs showing unrelated events can lead
you down the wrong path, if you don't confirm your dashboard-generated
hypotheses with log statements or other information.

#2 can be solved by a stronger reliance on logs,
[http://opentracing.io/](http://opentracing.io/) style distributed tracing,
and other information.

Something like this Show HN would be useful for problem #1 though.

~~~
stephengillie
So when paged, your first step is look at something that is difficult to look
at, and possibly waste time with false correlations?

If the metrics are known as important to your app, monitor specifically on
those - that way you're alerted _because_ of low memory or pooled connections
or something you're watching, and can lead your team with that info. If the
metrics aren't known to be important, why are you wasting your time looking at
them on a dashboard?

If you're paged - when it's raining is the wrong time to patch the leaky roof
- it's the wrong time to debug and fix a problem in code. Those notes and
todos should be pulled into next sprint triage. In the moment, just restoring
service should be priority. If restoring service requires modifying code,
database, routes, etc, then your testing environments and change control
policies need improvement.

