
Outlier Detection at Netflix - diab0lic
http://techblog.netflix.com/2015/07/tracking-down-villains-outlier.html
======
diab0lic
One of the authors here, I'll be around to answer any questions if anyone has
them. I'm sure my colleague will be around as well.

~~~
danso
At the risk of sparking off-topic debates about Netflix's content
selections...are the same event streaming/data processing frameworks also used
to detect trends and make actions on the content/frontpage side? Such as,
detecting an anomalous spike of users watching some obscure movie in the last
24 hours, and then updating the "Trending Now" queues for users systemwide? Or
is that something that happens (relatively) slow enough not to require an
elaborate real-time framework?

~~~
diab0lic
I don't work on the recommendations side of things, but this is actually the
ONE technical topic at Netflix that I don't believe I am allowed to discuss.
Sorry!

------
boltzmannbrain
Have you heard of Numenta? Their machine intelligence algorithms are great for
anomaly detection in streaming data, including a product "Grok" for IT
analytics ([http://numenta.com/grok/](http://numenta.com/grok/)). And all open
source in NuPIC:
[https://github.com/numenta/nupic](https://github.com/numenta/nupic)

~~~
diab0lic
This was something I looked into when i performed the initial investigation
for this project. It was a bit difficult to locate supporting academic
material on the algorithm though. The white paper on the page seemed more like
marketing material than an academic paper, which I imagine serves their
business purposes better.

I will say the fact that nupic produces an outlier score and confidence score
are things that would have been incredibly useful by the time this was brought
to its end users. Definitely worth a look for anyone looking to do realtime
stream processing for anomaly detection.

------
nightski
This is very interesting. Looking at the data though it is classified as
"Errors per server". It isn't really disclosed what variables this figure
entails but I'd have to imagine adding more information than simply error
counts would improve the separability of the data?

~~~
diab0lic
It would certainly help separability to observe the data in a higher
dimensional space, however when you're taking automated action against the
results its sometimes pertinent to know which specific metric is causing the
server to be an outlier.

For example if network tcp retransmits are throwing it off we probably just
want the system to kill it and let the autoscaling group bring up another
server. If its memory usage we probably want to page someone.

~~~
pilooch
Yes, which is a reason why all non-linear clustering techniques can be a
challenge when investigation follows outlier detection.

------
jordanthoms
This is very interesting and a source of frustration I have with New Relic and
the other alerting services we use. New Relic uses a 5-minute rolling average
for error rates, and alerts when that average goes above some threshold.
However, that means that it takes ~5 minutes from a spike occurring to an
alert being created - even if the error rate has increased to 50%.

It would be much better for it to be doing this sort of outlier detection - a
gradual increase in error rate to 3% should not trigger a critical alert,
whereas a big jump in error rates should trigger an alert quickly.

Has anyone implemented a system like this?

------
optimali
Would be interesting to try and fit a distribution to error rates (some type
of counting process) and then monitor the probability of having the occurred
number of errors (with in some period of time). Then low probability events
might indicate an outlier.

~~~
diab0lic
We have another component of the same outlier detection system that does this
type of fitting, identifying low probability events using a Bayesian model +
Markov Chain Monte Carlo. It hasn't gained nearly as much traction internally
(yet) as the clustering approach here.

~~~
optimali
Cool. Have you found the MCMC approach to be less robust than clustering? It
seems like this may be the case due to probability model assumptions.

------
sobinator
Why not take this a step further and have another system try to diagnose the
issue with the identified outlier server before handing it over to the alert
system? Seems like you'd be generating a lot of alerts otherwise.

~~~
diab0lic
Our central alerting gateway (CAG) does much more than just send emails and
pages. When the system was fist built we hooked into CAG directly as it is
capable of taking many of the desired actions on its own.

------
HamSession
[https://www.reddit.com/r/MachineLearning/comments/3dbrum/out...](https://www.reddit.com/r/MachineLearning/comments/3dbrum/outlier_detection_at_netflix/)
reddit discussion on the same thing.

Note my thoughts are included and thought they might be of interest to anyone
looking into this problem.

------
the_imp
The data looks like it could lend itself to an approach where you model the
error rate based on the prior data (at simplest, get a mean and variance out
of it) and then use a Chi-square critical range check to see if the last n
(degrees of freedom in the check) measurements are likely to have come out of
the modelled distribution. Is that something you've considered?

~~~
diab0lic
Hey there,

We have considered modeling the distribution from which the data is typically
drawn and then calculating likelihood of newly observed data. Some of the
approaches that we use to detect anomalies on stream starts per second (SPS)
now depend on these services. Same software package, slightly different
solution.

A colleague of mine (Chris) implemented a data tagger which allows users to
annotate data that is typically fed into this system. We have plans to have
the backend automatically swap out the algorithm based on performance against
their tagged data.

We've written about SPS here: [http://techblog.netflix.com/2015/02/sps-pulse-
of-netflix-str...](http://techblog.netflix.com/2015/02/sps-pulse-of-netflix-
streaming.html)

------
z3t4
You could also have the client run some tests and auto report problems such as
the bad server's id.

You could also have a manual report button the user can click on. And if you
want to get really advanced, have a system that learns from those manual
reports so that it can later warn when there is a high probability of a
"problem" occurring.

~~~
diab0lic
One of my colleagues designed and implemented a telemetry data tagger that
allows service owners to annotate the data reported into our primary telemetry
system Atlas [0] and has further worked on mechanisms for service owner
feedback to automatically create training data sets for this implementation.

Great question, and I'm glad others are thinking along these lines -- we fully
believe that this is important for bridging the gap between service owners and
our analytics.

[0] [https://github.com/Netflix/atlas](https://github.com/Netflix/atlas)

------
TheLoneWolfling
I think the graphs would be better presented differently.

Instead of a line graph, try different opacities. So you have one line per
server, going left-to-right, shading from white (for no errors) to whatever
color (for the maximum number of errors overall). And perhaps a dot per server
and a single line indicating overall status.

------
wamatt
Regarding devops alert spam, on the one hand it's possible to setup rules and
filters and tweak notifications just right, but that is often an initial
hassle.

Possibly a startup opportunity to offer an easier option with error logs +
machine learning.

------
jebblue
In the audio world we would have run the data through an FFT, not sure if you
can do that with non-audio data.

~~~
hofstee
You absolutely can. The FFT is one cycle of a Discrete Fourier Transform, and
since this data is both discrete and in regular intervals, it should behave
identical to just sampling an audio signal.

~~~
tanderson92
I am curious why you say "The FFT is one cycle of a DFT", since the FFT is
quite literally a DFT, only an algorithm in a better complexity class.

------
diab0lic
I should note that the team responsible for this (Insight Engineering) is
hiring a manager[0] and engineers[1]. If you're interested in solving this
kind of problem come check out the job post as I'm really excited to be
building out our team.

[0]
[https://jobs.netflix.com/jobs/2406/apply](https://jobs.netflix.com/jobs/2406/apply)

[1]
[https://jobs.netflix.com/jobs/2259/apply](https://jobs.netflix.com/jobs/2259/apply)

~~~
tayo42
How does someone get into doing work like this? This sounds like it would be
pretty cool. I started my first job working with infrastructure/ops for a
larger scale app and it's been really interesting.

------
smaili
For those of you on computers, here's the link with better formatting -
[http://techblog.netflix.com/2015/07/tracking-down-
villains-o...](http://techblog.netflix.com/2015/07/tracking-down-villains-
outlier.html)

~~~
dang
Ok, we changed to that from [http://techblog.netflix.com/2015/07/tracking-
down-villains-o...](http://techblog.netflix.com/2015/07/tracking-down-
villains-
outlier.html?utm_content=buffer26802&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer&m=1).

~~~
diab0lic
Thanks so much dang!

