
Practical and robust anomaly detection in time series - dsr12
https://blog.twitter.com/2015/introducing-practical-and-robust-anomaly-detection-in-a-time-series
======
iandanforth
Another good package to look at is Etsy's Skyline -
[https://github.com/etsy/skyline](https://github.com/etsy/skyline)

Their introduction of the Kale stack (which includes Skyline) is a great read
- [https://codeascraft.com/2013/06/11/introducing-
kale/](https://codeascraft.com/2013/06/11/introducing-kale/)

I spent a month or so evaluating anomaly detection systems and I can tell you
a few things the twitter post fails to mention

1\. You can get a long way with an ensemble of simple techniques. And it's
always better than any single technique.

I wouldn't recommend trying to install skyline but re-implementing the
ensemble of anomaly classifiers they have might take you a day or two and it
will get you 90% of the way there.

2\. The False Positive rate is probably your most important metric.

Detecting anomalies is good, but your ops team already has plenty of alerts to
deal with. If you throw false positives at them from a new system they will
hate you and ignore the new system. Most papers report ROC curve data as the
metric for classification, that can be ok too.

3\. Don't build something complex when a threshold will do.

If a point anomaly is obvious to a human, you absolutely should not build a
complex system to detect it, just use thresholds. It's only when you want to
detect anomalies _before_ they cross a threshold that you should start on this
kind of task. That leads me to

4\. Almost all anomalies have a temporal component.

If your detector isn't ultimately looking at multiple sources of data and
finding patterns that initially look like normal behavior (or odd behavior
over time like a change in frequency) then its not adding as much value as it
could. Slow trends, increased predictability, absent spikes which are still
within threshold, _those_ are the kind of anomalies your simple systems will
miss and which will add a lot of value.

Ultimately anything that makes ops life easier and alerts them sooner to real
problems is good. But in anomaly detection it is easy to fool yourself into
thinking you need something complex to start out with, and when you've built a
complex thing that it is "working" because it finds 95% of the obvious
outliers.

~~~
Eridrus
Why would you not recommend installing skyline? I would like to have something
that watches metrics for me, and we're already using graphite for stats.

~~~
iandanforth
Getting the full package working was (several months ago) a painful process.
It was much easier to just grab the algos.

------
bglazer
One of my responsibilities is to do post-outage reporting to estimate how much
money we lost. Right now I use Holt-Winters to give a forecast starting just
before the outage began. I then estimate the loss to be the difference between
the forecast and the data points that fall outside of the forecast's
confidence interval.

Is this a statistically valid method? I chose Holt-Winters because I'm
inexperienced and it's appealingly intuitive. Should I be looking at anomaly
detection methods instead? Would they be able to tell me what "normal" would
have been for the duration of the anomaly?

~~~
slipperyp
I don't know if it's statistically valid, but I've used methods similarly
based on this to do calculations like you're talking for a role similar to
what you described.

There are more subtleties that it might be important for you to take into
account - primarily:

1) need to sample fairly extensively before/after outage to calibrate more
accurately against Holt-Winters (the Holt-Winters seasonal projection should
accurately project the trend, but actual numbers are probably running at some
slight or significant rate above/below projections)

2) When running those samples, it's important to sample data where you believe
the data points are definitely not impacted by the outage. This is often quite
challenging since outages sometimes might span low / peak traffic periods or
ramp-up/down periods.

3) Finally, it can be hard to pinpoint the actual start / end of the event
(the identify the time samples you want to consider in your measurement for
the outage cost). Particularly the end, since there's often some pressure for
queued operations (by software or by your users who are itching to complete
what they were trying to do) that may make your samples fluctuate. That
backfill pressure can be substantial and is important to not ignore in your
measurement of the actual cost of the issue. Say you're a retail site - if you
have a 15 minute period of 50% order drop but the first 5 minutes where
service is restored, the total order rate was 50% above projections. Do you
count that as 15 minutes of 50% order drop, or 10 minutes of 50% order drop?
Both are legitimate but it's important to know what metric you're measuring
yourself against so you're as correct / honest as you can be.

~~~
bglazer
Thanks for the reply!

1\. This is a good point. I haven't incorporated sampling after the outage
into the analysis, but that should be a good qualitative measure of the
accuracy of the forecast.

2\. I typically have good data from server logs of when an outage started.
Outages during low volume periods are quite difficult to analyze though. I
usually revert to just comparing the outage volume to the average volume for
the whole outage period.

3\. The end is typically more difficult to determine, as there's typically a
period of instability as servers are restarted sporadically, followed by a
"recovery" caused by the backfill pressure that you mentioned. My solution is
to count any samples _above_ the forecast's confidence interval as "recovery"
and to subtract the total recovery from the loss estimate.

------
falcolas
An interesting complement to Etsy's Skyline project[1], which was intended to
read data from the incoming Graphite data streams. The two do seem like they
could be complementary, however.

[1]
[https://github.com/etsy/skyline/wiki](https://github.com/etsy/skyline/wiki)

------
tjradcliffe
Does anyone have a clue what "ESD" stands for in this context? The article is
too buzzword-heavy to be very meaningful, even to a practitioner (this is
surprisingly common in data analysis, where there seems to be a culture of
naming things in the most opaque way possible.)

~~~
jamessb
The abstract in reference 3 says it is the "extreme studentized deviate".

~~~
tjradcliffe
Thanks. I searched all over the place but was guessing the "D" stood for
"disaggregation".

In general things that depend on (approximate) normalcy, which ESD apparently
does (unsurprisingly, given the name) don't count as "robust" in my personal
lexicon. There is relatively little reason in this day and age to continue to
use parametric tests of any kind, particularly for data that are almost
certain to contain significant non-normal components.

~~~
bqe
Do you have any suggestions for a more robust method for anomaly detection in
seasonal time series data?

~~~
tjradcliffe
Every case is different (as the article says, generalizing these things is not
easy) but the basic approach always involves two things:

1) Build a model of the expected time series. There is no way of avoiding
this. To find an anomaly you _must_ define "that which is expected", either in
terms of the actual data, differences, or moments.

2) Measure the distribution around the expected values based on past data.

3) Apply some test that answers some version of the question, "What is
plausibility of the belief that the new data are drawn from the combination of
model plus distribution?" The trick here is that you aren't interested in
_all_ anomalies, just "significant" ones, which may have different temporal
behaviour, etc.

The important thing is to test relative to the distribution you actually have,
which is never going to be particularly normal, especially in the wings, which
you are going to really care about when attempting to be maximally sensitive
to real anomalies. Normal distributions almost always underestimate the tails,
which makes them prone to false triggers, which another poster here has
pointed out: you really do not want.

Robustness against "unknown unknowns" in the anomaly distribution is one thing
you want to be particularly careful about. Weird things happen all the time in
real data, and you are generally looking for anomalies of a particular kind,
with particular characteristics. The ideal anomaly detector will catch those
without going off on every odd thing that happens.

------
graycat
Observation (1).

The data they are looking at is essentially a univariate stochastic point
process, that is, an arrival process. The most important special case is the
Poisson process. There the times between arrivals are independent, identically
distributed random variables with exponential distribution with parameter the
arrival rate. The number of arrivals in an interval is random with Poisson
distribution (compare with the terms of the Taylor series for exp(x)).

See early in

E. Cinlar, 'Introduction to Stochastic Processes'.

There for the Poisson process there is a 'qualitative, axiomatic' definition
-- an arrival process with stationary, independent increments. A cute
derivation from this just qualitative description results in the details of
the Poisson process.

One point about this qualitative approach is that commonly in practice the
assumptions are obvious just intuitively.

Another solid approach to a Poisson process is the renewal theorem; there is a
careful treatment in W. Feller's now classic volume II.

The theorem says that under mild assumptions a sum of independent renewal
processes converges to a Poisson process. Arrivals at Twitter look like a
nearly perfect example.

So, basically without any anomalies the Twitter data is a Poisson process.

Observation (2).

In part, the anomaly detector is based on the extreme Studentized deviate
(ESD) statistical hypothesis test as in

[http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h3...](http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h3.htm)

but this test has a Gaussian assumption, not a Poisson assumption. So, there
should be some mention of justifying using a Gaussian assumption.

Point (1).

The work in the OP, that is, an anomaly detector, is basically, nearly
inescapably and necessarily a statistical hypothesis and, thus, faces the
usual issues of false alarm rate (significance level of the test, conditional
probability of Type I error given an anomaly), detection rate, and the classic
Neyman-Pearson best possible result.

In particular, it is important and usually standard to have a means to adjust,
control, and know the false alarm rate, but in the OP I saw no mention of
false alarm rate, power of the test, etc.

On a server farm bridge or in a network operations center (NOC) with near real
time anomaly detection, false alarm rate too high is a serious concern. With
realistic detectors, false alarm rate too low means detection rate too low and
is also a concern.

More.

There are some tests that are both multi-dimensional and distribution-free,
with false alarm rate known exactly in advance and adjustable in small steps
over a wide range. Such tests might be good for monitoring for 'zero day'
problems, that is, ones never seen before, in serious server farms and
networks.

~~~
joeyo

      > but this test has a Gaussian assumption, not a Poisson assumption.
    

In practice the Gaussian assumption might not be too badly abused (at least
for Twitter) since the Poisson distribution approaches the Gaussian when
lambda is large.

