
Statistical Anomaly Detection - anand-s
http://www.ebaytechblog.com/2015/08/19/statistical-anomaly-detection/
======
graycat
Been there, done that, found much better ideas, tested them out, and much
more.

I started in such things via artificial intelligence at IBM and gave a paper
on the work at an AAAI IAAI conference at Stanford.

Later I took a different approach:

Sure, we want:

(1) To exploit data on several variables, jointly, easily seen as much more
powerful than processing the variables one at a time. Processing the variables
one at a time usually means that the geometrical region where do not raise an
alarm has to be a box, just a box, with edges parallel to the coordinate axes.
Bummer.

My work does great even if the region where we raise an alarm is a fractal,
e.g., the Mandelbrot set or its complement, yes, in two dimensions or 20 or
more.

(2) To be able to select and set false alarm rate and get that rate exactly.

(3) Know the 'severity' of a detection, say, the lowest false alarm rate for
which would still have a detection.

(4) Do well on detection rate -- clearly we can't do as well as in the best
possible Neyman-Pearson result, but still there is a powerful sense in which
we can be the best.

For the above, I worked out how and wrote the code.

I tested the technology and code on some data from a cluster of servers at a
large, famous company. My work performed just as predicted.

~~~
jensgk
That sounds very interesting. Could you provide any links to papers or methods
used ?

~~~
graycat
Sure, but we should take that _off-line_.

What is just crucial, really, what it's all about, that is, where the value
is, necessarily, inescapably is just in the two rates, the false alarm rate
and the detection rate. If can't say at least good things about false alarm
rate, then have a _definite maybe_.

The OP struggled with the false alarm rate \-- my work solved that problem.

Really, essentially inescapably and necessarily, the core of the work has to
be a statistical _hypothesis test_.

So, what is that? We make an assumption, typically, that the system we are
monitoring is _healthy_. Commonly that assumption is called the _null
hypothesis_ , that is, _null_ in the sense of _nothing wrong_.

That assumption leads to some mathematical assumptions that can be used as
hypotheses in some theorems with proofs.

Then, that assumption with some real data let's us calculate the probability
of getting data _like_ we just observed.

If that probability is relatively small, then we have (A) the system is
healthy and we just observed something quite rare or (B) the system is sick.
If the probability is really small and/or the health of the system is really
important, then we guess that case (A) is too rare to be believed, we reject
the assumption, the null hypothesis, and conclude (B) that the system is sick.

That's a short description of _statistical hypothesis testing_ for 100+ years
a main pillar of pure and applied statistics and a lot in science, both social
and physical.

But for monitoring computer server farms, this context very much need the test
to be both multidimensional and distribution-free. The world wide collection
of those was from tiny down to zero -- likely and apparently zero. So, I
invented a nice, large collection of them.

The core of the work was really some novel derivations, right, with theorems
and proofs, in, call it, _applied probability_.

Then I worked out some algorithms to make the computing fast.

The work is likely and apparently the best approach to do _zero day_
monitoring for _health and wellness_ in high end server farms and networks.
And there may be some other applications.

 _Zero day_ is the case where we are looking for problems never seen before.
So, it's sometimes called _behavioral_ monitoring.

Sure, should monitor for all problems have seen before, and there we should be
able to do relatively well on the two rates. But, after those old problems, we
still need to look for new ones -- sorry 'bout that.

I did this work a long time ago and presented it, or put it in front of, a lot
of people, but no one was much interested.

From time to time, people have, as in the OP, rediscovered the problem and
done this and that with it. E.g., there was a project of profs Fox and
Patterson at Stanford and Berkeley funded by Google, Microsoft, and Sun and in
part considered monitoring -- looking at their work, I thought that mine was
better.

Sure, there was

[http://www.sans.org/resources/idfaq/behavior_based.php](http://www.sans.org/resources/idfaq/behavior_based.php)

where it was claimed that behavioral monitoring had high false alarm rates.
Nope -- my work doesn't! Instead, with my work, get to select false alarm
rates over a wide range, including quite small.

I have another project, and for it I recently completed writing the version
1.0 software -- 18,000 programming language statements in 80,000 lines of text
(with comments, blank lines, etc). I want to go live ASAP. This project is
much easier for people to like! E.g., don't have to get people interested in
rates of false alarms and detections. If the project grows, then I will
implement and deploy my anomaly detection ideas for my server farm.

Otherwise, for now, I'm no longer interested in that anomaly work mostly
because no one else is. Others are welcome, from time to time, to rediscover
the problem, struggle with false alarm rate, see no good, general results on
detection rate, etc., think that the key is artificial intelligence or machine
learning or some such (I didn't use either), etc. Uh, it's statistical
hypothesis testing, guys. Sorry 'bout that.

For more, see, say some of the classics:

E. L. Lehmann, _Testing Statistical Hypotheses_.

E. L. Lehmann, _Nonparametrics: Statistical Methods Based on Ranks_.

Sidney Siegel, _Nonparametric Statistics for the Behavioral Sciences_.

Once I presented my work to Cisco. Their reaction was that my work was "just
statistical". Well, what the heck do you expect? Want something deterministic,
0 or 1, for healthy or sick? Do you realize what the heck you are asking for?

Another time I presented the work and the reaction was, "What we really want
is diagnosis." Gads. You are Sony -- or some other big organization, take your
pick from various headlines from current to going way back -- and your server
farm is under attack with gigabytes or terabytes on the way out the door to
hackers, your whole company is at risk of going belly up, and you don't want
to know that there's a serious problem until you can also be told just what
the heck to do about it? Sure that's your view?

Gee, you don't want to know that your car's about out of gas unless you are
also told that you can get more gas right away? Ever drive in rural Montana?
Like long walks? What about engine oil pressure or coolant temperature --
don't want to know about those, either? Or your blood pressure? Not until can
be told just what to do about it? Sure about that?

It's an important problem. I did the work. It's good work. But if people are
not much interested, then so be it -- I'm not into pushing big rocks up long
hills for no good reason.

Suspicion: A guy has a sore on his foot. He knows it's there. It hurts. But he
lives with it. He's tried this and that, but he still has the sore. Someone
has a good cure, but he is so used to putting up with the sore that he ignores
the good solution. Besides, he didn't see a way to cure the sore so doesn't
believe that anyone else could either.

Even worse, my solution is not really _computer science_ but is, really, some
applied math, actually _applied probability_ , and such work, especially new
such work, is rare in practice.

More generally, good, new stuff is rare, so rare that people are very
reluctant to consider that they should pay any attention to it.

Fine: Let them live with a sore foot.

------
huac
For those following along - this method attempts to do anomaly detection in 3
dimensions: time, 'metric' and query. I interpret query to be something like
'polo shirts' or 'PEZ dispensers,' and metric to be something like 'sales
volume' or 'median sale price.'

The 'aggregation' method just takes a preset number of changes (10
'percentile' or 'quantile' \- the language is inconsistent) in a metric-query
pair for a given time. After that, I get really confused (if you fix T, how
can you have timeseries in T?) but the final output is essentially top 90
percentile of metrics, plotted over time (and independent of query). It's
difficult to see how this method is groundbreaking or novel.

At any rate - the final step in this post is to search for anomalies _by
metric_. So if you have more than _k_ anomalous metrics for a given time _t_
you classify that as a disruption. I think this is calculated by median
absolute deviation, which is probably fine.

What isn't clear in the paper or in the post is how many (or which!) anomalies
in the metrics are sufficient to qualify as a disruption. It's probably not a
great idea to assume equal weight on the features, but I don't know if that's
what's going on here. The associated paper also claims that 70% of the
'alerts' were on-target - but makes no mention of how many disruptions were
uncaught. You're always going to have pretty good hit rate, but you're not
always going to catch a lot!

------
rodionos

      > median price of the returned items
    

I wonder what action could be taken in case of median price anomaly. Would
this be an indicator of price gauging or auction flooding? Do queries weigh
price by time-to-auction - I would expect price variance to be non-stationary.
If it's true positive, how do you fix it?

Other notes:

(1) It's hard to determine whether the proposed algorithm was successful
without a feedback system which would tag true and false positives postmortem.

(2) I wish the raw data used for this project and similar studies were
available, e.g.
[http://webscope.sandbox.yahoo.com/](http://webscope.sandbox.yahoo.com/).
Yahoo datasets are permission-only and are not available unless you're a
researcher at .edu

~~~
huac
It doesn't look like the intent of this system is to detect price gauging,
etc. but rather to detect service outages/disruptions. Agree that high
variance makes 'jump-based' methods such as this fail a lot. The associated
paper claims a 30% false positive rate. But in the case of a true positive, if
you know which metric is failing, then that's a very good jump off point to
respond to the disruption.

Having benchmark datasets could mean more semi-supervised approaches, compared
to unsupervised approaches (such as this one). But - what would the end-result
be? Everyone has different metrics, so unless you merely use the datasets as
raw numbers to feed into an algorithm, you're unlikely to get meaningful
results.

------
techwizrd
As a Data Science intern at eBay, I'm really excited to see eBay on the front
of HN. We have some other interesting anomaly detection algorithms, but I
guess they're saving that for another post.

------
fatpixel
Pardon me for being blunt, but this is really fucking cool.

