

Using Go for Anomaly Detection - bketelsen
http://blog.gopheracademy.com/birthday-bash-2014/using-go-for-anomaly-detection/

======
graycat
Anomaly detection in large server farms and networks \-- important problem. OP
-- nice description of the problem!

But, there's more!

(1) Zero Day Problems.

If we see a problem very often, maybe even just once, hopefully we make
changes to protect against that problem so that the chances of seeing that
problem again fall to nearly nothing.

So, after we make such changes, net, the problems we really want to detect are
the ones we've never seen before. That is, we want to detect problems when
they are seen for the first time, on _day zero_ , that is, _zero day_
problems.

So, in this situation, we have not yet seen any of the problems we are trying
to detect, are trying to detect problems we have never seen before, and, thus,
do not have data on those problems.

And, for a problem we've seen before, even if we don't make changes to protect
against that problem, a good detector for just that problem is usually a
comparatively easy challenge and one that we should meet.

(2) Good Data.

Call a system from which we are collecting data our _target_ system. So, if
our target system is reasonably _stable_ , then maybe we can collect data from
that system for some hours, weeks, maybe months. If we let the data _age_ and
still see no symptoms of problems, then we regard this data, this _history_
data, as from a _healthy_ system, that is _good_ data.

(3) Hypothesis Test.

So, continually in near real time, we collect data and do some calculation to
raise an alarm or not.

This work needs to be essentially a statistical _hypothesis test_ performed
continually as we receive data.

In such a test, we tentatively assume that the target system is healthy. This
assumption is our _null hypothesis_ , that is, an assumption of a healthy
system, nothing wrong, a _null_ bad effect.

Then we use the assumption of this null hypothesis, the _good_ data, and the
data we just received (observed) to calculate a number, we call a _test
statistic_ , and then calculate the probability, usually called alpha, of
getting a test statistic that far from what it would have been for a healthy
system. If that probability is too low to be reasonable, then we reject the
null hypothesis that the system was healthy, conclude that the system is sick,
and raise an alarm.

So, alpha is the probability of getting such a bad value for the test
statistic when the system is healthy. So, for a healthy system, alpha is the
probability of raising a false alarm. Alpha is also commonly call Type I
error.

Then a missed detection of a real problem is Type II error, and commonly its
probability is called beta. That is, beta is the probability of saying that
the system is healthy when it is sick.

The _detection rate_ , the probability of saying the system is sick when it
is, is one minus beta.

Commonly we call alpha the _rate_ of false alarms and beta the _rate_ of
missed detections.

(4) Detector Quality.

There can be many hypothesis tests. A _perfect_ test is one with both alpha
and beta zero; usually on the shelf of reality, the box of the perfect tests
is empty.

It's easy to have alpha, the rate of false alarms, be zero -- just turn off
the detector. But then beta, the rate of missed detections, will be 1.

It's easy to have beta be 0 -- just sound the alarm all the time. But then
alpha will be 1.

Generally, for a given detector that is not perfect, there is a trade-off --
the lower we insist that alpha be, the higher beta will be.

But not all detectors are the same: Some detectors are better than others,
that is, _closer_ to being a perfect detector, that is, with a given alpha
give a smaller beta, that is, a better trade-off. And detectors, even ones
with the same alpha and beta, can differ on what real problems they detect.

(5) Best Detector.

The question of what would be the best possible detector was answered by the
Neyman-Pearson result. So, for a given alpha, the best possible detector gets
the lowest possible beta. A relatively general proof can be obtained from
measure theory and, there, the Hahn decomposition from the Radon-Nikodym
theorem.

Alas, usually in practice, the Neyman-Pearson result asks for more data than
we can have; in particular, when looking for zero-day problems we can't hope
to use Neyman-Pearson.

A high _quality_ detector is one with a relatively low beta for its alpha. In
practice a high quality detector saves money from chasing false alarm and the
possibly serious problems of missed detections. Of course, the Neyman-Pearson
result tells us how to create the highest quality detector possible.

(6) Adjusting Alpha.

Commonly in practice, we can select a value for alpha and have our detector
obtain that value. So, in advance we can select the value we want for alpha
and get that value in practice. But typically we have to get the corresponding
value of beta by empirical means. Since when looking for zero day problems in
a well run server farm or network we stand to get relatively few detections,
we can have trouble getting an accurate estimate of beta.

(7) Data Distribution.

It can help use create a higher quality detector if we know the probability
distribution of the data we observe when the system is healthy. As in the OP,
maybe that distribution could be Gaussian, although with much time with real
data from networks and server farms we expect to see Gaussian data only
rarely.

If we make no assumptions about the probability distribution of the data from
a healthy target, then our statistical hypothesis test is _distribution-free_.
For data from real server farms and networks, we are usually forced to use
distribution-free tests. A special case of _distribution-free_ is _non-
parametric_.

(8) Dimensions.

It is common in practice, from one target system, to be able to collect data
on each of several, say, n, variables at data rates from a point each few
second up to some hundreds of points a second.

Typically data from one variable is not independent of that from the other
variables.

So, with several variables, there is a multi-dimensional, n-dimensional,
region, the _critical_ region, such that we raise an alarm if and only if we
get data in that region.

For a high quality detector, that region should accurately fit where we want
to raise an alarm -- the Neyman-Pearson result, when we have data enough to
use it, can specify just what that region is.

If our detector is based on just _thresholds_ on the separate variables, then
we are forced to have our critical region be just some n-dimensional box, and
such a box gives us relatively little ability to get an accurate _fit_. With a
poor fit, for our selected alpha, we stand to get a relatively high beta and,
thus, lower detector quality.

Or course, for our n-variables, the best detector that does good work with all
n variables jointly will be the best detector we can have. Or, whatever can be
done with the variables separately can also be done, along with more, in an
n-dimensional detector.

For an intuitive explanation, suppose n = 2 and we get data points on a
checker board. Suppose a point on a red square indicates a healthy target and,
a black square a sick one. If we consider the n = 2 variables separately, then
we will have a low quality detector, but if we consider the n = 2 variables
together, then we can have a perfect detector.

With n-dimensional data, usually we have to give up on knowing the probability
distribution of the data.

(9) Old Techniques.

Long the workhorse of server monitoring was thresholds. Later _expert systems_
tried to use _rules_ to determine when to raise an alarm, say,

    
    
         When I see A, B, and one of
         X, Y, or Z, it looks bad;
         raise an alarm.
    

Here we had no idea of detector quality or false alarm rate and no ability to
adjust false alarm rate. Or the work was necessarily statistical hypothesis
testing except it was being done poorly.

(10) Summary.

The OP was correct that false alarms are bad but missed detections can be
worse. So, we want high quality detectors, and for a given detector, to get
the lowest rate of missed detections, that is, the highest detection rate, we
can, we set the false alarm rate at the highest value the operating staff is
willing to tolerate. Of course, in practice, if the false alarm rate is too
high, the staff may just ignore the detector and its alarms thus giving a zero
detection rate.

So, what we want is a collection of statistical hypothesis tests that are both
n-dimensional and distribution-free where we can select and know the false
alarm rates and otherwise have good evidence of high quality detectors.

All of this discussion is now quite old material. My conclusion for some years
has been that people with large server farms and networks doing important work
really should be interested but that nearly no one is.

Apparently the OP has re-discovered this subject. Here I've tried to get
everyone caught up as of some years ago!

