
Probabilistic Programming for Anomaly Detection - n-s-f
http://blog.fastforwardlabs.com/post/143792498983/probabilistic-programming-for-anomaly-detection
======
vamin
It kind of seems like the example in the iPython notebook is a red herring.
Wouldn't using a Multivariate Gaussian algorithm perform just as well as
probabilistic programming in this case?

~~~
nightski
Unless I am missing something it is using a multivariate Gaussian. What the
probabilistic programming is doing is inferring the mean and variance from the
data.

cons[i] ~ normal(mu_cons, sigma_cons) T[0.0,];

latency[i] ~ normal(beta * cons[i], sigma_latency) T[0.0,]; // latency is
linearly related to cons

These are both normal distributions.

------
elwell
Related: "Anomaly detection from server log data"
([http://www.vtt.fi/inf/pdf/tiedotteet/2009/T2480.pdf](http://www.vtt.fi/inf/pdf/tiedotteet/2009/T2480.pdf))

------
huac
How does this scale for high-dimensional data? The example uses 2 dimensions -
connections and latency - but real world (performance) data will often have
many more associated variables. I don't know a lot about Bayesian methods but
it seems like calculating joint posteriors would become slow.

But definitely a very interesting approach, and the Stan code is surprisingly
readable!

~~~
ppcsf
Stan uses Markov chain Monte Carlo for inference. MCMC algorithms can do very
well at approximating the high-dimensional integrals required to calculate the
posterior; estimating 1000-dimensional distributions with MCMC methods is not
unusual.

~~~
glial
Not unusual, but the sampling can take a long time and scales poorly with the
amount of data you have.

------
craigching
Could someone clarify what PPL they're using? Certainly something based on
Python obviously, but beyond that? I wasn't aware of PPL and had hoped to
learn something from the blog post, but there wasn't much information. Some
experience sharing with PPL would be really awesome!

~~~
tmarkovich
From the iPython notebook that appears to be associated with the blog post
[1], they appear to be using Stan [2]. Stan is a relatively mature PPL that
was designed with specific priorities in mind such as treating continuous
variables. Overall, it's quite cool to see a startup focused on developing a
PPL in the wild!

[1]
[https://github.com/fastforwardlabs/anomaly_detection/blob/ma...](https://github.com/fastforwardlabs/anomaly_detection/blob/master/Anomaly%20Detection%20Post.ipynb)

[2] [http://mc-stan.org](http://mc-stan.org)

*edit, formatting to get line-breaks.

------
graycat
Some of the leading challenges in anomaly detection are:

(1) Goal. What is the goal? One goal is for near _real-time_ monitoring for
_health and wellness_ and, then, detection ASAP. This goal can be appropriate
in monitoring on-line systems, say, server farms, digital communications
networks, process plants. For much of medicine, there is more time for
detection.

(2) Old Data. What is assumed about the input data, that is, the _old_ or
_historical_ data? One leading assumption is that the data is a second order
stationary stochastic process. Another leading assumption is that the data is
independent, identically distributed.

(3) Anomalies. There are essentially two important cases about the anomalies.
First, we have seen the anomalies before, have some good data characterizing
them, and now are just looking for those old anomalies in new data. An example
here is looking for _signatures_ of malware.

Second, we have not seen the anomalies before and are trying to detect them
for the first time, that is, the _zero-day_ case.

(4) Real-Time Assumption. If doing real-time monitoring, likely want to assume
that the real-time data is probabilistically the same as the _old_ ,
_historical_ data.

(5) False Alarm Rate. Once the _context_ of the work as in (1)-(4) is clear,
likely the most important goal is to be able to select a desired false alarm
rate and get that rate in practice.

(6) Detection Rate. The next major challenge is, for whatever false alarm rate
are willing to tolerate, how to get the highest possible detection rate? Of
course, we might like to achieve the highest possible detection rate in the
sense of the classic Neyman-Pearson result, e.g., in

[https://news.ycombinator.com/item?id=11440459](https://news.ycombinator.com/item?id=11440459)

(7) Nominal versus Interval Data. Another challenge is how to handle data that
is _nominal_ (i.e., with just a few values known in advance) or _interval_ ,
i.e., essentially any real numbers.

(8) Multi-Dimensional Data. Another challenge is how to get a higher detection
rate by jointly exploiting data on several variables.

(9) Distribution-Free. Another challenge is how to handle situations where
there is little or no chance of knowing the joint probability distribution of
data on several variables. E.g., it is easy to get input data on 20 variables,
but getting data enough on 20 variables to have an accurate description of the
joint probability distribution stands to be challenging, say, 1000^20 data
points or some such. So, can want techniques that are _distribution-free_ ,
that is, that make no assumptions about probability distributions.

Anomaly detection is often close to _hypothesis testing_ in statistics, and
that is a large and mature field with a lot of good work.

~~~
huac
What do you think of approaching anomaly detection as a semi-supervised
classification problem?

~~~
graycat
I suggest approaching anomaly detection in ways to get good answers to the
nine challenges in my post and any additional important challenges.

What you are suggesting may be related to what I suggested in my post

[https://news.ycombinator.com/item?id=11440459](https://news.ycombinator.com/item?id=11440459)

There for the data, both historical and new, we have only some discrete cases
and only a few thousand or so of those. And in that example there was data on
both _healthy_ and _sick_. So, we could go fairly directly to the classic
Neyman-Pearson result as in the derivations there in that post. But that
example was quite special.

 _Classification_ seems usually to be for only some modest number of discrete
cases, "modest" enough that we can get, store, and manipulate data on all the
cases in advance ( _supervised_ ) and before starting to detect anomalies. For
data on several variables, the number of cases stands to grow as an
exponential in the number of variables, say, as in the 1000^20 for 20
variables in my post above.

Another approach is to do some work with nearest neighbors, and for that with
some commonly justified assumptions there is a way to set and get a desired
false alarm rate and get some good things about detection rate.

Again, I see little choice but to see much of _anomaly detection_ as part of
statistical hypothesis testing.

~~~
huac
It seems like a semi-supervised classification framework can address each of
those 9 issues. I agree btw that each is important, but would emphasize the
importance of understanding collinearity/correlation with multidimensional
data. Maybe also consider online updating (adding new data to the old data).

I think I phrased my previous comment poorly; in the OP's example, and in a
pdf linked elsewhere in this thread
([https://news.ycombinator.com/item?id=11623647](https://news.ycombinator.com/item?id=11623647)),
the anomaly detection problem is of server data. The motivation for finding
anomalies is then to prevent server outages/delays, and it's that problem I
was considering (rather than simply anomalies, which can be good or bad or
just special).

My thinking is that under a semi-supervised framework we would take some
historical data (where each point in time is associated with n readings of a
system), mark known anomalous periods (e.g. if there's an outage detected by
some other system), train a classification model based on this previous data,
and then attempt to classify new points in time as anomalous or not. I don't
make any claims to what kind of classification model; nearest neighbors or
logistic regression would surely be fine.

With that understanding, I don't follow what you mean by 'modest number of
discrete cases' for classification. I follow most of your linked approach,
which I'm understanding as similar to a Naive Bayes method (which,
incidentally, doesn't appear at face value to deal with your point #7 about
continuous data) - it's quite interesting how that eventually ends up
intimately connected with the hypo test math. But if we frame the question
this way, of having a desired result ('normal' data) and having an undesired
result ('anomalous' data), then I don't see how a classification approach
wouldn't work.

~~~
jmmcd
Part of the problem in anomaly detection is that sometimes anomalies are of a
"novel" type. It can be impossible to collect labelled data on all possible
anomalies. So learning a model of normal behaviour turns out better.

But if we're talking about a specific application, where there's only one (or
a few) types of bad behaviour, then it's probably worthwhile to manually label
some data and use binary classification as you suggested.

~~~
graycat
Yes, for your first point, you seem to be considering _behavioral monitoring_
where we attempt to characterize the _behavior_ of a _healthy_ or _normal_
system and decree and declare everything else a _sick_ system or an _anomaly_.

Yes, of your second, an example seems to be detecting malware in
communications data streams or hard disk files by looking for specific,
characterizing patterns of bits, that is, _signatures_ extracted from earlier
detections of such malware.

In behavioral monitoring, we might have challenges with false alarm rate.

