
EGADS: A Scalable, Configurable, and Novel Anomaly Detection System - TomAnthony
http://yahoolabs.tumblr.com/post/118966433256/egads-a-scalable-configurable-and-novel-anomaly
======
graycat
In _anomaly detection_ , that is, near real-time _monitoring_ of system to see
if it is sick or healthy, we collect data on the system and in nearly all
practical, operational contexts, fundamentally, inescapably there are two ways
for the detection effort to be wrong: (1) Saying that the system is sick when
it is healthy and (2) saying that the system is healthy when it is sick.

Way (1) is a case of a _false alarm_ or _false positive_ or _Type I error_ ,
and way (2) is a case of a missed detection, false negative, or _Type II
error_.

Or, near real time _anomaly detection_ is basically some continuously applied
_statistical hypothesis_ tests.

Under mild assumptions, over time there will be a _rate_ of each of errors (1)
and (2). We want both rates to be low.

We would like to be able to adjust the rate of false alarms and know in
advance the rate we will be getting.

Moreover, since anomaly detectors differ, we want a detector that, for
whatever rate of false alarms we are willing to tolerate, gives us the lowest
rate we can get of the rate of missed detections.

For current relatively reliable systems in server farms and networks, likely
we will have only a little data on the probability distribution of the data
when a system is sick.

So, we need a detector that is good for _zero day_ problems, that is, that
assumes nothing about the _sickness_ we are trying to detect.

For knowing the rate of false alarms, it can be very helpful to have the
conditional probability distribution of the data we are using assuming that
the system is healthy, but commonly computer systems are so complicated that
knowing that probability distribution in useful detail is not reasonable. So,
we would like a detector that is _distribution-free_ , that is, makes no
assumptions about probability distributions.

Further, especially for monitoring systems as complicated as current server
farms and networks, we would like to be able to do well with input data on
several variables, not just one. That is, we want a detector that is _multi-
variate_.

So, we want to be able to adjust and know false alarm rate in advance, get for
that false alarm rate a rate of missed detections, be distribution-free and
multi-variate, and have essentially no data on the cases of sickness we are
trying to detect.

From the OP, I was not able to evaluate their work on these criteria.

~~~
srean
In past conversations we had on machine learning you had mentioned about your
paper in this domain. Unfortunately it is pay walled. Any chance you have a
sharable copy / link handy ?

~~~
graycat
Yes, I have a PDF of my paper, but I might torque off the journal (
_Information Sciences_ , 1999) if I make the PDF available in public. I don't
know the legal situation. The journal was nice enough to do a good job
handling my paper; I should not be nasty to them.

The point of my post here was to urge the authors of the OP and also others in
anomaly detection to address, for their work, carefully, the issues of the
error rates and being both distribution-free and multi-variate.

The paper of the OP starts with a description of the importance of anomaly
detection; I fully agree with their description; it is close to what for some
years I wrote VC firms!

Alas, no VC firm was able to evaluate my work. Most firms just ignored my
contact. A few firms asked when all the associated infrastructure software
would be ready and I had early customers.

So, right, the VC firms were ignoring the work and looking just at the early
hints of _traction_.

And they had no clue about just how much work the infrastructure software
would be to be ready for a high end enterprise sale for a really important
function in their server farm or network.

Or, if I had all that work done, then why the heck would I still need their
cash? Because I had four kids, all ready for a freshman year in college and a
wife who had just given birth to triplets?

Or, why would a VC firm think that I was giving them part ownership of such a
nice body of work with paying customers, in high end enterprise computing?

Or, VCs like "warm introductions". Well I was giving them a founder with a
STEM Ph.D. from one of the world's best research universities, with a
dissertation advisor President at CMU, with relevant experience in AI at IBM's
Watson lab, with a paper of original research and prototype software that had
already passed peer-review at a leading journal, and with more relevant
publications in the field.

And, still, what they wanted was a "warm introduction"? Did they know the
President at CMU? Likely not. They wanted a "warm introduction" from who, a
CEO in their portfolio? Why would they think that a founder with a good
project would know people like the CEOs in their portfolio? The arrogance of
VCs believing that they know the most knowledgeable people in technology. To
quote Betty Davis, "What a dump.", although what she was calling a "dump" was
actually quite nice.

Or, do I really want to report to a BoD of such clueless people?

That lesson was one of the early ones I got that VCs just will _not_ look at
unique, new, powerful, valuable technology.

So, some of the VC firms claimed, on their Web sites, that they have "deep
domain knowledge" in "enterprise software" and want "breakthrough technology"
(or some such) but, really, just want some good words back from some early
customers who are happy. What a scam. Can get good words back from happy
customers just from selling, say, lemonade.

The technology is what makes the work promising for a successful business, the
_technology_ , especially powerful technology with the power confirmed by
theorems and proofs in a peer-reviewed paper, not just some words from some
early customers.

The VCs seemed not to understand the work at all -- how the heck could I want
to report to such people on a BoD? I can't want.

Net, to the VC firms, _technology_ is just flatly irrelevant. They believe in
essentially a Markov condition: Technology and the future financial value of
the business are conditionally independent given current _traction_.

If the US DoD had thought this way, then for the Manhattan Project they would
have said "You build and demonstrate one and then we will chip in for the
aviation gasoline for the Enola Gay". Similarly for the SR-71, GPS, Keyhole,
Sosus, the B-2, the F-22, etc.

ROI? From a book of Richard Rhodes, the Manhattan Project cost about $3
billion but saved, according to some common estimates, 1 million US casualties
that would have resulted from the planned alternative, an invasion of the main
islands of Japan. So, that was $3000 per casualty. Fantastic ROI.

Thankfully for US national security, the US DoD is actually able to evaluate
technology, early on, just on paper, even on the back of an envelope. VCs? A
good peer-reviewed paper for an important problem, yes, with some prototype
software running on some real data -- nope. No wonder on average the ROI of
the VCs is poor.

VCs -- what a joke. What a total upchuck of a joke. "Deep domain knowledge"
\-- what a scam.

Again, in strong contrast, the journal worked hard on my paper; I don't want
to kick dirt in the face of the journal.

For my paper, all you need is trip to a university library and a few coins for
a photocopy machine. For now, that's still the way such things work.

Heck, in part I still work off paper in books I bought and paid for. Just this
morning it finally dawned on me that that one of the .NET collection classes
for _key, value_ pairs was storing the values without my ever saying what the
data type was. I was concerned about that, but I've programmed before with
memory address pointers with no associated _data type_ indications. Then,
finally, for some slightly more advanced usage of that collection class, I
discovered that .NET says I can't do that without enabling _late binding_.
Okay. So, I guessed, somewhere in .NET there is likely a solution. So, I got
out a big book by Francisco Bolena, read a few pages, and saw that,
apparently, all I have to do to get what I really want is, for a class of
mine, say my Class A, and .NET collection class List, say (type in)something
like

Inherits List (Of A)

So, I still like books. And I have a lot of books I like. And copies of some
published papers.

There are some things I don't like: Did I mention VCs claiming "deep domain
knowledge" in enterprise software?

~~~
srean
Dont get me started on draconian policies of journals to put findings behind
paywalls. Thanks to vocal complaints by academics and in some cases
unvarnished mutiny this is changing.

Information Sciences is owned by Elsevier. At Elsevier authors do enjoy the
right of fwding preprints to colleagues.
[http://cdn.elsevier.com/assets/pdf_file/0008/108674/AuthorUs...](http://cdn.elsevier.com/assets/pdf_file/0008/108674/AuthorUserRights.pdf)
So if that is your only reluctance to share the pdf, you can share without
worrying about it.

@graycat Never mind, got the pdf. In any case my email is srean dot list at
gamil dot com (typo intended)

~~~
graycat
Thanks for the note from Elsevier.

So, if I can get an e-mail address for you, then, sure, I'll send a PDF.

