Several of the comments in this thread clarify what the Bloomberg machine learni...

davidsrosenberg · on July 15, 2018

So the extrapolation-type problem you describe (an input not near any of your training examples) is an issue. Unless you have a world model you believe in (i.e. you've done some science -- not just statistics), hard to know if your prediction function works out there where you’ve never seen any examples. If you’ve seen some data out there, but relatively fewer than you see in deployment, then importance weighting or other approaches from covariate shift / domain adaptation could help.

Anomaly detection is definitely another important area, but I struggle to pull together a coherent unit on the topic. One issue is that it’s difficult to define precisely, at least partly because everybody means something a little bit different by it.

Also, based on classical hypothesis testing, I think that to some extent you have to know what you’re looking for to be able to detect it (ie to have power against the alternatives/anomalies you care about)... For that reason, I think it’s hard to separate anomaly detection from more general risk analysis/assessment, because you need to know the type of thing you care about finding.

In any case, I made an attempt on anomaly detection: There's https://bloomberg.github.io/foml/#lecture-15-citysense-proba... which is simply about building a conditional probability model, and flagging behavior as anomalous if it has low probability or prob density under the model. I also used to have 1-class SVM’s in a homework (https://davidrosenberg.github.io/mlcourse/Archive/2017/Homew... Problem 11).

graycat · on July 16, 2018

So, for anomaly detection, before evaluating the model at x, might want to know if x would be an anomaly in the training data x_i, i = 1. 2, ..., n. Sure, x is likely a vector with several to many components.

An anomaly detector should be at least as good as a statistical hypothesis test.

So, for the null hypothesis, assume that x is distributed like the training data.

Okay, except we don't really know the distribution of the training data.

"Ma! Help! What am I supposed to do now???"

So, we need a statistical hypothesis test that is both multi-dimensional and distribution-free.

Let's, see: In ergodic theory we consider transformations that are measure preserving .... Yup, can have a group (as in abstract algebra) of those, sum over the group, ..., and calculate the significance level of the test and, thus, get a real hypothesis test, multi-dimensional and distribution free. For some of the details of the test, there are lots of variations, i.e., options, knobs to turn.

Detection rate? Hmm. Depends ...! Don't have data enough to use the Neyman-Person approach, but in a curious but still relevant sense the detection rate is the highest possible.

I just call this work statistics, but maybe it would also qualify according to some definitions as machine learning. But my work is not merely heuristic and has nothing to do with regression analysis or neural networks. So, again, my work is an example that there can be more to machine learning than empirical curve fitting.

So, before applying an empirically fitted model at x, want x to be distributed like the training data and at least want an hypothesis test not to reject the null hypothesis that x is so distributed.

More generally, if are looking for anomalies in the data, say, a rapid real time stream, when see an anomaly, investigate further. In this case, an anomaly detector is a first cut filter, an alarm, to justify further investigation.

Looking back on what I did, I suspect that more could be done and that some of what I did could be done better.

Of course, my interests now are my startup. Yes, there the crucial core is some applied math I derived.

Maybe I'll use my anomaly detection work for real-time monitoring for zero-day problems in security, performance, failures, etc. in my server farm.

graycat · on July 16, 2018

As a very general but crude and blunt approach to show that the hypotheses tests were not trivial, used the result of S. Ulam that Le Cam called "tightness" as in P. Billingsley, Convergence of Probability Measures. When doing both multi-dimensional and distrbution-free, are nearly way out in the ozone so get pushed into some abstract techniques! Meow!

graycat · on July 20, 2018

Okay, I started watching your video lectures at your 15 City Sense Data and through your application of maximum likelihood estimation.

So, you drag out some common distributions, e.g., Poisson, negative binomial, beta, and look for the best "fit" for your data. Here you have little or no reason to believe that the data has any of those distributions. So, the work is all rules of thumb, intuitive heuristics, crude considerations, e.g., discrete or continuous and what the "support" is, etc. Then the quality is added on later -- see if the results are good on the test data. If not, then try again with more heuristics. If so, then use the "model" in practice. Cute. Apparently you are calling this two step technique, (i) use heuristics and then (ii) validate with the training data the influence of "machine learning". Okay.

But there is a more traditional approach: When we use, e.g., the Poisson distribution, it is because we have good reason to believe that the data really is Poisson distributed. Maybe we just estimate the one parameter of the distribution and then continue on. It will be good also to check with some test data, but really, logically it is not necessary. Moreover, we may have some predictions for which we have no test data.

An example of this traditional approach where we have no test data and, really, no "training" data in the sense you are assuming, is the work I outlined for predicting the lifetime of submarines in a special scenario of global nuclear war limited to sea. There we got a Poisson process from some work of Koopmans. But we had no data on sinking submarines in real wars, not for "training" or "testing". We can similar things for, say, arrivals at a Web site because from the renewal theorem we have good reason to believe that the arrivals, over, say, 10 minutes at a time, form a Poisson process. Then we could estimate the arrival rate and, if we wish, do an hypothesis test for anomalies where the null hypothesis was that the arrivals were as in our Poisson process, all without any heuristic fitting and detecting anomalies that were not in any data we had observed so far.

So, generally we have two approaches:

(1) Machine learning where we get test data, use heuristics to build a model, test the model on test data, and declare success when we pass the test.

(2) Math assumptions where we have reason to know the distributions and form of a model, and use possibly much less data to determine the relatively few free parameters in the model and then just apply the model (testing also if we have some suitable test data -- in the submarine model, we did not).

A problem for approach (1) machine learning is that we should be fairly sure that when we apply the model at point x, point x is distributed as in the training data. So, essentially we are just interpolating.

E.g., if we have positive integers n and d, the set of real numbers R, training data pairs (y_i, x_i) for i = 1, 2, ..., n where y_i in R and x_i in R^d, fit a model that predicts each y_i from each_x_i, then to apply the model at x in R^d, we should know that x is distributed like the x_i.

So, we are assuming that the x_i have a "distribution", that is, are independent, identically distributed (i.i.d.) "samples" from some distribution. And our model is a function f: R^d --> R. Then to apply our model a x in R^d we evaluate f(x). We hope that our f(x) will be a good approximation of the y in R that would correspond to our x in R^d. So, again, to do this, we should check if x in R^d is distributed like the x_i in R^d.

I did see your work on anomaly detection, but it was much simpler and really not in the same context as an hypothesis test with null hypothesis that the x and x_i are i.i.d. in R^d.

I can believe that at times approach (1) can work and be useful in practice, but (A) a lot of data is needed (e.g., can't apply this approach to the problem of the submarine lifetimes) and (B) are limited in the values of x that can be used in the model (can't do an hypothesis test to detect anomalous arrivals beyond any seen before at a Web site).

From all I could see in your lectures, you neglected approach (2); readers should be informed that something like (2) is missing.