Several of the comments in this thread clarify what the Bloomberg machine learning work is about. Then with the clarification, maybe I see some issues with the work.
We can start with just the simplest case, ordinary, plain old regression with just one independent variable. So, for some positive integer n, we have pairs of real numbers (y_i, x_i) for i = 1, 2, ..., n. Here we are trying to predict the y_i from the x_i; we are trying to build a model that will predict y from a corresponding x where x is not in the data. Our model is to be a straight line, e.g., in high school form
y = ax + b
Okay, doing this, with the classic assumptions, we can draw a graph of the data, the fitted line, and the confidence (prediction) interval at each real number x.
For some intuition, suppose the x_i are all between 0 and 5. Maybe then the fit is quite good, the line is close to the data, and the confidence intervals are small.
But, IIRC, roughly or exactly, the confidence interval curves are actually hyperbolas. So, while the upper and lower curves are close for x between 0 and 5, for x outside the interval [0,5] the curves can grow far apart. So, if we are interested in the predictions of the model for, say, x = 20, the confidence intervals may be very wide, enough for us to conclude that, even though we have a straight line that fits our data closely, still our model is useless at x >= 20.
So, this little example illustrates a broad point about such curve fitting: The model might work well for independent variable x (likely a vector of several components) close to the training data but commonly be awful otherwise.
How serious this situation is can vary a lot depending on the application. E.g., if are interested only in the values of y for x a lot like the training data, e.g., in the interval [0,5], then maybe don't are about the y value or the confidence interval when x = 20.
But quite broadly there are applications where what we want such models to tell us is the value of y for some x not close to what we have seen in our training data.
Uh, let's see: IIRC Bloomberg is selling stock market and economic data, often in nearly real time, to investors, some of whom are traders and make trades quickly, within a few seconds, based on the data from Bloomberg.
I'm no up to date expert on just what the Bloomberg customers are doing, but from 20,000 feet or so up maybe the situation is something like, broadly the investors vary:
(A) Some investors want a portfolio constructed much like in the work of H. Markowitz or W. Sharpe. In simple terms, they want the portfolio to have good expected return with low standard deviation of return and maybe, then, buy on margin to raise the rate of return while still having the risk relatively low.
(B) Some investors are interested in relationships between stocks and options -- e.g., the Black-Scholes work is an example of this. IIRC, a more general case is some stochastic process, maybe Brownian motion, reaching a boundary and a first exit. The exit has a value, so the investment problem is a boundary value problem. IIRC can design and attempt to evaluate exotic options with such ideas.
(C) Some investors are just stock pickers and buy when they sense a sudden rise in price.
But a theme in (A)-(C) is that the investors are looking for something unusual. So, in the model building, the unusual may have been unusual in the training data and the testing data. In that case, without more assumptions, theories, or whatever the prediction of the model for unusual input data may be poor.
That is, it appears that the model building techniques promise that the model will do poorly in just the application cases of greatest interest to investors -- the unusual cases that are not well represented in the training and test data.
So, maybe first cut some of what is needed is some anomaly detection.
So, we could use more information about the systems we are trying to model. A linearity assumption is one such. In Newton's second law and law of gravity, we can check that for falling apples. Next we can try on the planets in our solar system and, nicely enough, see that it works. And then we can be pretty sure for a rocket at Mach 15 headed for stationary orbit, etc. But with just empirical curve fitting, apparently mostly we don't have such additional information.
IIRC L. Breiman's first interest in empirical curve fitting was for clinical medical data. So, maybe in that data he was trying to predict some disease but the independent variable data he was using was common in his training and testing data. I.e., he wasn't really looking for exploiting some anomaly for some once in 20 years way to get rich quick.
So the extrapolation-type problem you describe (an input not near any of your training examples) is an issue. Unless you have a world model you believe in (i.e. you've done some science -- not just statistics), hard to know if your prediction function works out there where you’ve never seen any examples. If you’ve seen some data out there, but relatively fewer than you see in deployment, then importance weighting or other approaches from covariate shift / domain adaptation could help.
Anomaly detection is definitely another important area, but I struggle to pull together a coherent unit on the topic. One issue is that it’s difficult to define precisely, at least partly because everybody means something a little bit different by it.
Also, based on classical hypothesis testing, I think that to some extent you have to know what you’re looking for to be able to detect it (ie to have power against the alternatives/anomalies you care about)... For that reason, I think it’s hard to separate anomaly detection from more general risk analysis/assessment, because you need to know the type of thing you care about finding.
So, for anomaly detection, before evaluating the model at x, might want to know if x would be an anomaly in the training data x_i, i = 1. 2, ..., n. Sure, x is likely a vector with several to many components.
An anomaly detector should be at least as good as a statistical hypothesis test.
So, for the null hypothesis, assume that x is distributed like the training data.
Okay, except we don't really know the distribution of the training data.
"Ma! Help! What am I supposed to do now???"
So, we need a statistical hypothesis test that is both multi-dimensional and distribution-free.
Let's, see: In ergodic theory we consider transformations that are measure preserving .... Yup, can have a group (as in abstract algebra) of those, sum over the group, ..., and calculate the significance level of the test and, thus, get a real hypothesis test, multi-dimensional and distribution free. For some of the details of the test, there are lots of variations, i.e., options, knobs to turn.
Detection rate? Hmm. Depends ...! Don't have data enough to use the Neyman-Person approach, but in a curious but still relevant sense the detection rate is the highest possible.
I just call this work statistics, but maybe it would also qualify according to some definitions as machine learning. But my work is not merely heuristic and has nothing to do with regression analysis or neural networks. So, again, my work is an example that there can be more to machine learning than empirical curve fitting.
So, before applying an empirically fitted model at x, want x to be distributed like the training data and at least want an hypothesis test not to reject the null hypothesis that x is so distributed.
More generally, if are looking for anomalies in the data, say, a rapid real time stream, when see an anomaly, investigate further. In this case, an anomaly detector is a first cut filter, an alarm, to justify further investigation.
Looking back on what I did, I suspect that more could be done and that some of what I did could be done better.
Of course, my interests now are my startup. Yes, there the crucial core is some applied math I derived.
Maybe I'll use my anomaly detection work for real-time monitoring for zero-day problems in security, performance, failures, etc. in my server farm.
As a very general but crude and blunt approach to show that the hypotheses tests were not trivial, used the result of S. Ulam that Le Cam called "tightness" as in P. Billingsley, Convergence of Probability Measures. When doing both multi-dimensional and distrbution-free, are nearly way out in the ozone so get pushed into some abstract techniques! Meow!
Okay, I started watching your video lectures at your 15 City Sense Data and through your application of maximum likelihood estimation.
So, you drag out some common distributions, e.g., Poisson, negative binomial, beta, and look for the best "fit" for your data. Here you have little or no reason to believe that the data has any of those distributions. So, the work is all rules of thumb, intuitive heuristics, crude considerations, e.g., discrete or continuous and what the "support" is, etc. Then the quality is added on later -- see if the results are good on the test data. If not, then try again with more heuristics. If so, then use the "model" in practice. Cute. Apparently you are calling this two step technique, (i) use heuristics and then (ii) validate with the training data the influence of "machine learning". Okay.
But there is a more traditional approach: When we use, e.g., the Poisson distribution, it is because we have good reason to believe that the data really is Poisson distributed. Maybe we just estimate the one parameter of the distribution and then continue on. It will be good also to check with some test data, but really, logically it is not necessary. Moreover, we may have some predictions for which we have no test data.
An example of this traditional approach where we have no test data and, really, no "training" data in the sense you are assuming, is the work I outlined for predicting the lifetime of submarines in a special scenario of global nuclear war limited to sea. There we got a Poisson process from some work of Koopmans. But we had no data on sinking submarines in real wars, not for "training" or "testing". We can similar things for, say, arrivals at a Web site because from the renewal theorem we have good reason to believe that the arrivals, over, say, 10 minutes at a time, form a Poisson process. Then we could estimate the arrival rate and, if we wish, do an hypothesis test for anomalies where the null hypothesis was that the arrivals were as in our Poisson process, all without any heuristic fitting and detecting anomalies that were not in any data we had observed so far.
So, generally we have two approaches:
(1) Machine learning where we get test data, use heuristics to build a model, test the model on test data, and declare success when we pass the test.
(2) Math assumptions where we have reason to know the distributions and form of a model, and use possibly much less data to determine the relatively few free parameters in the model and then just apply the model (testing also if we have some suitable test data -- in the submarine model, we did not).
A problem for approach (1) machine learning is that we should be fairly sure that when we apply the model at point x, point x is distributed as in the training data. So, essentially we are just interpolating.
E.g., if we have positive integers n and d, the set of real numbers R, training data pairs (y_i, x_i) for i = 1, 2, ..., n where y_i in R and x_i in R^d, fit a model that predicts each y_i from each_x_i, then to apply the model at x in R^d, we should know that x is distributed like the x_i.
So, we are assuming that the x_i have a "distribution", that is, are independent, identically distributed (i.i.d.) "samples" from some distribution. And our model is a function f: R^d --> R. Then to apply our model a x in R^d we evaluate f(x). We hope that our f(x) will be a good approximation of the y in R that would correspond to our x in R^d. So, again, to do this, we should check if x in R^d is distributed like the x_i in R^d.
I did see your work on anomaly detection, but it was much simpler and really not in the same context as an hypothesis test with null hypothesis that the x and x_i are i.i.d. in R^d.
I can believe that at times approach (1) can work and be useful in practice, but (A) a lot of data is needed (e.g., can't apply this approach to the problem of the submarine lifetimes) and (B) are limited in the values of x that can be used in the model (can't do an hypothesis test to detect anomalous arrivals beyond any seen before at a Web site).
From all I could see in your lectures, you neglected approach (2); readers should be informed that something like (2) is missing.
We can start with just the simplest case, ordinary, plain old regression with just one independent variable. So, for some positive integer n, we have pairs of real numbers (y_i, x_i) for i = 1, 2, ..., n. Here we are trying to predict the y_i from the x_i; we are trying to build a model that will predict y from a corresponding x where x is not in the data. Our model is to be a straight line, e.g., in high school form
y = ax + b
Okay, doing this, with the classic assumptions, we can draw a graph of the data, the fitted line, and the confidence (prediction) interval at each real number x.
For some intuition, suppose the x_i are all between 0 and 5. Maybe then the fit is quite good, the line is close to the data, and the confidence intervals are small.
But, IIRC, roughly or exactly, the confidence interval curves are actually hyperbolas. So, while the upper and lower curves are close for x between 0 and 5, for x outside the interval [0,5] the curves can grow far apart. So, if we are interested in the predictions of the model for, say, x = 20, the confidence intervals may be very wide, enough for us to conclude that, even though we have a straight line that fits our data closely, still our model is useless at x >= 20.
So, this little example illustrates a broad point about such curve fitting: The model might work well for independent variable x (likely a vector of several components) close to the training data but commonly be awful otherwise.
How serious this situation is can vary a lot depending on the application. E.g., if are interested only in the values of y for x a lot like the training data, e.g., in the interval [0,5], then maybe don't are about the y value or the confidence interval when x = 20.
But quite broadly there are applications where what we want such models to tell us is the value of y for some x not close to what we have seen in our training data.
Uh, let's see: IIRC Bloomberg is selling stock market and economic data, often in nearly real time, to investors, some of whom are traders and make trades quickly, within a few seconds, based on the data from Bloomberg.
I'm no up to date expert on just what the Bloomberg customers are doing, but from 20,000 feet or so up maybe the situation is something like, broadly the investors vary:
(A) Some investors want a portfolio constructed much like in the work of H. Markowitz or W. Sharpe. In simple terms, they want the portfolio to have good expected return with low standard deviation of return and maybe, then, buy on margin to raise the rate of return while still having the risk relatively low.
(B) Some investors are interested in relationships between stocks and options -- e.g., the Black-Scholes work is an example of this. IIRC, a more general case is some stochastic process, maybe Brownian motion, reaching a boundary and a first exit. The exit has a value, so the investment problem is a boundary value problem. IIRC can design and attempt to evaluate exotic options with such ideas.
(C) Some investors are just stock pickers and buy when they sense a sudden rise in price.
But a theme in (A)-(C) is that the investors are looking for something unusual. So, in the model building, the unusual may have been unusual in the training data and the testing data. In that case, without more assumptions, theories, or whatever the prediction of the model for unusual input data may be poor.
That is, it appears that the model building techniques promise that the model will do poorly in just the application cases of greatest interest to investors -- the unusual cases that are not well represented in the training and test data.
So, maybe first cut some of what is needed is some anomaly detection.
So, we could use more information about the systems we are trying to model. A linearity assumption is one such. In Newton's second law and law of gravity, we can check that for falling apples. Next we can try on the planets in our solar system and, nicely enough, see that it works. And then we can be pretty sure for a rocket at Mach 15 headed for stationary orbit, etc. But with just empirical curve fitting, apparently mostly we don't have such additional information.
IIRC L. Breiman's first interest in empirical curve fitting was for clinical medical data. So, maybe in that data he was trying to predict some disease but the independent variable data he was using was common in his training and testing data. I.e., he wasn't really looking for exploiting some anomaly for some once in 20 years way to get rich quick.