CNN's exploit spatial locality and LSTMs exploit temporal locality. The SOTA models are architect-ed with even stronger assumptions about the nature of the task.
Methods like Neural Networks, Random Forests and SVMs when used as unconstrained universal function approximators for unstructured data only learn some non-linear polynomial/ exponential/logarithmic combination of data itself, without much nuance.
It is critical to help a model out by constraining the space of models it searches over to find the right answer.
I think, unless we figure out a way to constrain architectures to exploit specific traits of task they are trying to solve, (universal function approximator type) ML won't succeed in the same way that it has in vision / language.
As it of now, the alternative is to use PGMs where the model is fully interpretable as a graph structured combination of explicitly parameterized random variables. PGMs work well with low data and give really good uncertainty estimates, to evaluate the quality of a model.
PGMs of course suffer from the problem where they are excruciatingly slow for large datasets and require require a decent amount of prior knowledge about the problem to explicitly define the type of graph structure / random variables we are going to be using.
I think ML is most certainly capable of solving this problem, but the community is probably waiting for another break through along the lines of AlexNet/LSTMs before that it the case.
I think that's perfectly said. Humans are prone to the same thing, but we've developed better coping mechanisms.
In ML, techniques to avoid overfitting or reporting spurious relationships are a first-order, 101 topic, especially among the type of ML engineer a hedge fund might hire (they are not hiring data science hacks).
On the flip side, I worked in a quant finance firm before that mostly did factor investing with some twists, and the overall statistical rigor was embarrassing. Even with simple regressions, nobody was asking basic robustness questions, p-hacking was daily life, directly comparing t-stats from different univariate model fits was considered “advanced feature selection.”
If a firm is goimg to do bad stats, they don’t need machine learning for that.
Have you ever known anyone to check if the central limit theorem applies before taking an average ? I mean, we did it once when learning what it was and why you might want to check, but ...
The problem with statistics is that they in theory don't work in the real world. For instance, if you check a statistical variable, great. Now you fix something in the real world and recheck your variable. BZZT wrong ! You can't measure a variable after you've tried to influence it, because obviously you're not measuring the same thing anymore. So there is (potentially) no relationship whatsoever between the measurement after the change and the measurement before. So ... statistics CANNOT correctly be used to improve things in the real world.
But ... have you ever known anyone to use statistics any other way ? Also: we don't actually have anything better.
The thing is ... it mostly works in practice, though you can come up with examples where it doesn't.
And of course you can do things very wrong, as you're just adding, multiplying and so on. That works on any set of numbers.
The thing about machine learning is that a well designed machine learning algorithm contains far less details about the problem than a statistical model. So people far less versed in the problem being analyzed can improve things more using machine learning than by using statistics. But the potential maximum improvement you could ever hope to make, statistics is going to be higher. Compare a second-degree regression to and LSTM for a time series. ASSUMING the statistical model works at all, it'll beat the crap out of the LSTM. But the LSTM will sort-of succeed in nearly all cases. So if the variable fits the information you stuck into your statistical model there's no beating that model (in this case that the data is generated by a second-degree process with a not-too-close-to-zero determinant)
Issue for the future is that all interesting problems are beyond the comprehension of any human, so ... machine learning will win. This means humans can't make statistical models for them either. It'll win, not because it is always the best solution, but because for so many problems you might as well say for all problems we will never find anything remotely optimal or understand enough to even figure out how to apply statistics to it.
I agree that there's a set of problems that are both beyond human comprehension and interesting to humans. Specifying them, measuring the results of algorithms to solve them, and paying for the results will probably have to remain within human analytical capability, or you wind up with Skynet (unlikely), or some analog to the 'gray goo' problem, where machines are optimizing with unintended consequences.
Is this partly a problem with interpretation? Let's say I do a binary (supervised) classification with an algorithm that is also capable of assessing probabilities. If I generate a data set consisting of a randomized bag of words, and randomly assign them to 0 and 1 categories, and run it through a supervised ML classifier, then yeah, everything in the test set will get assigned to something.
But if you look at the probability estimates resulting from the ML, you'd almost certainly see something that indicates a high degree of randomness in the assignments (various techniques such as cross validation, or probabilities that indicate a high degree of uncertainty for almost all of the predictions).
I'm not sure this is a problem with the algorithm itself, because the output from many of these algorithms does indicate low predictive value.
Spoiler: the neutral net thinks it's doing a really good job!
The paper certainly does appear to address the question of categorizing completely randomized input:
From the abstract
"...our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a ran- dom labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization, and occurs even if we replace the true images by com- pletely unstructured random noise.
My understanding is that cross validation does multiple combinations of splitting the input data into test and training sets... so if cross validation measures the generalization error, wouldn't this catch the low predictive value resulting from randomization of labels or input?
I'm not saying the paper doesn't have value, but I think it's more about the fact that neural nets can obtain a training error of zero on randomized data, not a testing error (or generalization error, which represents the difference between training error and testing error, as far as I can tell).
To be clear, I'm not an expert, and this is just what I gleaned from a first pass over the paper.
(Added as edit) also keep in mind that datasets themselves often fail to generalize - overriding to a particular set makes for domain error when moving to slightly different data. Cross validation won't help wit that, but more "self aware" algorithms might.
It's interesting to see that a neural net will reach a training error of zero on randomized data, and it's a worthwhile contribution to the literature to demonstrate this, test it, and measure it... but the outcome here doesn't surprise me. From experience I know that random forests will also show nearly 100% accuracy on a training set but show far lower accuracy for a testing set, so while I think it's great to measure it, the conclusion in this paper is not surprising.
In no way is that a knock on the paper, people weren't surprised that Fermat's last theorem turned out to be true, but that doesn't make the proof any less of an accomplishment!
Or, in the case of financial markets, the future might not look like the past. And the present might not look like the past... So you get datasets that are very time-specific, and thus prone to overfitting to local conditions and/or noise.
Secondly, you can absolutely overfit your cross validation set, same as p-hacking. Run experiments until you have a slight positive, statistically significant result, then tell your managers you've got some crazy new sliver of alpha. And then when it hits really new data, it falls to pieces, because repeated experiments on noise will eventually produce a statistically significant result.
It's like the old saw about freshmen who don't know that they don't know... Our current ML models tend to be like freshmen, or freshmen with bandaids...
As another example, you might train a model to identify positive and negative movie reviews (a pretty common example in intro tutorials). Your original data set might just be views by Roger Ebert. Your model, based on a training set, thinks it's at 100% accuracy. Cross validation on the test set reveals it's at 85%. Not bad! Then you apply it to reviews for 100 film critics. It's down to 70%. Then you apply it to random reviews left anonymously on the internet by 1000s of people. It drops all the way down to 55%.
That's a good argument in favor of a robust data set, drawn from multiple independent sources.
However, here's where I'm not convinced. How would a low training error based on a narrow data set be any less misleading than a low cross validation error based on a narrow training and testing set? If you aren't drawing from a robust data source, it seems the problem would be just as bad either way.
As mlthoughts pointed out in a different comment, any kind of regression technique faces issues about goodness of fit. The thing is, there are techniques to show you that the fit isn't very good. A simple linear regression will fit randomized noise, but there are outputs that can show you that the fit isn't good and the regression may not be reliable.
The question I have here is whether ML techniques are failing in a different way, that it is fitting to randomized noise while appearing by various tests to be a very strong fit. If they're failing the same way that regression would (i.e.., someone applies it and fails to do basic tests for goodness of fit), that's a problem I suppose, but is it really a unique failing of ML or neural nets? It sounds like more like a standard misapplication of predictive modeling...