
The Mixed Track Record of Machine Learning Algorithms - briatx
https://www.bloomberg.com/news/articles/2018-10-09/the-big-problem-with-machine-learning-algorithms
======
screye
A good ML system has to be architect-ed to exploit known structure of the task
that is being attempted.

CNN's exploit spatial locality and LSTMs exploit temporal locality. The SOTA
models are architect-ed with even stronger assumptions about the nature of the
task. Methods like Neural Networks, Random Forests and SVMs when used as
unconstrained universal function approximators for unstructured data only
learn some non-linear polynomial/ exponential/logarithmic combination of data
itself, without much nuance.

It is critical to help a model out by constraining the space of models it
searches over to find the right answer. I think, unless we figure out a way to
constrain architectures to exploit specific traits of task they are trying to
solve, (universal function approximator type) ML won't succeed in the same way
that it has in vision / language.

As it of now, the alternative is to use PGMs where the model is fully
interpretable as a graph structured combination of explicitly parameterized
random variables. PGMs work well with low data and give really good
uncertainty estimates, to evaluate the quality of a model. PGMs of course
suffer from the problem where they are excruciatingly slow for large datasets
and require require a decent amount of prior knowledge about the problem to
explicitly define the type of graph structure / random variables we are going
to be using.

I think ML is most certainly capable of solving this problem, but the
community is probably waiting for another break through along the lines of
AlexNet/LSTMs before that it the case.

------
pjmorris
> “Machine learning algorithms will always identify a pattern, even if there
> is none,” he says.

I think that's perfectly said. Humans are prone to the same thing, but we've
developed better coping mechanisms.

~~~
mlthoughts2018
It’s a bit foolish though, because “regression” will always identify a pattern
too, or many other simplistic models.

In ML, techniques to avoid overfitting or reporting spurious relationships are
a first-order, 101 topic, especially among the type of ML engineer a hedge
fund might hire (they are not hiring data science hacks).

On the flip side, I worked in a quant finance firm before that mostly did
factor investing with some twists, and the overall statistical rigor was
embarrassing. Even with simple regressions, nobody was asking basic robustness
questions, p-hacking was daily life, directly comparing t-stats from different
univariate model fits was considered “advanced feature selection.”

If a firm is goimg to do bad stats, they don’t need machine learning for that.

~~~
pjmorris
I guess what I'm trying to say is that algorithms can't tell whether they're
fooling themselves. Someone has to apply the 101 techniques for testing fit,
etc. Humans at least have the opportunity, though, as you point out, they
don't always take it.

~~~
candiodari
Maybe it's just me, but to me it seems like 99%+ of humans don't check whether
they're fooling themselves, even when using statistics.

Have you ever known anyone to check if the central limit theorem applies
before taking an average ? I mean, we did it once when learning what it was
and why you might want to check, but ...

The problem with statistics is that they _in theory_ don't work in the real
world. For instance, if you check a statistical variable, great. Now you fix
something in the real world and recheck your variable. _BZZT_ wrong ! You
can't measure a variable after you've tried to influence it, because obviously
you're not measuring the same thing anymore. So there is (potentially) no
relationship whatsoever between the measurement after the change and the
measurement before. So ... statistics CANNOT correctly be used to improve
things in the real world.

But ... have you ever known anyone to use statistics any other way ? Also: we
don't actually have anything better.

The thing is ... it mostly works in practice, though you can come up with
examples where it doesn't.

And of course you can do things _very_ wrong, as you're just adding,
multiplying and so on. That works on any set of numbers.

The thing about machine learning is that a well designed machine learning
algorithm contains far less details about the problem than a statistical
model. So people far less versed in the problem being analyzed can improve
things more using machine learning than by using statistics. But the potential
maximum improvement you could ever hope to make, statistics is going to be
higher. Compare a second-degree regression to and LSTM for a time series.
ASSUMING the statistical model works at all, it'll beat the crap out of the
LSTM. But the LSTM will sort-of succeed in nearly all cases. So if the
variable fits the information you stuck into your statistical model there's no
beating that model (in this case that the data is generated by a second-degree
process with a not-too-close-to-zero determinant)

Issue for the future is that all interesting problems are beyond the
comprehension of any human, so ... machine learning will win. This means
humans can't make statistical models for them either. It'll win, not because
it is always the best solution, but because for so many problems you might as
well say for all problems we will never find anything remotely optimal or
understand enough to even figure out how to apply statistics to it.

~~~
pjmorris
> Issue for the future is that all interesting problems are beyond the
> comprehension of any human, so ... machine learning will win.

I agree that there's a set of problems that are both beyond human
comprehension and interesting to humans. Specifying them, measuring the
results of algorithms to solve them, and paying for the results will probably
have to remain within human analytical capability, or you wind up with Skynet
(unlikely), or some analog to the 'gray goo' problem, where machines are
optimizing with unintended consequences.

------
geebee
“Machine learning algorithms will always identify a pattern, even if there is
none”

Is this partly a problem with interpretation? Let's say I do a binary
(supervised) classification with an algorithm that is also capable of
assessing probabilities. If I generate a data set consisting of a randomized
bag of words, and randomly assign them to 0 and 1 categories, and run it
through a supervised ML classifier, then yeah, everything in the test set will
get assigned to _something_.

But if you look at the probability estimates resulting from the ML, you'd
almost certainly see something that indicates a high degree of randomness in
the assignments (various techniques such as cross validation, or probabilities
that indicate a high degree of uncertainty for almost all of the predictions).

I'm not sure this is a problem with the algorithm itself, because the output
from many of these algorithms does indicate low predictive value.

~~~
sdenton4
Check out this classic paper on deep learning with randomized labels:
[https://arxiv.org/abs/1611.03530](https://arxiv.org/abs/1611.03530)

Spoiler: the neutral net thinks it's doing a really good job!

~~~
geebee
Thank you for the link. I'll read this paper. I'm hoping to reply but the
thread may be stale by the time I do. Right now, my thoughts are: if it is
easily fitting, what are the assignment probabilities? Are we getting 90%+, or
is it fitting easily, but to much lower probabilities. Also, is there a big
difference between neural nets and other algorithms like RF?

The paper certainly does appear to address the question of categorizing
completely randomized input:

From the abstract

"...our experiments establish that state-of-the-art convolutional networks for
image classification trained with stochastic gradient methods easily fit a
ran- dom labeling of the training data. This phenomenon is qualitatively
unaffected by explicit regularization, and occurs even if we replace the true
images by com- pletely unstructured random noise.

~~~
geebee
I'm going off a first pass through the paper, but it appears that what this
paper shows is that the training error can be 0 on an entirely randomized data
set, but the generalization error - the difference between the error on the
_test_ set and the _training_ set, does increase dramatically as label
corruption increases.

My understanding is that cross validation does multiple combinations of
splitting the input data into test and training sets... so if cross validation
measures the generalization error, wouldn't this catch the low predictive
value resulting from randomization of labels or input?

I'm not saying the paper doesn't have value, but I think it's more about the
fact that neural nets can obtain a _training_ error of zero on randomized
data, not a _testing_ error (or generalization error, which represents the
difference between training error and testing error, as far as I can tell).

To be clear, I'm not an expert, and this is just what I gleaned from a first
pass over the paper.

~~~
sdenton4
All true. The interesting thing here is that the neural network has /no idea/
that it sucks at generalization, though. Yes, we can do extra work to
calibrate outputs, but it would be much better to have some idea of
uncertainty from the network itself.

(Added as edit) also keep in mind that datasets themselves often fail to
generalize - overriding to a particular set makes for domain error when moving
to slightly different data. Cross validation won't help wit that, but more
"self aware" algorithms might.

~~~
geebee
But... isn't that the entire point of splitting your initial training data
into a training set and a separate testing set? Why is it better to have an
idea of uncertainty from the model itself when you can get the generalization
error through cross validation, or by setting aside a testing set?

It's interesting to see that a neural net will reach a training error of zero
on randomized data, and it's a worthwhile contribution to the literature to
demonstrate this, test it, and measure it... but the outcome here doesn't
surprise me. From experience I know that random forests will also show nearly
100% accuracy on a training set but show far lower accuracy for a testing set,
so while I think it's great to measure it, the conclusion in this paper is not
surprising.

In no way is that a knock on the paper, people weren't surprised that Fermat's
last theorem turned out to be true, but that doesn't make the proof any less
of an accomplishment!

~~~
sdenton4
First, because the dataset itself might be biased in subtle ways, in which
case your cross validation won't help. This happens All. The. Time. For
example, your training set for speech recognition might use a nice microphone
uniformly, and everything goes to hell once you deploy to cell phones, because
the microphones have different characteristics.

Or, in the case of financial markets, the future might not look like the past.
And the present might not look like the past... So you get datasets that are
very time-specific, and thus prone to overfitting to local conditions and/or
noise.

Secondly, you can absolutely overfit your cross validation set, same as
p-hacking. Run experiments until you have a slight positive, statistically
significant result, then tell your managers you've got some crazy new sliver
of alpha. And then when it hits really new data, it falls to pieces, because
repeated experiments on noise will eventually produce a statistically
significant result.

It's like the old saw about freshmen who don't know that they don't know...
Our current ML models tend to be like freshmen, or freshmen with bandaids...

~~~
geebee
I'm certainly convinced that sampling from a narrow dataset can lead to
problems down the line. You don't remove possible bias in your data by
randomly setting aside a testing set and using it for cross validation. It's
an important practice, but yes, I agree that you can end up over training on
structures that exist in both the testing and training set, since they're
drawn from the same, narrow source.

As another example, you might train a model to identify positive and negative
movie reviews (a pretty common example in intro tutorials). Your original data
set might just be views by Roger Ebert. Your model, based on a training set,
thinks it's at 100% accuracy. Cross validation on the test set reveals it's at
85%. Not bad! Then you apply it to reviews for 100 film critics. It's down to
70%. Then you apply it to random reviews left anonymously on the internet by
1000s of people. It drops all the way down to 55%.

That's a good argument in favor of a robust data set, drawn from multiple
independent sources.

However, here's where I'm not convinced. How would a low training error based
on a narrow data set be any less misleading than a low cross validation error
based on a narrow training and testing set? If you aren't drawing from a
robust data source, it seems the problem would be just as bad either way.

~~~
sdenton4
Very true! But part of knowing that you don't know is recognizing when you're
in unfamiliar territory. Being able to say "this example doesn't look like
what I've seen before, and therefore I'm not as sure of my response."
Algorithms which can express uncertainty /should be/ more robust to domain
errors.

------
bitL
Most of those replies sound like: "we are experts, we know better" and stating
badly outdated facts that probably passed through marketing department on the
way to those experts. I hope no progressive person wants to work for them, but
rather to compete with them to drive them out of business, a typical pattern
repeating whenever somebody gets too cocky about their abilities.

------
pighive
Why are we seeing so many stories from bloomberg.com? What tier is bloomberg’s
credibility and their sources’? What part of technology, or business is their
journalism known for? I am new to US, but I think my questions are not
senseless. Thanks.

------
fathead_glacier
I find this article moot. They are basically saying that applying a blank ML
stamp can yield unsatisfying results on your data. That ought to be obvious
and seems unnecessarily repetitive.

