
Myths in Machine Learning Research - crazyoscarchang
https://crazyoscarchang.github.io/2019/02/16/seven-myths-in-machine-learning-research/
======
dkislyuk
I appreciate the authors calling out #2. ImageNet and CIFAR are both "solved"
benchmarks, to the extent that state of the art algorithms by now are likely
overfitting to the specific dataset details. In particular, objects in
ImageNet are nearly always in canonical pose, with no occlusion, few
confounding objects, illuminated, and in a semantically-obvious configuration.
For some applications this is acceptable but as an industry benchmark ImageNet
is not informative (user generated photo distributions are never this clean).

Even more damning is the recent BagNet paper (nice summary here:
[https://blog.evjang.com/2019/02/bagnet.html](https://blog.evjang.com/2019/02/bagnet.html)),
which indicates that ImageNet can likely be solved with no global features
(i.e. model doesn't have to learn anything truly abstract, just configurations
of textures, shapes, colors). I thought the author of that blog post put it
nicely:

"As someone who is deeply interested in AGI, I find ImageNet much less
interesting now, precisely because it can be solved with models that have
little global understanding of images."

~~~
joshvm
This also happened (and is happening?) in stereo imaging research.

For years, Middlebury was what you tested on, and for years that's what got
you published. Nowadays Middlebury is viewed as solved by the top algorithms.
If you try those algorithms on your own data, good luck getting similar
performance; at least I've not seen any kind of advantage in using anything
other than SGM (outside of specific research contexts like my PhD).

I'm more concerned that everyone is using KITTI as a (often the only)
benchmark for deep-learning based stereo matchers, since those are all images
of roads _. At least with classical stereo you have some idea_ what* your cost
function is. The other one people are increasingly using is Scene Flow, which
is (entirely?) synthetic. Not a great situation.

* KITTI is a widely used dataset of driving imagery

------
simonster
#3 is actually wrong. The results of Recht et al. do not show that people are
performing validation on the test set. If this were true, one would expect a
poor correlation between accuracy on the original CIFAR-10 test set and the
new test set, whereas the authors observe an extremely high correlation. The
results actually indicate that attempting to follow the same dataset
collection procedures as the creators of CIFAR-10 results in a dataset that is
slightly harder than the original dataset (at least for models trained on the
original dataset). The follow-up paper
([http://people.csail.mit.edu/ludwigs/papers/imagenet.pdf](http://people.csail.mit.edu/ludwigs/papers/imagenet.pdf))
makes this point explicitly. The fact that the relative ordering of models is
preserved on the new dataset suggests that the creators of the models didn't
cheat, or at least didn't cheat enough to invalidate CIFAR-10 test set
performance as an evaluation metric.

~~~
Majromax
> The results of Recht et al. do not show that people are performing
> validation on the test set.

As I read #3, the hypothesis doesn't depend on individual researchers acting
unethically by validating against the test set. Instead, I read it as an
analogy to significance bias in other sciences: machine learning models that
don't perform as well on the validation set simply aren't published, so the
field _as a whole_ over-fits, as if validation is performed on the test set.

In the paper you link, the authors themselves do explicitly note some test-set
shenanigans:

> But this assumption [that the models are independent of the test set] is
> undermined by the common practice of tuning model hyperparameters directly
> on the test set, which introduces dependencies between the model $\ˆf$ and
> the test set S. In the extreme case, this can be seen as training directly
> on the test set

~~~
jph00
The authors made various claims or implications not backed up by their
experiment.

However, it absolutely does not answer the question that is asked in the title
of the paper, and the process they use is incapable of answering that
question.

If you go back and read the original CIFAR10 paper, you'll see that the
process they carefully went through meant that they curated the most suitable
images for each category. By definition, what's left over (which is what the
Recht et al paper chose from) is less good images, which are of course
therefore harder to classify.

All the experiment measures is how good they are at matching the distribution
of the original dataset. The answer, they discovered, is: not very.

------
gok
4-7 of these aren't so much "myths" are "ideas that one or two recent papers
has cast some doubt upon." Not to say that these won't turn out to be more
thoroughly debunked in the future, but it hasn't happened yet.

Browsing through papers from a few years ago some of these myths in 2011 might
have been:

1\. You need to pre-train large networks so that they converge

2\. You need GPUs to train deep networks efficiently

#1 did indeed turn out to be false; pre-training has pretty convincingly been
debunked. But if anything #2 turned out very right; there's no serious
training happening today on CPUs.

A bit of conjecture here but I suspect word embeddings are going to turn to be
the next big thing that turns out not to be all that useful.

~~~
screye
Has #1 really been proven to be false ?

Transfer learning using the first few layers pertained on imagenet or a
related task have consistently given 1-2% improvements in scores... As
recently as mid 2018.

This is especially for complex tasks like VQA

~~~
gok
Transfer learning (or student-teacher training) isn't really the same thing as
pre-training as it was being talked about 10 years ago. And in those days the
claim wasn't that it helped a bit, but that it made training many deeper
networks possible at all.

------
princeofwands
To use the test set explicitly for evaluation is a deadly sin. When found out,
you'd face serious damage to your reputation (Like Baidu did a few years ago).
[1] What the decreasing performance results on the remade CIFAR-10 test set
shows, is probably more akin to a subtle form of overfitting (Due to these
datasets being around for a long time, the good results get published, and the
bad results discarded, leading to a feedback loop). [2] It is also possible
the original test set was closer in distribution to the train set, than the
remade one. The ranks stay too consistent for test set evaluation cheating.

I also think the "do not trust saliency maps" is too strongly worded. The
authors of that paper used adversarial techniques to change the saliency maps.
Not just random noise or slight variation, but carefully crafted noise to
attack saliency feature importance maps.

> For example, while it would be nice to have a CNN identify a spot on an MRI
> image as a malignant cancer-causing tumor, these results should not be
> trusted if they are based on fragile interpretation methods.

Interpretation methods are as fragile as the deep learning model itself, which
is susceptible to adversarial images too. If you allow for scenario's with
adversarial images, not only should you not trust the interpretation methods,
but also the predictions themselves, destroying any pragmatic value left. It
is hard to imagine a realistic threat scenario where MRI's are altered by an
adversary, _before_ they are fed into a CNN. When such a scenario is
realistic, all bets are off. It is much like blaming Google Chrome exposing
passwords during an evil maid attack (when someone has access to your
computer, they can do all sorts of nasty stuff, it is nearly impossible to
guard against this). [3]

[1] [https://www.technologyreview.com/s/538111/why-and-how-
baidu-...](https://www.technologyreview.com/s/538111/why-and-how-baidu-
cheated-an-artificial-intelligence-test/)

[2] [http://hunch.net/?p=22](http://hunch.net/?p=22)

[3] [https://www.theguardian.com/technology/2013/aug/07/google-
ch...](https://www.theguardian.com/technology/2013/aug/07/google-chrome-
password-security-flaw)

EDIT: meta(I liked the article. I do not want to argue it is wrong. It is
difficult for me to start a thread without finding the one or two things to
nitpick at, or to expand upon a point, but this article was already very
resourceful)

~~~
abidlabs
As one of the authors of the "Interpretation of Neural Networks is Fragile"
paper, I would agree with you.

To a certain extent, saliency maps can be perturbed even with random noise,
but the more dramatic attacks (and certainly the targeted attacks, in which we
move the saliency map from one region of the image to a specified another
region of the image) require carefully-crafted adversarial perturbations.

------
max_likelihood
Even though I don't know what a Tensor is, I had a suspicion that TensorFlow
was really just "MatrixFlow". I felt validated after reading myth 1, but I'm
still trying to wrap my head around the difference between Tensors & Matrices.
I have a feeling that I missing out on something beautiful, like Fourier
Transforms, and when I finally get it a deep smile will spread across my face.

~~~
antognini
The way "tensor" is typically used in machine learning it really is just an
n-dimensional generalization of a matrix.

In physics, however, a tensor has a more specific meaning. In this context,
certain 2-dimensional tensors can be represented as matrices, but a matrix is
a distinct concept. A bit more precisely, in physics a tensor is an object
that transforms a particular way during coordinate transformations.
Intuitively this means that a tensor must be some physical "thing".

A classical example of a tensor is the moment of inertia tensor. Every 3-d
object has a moment of inertia tensor. This tells you how the torque relates
to angular acceleration, and it will in general be different across different
axes of the object. Now, you can choose any three (non-collinear) directions
you want and write down a matrix which represents the tensor in that basis,
but this representation is fundamentally coordinate dependent. The moment of
inertia tensor, by contrast is a _coordinate-independent_ entity. Just like a
vector, it will have certain values in certain reference frames, but the
vector itself transcends any coordinate system. (Though this is a bit of
tautology since a vector is a 1-dimensional tensor.)

~~~
soVeryTired
> A bit more precisely, in physics a tensor is an object that transforms a
> particular way during coordinate transformations

No offence, but that's a hideous definition :)

For me a (real) tensor is a function that takes an ordered set of N row
vectors and M column vectors as arguments, and spits back a real number as a
result. It has to be linear in its arguments. That's all folks!

By this token a matrix A _is_ a tensor: it takes one row vector x, and one
column vector y, and returns a real number xAy.

Similarly, a row vector x is a tensor: feed it a column vector y and you get
the real number xy.

You can dress all this up in the language of linear functionals or n-forms,
but at core that's what's going on.

~~~
antognini
Yes, that definition is fine for machine learning, but it's not quite complete
for physics. To extend your definition for physics, a tensor is a function
that takes an ordered set of N row vectors and M column vectors as arguments
and spits back a real, _coordinate-invariant_ number as a result.

~~~
soVeryTired
I think you get coordinate invariance for free if you think of a vector as an
object in its own right, rather than as a tuple in a coordinate system. But
then I guess it's more accurate to speak of vectors and covectors than row
vectors and column vectors.

------
howlin
Myth 8: Machine learning == visual object classification

~~~
khaledh
I was expecting a list of general ML myths, but this list is laser focused on
neural networks, and in particular their use in object recognition in images.

------
edmack
Nice refreshing list :)

I didn't think "Attention > Convolution" was a prevalent myth, given how
integral convolutions are to SOTA image classifiers and GANs (if anything, I
believe attention is unde-utilised here and due to grow in usage a lot)

~~~
sdenton4
Ugh... This list seems like a mishmash of actual bad practices and active
areas of discussion labeled 'myths...' which seems pretty harmful to me.

------
tanilama
> Myth 3: Machine Learning researchers do not use the test set for validation

Damn this hits the nails. People are essentially using 'test' set as
validation set, the validation set as the early stopping helper set.

------
andreilys
Why is using a test set for evaluation a deadly sin?

As I understand it, you fit() with training, then do parameter tuning with
validation and the best parameter tuned model is used on test.

Now I'm still a little confused as to why we don't just fit() then do
hyperparameter tuning with the test set (best-tuned model wins, no need for
test). Why would calling predict() on a model cause it to update its weights
and overfit?

~~~
princeofwands
It is not the ML model that is updated with information, but the predictive
modeler herself is updated. She now finds parameters that make the model
perform well on that specific test set. This gives you overly optimistic
estimates of generalization performance (thus unsound science, and, in
business, it is better to report too low performance, than too high, because a
policy build on a model that is overfit like this can ruin a company or a
life). For smarter approaches to this problem, see the research on reusable
holdout sets.

------
AndrewKemendo
The author doesn't even try to make a good case that these are "commonly
believed to be true." I don't know any serious researcher who would claim any
of those and #3 isn't really a myth it's just poor practices.

Would have been better titled, "Things to avoid when doing ML research"

------
banachtarski
People reading point #3 as saying that researchers are intentionally cheating
and using test data for validation need to increase their reading
comprehension skills.

------
adamnemecek
What’s up with convolution and attention? Like in general, what’s their
relationship.

------
dcbadacd
Seeing one of those saliency maps one question arose, why do researchers not
rotate and wiggle (crop a tiny amount but thus change the pixel that's at the
center of the image) source images?

------
singularity2001
> These tricks make lightweight and dynamic convolutions several orders of
> magnitude more efficient that standard non-separable convolutions.

Is that backed up by any data?

------
kevintb
Excellent post! I found #3 pretty informative (and funny, in a sad way) in
particular.

------
mcilai
Number 6 is not a myth!!!

~~~
acdha
Can you explain why?

