Hacker News new | past | comments | ask | show | jobs | submit login
Myths in Machine Learning Research (crazyoscarchang.github.io)
432 points by crazyoscarchang on Feb 25, 2019 | hide | past | favorite | 53 comments

I appreciate the authors calling out #2. ImageNet and CIFAR are both "solved" benchmarks, to the extent that state of the art algorithms by now are likely overfitting to the specific dataset details. In particular, objects in ImageNet are nearly always in canonical pose, with no occlusion, few confounding objects, illuminated, and in a semantically-obvious configuration. For some applications this is acceptable but as an industry benchmark ImageNet is not informative (user generated photo distributions are never this clean).

Even more damning is the recent BagNet paper (nice summary here: https://blog.evjang.com/2019/02/bagnet.html), which indicates that ImageNet can likely be solved with no global features (i.e. model doesn't have to learn anything truly abstract, just configurations of textures, shapes, colors). I thought the author of that blog post put it nicely:

"As someone who is deeply interested in AGI, I find ImageNet much less interesting now, precisely because it can be solved with models that have little global understanding of images."

This also happened (and is happening?) in stereo imaging research.

For years, Middlebury was what you tested on, and for years that's what got you published. Nowadays Middlebury is viewed as solved by the top algorithms. If you try those algorithms on your own data, good luck getting similar performance; at least I've not seen any kind of advantage in using anything other than SGM (outside of specific research contexts like my PhD).

I'm more concerned that everyone is using KITTI as a (often the only) benchmark for deep-learning based stereo matchers, since those are all images of roads. At least with classical stereo you have some idea what* your cost function is. The other one people are increasingly using is Scene Flow, which is (entirely?) synthetic. Not a great situation.

* KITTI is a widely used dataset of driving imagery

#3 is actually wrong. The results of Recht et al. do not show that people are performing validation on the test set. If this were true, one would expect a poor correlation between accuracy on the original CIFAR-10 test set and the new test set, whereas the authors observe an extremely high correlation. The results actually indicate that attempting to follow the same dataset collection procedures as the creators of CIFAR-10 results in a dataset that is slightly harder than the original dataset (at least for models trained on the original dataset). The follow-up paper (http://people.csail.mit.edu/ludwigs/papers/imagenet.pdf) makes this point explicitly. The fact that the relative ordering of models is preserved on the new dataset suggests that the creators of the models didn't cheat, or at least didn't cheat enough to invalidate CIFAR-10 test set performance as an evaluation metric.

> The results of Recht et al. do not show that people are performing validation on the test set.

As I read #3, the hypothesis doesn't depend on individual researchers acting unethically by validating against the test set. Instead, I read it as an analogy to significance bias in other sciences: machine learning models that don't perform as well on the validation set simply aren't published, so the field as a whole over-fits, as if validation is performed on the test set.

In the paper you link, the authors themselves do explicitly note some test-set shenanigans:

> But this assumption [that the models are independent of the test set] is undermined by the common practice of tuning model hyperparameters directly on the test set, which introduces dependencies between the model $\ˆf$ and the test set S. In the extreme case, this can be seen as training directly on the test set

The authors made various claims or implications not backed up by their experiment.

However, it absolutely does not answer the question that is asked in the title of the paper, and the process they use is incapable of answering that question.

If you go back and read the original CIFAR10 paper, you'll see that the process they carefully went through meant that they curated the most suitable images for each category. By definition, what's left over (which is what the Recht et al paper chose from) is less good images, which are of course therefore harder to classify.

All the experiment measures is how good they are at matching the distribution of the original dataset. The answer, they discovered, is: not very.

It is not people are performing validation on the test set, that would be straightly cheating.

What #3 has described is a totally normal workflow:

1. Conceptualize a new model idea

2. Implementing and training is totally legal without involving test set.

3. However, once finished, the model is evaluated on test set, the performance of which will decide whether this idea is worthy of publishing or not. If not, they go to the first step and repeat.

Such loop essentially makes the test, the actual validation set, if you think human as his/her own optimizer, and he/she takes a look at the test set periodically, and decide whether to pursue the current idea or not. Sounds like early stopping, isn't it?

Remember back in 2015, there is a debacle from Baidu, where a researcher had fabricated multiple accounts to run unlimited tests against ImageNet's own reserved test set, which the competition straightly forbade.

If a test set is a 'true' test set, then it should work like a test in a real world: be kept secret before revealing to the public, once evaluated, the same problems/examples shall never appear in the later tests ever. But such approach would not be accepted because the cost is simply too high.

The whole thing is very similar to the replication crisis, which is caused by p being a random variable and sometimes hitting below 0.05 for no reason. Similarly, any training or test set validation percentage is also a random variable. Even a single researcher trains only a single model, the field is very packed which leads to many, many models being created. As test set accuracy is a random variable, some of the models will hit above the threshold and they are the models that can be published. Otherwise the researcher has no results to report. The extra step that eliminates negative results is the mechanism that causes the indirect validation on the test set. Yes, the results are correlated but they also provide inflated results, which is shown by all of the new test accuracy points lying wildly outside the prediction interval.

I don't think machine learning suffers from the same kind of p-value-driven replication crisis as other fields. It is true that people don't generally perform proper statistics to compare machine learning models [1], but ML research has two things going for it that other scientific fields do not. First, comparing machine learning models on the same test set corresponds to a within-subjects analysis, generally with tens of thousands of subjects, so the noise level is low. Second, because ML researchers don't generally perform hypothesis tests, they care solely about effect size and not about significance. If my model gets all the same examples right as the previous state-of-the-art, plus 10 more, then my model is statistically significantly better, but on a test set of 10,000 examples this corresponds to a 0.1% accuracy improvement, which is generally not big enough to publish. By not doing hypothesis tests, ML researchers actually tend to be more conservative than their p-value-driven counterparts in other fields.

In the Recht et al. study, the reason the new test accuracy is wildly outside of a binomial confidence interval around the original test set accuracy is that the distribution is different. The CI only applies to data drawn from the same distribution.

ML research still suffers from replication issues; such is the nature of the scientific incentive structure. However, these issues generally come in the form of poorly tuned baselines, buggy code, and claims with insufficient experimental/theoretical justification. Outside of some isolated cases, publication bias and cheating at hyperparameter tuning do not seem to be major factors.


[1] Statistically speaking, to compare two models on the same dataset, one does not care about the accuracy numbers but instead about the number of examples model A gets right that model B does not and vice versa; see McNemar's test.

The problem is that outside of the machine learning community people don't hear "within-subjects analysis" they hear (and are told) "better than human performance". Within the community I think you are right; people are working on a shared set of assumptions and have the same expectations about performance in the real world (that the results will not transfer without massive negative deltas), but that is definitely not what 10000's of web developers downloading scikit-learn or tensorflow believe.

Thanks for that paper, I missed it when it came out (is it published? Where?).

Also the footnote on the first page cracks me up: "Authors ordered alphabetically. Ben did none of the work."

4-7 of these aren't so much "myths" are "ideas that one or two recent papers has cast some doubt upon." Not to say that these won't turn out to be more thoroughly debunked in the future, but it hasn't happened yet.

Browsing through papers from a few years ago some of these myths in 2011 might have been:

1. You need to pre-train large networks so that they converge

2. You need GPUs to train deep networks efficiently

#1 did indeed turn out to be false; pre-training has pretty convincingly been debunked. But if anything #2 turned out very right; there's no serious training happening today on CPUs.

A bit of conjecture here but I suspect word embeddings are going to turn to be the next big thing that turns out not to be all that useful.

Has #1 really been proven to be false ?

Transfer learning using the first few layers pertained on imagenet or a related task have consistently given 1-2% improvements in scores... As recently as mid 2018.

This is especially for complex tasks like VQA

Transfer learning (or student-teacher training) isn't really the same thing as pre-training as it was being talked about 10 years ago. And in those days the claim wasn't that it helped a bit, but that it made training many deeper networks possible at all.

To use the test set explicitly for evaluation is a deadly sin. When found out, you'd face serious damage to your reputation (Like Baidu did a few years ago). [1] What the decreasing performance results on the remade CIFAR-10 test set shows, is probably more akin to a subtle form of overfitting (Due to these datasets being around for a long time, the good results get published, and the bad results discarded, leading to a feedback loop). [2] It is also possible the original test set was closer in distribution to the train set, than the remade one. The ranks stay too consistent for test set evaluation cheating.

I also think the "do not trust saliency maps" is too strongly worded. The authors of that paper used adversarial techniques to change the saliency maps. Not just random noise or slight variation, but carefully crafted noise to attack saliency feature importance maps.

> For example, while it would be nice to have a CNN identify a spot on an MRI image as a malignant cancer-causing tumor, these results should not be trusted if they are based on fragile interpretation methods.

Interpretation methods are as fragile as the deep learning model itself, which is susceptible to adversarial images too. If you allow for scenario's with adversarial images, not only should you not trust the interpretation methods, but also the predictions themselves, destroying any pragmatic value left. It is hard to imagine a realistic threat scenario where MRI's are altered by an adversary, _before_ they are fed into a CNN. When such a scenario is realistic, all bets are off. It is much like blaming Google Chrome exposing passwords during an evil maid attack (when someone has access to your computer, they can do all sorts of nasty stuff, it is nearly impossible to guard against this). [3]

[1] https://www.technologyreview.com/s/538111/why-and-how-baidu-...

[2] http://hunch.net/?p=22

[3] https://www.theguardian.com/technology/2013/aug/07/google-ch...

EDIT: meta(I liked the article. I do not want to argue it is wrong. It is difficult for me to start a thread without finding the one or two things to nitpick at, or to expand upon a point, but this article was already very resourceful)

As one of the authors of the "Interpretation of Neural Networks is Fragile" paper, I would agree with you.

To a certain extent, saliency maps can be perturbed even with random noise, but the more dramatic attacks (and certainly the targeted attacks, in which we move the saliency map from one region of the image to a specified another region of the image) require carefully-crafted adversarial perturbations.

To use the test set explicitly for evaluation is a deadly sin

I’ve seen tons of papers doing that and getting published, especially on cifar10. Not saying it’s a good practice, just that it’s fairly common.

>"It is hard to imagine a realistic threat scenario where MRI's are altered by an adversary, _before_ they are fed into a CNN."

what about when people in the hospital who have a patient that they suspect has cancer use the best machine to create that patients scans and tend to push patients that they think are ok to the older less good instrument? Or if they choose to utilise time on the best instrument for children?

What about when the MRI's done at night are done by one technician who uses a slightly different process from the technicians who created the MRI data set?

At the very least there is a significant risk of systematic error being introduced by these kind of bias, and as you say, it's really hard to guard against this, but if a classifier that I produce is used where this happens and people die... Well, whatever I feel I would be responsible.

Even though I don't know what a Tensor is, I had a suspicion that TensorFlow was really just "MatrixFlow". I felt validated after reading myth 1, but I'm still trying to wrap my head around the difference between Tensors & Matrices. I have a feeling that I missing out on something beautiful, like Fourier Transforms, and when I finally get it a deep smile will spread across my face.

The way "tensor" is typically used in machine learning it really is just an n-dimensional generalization of a matrix.

In physics, however, a tensor has a more specific meaning. In this context, certain 2-dimensional tensors can be represented as matrices, but a matrix is a distinct concept. A bit more precisely, in physics a tensor is an object that transforms a particular way during coordinate transformations. Intuitively this means that a tensor must be some physical "thing".

A classical example of a tensor is the moment of inertia tensor. Every 3-d object has a moment of inertia tensor. This tells you how the torque relates to angular acceleration, and it will in general be different across different axes of the object. Now, you can choose any three (non-collinear) directions you want and write down a matrix which represents the tensor in that basis, but this representation is fundamentally coordinate dependent. The moment of inertia tensor, by contrast is a coordinate-independent entity. Just like a vector, it will have certain values in certain reference frames, but the vector itself transcends any coordinate system. (Though this is a bit of tautology since a vector is a 1-dimensional tensor.)

> A bit more precisely, in physics a tensor is an object that transforms a particular way during coordinate transformations

No offence, but that's a hideous definition :)

For me a (real) tensor is a function that takes an ordered set of N row vectors and M column vectors as arguments, and spits back a real number as a result. It has to be linear in its arguments. That's all folks!

By this token a matrix A is a tensor: it takes one row vector x, and one column vector y, and returns a real number xAy.

Similarly, a row vector x is a tensor: feed it a column vector y and you get the real number xy.

You can dress all this up in the language of linear functionals or n-forms, but at core that's what's going on.

Yes, that definition is fine for machine learning, but it's not quite complete for physics. To extend your definition for physics, a tensor is a function that takes an ordered set of N row vectors and M column vectors as arguments and spits back a real, coordinate-invariant number as a result.

I think you get coordinate invariance for free if you think of a vector as an object in its own right, rather than as a tuple in a coordinate system. But then I guess it's more accurate to speak of vectors and covectors than row vectors and column vectors.

How do you represent a coordinate-independent tensor? Don't you still need a basis?

You just call it something like T. If you want any numbers you need a coordinate basis.

For those interested, the first chapter of Kip Thorne's book has a good, though idiosyncratic, explanation of tensors: http://www.pmaweb.caltech.edu/Courses/ph136/yr2012/1201.1.K....

The first lectures here https://theoreticalminimum.com/courses/general-relativity/20... are mostly an introduction to tensor calculus.

I think Tensor in machine learning is more akin to array, just an n-dimensional collection of numbers. It faces the same confusion with a real tensor, as "2-d array" does with "matrix". While a matrix can be represented as a 2-d array, and a tensor can be represented as a n-d array, they have different mathematical connotations. A matrix should be viewed as a 2-d array representing a linear map from on vector space (say R^n) to another (say R^m). We can say this because it is a theorem that any linear map from one (finite dimensional) vector space to another can be represented as a matrix and computed via matrix multiplication. Tensors are the same thing, except the input/output is no longer limited to vector spaces.

Consider you have a vector and a bunch (let's say q) of matrices, and you take the matrix product of the vector with all those matrices. You will get q vectors, which you can stick together to form a matrix. This act of multiplying a vector by a bunch of matrices is clearly linear with respect to the input: If we multiply the input by X, every vector will be multiplied by X, so the resulting matrix will be multiplied by X. Suppose you do this operation, T on v to get vectors T1(v), T2(v) ... Tq(v) and on w to get T1(w), T2(w), ... Tq(w). Since all T's are matrix products (linear) then if we do the operation on v + w we will get T1(v) + T1(w), T2(v) + T2(w) ... Tq(v) + Tq(w). Which is essentially T(v) + T(w). So now we know T is linear with respect to the input. Now, this all took a long time to describe, so let's simplify it: How about instead of a group of matrices, we just call this thing a 3-d tensor? we can let i and j index the regular matrix dimensions and make up a new dimension for the matrix we're on, call it k. Now at any coordinate we get a value so it's basically a 3-d array, but it represents something much more specific than that. You can guess how this might generalize to mapping matrices x matrices to 3d tensors or 3d matrices x 3d tensors to 4d tensors and so on.

So now the question is, does TensorFlow conflate these? I think it does - somewhat. A convolution can be viewed as a tensor (a single filter maps matrices (images) x 3d-tensor (kernel) to matrices (another image)) so I'd call that a Tensor operation. But consider the input image itself. Is this truly a tensor? If we consider a simple situation, say we have some data vector and we're doing a matrix multiply to get the output of a linear model. Is the input a matrix? I would say no, because we don't think of it as acting on the model, we thing of the model as acting on it, even though what we are doing is really equivalent to multiplying two matrices. Equivalently, I would not call the input image, or any activation in a neural network a true tensor, even though it is numerically equivalent. There are true tensors in TensorFlow, but if you're using high level functions (dense, conv2d) they are usually hidden from the user.

In this case I don't think what the article says is exactly right. Naively, tensors are just n-dimensional arrays, which TensorFlow supports. The paper linked in the article appears more to be talking about how derivatives of tensors are represented in TensorFlow. The difference doesn't seem to matter unless you are taking higher-order derivatives. This makes sense, since TensorFlow is focused on first-order derivatives needed for gradient descent, but traditional machine learning algorithms also rely on second-order derivatives to make use of more powerful optimization algorithms based on Newton's method. I'm not sure exactly where the difference comes from, but it comes from a convenient notation for tensors used in physics, known as Einstein notation (Einstein invented this notation to make his life easier when deriving general relativity). In this notation, tensors are represented by a single scalar variable. For example, matrix multiplication y = A x is expressed as

y_i = A_ij x_j.

If I understand correctly, the paper points out that an algorithm for computing derivatives based on this notation is faster for taking higher-order derivatives compared to using TensorFlow.

Mathematically, tensors are more complicated objects. Basically, they are what you get when you take higher-order derivatives of a function. In particular, the first-order derivative of a function f: R^n -> R^m at a point x \in R^n is the best linear function A_x \in R^{m X n} that approximates the original function, i.e.,

f(x + dx) ~= f(x) + A_x dx.

A linear function is represented by a matrix, so a first-order derivative is a matrix. If I take the second-order derivative, I get a more complicated object B_x, which represents the quadratic term in the Taylor expansion:

f(x + dx) ~= f(x) + A_x dx + B_x(dx, dx)

where B_x(a, b) is a linear function (or more precisely, a "multilinear" function) of two vectors a, b (which are the same in the above formula). That is, whereas A_x is a (linear) function R^n -> R^m, B_x is a (multilinear) function R^n X R^n -> R^m. This mathematical object B_x is an example of a tensor. In R^n and R^m, tensors are pretty boring, but they become more interesting when dealing with functions on manifolds.

+1. Looking quickly at the backing paper, it's all about higher-order derivatives. As I see it, the grindy-axe is about where and how one makes the hand-off from algebraic notation to actual computation: keeping the calculation in algebraic form allows efficient algebraic manipulations, which can then be translated into low-level computations.

The question, then, is: a) whether the space of problems where you have good algebraic notation lines up well with the total scope of TF problems, and b) whether the extra complexity of supporting the full computer algebra system is 'worth it.'

For the latter, keep in mind that algebraic derivatives can get cumbersome/expensive when you have an exponentially complex piecewise linear space (eg: https://arxiv.org/pdf/1711.02114.pdf); the linked paper makes no mention of ReLUs... things might be fine with sigmoid activations, but they're the exception, these days...

I have often seen Tensors introduced in the context of Einstein's General Relativity. I read this article on HN: https://news.ycombinator.com/item?id=19055994 a few weeks back and found it really helpful.

Thanks for linking. My tldr from the top-rated answer: the components of a Tensor can be written in "matrix" form (i.e. a 2D array of numbers), but the Tensor is not that matrix. Ultimately, "a Tensor is what transforms like a Tensor".

Tensors are in effect a generalization of matrices in higher dimensions. A tensor of dimension 3 is a similar step up to the step from a linear array to a matrix. They arise in all sorts of places though fluids is where I met them first. Operations on tensors are also a bit more challenging than operations on matrices.

This is probably wrong, but I always think of tensor's as n-dimensional generalizations of matrices with units attached.

Edit: After some wikipediaing, "bases" might be a better word than "units."

Well intuitively an n-dimensional generalization of matrices would just be a big multi-dimensional table. But a tensor is different in that you have some number of dimensions which are covariant and some number which are contravariant.

Additionally, you've sort of got it backwards. A matrix with units (and a set of basis vectors) attached is one representation of a rank (1, 1) tensor. But it's not really a unique representation of the tensor - you could choose a different set of basis vectors and come up with a different matrix representation of the exact same tensor. The tensor is an entity, while the matrix is a representation of an entity within a given coordinate system.

It's not even MatrixFlow, but more ArrayFlow.

Myth 8: Machine learning == visual object classification

I was expecting a list of general ML myths, but this list is laser focused on neural networks, and in particular their use in object recognition in images.

Nice refreshing list :)

I didn't think "Attention > Convolution" was a prevalent myth, given how integral convolutions are to SOTA image classifiers and GANs (if anything, I believe attention is unde-utilised here and due to grow in usage a lot)

Ugh... This list seems like a mishmash of actual bad practices and active areas of discussion labeled 'myths...' which seems pretty harmful to me.

The author doesn't even try to make a good case that these are "commonly believed to be true." I don't know any serious researcher who would claim any of those and #3 isn't really a myth it's just poor practices.

Would have been better titled, "Things to avoid when doing ML research"

> Myth 3: Machine Learning researchers do not use the test set for validation

Damn this hits the nails. People are essentially using 'test' set as validation set, the validation set as the early stopping helper set.

Why is using a test set for evaluation a deadly sin?

As I understand it, you fit() with training, then do parameter tuning with validation and the best parameter tuned model is used on test.

Now I'm still a little confused as to why we don't just fit() then do hyperparameter tuning with the test set (best-tuned model wins, no need for test). Why would calling predict() on a model cause it to update its weights and overfit?

It is not the ML model that is updated with information, but the predictive modeler herself is updated. She now finds parameters that make the model perform well on that specific test set. This gives you overly optimistic estimates of generalization performance (thus unsound science, and, in business, it is better to report too low performance, than too high, because a policy build on a model that is overfit like this can ruin a company or a life). For smarter approaches to this problem, see the research on reusable holdout sets.

I think the idea is that "calling predict and tuning on a model with the test set" is the "overfitting". It's not actual overfitting like we know in ML; it's as if the researcher is performing "descent" to get the best hyper-parameters. Problem is, if we use the test set to find these hyper-parameters, we'll have no idea how well it does in the real-world/in general. We'd need another set to figure that out - and we're back where we started.

It's a deadly sin considering that many (most?) researchers do not treat it as a test set, but as a "validation set #2". Basically, you tune your hyperparameters (up to the random seed!) to fare better on the test set. So, as shown in the cited paper, the results are not generalization results anymore.

You could easily achieve perfect accuracy on the test set by just hardcoding the entire test set into your "model" and the entire model is "1. See image, 2. Look up image in test set, 3. Read off answer".

It would be interesting if someone would see whether they could sneakily (Sokal-style) publish a paper like the following: "We took (popular model X) and augmented it with an additional lexicon of specific lookup data, and the result blows away all the competition. This is deeply profound and implies that built-in lexicons could be the key to true general intelligence!" (When in fact all they did was hard-code the test set or part of the test set into their model.) Then see how many popular presses churn out sensational articles.

People reading point #3 as saying that researchers are intentionally cheating and using test data for validation need to increase their reading comprehension skills.

What’s up with convolution and attention? Like in general, what’s their relationship.

Seeing one of those saliency maps one question arose, why do researchers not rotate and wiggle (crop a tiny amount but thus change the pixel that's at the center of the image) source images?

> These tricks make lightweight and dynamic convolutions several orders of magnitude more efficient that standard non-separable convolutions.

Is that backed up by any data?

Excellent post! I found #3 pretty informative (and funny, in a sad way) in particular.

Number 6 is not a myth!!!

Can you explain why?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact