
Autopsy of a deep learning paper - ognyankulev
https://blog.piekniewski.info/2018/07/14/autopsy-dl-paper/
======
chillee
Cmon, wtf? Some of the criticisms here just aren't even close to valid. He
spends a half of the blog post criticizing them spending 100 GPUs on the
Imagenet classification experiment,

> So they trained it using 100 GPUs (100 GPUs dear lord!), and got no
> difference until fourth decimal digit! 100 GPU's to get a difference on
> fourth decimal digit! I think somebody at Google of Facebook should
> reproduce this result using 10000 GPU's, perhaps they will get a difference
> at a third decimal digit. Or maybe not, but whatever, those GPU's need to do
> something right?

Wow. This is just a blatant mischaracterization of what's going on. First of
all, this result is in the appendix. It's not meant to be an important result
of the paper. In the appendix, they explicitly write:

>Of all vision tasks, we might expect image classification to show the least
performance change when using CoordConv instead of convolution, as
classification is more about what is in the image than where it is. This tiny
amount of improvement validates that.

In contrast, they compare against object detection (in which the spatial
location matters), and get substantially better results.

This is just a standard "negative" result, to validate the fact that what they
think is happening is actually happening empirically.

The fact that this blog post mocks them for that, and much of HN is laughing
along with the blog is seriously disappointing.

~~~
317070
The first commit of Lasagne in 2014 already had the 'untie bias' option [0],
which achieves the same effect as the paper, but in a different way (and is in
my opinion more elegant). And while I cannot find a paper, I think it is one
of those tricks which have been around since the Schmidhuber days. Moreover,
it is one of the tricks which has been actively used since as long as I have
been involved with convolutional neural networks (since 2010).

So, the Uber paper is kind of silly, other for that I now know where to point
to for a confirmation of the effectiveness of the idea.

But I agree with you that the mischaracterisation is not appropriate. The main
criticisms in the blogpost are missing the point too. The paper is not
particularly interesting and might not be appropriate for the big conferences,
but in my opinion not for the reasons in the blog post.

Also, who cares about 100 GPU's? Nobody is complaining that all algorithms
require 1 GPU these days and don't run on a smartphone, but suddenly 100 GPU's
is too much? For some researchers (and I think Uber falls in the category),
100 GPU's are pocket change. Science does not require that your algorithm also
runs on your lab's DIY Phd GPU cluster. If these guys have the GPU's available
and it allowed them to be home earlier to spend time with the family, why
would it be a problem for them to use the compute?

[0][https://github.com/Lasagne/Lasagne/commit/2f9147497493f71ab6...](https://github.com/Lasagne/Lasagne/commit/2f9147497493f71ab6d79f35a041150f0c881d9d)

------
cs702
The OP seemingly forgot to mention the fact that using CoordConv with GANs
results in more realistic generation of images, with smooth geometric
transformations (including translation and deformations) of objects. Examples:

* [https://eng.uber.com/wp-content/uploads/2018/07/image5.gif](https://eng.uber.com/wp-content/uploads/2018/07/image5.gif)

* [https://eng.uber.com/wp-content/uploads/2018/07/image11.gif](https://eng.uber.com/wp-content/uploads/2018/07/image11.gif)

* [https://eng.uber.com/wp-content/uploads/2018/07/image12.gif](https://eng.uber.com/wp-content/uploads/2018/07/image12.gif)

These and other examples suggest CoordConv can _significantly improve the
quality of the representations_ learned by existing architectures.

That doesn't seem so "trivial."

~~~
fwilliams
I haven't read the paper so I can't comment on the success of the method, but
most applied ML research will show their best results in the publication and
leave out failure cases.

These images look impressive, but without doing a proper in-depth analysis,
more general claims of improvement on a task are hard to make And while it's
totally possible that, in this case, the improvements are significant, it's
dangerous to extrapolate from just a few examples in a paper.

~~~
cs702
Yes... but no one's making "general claims." This work _suggests_ the
technique can significantly improve the quality of the representations learned
by existing architectures. Please don't resort to straw-man arguments.

~~~
fwilliams
Okay so I went and read the paper. They discuss generative modeling in section
5 and in the appendix (section 7.2).

Section 5 claims "the corresponding CoordConv GAN model generates objects that
better cover the 2D Cartesian space while using 7% of the parameters of the
conv GAN". There isn't really quantitative analysis beyond a couple of small
graphs discussing this any further. Section 7.2 and 7.3 visually compares the
results between the generator's output of interpolated noise vectors in the
latent space. The results look good but without quantitative analysis, they
are very preliminary.

Generative modeling is tricky and I think in your first comment, the jump from
a few nice images to CoordConv can "significantly improve the quality of the
representations" is a big one given the sparsity of evidence in the paper. I'm
not saying that you're wrong but your original comment seemed a bit misleading
to me.

~~~
cs702
Yes, the evidence is preliminary and not extensive. Yes, generative models can
be tricky (to say the least). No one's claiming otherwise.

Visual evidence is important for generative image tasks, given that we can't
measure any of these DNN generators against a "true" statistical model that
generates the data.

For a DNN to be able to generate more realistic transformations of generated
images from low-dimensional representations, it must learn higher quality
representations... or are you saying otherwise?

------
alew1
This doesn’t seem like a particularly fair criticism.

1\. As others have pointed out, the ImageNet experiment is presented as
evidence that (as you’d expect) adding coordinate channels doesn’t affect
performance on image classification tasks. That’s a good “sanity check”
experiment to have done.

2\. The paper proposes a simple idea, and it may not have been necessary to
give it a whole new name (CoordConv). But if you’d asked me if I thought that
adding coordinate data to the input would have led to significantly better
object detection, I wouldn’t have known the answer, so the results of their
experiments—that it _does_ help on tasks like object detection—is not trivial.
Not only that—a lot of people have tried to do object detection, and yet
nobody has reported adding input channels for storing coordinates before. A
lot of ideas seem simple after someone thinks of them.

3\. Toy examples are useful for testing intuition (and building intuition
about why this trick may be helpful and for what kinds of tasks). The fact
that we can easily imagine what sorts of weights we’d expect the network to
learn is one of the things that makes it a _good_ toy example. (Of course, the
paper wouldn’t be worth publishing if it only had the toy example.)

~~~
felippee
Yes it is certainly not fair that the network they spend one page explaining
and probably weeks training and researching can be hardwired in 30 lines of
python. This is very unfair. But this is the reality, and so the post states.

Also the idea to add coordinate as a feature has been used in the past without
giving even much thought.

Toy examples are great. As long as they are not trivial. Some guy, presumably
smart, once said that "things should be as simple as possible but not
simpler". The toy example they play with is just too simple.

~~~
ebalit
The interesting part is that this trivial toy problem is hard to learn for a
standard CNN.

They probably engineered the toy problem to be that simple, looking for the
simplest problem that still displays the phenomenon.

~~~
felippee
This may indeed be interesting, but that is not what this paper focuses on.

~~~
ebalit
From the abstract:

"For any problem involving pixels or spatial representations, common intuition
holds that convolutional neural networks may be appropriate. In this paper we
show a striking counterexample to this intuition via the seemingly trivial
coordinate transform problem, which simply requires learning a mapping between
coordinates in (x,y) Cartesian space and one-hot pixel space. Although
convolutional networks would seem appropriate for this task, we show that they
fail spectacularly. We demonstrate and carefully analyze the failure first on
a toy problem, at which point a simple fix becomes obvious."

[https://arxiv.org/abs/1807.03247](https://arxiv.org/abs/1807.03247)

------
mlthoughts2018
I think there is room for criticizing a lot of the hype around deep learning
papers, especially the semi-blog / semi-research stuff you often see in tech
company blogs, fastai, etc.

But this criticism falls a little flat to me. For instance,

> “Nevertheless the central point of a scientific paper is a relatively
> concisely expressible idea of some nontrivial universality (and predictive
> power) or some nontrivial observation about the nature of reality”

That’s an insanely high bar for published work. I also read lots of research
papers, and I think only a handful per year would meet these requirements. Yet
many others are extremely valuable to show negative or partial results,
results with small effect sizes, and other things.

We absolutely should not disparage someone for publishing results of a failed
or ineffectual approach. Because otherwise we’ll just make things like file
drawer bias and p-hacking far worse, and create an even worse cultural
expectation that to make a career in science, you must constantly publish
positive results with big, sexy implications — which is what leads to the
whole disastrous hype-driven state of affairs, like in deep learning right
now, in the first place, and ludicrous science journalism, funding battles
fought over demoware and vaporware, academics fleeing into corporate
sponsorship like yesterday’s article about Facebook, etc.

~~~
felippee
Author of the post here. I totally agree that negative stuff should be
published. But without the fanfares. I think they could have changed the tone
of that paper and I would not have an issue with it. It is likely that if they
did that they'd never go through some idiot reviewer who expects "a positive
result" or some similar silliness. This is not a perfect world. The paper as
is makes strong claims about the novelty and usefulness of their gimmick. If
it turns out your stuff is at least partially hollow and you take on the
pompous tone, you have to be ready to take some heat. Science is not about
tapping friends on the back (which BTW is what is happening a lot with the so
called "deep learning community"). Science is about fighting to get to some
truth, even if that takes some heat. People so fragile that they cannot take
criticism should just not do it.

~~~
mlthoughts2018
I completely agree regarding fanfare in deep learning. There are lots of
“incremental improvement” papers, GitHub repos, blog posts, etc. and these are
totally fine in principle — but they are without a doubt branded as “state of
the art” often with messy or incomplete code and little capability to
reproduce results.

An additional frustration point I always have is when network architectures
are not even fully specified.

Try reading the MTCNN fave detection paper. How, exactly, is the input image
pyramid calculated? By what mechanism, exactly, can the network cascade
produce multiple detections (i.e. can it only produce one detection per each
input scale? If more than one, how?). In the Inception paper dealing with
factorized convolutions, just google around to see the deep, deep confusion
over the exact mechanics by which the two-stage, smaller convolutions ends up
saving operations ovrr a one-stage larger convolution. The highest upvoted
answers on Stack Overflow, reddit, quora are often wrong.

And these examples are from reasonably interesting mainstream papers that
deserve some fanfare. Just imagine how much worse it is for extremely
incremental engineering papers trying to milk the hype by claiming state of
the art performance.

Still though, at the end of the day, I’d rather that more papers are published
and negative / incremental results are not penalized, because the alternative
file drawer bias would be much worse for science overall.

------
arnioxux
> So they trained it using 100 GPUs (100 GPUs dear lord!), and got no
> difference until fourth decimal digit! 100 GPU's to get a difference on
> fourth decimal digit!

That's hilarious!

But I found the criticism on their toy task less convincing. Algorithmic toy
tasks can always be solved "without any training whatsoever".

For example in RNNs, there's a toy task that adds two numbers that are far
apart in a long sequence. This can be solved deterministically with a one
liner, but that's not the point. It's still useful for demonstrating RNN's
failure with long sequences. Would you then call the subsequent development to
make RNNs work for long sequences just feature engineering with no
universality?

In that sense, I think their choice of toy task is fine. They're just pointing
out position is a feature that's currently overlooked in the many
architectures that are heavily position dependent (they showed much better
results on faster r-cnn for example).

~~~
fwilliams
Somewhat tangentially, some recent work showed that a lot of problems with
images (e.g. denoising, upsampling, inpainting, etc...) can be solved very
efficiently with no training set at all:
[https://dmitryulyanov.github.io/deep_image_prior](https://dmitryulyanov.github.io/deep_image_prior)

This work shows that the network architecture is a strong enough prior to
effectively learn this set of tasks from a single image. Note that there is no
pretraining here whatsoever.

More to your point, I think a big problem with toy tasks are not so much the
tasks but the datasets. A lot of datasets (particularly in my field of
geometry processing) have a tremendous amount of bias towards certain
features.

A lot of papers will show their results trained and evaluated on some toy
dataset. Maybe their claim is that using such-and-such a feature as input
improves test performance on such-and-such problem and dataset.

The problem with these papers often comes when you try to generalize to data
that is similar but not from the toy dataset. A lot of applied ML papers fail
to even moderately generalize, and the authors almost never test or report
this failure. As a result, I think we can spend a lot of time designing over-
fitted solutions to certain problems and datasets.

On the flipside, there are plenty of good papers which do careful analysis of
their methods' ability to generalize and solve a problem, but when digging
through the literature its important to be judicious. I've wasted time testing
methods that turn out to work very poorly.

------
stared
Frankly, I have mixed opinions about this blog post. Good for discussing types
of papers, and that for the toy you can write convolutions by hand (which,
IMHO, is by not means any argument against CoordConv!). I adore toy problem
they (author of the paper) picked, and if anything, it is an argument for
their choice of the toy problem (unsolvable by typical conv, trivial when
added x and y channels).

In science it is crucial to make many failed approaches, not only approaches
of things that we are sure they work. So yes, it's good that they burnt 100
GPUs on a problem that didn't work. And in fact it is much better standard
than most deep learning papers I read, when they focus mostly or only on
problems in which architecture is better.

Plus, it works for object detection, so it's not a "MNIST-only trick".

------
bhouston
I've participated a bit in academic paper reviews over the years for some ACM
journals/conferences in the computer graphics area. Initially I was pretty
green and I often would not catch some of the problems that the more
experienced reviewers would catch. I embarrassingly recommended acceptance to
some papers that other more experienced reviews said were clearly crap. Over
time though, I learned to be more critical by example from the more
experienced reviewers. And eventually I sometimes would be one of the assholes
on the review committee that wrecked people's dreams of publication.

I wonder if the rapid growth of ML recently has diluted the reviewer pool
dramatically? There are so many papers submitted but so many of the reviewers
are green that crap gets through more easily? I wonder if there is a growth
limit to fields such that the paper review teams do not get overly diluted
with green researchers?

(Has this paper even been peer-reviewed? If it hasn't been peer reviewed there
is a good chance it is crap just by the law of averages -- most "academic"
papers are crap. There is a reason the top venues that I was involved with
have a rejection rate upwards of 80%.)

~~~
throwawaymath
Some machine learning conferences are now recruiting graduate students to be
reviewers.

~~~
glup
This is a common practice in a number of other fields (e.g. cognitive science
and psycholinguistics), at least for conference submissions. In general I
don't see a huge difference -- when a grad student has sufficient standing and
expertise to be chosen by a reviewer, they generally know the domain very well
and are often up to date with the research in a way that more senior reviewers
often aren't. And because they have fewer demands on their time and feel a
greater need to substantiate their critiques, they tend to write depthier
reviews. And of course there's still the safeguard that the meta-reviewer can
indicate to the authors in various ways that a review is garbage, or even
throw it out / seek another reviewer.

~~~
throwawaymath
I think a lot of what you're saying is valid, but to be completely clear:
these are graduate students who are reviewing for tier 1 conferences while
being yet to publish in a tier 1 conference themselves. There is a legitimate
argument that they simply don't have the academic maturity, experience or
competency to be reviewing other researchers' submissions, regardless of how
well intentioned they are or how much they want to prove themselves.

------
throwawaymath
Some discussion is happening concurrently at /r/MachineLearning:
[https://reddit.com/r/MachineLearning/comments/90n40l/dautops...](https://reddit.com/r/MachineLearning/comments/90n40l/dautopsy_of_a_deep_learning_paper_quite_brutal/)

In my opinion it wasn’t particularly significant enough a result to publish,
but writing takedown pieces like this feels petty and contemptuous to me.

~~~
charmides
I agree with this. This article is like a hit piece and I feel sorry for the
scientists who were attacked (even if the scientific merit of their paper is
questionable).

The author could have made his point in a more diplomatic way.

~~~
felippee
Author here (of the post, not the paper). I think you don't understand how
science works. The whole point of the exercise (which indeed may have been
forgotten these days) is to attack ideas/papers. The first line of attack
should be your friends to make sure you don't put anything out there that is
silly. The second line of attack are the reviewers, who may or may not be
idiots themselves, but in the perfect world should serve the same purpose. The
third line attack are independent readers, people like me. I found it to be
trivial, took my liberty to attack it. It is not personal and should not be
taken so. These guys may in the future publish the most amazing piece of
research ever. But this one is not it. They should realize this and my blog
post serves this purpose. If somebody gets offended and takes it personally,
so be it. I think people should have a bit thicker skin, especially in
science. I took quite a bit of bullshit myself (and I'm sure I will have to
take more) and never complained. So relax, read the paper, read the post,
learn something from both and go on.

~~~
dgacmu
Don't hide behind the pretense of doing science to justify being a jerk. Look
at your own language, in this reply, and in your blog post:

"you don't understand how science works" \- this is attacking a person, not an
idea.

The blog post:

"Perhaps this would be less shocking, if they'd sat down and instead of
jumping straight to Tensorflow, they could realize" [...]

"They apparently have spent to much time staring into the progress bars on
their GPU's to realize they are praising something obvious, obvious to the
point that it can be constructed by hand in several lines of python code."

This makes assumptions about the authors, and all but calls them idiots. That
entire paragraphs drip with sarcasm, of which one can only assume you're smart
enough to be aware and have intended. You made it personal, and that's exactly
what the GP is noting when they term your blog post a "hit piece".

Yes, people have used explicit coordinates as features before. No, this paper
isn't going to radically change the world, but if you're arguing from
"science", that _doesn't matter_ at all. Science is full of rediscovery and
duplication, and tolerates it just fine. What matters most is that we filter
out things that are wrong -- and I don't think that's obviously the case with
this paper. "Trivial" is a subjective determination, and while one part of the
job of refereeing a journal or conference is to try to rank things as a
service to the audience, it's not the most important aspect of a reviewer's
job.

Just because you took a lot of bullshit doesn't mean it's OK. It's not OK if
people were jerks to you in this way, and it's not OK to pass it on.

~~~
felippee
Oh, somebody got triggered here! Yes, there is sarcasm in this post! And if
you don't like it, fine. But please, don't give me bullshit about being a
jerk. I think you probably have not seen a real jerk in your life yet.

------
madmax108
While I understand the OP's issue with the paper, I also feel that there is
scope for the "We tried this and the improvement we got was minimalistic, so
you should probably try a different approach" kind of paper.

But OTOH, I agree that with the current "hype" around Deep learning,
accompanied by the beginning of an "DL winter" in revolutionary papers means
that academicians and companies which are set up in a "publish or perish"
state of mind end up in a rush to publish even the smallest of
modifications/enhancements.

I understand that I'm arguing both sides of the table here, but at the end of
the day I'd rather have these papers published than not, as long as they end
up in public domain and can somehow be viewed more as experimental papers than
purely theoretical ones.

~~~
maym86
> We tried this and the improvement we got was minimalistic, so you should
> probably try a different approach

I can try a lot of approaches that won't improve results :) There needs to be
a strong justification for thinking it would work and a high cost in trying it
for this to be a useful approach to writing papers.

~~~
yorwba
The paper also has some positive results (which TFA conveniently ignores), so
also publishing a null result is quite nice.

------
grizzles
> 100 GPU's to get a difference on fourth decimal digit!

So now we know what not to do. That's valuable.

So what if it's not the best theoretical paper. This screed rehashes
criticisms that are well known among researchers in the field. Overall it
reads like a kind of egotistical hit piece. Personally I'm glad Uber published
it.

~~~
throwaway080383
Yeah, I'm not a deep learning researcher, but I didn't understand the
incredulity at 100 GPU's. Surely Uber has that many lying around, so why not
overkill on the hardware when testing? This leaves no doubt that the feature
is worthless for the given task, and does not leave open the question of
whether more hardware could produce better results.

~~~
stigsb
If training requires N operations, using more GPUs (that otherwise would idle)
simply means you finish (and iterate) faster.

------
bborud
I've spent a good portion of my life in the border region between software
engineering and pure science. I wish more people from either side would spend
more time in this region. It makes for both better scientists and it
definitively makes for better programmers.

My experience is that when the two are combined you'll get much faster
scientific progress coupled with software engineers that have much better
problem solving vocabularies. Engineering seems to inject more imagination and
urgency to the scientific bits of the work. And you need engineers that have
the scientific vocabulary to lift their work to a more scholarly level.

Much scientific publishing is junk. It doesn't carry its own weight in that it
provides an insufficient delta in knowledge to be worth the time it takes to
read.

Likewise, much code that is written is junk in that the developer used the
first method (or only method) they could think of to solve a given problem due
to having a limited toolchest for problem solving. Often not even knowing
which exact problem they are solving.

Don't shit on engineering papers. It benefits both those who think of
themselves as pure scientists and engineers.

------
_cs2017_
Maybe I'm confused. The blog makes a big deal out of the fact that the neural
network can be hard coded. How is this relevant? I thought the whole point of
the paper is whether our standard training process can learn the weights, not
whether it's easy to create a NN with perfect weights if we already know those
weights.

------
vernie
ReLU's pretty trivial; I hope nobody tried to publish a paper about that.

~~~
felippee
Author of the post here: I think their paper would have been much better if
they included the piece of code which I wrote in python to explain that the
transformation they are learning is obviously trivial and the fact that it
works is not in question. This would leave them a lot more space to focus on
something interesting, perhaps explore the GAN's a little further, cause what
they did is somewhat rudimentary. But that omission (and lack of context for
previous use of such features in the literature) left a vulnerability which I
have the full right to exploit in a blog post.

------
ModernMech
I sort of skimmed past the part where it was noted the critique was on Uber
AI. I got the impression that this was a critique on a student's conference
paper or something like that, and started to feel a little bad for the author
of the paper.

But then I got to this "Why is Uber AI doing this? What is the point? I mean
if these were a bunch of random students on some small university somewhere,
then whatever. They did something, they wanted to go for a conference, fine.
But Uber AI?" and had to wake myself up. Seriously? This is from Uber? This
just screams cargo cult AI.

------
yeukhon
You know what? I think every research paper should come with a video
explaining the result.

