Hacker News new | comments | ask | show | jobs | submit login
Autopsy of a deep learning paper (piekniewski.info)
249 points by ognyankulev 7 months ago | hide | past | web | favorite | 54 comments

Cmon, wtf? Some of the criticisms here just aren't even close to valid. He spends a half of the blog post criticizing them spending 100 GPUs on the Imagenet classification experiment,

> So they trained it using 100 GPUs (100 GPUs dear lord!), and got no difference until fourth decimal digit! 100 GPU's to get a difference on fourth decimal digit! I think somebody at Google of Facebook should reproduce this result using 10000 GPU's, perhaps they will get a difference at a third decimal digit. Or maybe not, but whatever, those GPU's need to do something right?

Wow. This is just a blatant mischaracterization of what's going on. First of all, this result is in the appendix. It's not meant to be an important result of the paper. In the appendix, they explicitly write:

>Of all vision tasks, we might expect image classification to show the least performance change when using CoordConv instead of convolution, as classification is more about what is in the image than where it is. This tiny amount of improvement validates that.

In contrast, they compare against object detection (in which the spatial location matters), and get substantially better results.

This is just a standard "negative" result, to validate the fact that what they think is happening is actually happening empirically.

The fact that this blog post mocks them for that, and much of HN is laughing along with the blog is seriously disappointing.

The first commit of Lasagne in 2014 already had the 'untie bias' option [0], which achieves the same effect as the paper, but in a different way (and is in my opinion more elegant). And while I cannot find a paper, I think it is one of those tricks which have been around since the Schmidhuber days. Moreover, it is one of the tricks which has been actively used since as long as I have been involved with convolutional neural networks (since 2010).

So, the Uber paper is kind of silly, other for that I now know where to point to for a confirmation of the effectiveness of the idea.

But I agree with you that the mischaracterisation is not appropriate. The main criticisms in the blogpost are missing the point too. The paper is not particularly interesting and might not be appropriate for the big conferences, but in my opinion not for the reasons in the blog post.

Also, who cares about 100 GPU's? Nobody is complaining that all algorithms require 1 GPU these days and don't run on a smartphone, but suddenly 100 GPU's is too much? For some researchers (and I think Uber falls in the category), 100 GPU's are pocket change. Science does not require that your algorithm also runs on your lab's DIY Phd GPU cluster. If these guys have the GPU's available and it allowed them to be home earlier to spend time with the family, why would it be a problem for them to use the compute?


The 100 GPUs remark is an aside, but still somewhat significant. It’s a tremendous amount of computing power to show a negative result that the authors don’t even really acknowledge as a negative result.

There is a common issue with deep learning papers: deep learning seems to excuse thinking rigorously about the underlying problem at hand. We trust a deep network to do the legwork for us, and this invites an intellectually-lazy approach to research. We’re graduating PhDs who have spent four years twisting knobs.

The paper itself is fine, no better or worse than the average paper today. The troubling part is the large number of high-profile ML practitioners praising it within minutes/hours. This suggests to me that notable results in DNN might be drying up and name recognition (Uber AI) is driving attention more than substance.

I think it's fair criticism.

The paper explores effects of known models (convolution functions that exist even outside the NN field) without explaining them and doing the math to understand where the error comes from. You could have taken for example the CNN layer and see exactly what it does. Instead they prefer to pass through the network the full signal they expect to recover at the end and the find it works.

Finding sanctuary inside computation and random experimentation against known datasets, without fully explaining what is actually happening on the modelling side should't be acceptable.

The OP is flagging in his post the mental model and attitude towards research. The paper being discussed is just an example that confirms that proper rigour is being replaced by the convenience of either computation or parameter fiddling (human or experimental). None of these are going to take the field too far.

The OP suggests the scientific field should focus on things that matter and change the world in an irreversible way. Provide a theorem or something that can be proved.

The post mocks them primarily for learning the trivial coordinate transform. That is the core of the paper and ridiculing this piece leaves very little left on the table. The ImageNet test is just an appendix, a cherry on the cake, a curiosity one should say.

The OP seemingly forgot to mention the fact that using CoordConv with GANs results in more realistic generation of images, with smooth geometric transformations (including translation and deformations) of objects. Examples:

* https://eng.uber.com/wp-content/uploads/2018/07/image5.gif

* https://eng.uber.com/wp-content/uploads/2018/07/image11.gif

* https://eng.uber.com/wp-content/uploads/2018/07/image12.gif

These and other examples suggest CoordConv can significantly improve the quality of the representations learned by existing architectures.

That doesn't seem so "trivial."

I haven't read the paper so I can't comment on the success of the method, but most applied ML research will show their best results in the publication and leave out failure cases.

These images look impressive, but without doing a proper in-depth analysis, more general claims of improvement on a task are hard to make And while it's totally possible that, in this case, the improvements are significant, it's dangerous to extrapolate from just a few examples in a paper.

A bit related: On top of that, papers in top venues frequently don't publish data, let alone their implementations. How hard is to just dump a zip of your data in 2018? (I don't want to single out any particular papers).

I am talking about data that should not be that huge. Does no one else feel frustrated by this?

Exactly this. Science is supposed to be reproducible. Computer science is much more easily reproducible than many other disciplines and it is just sad that many papers don't publish their data/code to reinforce their conclusions.

I could understand Facebook/Google/BigCo. doing so when the data in question might be internal implementation/tech but the trend is far more prevalent.

Sharing experiment data/parameters would not only help people verify your results but also help students learning how to design and carry out experiments.

Yes... but no one's making "general claims." This work suggests the technique can significantly improve the quality of the representations learned by existing architectures. Please don't resort to straw-man arguments.

fwilliams: looking at this comment, in hindsight, I wish I could soften it. I did make what can only be described as "general claims," and now feel more than a bit dumb for saying otherwise... sorry about that. (What I was trying to say, but evidently did a poor job of articulating, is that the authors of the paper are not making general claims, even though they do speculate about the potential usefulness of their work.)

Okay so I went and read the paper. They discuss generative modeling in section 5 and in the appendix (section 7.2).

Section 5 claims "the corresponding CoordConv GAN model generates objects that better cover the 2D Cartesian space while using 7% of the parameters of the conv GAN". There isn't really quantitative analysis beyond a couple of small graphs discussing this any further. Section 7.2 and 7.3 visually compares the results between the generator's output of interpolated noise vectors in the latent space. The results look good but without quantitative analysis, they are very preliminary.

Generative modeling is tricky and I think in your first comment, the jump from a few nice images to CoordConv can "significantly improve the quality of the representations" is a big one given the sparsity of evidence in the paper. I'm not saying that you're wrong but your original comment seemed a bit misleading to me.

Yes, the evidence is preliminary and not extensive. Yes, generative models can be tricky (to say the least). No one's claiming otherwise.

Visual evidence is important for generative image tasks, given that we can't measure any of these DNN generators against a "true" statistical model that generates the data.

For a DNN to be able to generate more realistic transformations of generated images from low-dimensional representations, it must learn higher quality representations... or are you saying otherwise?

I'm not sure how to interpret these pictures. They don't suggest anything to me. And certainly don't suggest anything about the quality of representations. And BTW how do you measure quality of representations?

It is mentioned on the video

This doesn’t seem like a particularly fair criticism.

1. As others have pointed out, the ImageNet experiment is presented as evidence that (as you’d expect) adding coordinate channels doesn’t affect performance on image classification tasks. That’s a good “sanity check” experiment to have done.

2. The paper proposes a simple idea, and it may not have been necessary to give it a whole new name (CoordConv). But if you’d asked me if I thought that adding coordinate data to the input would have led to significantly better object detection, I wouldn’t have known the answer, so the results of their experiments—that it does help on tasks like object detection—is not trivial. Not only that—a lot of people have tried to do object detection, and yet nobody has reported adding input channels for storing coordinates before. A lot of ideas seem simple after someone thinks of them.

3. Toy examples are useful for testing intuition (and building intuition about why this trick may be helpful and for what kinds of tasks). The fact that we can easily imagine what sorts of weights we’d expect the network to learn is one of the things that makes it a good toy example. (Of course, the paper wouldn’t be worth publishing if it only had the toy example.)

Yes it is certainly not fair that the network they spend one page explaining and probably weeks training and researching can be hardwired in 30 lines of python. This is very unfair. But this is the reality, and so the post states.

Also the idea to add coordinate as a feature has been used in the past without giving even much thought.

Toy examples are great. As long as they are not trivial. Some guy, presumably smart, once said that "things should be as simple as possible but not simpler". The toy example they play with is just too simple.

I highly doubt they spent weeks training on the toy example. More like five minutes, probably. Again, that the weights learned (quickly) for the toy example can also be set by hand is not surprising and is evidence of a good toy example. The paper’s main result is not the toy example, but the real experiments (for which I doubt you could hand-code the network weights).

The interesting part is that this trivial toy problem is hard to learn for a standard CNN.

They probably engineered the toy problem to be that simple, looking for the simplest problem that still displays the phenomenon.

This may indeed be interesting, but that is not what this paper focuses on.

From the abstract:

"For any problem involving pixels or spatial representations, common intuition holds that convolutional neural networks may be appropriate. In this paper we show a striking counterexample to this intuition via the seemingly trivial coordinate transform problem, which simply requires learning a mapping between coordinates in (x,y) Cartesian space and one-hot pixel space. Although convolutional networks would seem appropriate for this task, we show that they fail spectacularly. We demonstrate and carefully analyze the failure first on a toy problem, at which point a simple fix becomes obvious."


I’m surprised that they put toy example, imagenet, and RL results in the body of the paper, but object detection results are relegated to the appendix. Is their object detection result significant? If so, why not give more attention to it?

I think there is room for criticizing a lot of the hype around deep learning papers, especially the semi-blog / semi-research stuff you often see in tech company blogs, fastai, etc.

But this criticism falls a little flat to me. For instance,

> “Nevertheless the central point of a scientific paper is a relatively concisely expressible idea of some nontrivial universality (and predictive power) or some nontrivial observation about the nature of reality”

That’s an insanely high bar for published work. I also read lots of research papers, and I think only a handful per year would meet these requirements. Yet many others are extremely valuable to show negative or partial results, results with small effect sizes, and other things.

We absolutely should not disparage someone for publishing results of a failed or ineffectual approach. Because otherwise we’ll just make things like file drawer bias and p-hacking far worse, and create an even worse cultural expectation that to make a career in science, you must constantly publish positive results with big, sexy implications — which is what leads to the whole disastrous hype-driven state of affairs, like in deep learning right now, in the first place, and ludicrous science journalism, funding battles fought over demoware and vaporware, academics fleeing into corporate sponsorship like yesterday’s article about Facebook, etc.

Author of the post here. I totally agree that negative stuff should be published. But without the fanfares. I think they could have changed the tone of that paper and I would not have an issue with it. It is likely that if they did that they'd never go through some idiot reviewer who expects "a positive result" or some similar silliness. This is not a perfect world. The paper as is makes strong claims about the novelty and usefulness of their gimmick. If it turns out your stuff is at least partially hollow and you take on the pompous tone, you have to be ready to take some heat. Science is not about tapping friends on the back (which BTW is what is happening a lot with the so called "deep learning community"). Science is about fighting to get to some truth, even if that takes some heat. People so fragile that they cannot take criticism should just not do it.

I completely agree regarding fanfare in deep learning. There are lots of “incremental improvement” papers, GitHub repos, blog posts, etc. and these are totally fine in principle — but they are without a doubt branded as “state of the art” often with messy or incomplete code and little capability to reproduce results.

An additional frustration point I always have is when network architectures are not even fully specified.

Try reading the MTCNN fave detection paper. How, exactly, is the input image pyramid calculated? By what mechanism, exactly, can the network cascade produce multiple detections (i.e. can it only produce one detection per each input scale? If more than one, how?). In the Inception paper dealing with factorized convolutions, just google around to see the deep, deep confusion over the exact mechanics by which the two-stage, smaller convolutions ends up saving operations ovrr a one-stage larger convolution. The highest upvoted answers on Stack Overflow, reddit, quora are often wrong.

And these examples are from reasonably interesting mainstream papers that deserve some fanfare. Just imagine how much worse it is for extremely incremental engineering papers trying to milk the hype by claiming state of the art performance.

Still though, at the end of the day, I’d rather that more papers are published and negative / incremental results are not penalized, because the alternative file drawer bias would be much worse for science overall.

> Yet many others are extremely valuable to show negative or partial results, results with small effect sizes, and other things.

(the evil?) Elsevier has a similar opinion => https://www.elsevier.com/connect/scientists-we-want-your-neg...

> So they trained it using 100 GPUs (100 GPUs dear lord!), and got no difference until fourth decimal digit! 100 GPU's to get a difference on fourth decimal digit!

That's hilarious!

But I found the criticism on their toy task less convincing. Algorithmic toy tasks can always be solved "without any training whatsoever".

For example in RNNs, there's a toy task that adds two numbers that are far apart in a long sequence. This can be solved deterministically with a one liner, but that's not the point. It's still useful for demonstrating RNN's failure with long sequences. Would you then call the subsequent development to make RNNs work for long sequences just feature engineering with no universality?

In that sense, I think their choice of toy task is fine. They're just pointing out position is a feature that's currently overlooked in the many architectures that are heavily position dependent (they showed much better results on faster r-cnn for example).

Somewhat tangentially, some recent work showed that a lot of problems with images (e.g. denoising, upsampling, inpainting, etc...) can be solved very efficiently with no training set at all: https://dmitryulyanov.github.io/deep_image_prior

This work shows that the network architecture is a strong enough prior to effectively learn this set of tasks from a single image. Note that there is no pretraining here whatsoever.

More to your point, I think a big problem with toy tasks are not so much the tasks but the datasets. A lot of datasets (particularly in my field of geometry processing) have a tremendous amount of bias towards certain features.

A lot of papers will show their results trained and evaluated on some toy dataset. Maybe their claim is that using such-and-such a feature as input improves test performance on such-and-such problem and dataset.

The problem with these papers often comes when you try to generalize to data that is similar but not from the toy dataset. A lot of applied ML papers fail to even moderately generalize, and the authors almost never test or report this failure. As a result, I think we can spend a lot of time designing over-fitted solutions to certain problems and datasets.

On the flipside, there are plenty of good papers which do careful analysis of their methods' ability to generalize and solve a problem, but when digging through the literature its important to be judicious. I've wasted time testing methods that turn out to work very poorly.

Frankly, I have mixed opinions about this blog post. Good for discussing types of papers, and that for the toy you can write convolutions by hand (which, IMHO, is by not means any argument against CoordConv!). I adore toy problem they (author of the paper) picked, and if anything, it is an argument for their choice of the toy problem (unsolvable by typical conv, trivial when added x and y channels).

In science it is crucial to make many failed approaches, not only approaches of things that we are sure they work. So yes, it's good that they burnt 100 GPUs on a problem that didn't work. And in fact it is much better standard than most deep learning papers I read, when they focus mostly or only on problems in which architecture is better.

Plus, it works for object detection, so it's not a "MNIST-only trick".

I've participated a bit in academic paper reviews over the years for some ACM journals/conferences in the computer graphics area. Initially I was pretty green and I often would not catch some of the problems that the more experienced reviewers would catch. I embarrassingly recommended acceptance to some papers that other more experienced reviews said were clearly crap. Over time though, I learned to be more critical by example from the more experienced reviewers. And eventually I sometimes would be one of the assholes on the review committee that wrecked people's dreams of publication.

I wonder if the rapid growth of ML recently has diluted the reviewer pool dramatically? There are so many papers submitted but so many of the reviewers are green that crap gets through more easily? I wonder if there is a growth limit to fields such that the paper review teams do not get overly diluted with green researchers?

(Has this paper even been peer-reviewed? If it hasn't been peer reviewed there is a good chance it is crap just by the law of averages -- most "academic" papers are crap. There is a reason the top venues that I was involved with have a rejection rate upwards of 80%.)

>I wonder if the rapid growth of ML recently has diluted the reviewer pool dramatically?

There was a blow up a couple months ago around this issue. Someone posted to r/machinelearning saying they were starting grad school in the fall, and had been accepted as a reviewer for NIPS2018.

Twitter convo about it [0] generated some press, with the pull quote of "It's 'peer review', not 'person who did 5 TensorFlow tutorials review'".


Some machine learning conferences are now recruiting graduate students to be reviewers.

This is a common practice in a number of other fields (e.g. cognitive science and psycholinguistics), at least for conference submissions. In general I don't see a huge difference -- when a grad student has sufficient standing and expertise to be chosen by a reviewer, they generally know the domain very well and are often up to date with the research in a way that more senior reviewers often aren't. And because they have fewer demands on their time and feel a greater need to substantiate their critiques, they tend to write depthier reviews. And of course there's still the safeguard that the meta-reviewer can indicate to the authors in various ways that a review is garbage, or even throw it out / seek another reviewer.

I think a lot of what you're saying is valid, but to be completely clear: these are graduate students who are reviewing for tier 1 conferences while being yet to publish in a tier 1 conference themselves. There is a legitimate argument that they simply don't have the academic maturity, experience or competency to be reviewing other researchers' submissions, regardless of how well intentioned they are or how much they want to prove themselves.

Have you read the paper or just the review? I'm interested in your opinion as an experienced reviewer whether this article appropriately evaluates the paper.

I knew a specific sub-field of computer graphics well enough to be a paper reviewer and even then I wasn't initially very good at it. I am not competent to review anything outside of that specific sub-field.

Some discussion is happening concurrently at /r/MachineLearning: https://reddit.com/r/MachineLearning/comments/90n40l/dautops...

In my opinion it wasn’t particularly significant enough a result to publish, but writing takedown pieces like this feels petty and contemptuous to me.

I agree with this. This article is like a hit piece and I feel sorry for the scientists who were attacked (even if the scientific merit of their paper is questionable).

The author could have made his point in a more diplomatic way.

Author here (of the post, not the paper). I think you don't understand how science works. The whole point of the exercise (which indeed may have been forgotten these days) is to attack ideas/papers. The first line of attack should be your friends to make sure you don't put anything out there that is silly. The second line of attack are the reviewers, who may or may not be idiots themselves, but in the perfect world should serve the same purpose. The third line attack are independent readers, people like me. I found it to be trivial, took my liberty to attack it. It is not personal and should not be taken so. These guys may in the future publish the most amazing piece of research ever. But this one is not it. They should realize this and my blog post serves this purpose. If somebody gets offended and takes it personally, so be it. I think people should have a bit thicker skin, especially in science. I took quite a bit of bullshit myself (and I'm sure I will have to take more) and never complained. So relax, read the paper, read the post, learn something from both and go on.

Don't hide behind the pretense of doing science to justify being a jerk. Look at your own language, in this reply, and in your blog post:

"you don't understand how science works" - this is attacking a person, not an idea.

The blog post:

"Perhaps this would be less shocking, if they'd sat down and instead of jumping straight to Tensorflow, they could realize" [...]

"They apparently have spent to much time staring into the progress bars on their GPU's to realize they are praising something obvious, obvious to the point that it can be constructed by hand in several lines of python code."

This makes assumptions about the authors, and all but calls them idiots. That entire paragraphs drip with sarcasm, of which one can only assume you're smart enough to be aware and have intended. You made it personal, and that's exactly what the GP is noting when they term your blog post a "hit piece".

Yes, people have used explicit coordinates as features before. No, this paper isn't going to radically change the world, but if you're arguing from "science", that _doesn't matter_ at all. Science is full of rediscovery and duplication, and tolerates it just fine. What matters most is that we filter out things that are wrong -- and I don't think that's obviously the case with this paper. "Trivial" is a subjective determination, and while one part of the job of refereeing a journal or conference is to try to rank things as a service to the audience, it's not the most important aspect of a reviewer's job.

Just because you took a lot of bullshit doesn't mean it's OK. It's not OK if people were jerks to you in this way, and it's not OK to pass it on.

Oh, somebody got triggered here! Yes, there is sarcasm in this post! And if you don't like it, fine. But please, don't give me bullshit about being a jerk. I think you probably have not seen a real jerk in your life yet.

While I understand the OP's issue with the paper, I also feel that there is scope for the "We tried this and the improvement we got was minimalistic, so you should probably try a different approach" kind of paper.

But OTOH, I agree that with the current "hype" around Deep learning, accompanied by the beginning of an "DL winter" in revolutionary papers means that academicians and companies which are set up in a "publish or perish" state of mind end up in a rush to publish even the smallest of modifications/enhancements.

I understand that I'm arguing both sides of the table here, but at the end of the day I'd rather have these papers published than not, as long as they end up in public domain and can somehow be viewed more as experimental papers than purely theoretical ones.

> We tried this and the improvement we got was minimalistic, so you should probably try a different approach

I can try a lot of approaches that won't improve results :) There needs to be a strong justification for thinking it would work and a high cost in trying it for this to be a useful approach to writing papers.

The paper also has some positive results (which TFA conveniently ignores), so also publishing a null result is quite nice.

> 100 GPU's to get a difference on fourth decimal digit!

So now we know what not to do. That's valuable.

So what if it's not the best theoretical paper. This screed rehashes criticisms that are well known among researchers in the field. Overall it reads like a kind of egotistical hit piece. Personally I'm glad Uber published it.

Yeah, I'm not a deep learning researcher, but I didn't understand the incredulity at 100 GPU's. Surely Uber has that many lying around, so why not overkill on the hardware when testing? This leaves no doubt that the feature is worthless for the given task, and does not leave open the question of whether more hardware could produce better results.

If training requires N operations, using more GPUs (that otherwise would idle) simply means you finish (and iterate) faster.

It's a genuine scientific question: adding a useless feature slows down NN training, but how much? Is the impact as negligible as can be expected?

There's no reason to spend lots of brain cycles on low-value theory and possibly wrong estimates instead of lots of GPU cycles on a conclusive experiment.

I've spent a good portion of my life in the border region between software engineering and pure science. I wish more people from either side would spend more time in this region. It makes for both better scientists and it definitively makes for better programmers.

My experience is that when the two are combined you'll get much faster scientific progress coupled with software engineers that have much better problem solving vocabularies. Engineering seems to inject more imagination and urgency to the scientific bits of the work. And you need engineers that have the scientific vocabulary to lift their work to a more scholarly level.

Much scientific publishing is junk. It doesn't carry its own weight in that it provides an insufficient delta in knowledge to be worth the time it takes to read.

Likewise, much code that is written is junk in that the developer used the first method (or only method) they could think of to solve a given problem due to having a limited toolchest for problem solving. Often not even knowing which exact problem they are solving.

Don't shit on engineering papers. It benefits both those who think of themselves as pure scientists and engineers.

Maybe I'm confused. The blog makes a big deal out of the fact that the neural network can be hard coded. How is this relevant? I thought the whole point of the paper is whether our standard training process can learn the weights, not whether it's easy to create a NN with perfect weights if we already know those weights.

ReLU's pretty trivial; I hope nobody tried to publish a paper about that.

Author of the post here: I think their paper would have been much better if they included the piece of code which I wrote in python to explain that the transformation they are learning is obviously trivial and the fact that it works is not in question. This would leave them a lot more space to focus on something interesting, perhaps explore the GAN's a little further, cause what they did is somewhat rudimentary. But that omission (and lack of context for previous use of such features in the literature) left a vulnerability which I have the full right to exploit in a blog post.

I sort of skimmed past the part where it was noted the critique was on Uber AI. I got the impression that this was a critique on a student's conference paper or something like that, and started to feel a little bad for the author of the paper.

But then I got to this "Why is Uber AI doing this? What is the point? I mean if these were a bunch of random students on some small university somewhere, then whatever. They did something, they wanted to go for a conference, fine. But Uber AI?" and had to wake myself up. Seriously? This is from Uber? This just screams cargo cult AI.

You know what? I think every research paper should come with a video explaining the result.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact