> So they trained it using 100 GPUs (100 GPUs dear lord!), and got no difference until fourth decimal digit! 100 GPU's to get a difference on fourth decimal digit! I think somebody at Google of Facebook should reproduce this result using 10000 GPU's, perhaps they will get a difference at a third decimal digit. Or maybe not, but whatever, those GPU's need to do something right?
Wow. This is just a blatant mischaracterization of what's going on. First of all, this result is in the appendix. It's not meant to be an important result of the paper. In the appendix, they explicitly write:
>Of all vision tasks, we might expect image classification to show the least performance change when
using CoordConv instead of convolution, as classification is more about what is in the image than
where it is. This tiny amount of improvement validates that.
In contrast, they compare against object detection (in which the spatial location matters), and get substantially better results.
This is just a standard "negative" result, to validate the fact that what they think is happening is actually happening empirically.
The fact that this blog post mocks them for that, and much of HN is laughing along with the blog is seriously disappointing.
So, the Uber paper is kind of silly, other for that I now know where to point to for a confirmation of the effectiveness of the idea.
But I agree with you that the mischaracterisation is not appropriate. The main criticisms in the blogpost are missing the point too. The paper is not particularly interesting and might not be appropriate for the big conferences, but in my opinion not for the reasons in the blog post.
Also, who cares about 100 GPU's? Nobody is complaining that all algorithms require 1 GPU these days and don't run on a smartphone, but suddenly 100 GPU's is too much? For some researchers (and I think Uber falls in the category), 100 GPU's are pocket change. Science does not require that your algorithm also runs on your lab's DIY Phd GPU cluster. If these guys have the GPU's available and it allowed them to be home earlier to spend time with the family, why would it be a problem for them to use the compute?
There is a common issue with deep learning papers: deep learning seems to excuse thinking rigorously about the underlying problem at hand. We trust a deep network to do the legwork for us, and this invites an intellectually-lazy approach to research. We’re graduating PhDs who have spent four years twisting knobs.
The paper itself is fine, no better or worse than the average paper today. The troubling part is the large number of high-profile ML practitioners praising it within minutes/hours. This suggests to me that notable results in DNN might be drying up and name recognition (Uber AI) is driving attention more than substance.
The paper explores effects of known models (convolution functions that exist even outside the NN field) without explaining them and doing the math to understand where the error comes from. You could have taken for example the CNN layer and see exactly what it does. Instead they prefer to pass through the network the full signal they expect to recover at the end and the find it works.
Finding sanctuary inside computation and random experimentation against known datasets, without fully explaining what is actually happening on the modelling side should't be acceptable.
The OP is flagging in his post the mental model and attitude towards research. The paper being discussed is just an example that confirms that proper rigour is being replaced by the convenience of either computation or parameter fiddling (human or experimental). None of these are going to take the field too far.
The OP suggests the scientific field should focus on things that matter and change the world in an irreversible way. Provide a theorem or something that can be proved.
These and other examples suggest CoordConv can significantly improve the quality of the representations learned by existing architectures.
That doesn't seem so "trivial."
These images look impressive, but without doing a proper in-depth analysis, more general claims of improvement on a task are hard to make And while it's totally possible that, in this case, the improvements are significant, it's dangerous to extrapolate from just a few examples in a paper.
I am talking about data that should not be that huge. Does no one else feel frustrated by this?
I could understand Facebook/Google/BigCo. doing so when the data in question might be internal implementation/tech but the trend is far more prevalent.
Sharing experiment data/parameters would not only help people verify your results but also help students learning how to design and carry out experiments.
Section 5 claims "the corresponding CoordConv GAN model generates
objects that better cover the 2D Cartesian space while using 7% of the parameters of the conv GAN". There isn't really quantitative analysis beyond a couple of small graphs discussing this any further. Section 7.2 and 7.3 visually compares the results between the generator's output of interpolated noise vectors in the latent space. The results look good but without quantitative analysis, they are very preliminary.
Generative modeling is tricky and I think in your first comment, the jump from a few nice images to CoordConv can "significantly improve the quality of the representations" is a big one given the sparsity of evidence in the paper. I'm not saying that you're wrong but your original comment seemed a bit misleading to me.
Visual evidence is important for generative image tasks, given that we can't measure any of these DNN generators against a "true" statistical model that generates the data.
For a DNN to be able to generate more realistic transformations of generated images from low-dimensional representations, it must learn higher quality representations... or are you saying otherwise?
1. As others have pointed out, the ImageNet experiment is presented as evidence that (as you’d expect) adding coordinate channels doesn’t affect performance on image classification tasks. That’s a good “sanity check” experiment to have done.
2. The paper proposes a simple idea, and it may not have been necessary to give it a whole new name (CoordConv). But if you’d asked me if I thought that adding coordinate data to the input would have led to significantly better object detection, I wouldn’t have known the answer, so the results of their experiments—that it does help on tasks like object detection—is not trivial. Not only that—a lot of people have tried to do object detection, and yet nobody has reported adding input channels for storing coordinates before. A lot of ideas seem simple after someone thinks of them.
3. Toy examples are useful for testing intuition (and building intuition about why this trick may be helpful and for what kinds of tasks). The fact that we can easily imagine what sorts of weights we’d expect the network to learn is one of the things that makes it a good toy example. (Of course, the paper wouldn’t be worth publishing if it only had the toy example.)
Also the idea to add coordinate as a feature has been used in the past without giving even much thought.
Toy examples are great. As long as they are not trivial. Some guy, presumably smart, once said that "things should be as simple as possible but not simpler". The toy example they play with is just too simple.
They probably engineered the toy problem to be that simple, looking for the simplest problem that still displays the phenomenon.
"For any problem involving pixels or spatial representations, common intuition holds that convolutional neural networks may be appropriate. In this paper we show a striking counterexample to this intuition via the seemingly trivial coordinate transform problem, which simply requires learning a mapping between coordinates in (x,y) Cartesian space and one-hot pixel space. Although convolutional networks would seem appropriate for this task, we show that they fail spectacularly. We demonstrate and carefully analyze the failure first on a toy problem, at which point a simple fix becomes obvious."
But this criticism falls a little flat to me. For instance,
> “Nevertheless the central point of a scientific paper is a relatively concisely expressible idea of some nontrivial universality (and predictive power) or some nontrivial observation about the nature of reality”
That’s an insanely high bar for published work. I also read lots of research papers, and I think only a handful per year would meet these requirements. Yet many others are extremely valuable to show negative or partial results, results with small effect sizes, and other things.
We absolutely should not disparage someone for publishing results of a failed or ineffectual approach. Because otherwise we’ll just make things like file drawer bias and p-hacking far worse, and create an even worse cultural expectation that to make a career in science, you must constantly publish positive results with big, sexy implications — which is what leads to the whole disastrous hype-driven state of affairs, like in deep learning right now, in the first place, and ludicrous science journalism, funding battles fought over demoware and vaporware, academics fleeing into corporate sponsorship like yesterday’s article about Facebook, etc.
An additional frustration point I always have is when network architectures are not even fully specified.
Try reading the MTCNN fave detection paper. How, exactly, is the input image pyramid calculated? By what mechanism, exactly, can the network cascade produce multiple detections (i.e. can it only produce one detection per each input scale? If more than one, how?). In the Inception paper dealing with factorized convolutions, just google around to see the deep, deep confusion over the exact mechanics by which the two-stage, smaller convolutions ends up saving operations ovrr a one-stage larger convolution. The highest upvoted answers on Stack Overflow, reddit, quora are often wrong.
And these examples are from reasonably interesting mainstream papers that deserve some fanfare. Just imagine how much worse it is for extremely incremental engineering papers trying to milk the hype by claiming state of the art performance.
Still though, at the end of the day, I’d rather that more papers are published and negative / incremental results are not penalized, because the alternative file drawer bias would be much worse for science overall.
(the evil?) Elsevier has a similar opinion => https://www.elsevier.com/connect/scientists-we-want-your-neg...
But I found the criticism on their toy task less convincing. Algorithmic toy tasks can always be solved "without any training whatsoever".
For example in RNNs, there's a toy task that adds two numbers that are far apart in a long sequence. This can be solved deterministically with a one liner, but that's not the point. It's still useful for demonstrating RNN's failure with long sequences. Would you then call the subsequent development to make RNNs work for long sequences just feature engineering with no universality?
In that sense, I think their choice of toy task is fine. They're just pointing out position is a feature that's currently overlooked in the many architectures that are heavily position dependent (they showed much better results on faster r-cnn for example).
This work shows that the network architecture is a strong enough prior to effectively learn this set of tasks from a single image. Note that there is no pretraining here whatsoever.
More to your point, I think a big problem with toy tasks are not so much the tasks but the datasets. A lot of datasets (particularly in my field of geometry processing) have a tremendous amount of bias towards certain features.
A lot of papers will show their results trained and evaluated on some toy dataset. Maybe their claim is that using such-and-such a feature as input improves test performance on such-and-such problem and dataset.
The problem with these papers often comes when you try to generalize to data that is similar but not from the toy dataset. A lot of applied ML papers fail to even moderately generalize, and the authors almost never test or report this failure. As a result, I think we can spend a lot of time designing over-fitted solutions to certain problems and datasets.
On the flipside, there are plenty of good papers which do careful analysis of their methods' ability to generalize and solve a problem, but when digging through the literature its important to be judicious. I've wasted time testing methods that turn out to work very poorly.
In science it is crucial to make many failed approaches, not only approaches of things that we are sure they work. So yes, it's good that they burnt 100 GPUs on a problem that didn't work. And in fact it is much better standard than most deep learning papers I read, when they focus mostly or only on problems in which architecture is better.
Plus, it works for object detection, so it's not a "MNIST-only trick".
I wonder if the rapid growth of ML recently has diluted the reviewer pool dramatically? There are so many papers submitted but so many of the reviewers are green that crap gets through more easily? I wonder if there is a growth limit to fields such that the paper review teams do not get overly diluted with green researchers?
(Has this paper even been peer-reviewed? If it hasn't been peer reviewed there is a good chance it is crap just by the law of averages -- most "academic" papers are crap. There is a reason the top venues that I was involved with have a rejection rate upwards of 80%.)
There was a blow up a couple months ago around this issue. Someone posted to r/machinelearning saying they were starting grad school in the fall, and had been accepted as a reviewer for NIPS2018.
Twitter convo about it  generated some press, with the pull quote of "It's 'peer review', not 'person who did 5 TensorFlow tutorials review'".
In my opinion it wasn’t particularly significant enough a result to publish, but writing takedown pieces like this feels petty and contemptuous to me.
The author could have made his point in a more diplomatic way.
"you don't understand how science works" - this is attacking a person, not an idea.
The blog post:
"Perhaps this would be less shocking, if they'd sat down and instead of jumping straight to Tensorflow, they could realize" [...]
"They apparently have spent to much time staring into the progress bars on their GPU's to realize they are praising something obvious, obvious to the point that it can be constructed by hand in several lines of python code."
This makes assumptions about the authors, and all but calls them idiots. That entire paragraphs drip with sarcasm, of which one can only assume you're smart enough to be aware and have intended. You made it personal, and that's exactly what the GP is noting when they term your blog post a "hit piece".
Yes, people have used explicit coordinates as features before. No, this paper isn't going to radically change the world, but if you're arguing from "science", that _doesn't matter_ at all. Science is full of rediscovery and duplication, and tolerates it just fine. What matters most is that we filter out things that are wrong -- and I don't think that's obviously the case with this paper. "Trivial" is a subjective determination, and while one part of the job of refereeing a journal or conference is to try to rank things as a service to the audience, it's not the most important aspect of a reviewer's job.
Just because you took a lot of bullshit doesn't mean it's OK. It's not OK if people were jerks to you in this way, and it's not OK to pass it on.
But OTOH, I agree that with the current "hype" around Deep learning, accompanied by the beginning of an "DL winter" in revolutionary papers means that academicians and companies which are set up in a "publish or perish" state of mind end up in a rush to publish even the smallest of modifications/enhancements.
I understand that I'm arguing both sides of the table here, but at the end of the day I'd rather have these papers published than not, as long as they end up in public domain and can somehow be viewed more as experimental papers than purely theoretical ones.
I can try a lot of approaches that won't improve results :) There needs to be a strong justification for thinking it would work and a high cost in trying it for this to be a useful approach to writing papers.
So now we know what not to do. That's valuable.
So what if it's not the best theoretical paper. This screed rehashes criticisms that are well known among researchers in the field. Overall it reads like a kind of egotistical hit piece. Personally I'm glad Uber published it.
There's no reason to spend lots of brain cycles on low-value theory and possibly wrong estimates instead of lots of GPU cycles on a conclusive experiment.
My experience is that when the two are combined you'll get much faster scientific progress coupled with software engineers that have much better problem solving vocabularies. Engineering seems to inject more imagination and urgency to the scientific bits of the work. And you need engineers that have the scientific vocabulary to lift their work to a more scholarly level.
Much scientific publishing is junk. It doesn't carry its own weight in that it provides an insufficient delta in knowledge to be worth the time it takes to read.
Likewise, much code that is written is junk in that the developer used the first method (or only method) they could think of to solve a given problem due to having a limited toolchest for problem solving. Often not even knowing which exact problem they are solving.
Don't shit on engineering papers. It benefits both those who think of themselves as pure scientists and engineers.
But then I got to this "Why is Uber AI doing this? What is the point? I mean if these were a bunch of random students on some small university somewhere, then whatever. They did something, they wanted to go for a conference, fine. But Uber AI?" and had to wake myself up. Seriously? This is from Uber? This just screams cargo cult AI.