

Surpassing Human-Level Performance on ImageNet Classification [pdf] - fchollet
http://arxiv.org/pdf/1502.01852v1.pdf

======
svantana
Interesting, I've long wondered why parametric nonlinearities are not used
more often. It adds very little to the overall parameter count (the number of
units is most often dwarfed by the number of connections), but it should
increase expressibility a lot (e.g. adaptively using soft or hard
nonlinearities). Taken to its extreme, I've been toying with the idea of
combining DNNs and Genetic Programming - using a large number of arbitrary
symbolic expressions with high connectivity and many layers.

~~~
fchollet
Genetic programming is an inefficient search method, and will require many
evaluations of the cost function to optimize anything. In the case of DCNNs,
evaluating an architecture can keep a modern GPU busy for days, so genetic
algorithms are pretty much out of the question.

I think an easy way to improve our models in the short term is to make more of
the parameters we use be learnable: the parameters of the non-linearity are a
good place to start with, and another would be the parameters of the data
augmentation transformations.

One could consider that learned data augmentation schemes implement a form of
guided visual attention.

------
Animats
That's an impressive result, considering how simple the algorithm really is.
(The learning algorithm isn't obvious, but it's not a lot of code.)

Can this algorithm be run in reverse, to generate an image from the network?
That's been done with one of the other deep neural net classifiers. "School
bus" came out as a yellow blob with horizontal black lines.

An interesting question is whether there's a bias in the data set because
humans composed the pictures, and humans like to take pictures of certain
things. (Cats are probably over-represented.) Images taken by humans tend to
have a primary subject, and that subject is usually roughly centered in the
images. It might be useful to test against a data set taken from Google
StreetView images, which lack such a composition bias.

~~~
dEnigma
_Can this algorithm be run in reverse, to generate an image from the network?
That 's been done with one of the other deep neural net classifiers. "School
bus" came out as a yellow blob with horizontal black lines._

Do you have a link for that, sounds interesting (nothing turned up in a quick
google search)

edit: I found something similar to what you were talking about, is this[1]
what you meant?

[1][http://www.evolvingai.org/fooling](http://www.evolvingai.org/fooling)

~~~
Houshalter
That's the paper he's referring to. They used another NN to generate images,
and selected the ones that the first NN predicted to be school buses the most.

------
dwiel
It seems that this might mean that ImageNet is becoming less useful a
benchmark dataset. Some of the images have labels which I would have never
guessed myself, and not because I don't know the difference between two types
of stingray, but because I would have never said that the topic of the image
was a seatbelt. It would be interesting to know how many errors are actually
due to outputting a label which does exist in the image, but isn't in the
labeled truth data and how many are plainly not in the image.

~~~
fchollet
The accuracy reported is top-5 accuracy, meaning that a model is considered
correct on a test image if it includes the expected label in its top 5
predicted labels. This does mitigate the multi-object issue quite a bit.

~~~
dwiel
Most of the examples cited in the paper that their algorithm got wrong, I
would also get 'wrong.' I don't think I would guess restaurant for any of the
images with that label, I might have gotten the middle spotlight and I might
have gotten the first letter opener right, but I'm not sure.

How do you explain that their performance is better than human? Is it in the
obscure examples?

~~~
greeneggs
On the other hand, some of their correct images have questionable captions as
well. For example, they labeled the geyser picture correctly, but their top
labels also included "sandbar", "breakwater" and "leatherback turtle". A
better scoring function, perhaps including hierarchies to account for the very
vague "restaurant" photos and very specific dog breed photos, might be
helpful. Otherwise, it seems like we might be overfitting to the peculiarities
of this dataset.

