
A New, Deep-Learning Take on Image Recognition - ot
http://research.microsoft.com/en-us/news/features/spp-102914.aspx
======
karpathy
The fact that RCNN does a wasteful and naive feature re-computation for every
object-like region in the image is well-known in the community, and there are
several papers addressing the shortcoming in very similar ways (i.e.
precomputing pool-5 features for the entire image once, instead of recomputing
from scratch for each individual box). The Overfeat detecion paper [0] was
among the first ones, for example, and it achieved similar speedups as
discussed in the article, but the accuracy is not as great for various other
reasons.

This article makes it sound like Spatial Pyramid Pooling is something new and
amazing, but in fact this idea has been around in Computer Vision for a very
long time (starting with this 2006 paper [1], which has now been cited almost
4000 times) and it has been successfully applied to pool descriptors over the
image into fixed-size representations, ready to go into an SVM.

But this was the first time the basic concept was used as a layer in a ConvNet
(by having variable-sized pooling bins) and it's nice to see the numbers.

[0] [http://arxiv.org/abs/1312.6229](http://arxiv.org/abs/1312.6229)

[1]
[http://web.engr.illinois.edu/~slazebni/publications/cvpr06b....](http://web.engr.illinois.edu/~slazebni/publications/cvpr06b.pdf)

~~~
strebler
My thoughts exactly - not to mention the fact that [1] was based on the
earlier Spatial Pyramid work of Grauman & Darrell.

The language in the article is a bit zealous - Microsoft did well, but didn't
"win" any challenge in Imagenet. VGG and Google seem to be the ones people are
really watching / emulating.

~~~
tsiki
While Google's this year's submission was good, it didn't seem to contain any
particularly inspiring ideas, as it was mostly just a collection of small
improvements on previous ideas. Personally, SPP-nets seem to be the most
interesting idea to come out of this year's competition.

------
amelius
I'm curious: what kind of computational challenge are we talking about? Does
image recognition take 1 second on a smartphone, or 2 hours on a
supercomputer?

~~~
kastnerkyle
Yann LeCun gave a webinar for NVidia a few weeks ago[1] which used a fairly
high-powered laptop to do real time recognition of objects from a video feed.
Overfeat[2] has also had this for a while.

There are some projects which can run pretrained deep nets in a few seconds on
a smartphone, but they are definitely not "real-time" to my knowledge, though
NVidia's upcoming unification of Tegra with their desktop line (and CUDA
support!) may change this in a hurry.

sklearn-theano[3] doesn't support the "video stream" classification that
Overfeat does, but we are planning to add it in the near future cause it is
awesome :)

[1] [http://on-demand-gtc.gputechconf.com/gtcnew/on-demand-
gtc.ph...](http://on-demand-gtc.gputechconf.com/gtcnew/on-demand-
gtc.php?searchByKeyword=LeCun&searchItems=&sessionTopic=&sessionEvent=&sessionYear=&sessionFormat=&submit=&select=+)

[2]
[http://cilvr.nyu.edu/doku.php?id=code:start](http://cilvr.nyu.edu/doku.php?id=code:start)

[3] [http://sklearn-theano.github.io/](http://sklearn-theano.github.io/)

