
A Revolutionary Technique That Changed Machine Vision - lelf
http://www.technologyreview.com/view/530561/the-revolutionary-technique-that-quietly-changed-machine-vision-forever/
======
kastnerkyle
I think it is key to remember that the accessibility of huge amounts of
labeled data is behind most of this. ImageNet is 1.2TB (~1.2 million images!).
Convolutional neural nets have been around for a long, long time (early 90s?
or maybe late 80s) and they just needed more data (and a few training tricks
:) ).

As we enter the world of "big data" more and more companies are locking this
level/size (or bigger) data away in closed vaults. This is a competitive edge
in business, but also strangles the ability for academics to do "real world"
research, which is often a huge criticism of academia at large.

Academic research can have a role in tech R&D if given a chance, so I hope
that companies will open their data to private researchers to continue this
kind of improvement - even if under secrecy clauses or some other safeguard.

All that said, the technological improvements have been astounding, and I hope
this is only the beginning! There have been some incredible results using
related techniques in NLP for machine translation recently, and speech has
always been a great playpen.

If anyone is interested in playing with these things in Python, myself and
another researcher have recently created a library for using these type of
neural networks as a black-box _WITHOUT ANY TRAINING!_ , in a similar style to
scikit-learn. See [1] for more. We support only a few nets currently, but plan
to integrate support for caffe and pylearn2 in the near future.

[1] [http://sklearn-theano.github.io/](http://sklearn-theano.github.io/)

~~~
agibsonccc
Great work guys! Are these going to only be computer vision nets?

I think the huge problem with most deep learning frameworks out there are they
are either hard to use or limited in scope.

There's nothing wrong with that per se, but I'd be curious to see what you
guys intend. It's great that you guys are giving an sklearn interface to it.

Edit: I should add my criticisms come from a biased perspective: I write
[http://deeplearning4j.org/](http://deeplearning4j.org/)

I love comparing notes with others regardless.

Either way: Wish you the best of luck with it.

~~~
kastnerkyle
Support is planned for audio and _hopefully_ text - I am working on building a
million song dataset to recreate the work Sander Dieleman did for Spotify, and
have had some possible support in October for the weights of a trained speech
network! So yes, feature extraction from other domains should be on the
horizon.

We are specifically trying to make it easy to say: I want to transform an
image (or audio, etc.) using a pretrained net. Download the weights, extract
the features for me, and give me the feature vectors so I can do something
with those. This seems to be really, really, really hard in _all_ the tools I
have used and usually involves training yourself, which is not very useful for
things on the level of ImageNet.

Of course, having nice examples and good docs is one of the great parts of
scikit-learn, and is actually one of the things I have been working on most
recently. Our docs aren't to that level yet, but I hope they can be one day.

DeCAF (precursor to Caffe) and OverFeat binaries were really kind of the first
in this regard (about 1 year ago now), but IMO one of the limitations is that
they lose interaction with the rest of the Python ecosystem for data munging,
and simple algorithms for exploration. By wrapping the weights, we hope to
leverage the support of the Python ML ecosystem easily, while still being able
to use the power of these networks.

Right now the most compelling use case is as part of a scikit-learn pipeline
i.e. make_pipeline(OverfeatTransformer, LinearSVC) or whatever. Feed images
in, get predictions out. I am also working on a demo of "writing your own
twitter bot" similar to
[https://twitter.com/id_birds](https://twitter.com/id_birds) , written by
Daniel Nouri . I like sloths, so it will definitely be a slothbot.

I also hope to support recurrent architectures from Groundhog
([https://github.com/lisa-groundhog/GroundHog](https://github.com/lisa-
groundhog/GroundHog)) as several researchers here at the LISA lab have been
using it to get pretty amazing results in NLP and audio, both of which are
potential targets in the future. If we can leverage their work, it would be a
very nice way for people to immediately play with SOTA architectures in
different applications.

In any case, just loading in weights and extracting image features easily is
nice, and was a benefit for both Michael and myself in a research project this
summer.

~~~
agibsonccc
Great to hear! That's exactly what I'm trying to replicate as well. I'm mainly
trying to do it for industry myself. Not a lot of people like making this
stuff for the JVM ecosystem (understandble of course...I love python as well)

I also agree about caffe as well. The python ecosystem is amazing and should
be leveraged which also increases adoption.

As I said before, being able to do this at scale for people where their data
is stored on the JVM should help it make it more accessibble to a lot of
people.

Re: Twitter bot. This looks really cool.!

I'll be keeping an eye on developments here. Good stuff!

------
pavlov
It's interesting that, in the image showing the classes easiest and hardest
for the algorithm, all the easy ones are animals and all the difficult ones
are human-made artifacts.

Does nature tend to create forms which are easily detected by relatively
simple neural networks? The evolutionary explanation could be that this allows
animals with primitive neural systems to more easily distinguish other members
of their species.

~~~
ma2rten
I think it's because these human made object have no single form. For instance
"letteropener" describes the function of the object, not the form. Compare
that to "red fox" which is always gonna look more or less the same.

~~~
kastnerkyle
Yes, but the network is still differentiating between select breeds which
means it is learning traits of the animal which are unique to the breed. And
training at a higher level is perfectly doable, say "fox" instead of the fox
breed.

Ultimately, if you are able to look at an image and say "letter opener" there
are features which differentiate this from a knife/can opener/whatever - these
are the exact things a convolutional neural network should be able to use (in
theory) and has nothing to do with the label, which is typically unimportant
as long as it is unique and accurate.

We could flip all the labels around and still get unique answers - the network
is just learning a mapping from input -> some integer, and I would argue the
variance in dog breeds and lighting in natural scenes is much trickier than
the angle/shape of a letter opener.

I still think it comes down to the composition of this particular dataset.
Augmenting this with images scraped from online stores would be very
interesting as it is fairly trivial to get huge numbers of images for anything
that is typically sold online - I think Google is way ahead on this one!

~~~
bunderbunder
It's impossible to tell from the examples given in the article, but I wouldn't
be surprised if the same classifier that gets 100% on "Blenheim Spaniel" and
"Flat-coated Retriever" gets less than 100% on "Dog".

It's a question of how visually coherent the category you're trying to learn
is. From a purely visual perspective, the first two categories are relatively
tightly bunched in the state space, whereas "dog" covers a diffuse cloud of
appearances whose total range might even encompass the area where many non-dog
animals also lie. Humans may rely on some additional semantic knowledge about
different kinds of animal to produce an accurate classification. It's not
entirely unlike how determining the meaning of the words in the phrase "eats
shoots and leaves" can't be done reliably without incorporating contextual
clues such as whether we were just talking about pandas or a murder in a
restaurant.

There may also be issues around how distinct the categories are from each
other. A couple years ago yours truly picked up a letter opener off the table
and used it to spread butter on his toast, much to the amusement of his hosts.

~~~
kastnerkyle
In practical use, you can simply search for anything in the "dog" subclass
using the WordNet hierarchy... so there is no loss in accuracy unless you have
confusion _across_ the search groups! We actually support this in sklearn-
theano - if you plug in 'cat.n.01' and 'dog.n.01' for an OverfeatLocalizer we
return all matched points in that subgroup.

In general, if you misclassify "dog" for a fixed architecture you will most
certainly misclassify "Blenheim Spaniel" and "Flat-coated Retriever" \- the
two other classes are subsets of the first. The "eats shoots and leaves"
sentence is analogous to a "zoomed in" picture of fur - we don't know what it
_is_ but we are pretty sure what it isn't! This is still useful, and would
already get most of the way there for large numbers of fur colors/patterns.

I think the concerns you have are more important at training time, but I have
not seen a scenario where it has mattered very much. In general having good
inference about these nets is really hard, but I think your initial thought
about "dog space" ties in nicely to a post by Christopher Olah
([http://christopherolah.wordpress.com/2014/04/09/neural-
netwo...](http://christopherolah.wordpress.com/2014/04/09/neural-networks-
manifolds-and-topology/)) - maybe you will find it interesting?

And yes it becomes really fascinating to extend your last thought to "optical
illusions" and other tricks of the mind - even our own processing has paths
are easily deceived and sometimes flat out wrong... so it is no surprise when
something far inferior and less powerful also has trouble :)

------
drpgq
Not to denigrate the techniques used here, but it is interesting as a computer
vision researcher (face recognition in my case) how important good labelled
training and testing sets are. Some of my successes over the years have come
more from figuring out where to get good data than good computer vision
techniques.

~~~
tzs
Can you ever generate artifical training and test data?

For instance, suppose I would like to be able to take a photo of a chess game
and turn that into a diagram of the position. I have no idea where I would get
natural photos of thousands or millions of chess games to use for training and
testing a chess piece identifying vision system.

Could I instead make 3D models of common chess set designs, and then generate
and render photorealistic images of chess positions to use for training and
test data for the vision system?

~~~
ertdfgcb
Probably, but it might not be able to work very well with different
conditions, like a picture of a chessboard in central park vs a chessboard in
a library. This could probably be solved with more renders, and even if it
can't, the renders would probably be a good starting point

------
AndrewOMartin
There are two irresponsible sentences in the article, "In other words, it is
not going to be long before machines significantly outperform humans in image
recognition tasks." and "Or put another way, It is only a matter of time
before your smartphone is better at recognizing the content of your pictures
than you are.".

The irresponsibility is in seeing the existence of a technique to solve a more
complex version of a toy problem (e.g. find the location in this photo that
exactly matches pattern x), and inferring that the same technique, given more
power, will exhibit super-human behaviour.

It's a reasonable claim that, in the task as described, the technique
outperforms humans but that's about as exciting a claim as saying how much
quicker the latest supercomputer is at arithmetic than a human. The point
being that human object recognition isn't about labelling a scene with nouns,
but somehow instinctively knowing the relevant objects for a situation and, if
required, the appropriate situation-specific noun.

I therefore request the final sentence of the article be rewritten as "Or put
another way, it may only be a matter of time, funding and motivation before
your smartphone (with equivalent computing resources of the likes of Google
Inc. and University of Tokyo) can ascribe one or more nouns from a set of size
of order 1000 to regions of a photo, given that huge amounts of pre-processing
and man hours have been dedicated to the pre-processing of that exact set of
nouns to create a training set, and that you take the photo in similar
lighting conditions as the training set, don't apply any filters, and that the
objects referred to by the appropriate nouns is neither small nor thin, better
than you.".

~~~
blauwbilgorgel
It is just extrapolating the error rate reduction over the last few years.
Spam filters have become better than moderators in labeling spam only in the
last decade or so.

When computers first started to become faster than mathematicians this was
really a breakthrough. The same is happening now with object and speech
recognition.

The computer succesfully completes a task. That it is not how humans
intuitively approach these same tasks is irrelevant for this accomplishment.
What if the results were only half as good, but the system behaved more like
humans, who does this satisfy?

The state-of-the-art is capable of detecting far more than 1000 objects, does
not need labeled data, is robust to changes in light and does not care about
the camera used. No preprocessing the data needed, features are automatically
generated (preprocessing the target labels is a bit silly BTW).

So yes, in the very near future, algorithms will be better security guards
than well... security guards.

~~~
AndrewOMartin
My point is that extrapolating error rate reduction only applies to this
tightly defined task.

You can only make claims about machines being better at "general" pattern
recognition when we make progress on the issue that's stopped all Cognitivist
General AI projects dead, which is that of situational awareness.

Arithmetic operations, spam detection and the task described in the article
have a much smaller, and static, problem space than most human activities. You
can demonstrably already knock up an automated-barrier style security guard.
However, I'd argue that there does not exist an algorithm or appropriately
weighted n-layer network that can handle all the ambiguity, countermeasures
and ill-defined or contradictory situations that human security guards, or
even just their object recognition capabilities, handle largely instinctively.

~~~
blauwbilgorgel
Do you think that computers are better at chess than humans? If yes, how does
this relate to pattern recognition. If not, what makes someone or something
better at chess, while still losing against a computer? Is that a beautiful
move? Tactics? Irrational sacrifices to cause confusion?

Do you think that a machine's situational awareness can not achieve or surpass
the level of a human? If not, what is holding the machines back?

Why do you think that instinct works better to create more rational,
consistent and correct predictions? Are 100 security guards better than a
single security guard at dealing with ambiguities? Do you think an algorithm
to detect fights, drug dealers, and pickpockets from street cams can not
exist? What if a NN could detect these cases faster and flag this to a human
security guard for action/no-action.

------
a_e_k
Can any machine vision researchers recommend a survey/overview paper or two
that would make a good technical introduction to these particular techniques?

~~~
kastnerkyle
I gave a talk at EuroScipy 2014, in the back part there are a ton of
references and resources to get started. See [1]

[1]
[https://speakerdeck.com/kastnerkyle/euroscipy2014](https://speakerdeck.com/kastnerkyle/euroscipy2014)

------
oftenwrong
Here is the challenge site: [http://image-
net.org/challenges/LSVRC/](http://image-net.org/challenges/LSVRC/)

The datasets can be found by clicking the links for each year's challenge.

------
m_ke
If you want to see a convolutional net in action checkout out the demo on
[http://clarifai.com/](http://clarifai.com/)

------
graycat
What I see in the OP, via my Web browser Firefox 27.0.1, is just an ad I can't
make go away in the middle of the screen. So, without some special effort,
say, grab and parse the HTML, I can't read the content. Anyone else have this
issue?

~~~
inetsee
I have Firefox 32 on Kubuntu 14.04, and I had no problem clicking on the close
button on the ad.

~~~
graycat
Thanks. I couldn't see a button for closing the window. Apparently somehow
asking for higher magnification of the Firefox window didn't give higher
magnification of the ad window; thus I didn't see the close button.

Thanks.

With your feedback, I tried again and did see the big X close button outside
of the main part of the ad window. The thing did close.

Read the article. Cute.

Thanks.

~~~
ars
Try ESC next time, it works on many of those.

