
Building a deeper understanding of images - xtacy
http://googleresearch.blogspot.com/2014/09/building-deeper-understanding-of-images.html
======
karpathy
I am one of the people who helped analyze the results of the mentioned ILSVRC
challenge. In particular, I performed an experiment comparing Google's
performance to that of a human a week ago and wrote up the results in this
blog post:

[http://karpathy.github.io/2014/09/02/what-i-learned-from-
com...](http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-
against-a-convnet-on-imagenet/)

TLDR is that it's very exciting that the models are starting to perform on par
with humans (on ILSVRC classification at least), and doing so on orders of
milliseconds. The included page also has a link to our annotation interface
where you can try to compete against their model yourself, and see its
predictions and mistakes.

~~~
possibilistic
I hope I'm not too late to your thread to ask questions. I find your field
absolutely fascinating, though I've never taken my interest further than basic
undergrad image processing or simple machine learning. I wish I had more time
to follow all of my interests.

Anyway, this one particular gray area has been bugging me, and I've been
hoping to run into a researcher or someone that has the appropriate context to
clarify it for me. It's not technical question, per se.--it's a set of
questions that build up to an uncertainty related to model training, reuse,
and sharing. (I hope I'm asking the relevant questions...) I don't expect
answers to all of this as it would take way too much of anyone's time, but
maybe you can pick a few key points and discuss. I'm also really dying to get
feedback on the last part as to its feasibility.

1\. Do computer vision / modeling researchers typically share common training
and evaluation data sets, or is it mostly kept proprietary? If good data sets
do exist out in the open, do they undergo continual improvement, or are they
frozen? Are these training sets typically amenable to only one type of
training--ie., black box, offline vs online, etc.? Does it take a lot of
practice/skill to know how much to hold back for post-training evaluation? Do
you have an estimate for how much time and effort is involved in manually
annotating and curating these data sets?

2\. Once an algorithm is developed, does a model get trained under different
parameters? Does tweaking those parameters lead to vastly different results?
Are there typically distinct optima for a given classification task, or can it
vary? Does the training set have a big impact on the performance of the
algorithm? Which is more important, the training data or the algorithm?

3\. Once models undergo offline training, are they fast to run? Is there a
typical runtime complexity, or do different types of models operate
significantly differently under the hood (by principle or by computational
complexity)? Can you run an extremely complicated and robust model with a ton
of classification outputs in sub-millisecond time on commodity hardware?

4\. Can trained models be packaged and redistributed as a kind of "open
source"? Are there any obvious barriers preventing this, such as the existence
(or proliferation) of patents in your field? Do computer vision researchers
like to share their code / results? If it's not a common practice to share
code, would a large number of researchers be downright opposed to having their
(patentable) algorithms and (copyrightable?) models made available to others?

5\. Are trained models too complicated for the layman to use and produce good,
consistent results? (For our purposes, I would consider a layman to be someone
with at least some understanding of basic computer science, including exposure
to data structures, algorithms, and a little bit of mathematical ability.
Perhaps none of this too deep.)

6\. Highly related to the last question, would there be a lot of specialized
knowledge required to tweak model input parameters? Would these parameters
correlate to mathematical operations? Would there be any arcane and seemingly
arbitrary weights to adjust that are deeply encoded into the model itself?
(Not sure if I'm far left field here or not.)

I think what I'm ultimately hitting at is that it would be _freaking awesome_
if there were a good set of robust pre-trained classifier models available for
the layman programmer. Models that continually undergo development as
improvements are made to training sets, algorithms, the literature--what have
you. Please tell me if this kind of thing already exists.

Anyway, I feel like I'm about to go on a long-winded talk about a technology
I've been imagining. It's something along the lines of a specification that
allows for the broad spectrum sharing of reusable, generic, containerized ML
and classifier training results with others. In the world I envision, you
would share trained models and import them just as you would code libraries.

Let me reiterate: if this sort of thing happens to exist in the wild already,
_please point me to it!_ I can think of tons of uses. :)

 _Note: replied to my own comment because original text was too long._

~~~
possibilistic
_(Note: Posted as self-reply because original comment was too long)_

I picture a website kind of like Github, or perhaps a language package index
like npm's. Except this website is concerned with classification instead of
code. You would see the following kinds of downloads:

    
    
      * Highly optimized, performant, pre-trained classifier models. 
        Perhaps numbering in the thousands if the site were popular. 
        You'd see a wide variety of classifiers: general ones,
        specific things like "dog breeds" and "celebrities" and "car types".
    
      * Common library code capable of running the models against user data.
        Available in a variety of languages. The model classifier format would 
        have to be generic enough that it can run under libraries in any 
        environment. 
    

There might also be downloads directed at supporting researchers. Stuff like:

    
    
      * Well put-together training data sets
    
      * Human-curated annotations, categorizations, ontologies, etc. 
        available as metadata that can be paired with the training sets in any 
        way desired. Not all of it may be useful for any given classifier. 
    

Do you know if there is already an existing "standard format" for encoding or
serializing pre-trained models for the purpose of sharing and exchanging them?
There are likely a few algorithm-specific serialization formats for persisting
internal graphs and weights and so forth. But some of these results are left
trapped entirety within the confines of the internal data structures of a
particular implementation...

In any case, I'm not aware of the existence of one universal format encoding
all the things. Because why would there be? What would be "standard" about it?
The algorithm space is rather wide and different algorithms encode different
things, so there wouldn't even be cross-cutting similarities to take advantage
of. It would be like inventing a file format for "text files and PNG images
and fonts together!". Arbitrary, pointless. An absurd idea.

Assume it's not pointless, though; start forming a picture of a universal data
format or scheme for encoding and sharing all the possible training results
irrespective of the source algorithm that produced them or the one that is
required to produce the results. We can't just encode the "training result"
because we've already demonstrated the complete and utter uselessness.
Instead, the universal scheme would have to encode at least three things: a "
_language descriptor_ " written as an abstract machine language whose task is
creating a bridge between the computer-generated " _training result_ " and the
predetermined " _classification result_ " when a user input is provided.

    
    
        descriptor(input, perception encoding) => classification result 
    
    

Apart from the user input, everything else is encoded into our data
serialization format. The "descriptor" would ordinarily have been some C or
MatLab (or whatever) code. It's the part that would have told us "this picture
is of a KITTEN" or "this text was written by STEPHEN KING" given all the other
inputs. Now it is an encoding of an abstract circuit, state machine, or some
other language grammar. Notice also how this has become entirely self-hosting.

If there other other arguments, then,

    
    
        descriptor(input, A, B, C..., perception encoding) => classification result
    
    

Where metadata concerning the purpose, names, types, ranges, and defaults for
`A, B, C...` are also encoded in the data format. Classification types,
ranges, etc. must also be encoded,

    
    
        classification result ∈ (class P, Q, R...)
    
    

Instead of being compiled to a reduced representation and inlined into the
body of the " _descriptor_ ", they could be provided as a parameter, adding a
further degree of indirection. I won't show any further notation.

To quickly summarize again: The language descriptor performs the task of
parsing the model, then accepting an arbitrary input (a classification set,
possible parameters, and the subject material), and ultimately yielding
production of an output.

By now you've likely noticed that all of this mess isn't technically different
than stuffing the executable of the classifier program itself into the
serialization of the training results. You'd be correct, of course. It might
seem arbitrary here, but I think I'll demonstrate a few nifty results later
on. Besides, I'm not really suggesting they be contained in the same file.

To make use of these abstractions, there would need to be some _client
libraries_ (C++, Java, Python, etc.) provided that make it trivially easy to
load and evaluate any of the classifiers from your own code. Since we went to
the trouble of encoding the aforementioned " _language descriptor_ " as an
abstract grammar, the whole classifier (training and all) can be hosted and
run from anywhere there is a client library provided, essentially making the
classifier available from any language. What's more is the client libraries
would not require constant updates to support new algorithm variations--it's
baked into the data format, so we get the capability for free simply by
swapping files.

    
    
               import pyclassifier 
    
               #include <classification>
    
               etc...
    
    
               classifier = classifiers.standardSvm()
    
    
               classifier = classifiers.load("my_novel_algorithm")
    
    
               classifier = classifiers.load("clustering_doe_et_al_09")
    
    

Another cool thing we could do is define shared sets of "classification
results". Instead of defining classes and categories and whatnot on a
situation-to-situation basis, perhaps we could draw from a global pool of
concepts and ideas pulled from the world. We can impart stable names to as
many different things and concepts as possible: classes like "CAR", "PERSON",
"BOY", "DOG", "TANK", etc. -- all designed to be globally unique and robust
identifiers that can continue to evolve over time without breaking our
classifier algorithms. A side benefit is that all classifiers would begin to
speak the same shared language. (Granted, not all classification outputs would
be amenable to this. Some would. Photo seem like a great use case.)

Now, If we were to build that category database as an ontology database
instead... Perhaps you could begin to semantically infer things?

    
    
              {Dexter's Lab} implies {Cartoon}
    
              Might we infer the person who uploaded it is a 90's kid? 
    
    

Or for a more graph topology-based, semantic kind of result,

    
    
             {Velociraptor} implies {Dinosaur}
                            implies {Predator}
                            implies {Extinct Animals}
                            implies {Seen on film} {Jurassic Park}
    
             Coincident occurrence with
    
             {Person}, -> 
             {Man}, -> 
             {Sam Neill} 
    
             {Sam Neill} was {Seen on film} {Jurassic Park}
    
             We can probably be sure that you're looking at a still from {Jurassic Park} at this point.
    
    

I'm not claiming searching the graph like that would be efficient, of course.
But if you've got ontological overlap that is cheap to check, it might be
fun...

The classifier ontology would be versioned, but probably very slow moving in
terms of changes. It might be impractical to package the entire database with
an app. With ontologies, you can plug in subsets and wire them to other ones
later on,

    
    
              "AKC-STANDARD-DOG-BREEDS_CLASSES-1.0.1"
    
              "CARTOONS-OF-THE-90S-0.1.0"
    
    

The pre-trained classifier data can be versioned. Say that someone else did
all the work of training it to recognize Bob Ross paintings or whatever.

    
    
              classifier = new Classifier("algorithm").forModel("BOB-ROSS-PAINTINGS-0.1")
    
              classifier.classify(new Image("http://i.imgur.com..."));
    
    

Oops, someone forgot to include happy little trees. But we can fix that,

    
    
              // Happy Little Trees edition.
    
              classifier = new Classifier("algorithm").forModel("BOB-ROSS-PAINTINGS-1.0")  
    
              classifier.classify(new Image("http://i.imgur.com..."));
    
              Produces results:
    
                 95% Bob Ross
                 100% Happy Little Trees
                 75% Happy Little Clouds
    
    

If this kind of tooling and ecosystem existed, _do you even know how much fun
I could have on Reddit?_

But in all seriousness, think of the practicality of reusability. Downloading
and running classifiers other people trained, from the language of your
choice? That's powerful and empowering. It takes the tech out of the realm of
"Google playtoy" and puts it in our collective hands.

Think of the kinds of novel apps that the Average Joe programmer could
develop. And if this type of thing truly got the support of the image
processing crowd, I can't fathom how much improvement we'd witness on a year-
to-year basis.

Does anything like this exist in your field for researchers now? If so, could
it be made to be usable by laymen? (Or does it already exist for general
audiences? Am I living under a rock?)

If this kind of thing doesn't exist or isn't shared, what steps could be made
toward making something like this a reality? Are there critical gating pieces
that need to come together first in order to make all of it work? Or
conversely, do you feel strongly that something like this just isn't feasible?

Possible complications on building an "open source" set of classifiers is the
deep knowledge required to contribute. And what about patents? It's my
understanding that universities like to patent research (eg. SIFT), and AFAIK
there must be broad coverage of of this space by the universities. That would
be a major setback.

Anyway, I've rambled on far too much. If anyone managed to read all of that,
please forgive me for inundating you with such a crazy, ill-informed, and
long-winded diatribe. I hope I made sense.

~~~
tlarkworthy
The neural architectures change so a set of parameters for one network won't
run on another. Image size changes too which changes the nn behind it. There
is not much work on mapping one set of parameters onto another. Transfer
learning might have a little applicability. But unlikely at the needing edge
of vision research.

Worth noting a neural network IS a general purpose function encoding.

------
colanderman
Now if only Google could develop a way to serve static text content without
using JavaScript!

(All I get is a B with twirling gears in it...)

~~~
gavinpc
This is a longstanding Blogger bug that happens when cookies are blocked. They
haven't fixed it because you and I are the only people on the planet who
whitelist cookies.

~~~
sp332
It's also incompatible with the Readability plugin. :(

------
botman
"typical incarnations of which consist of over 100 layers with a maximum depth
of over 20 parameter layers)" Anyone know exactly what that means? I'm
guessing that that there are 100 layers total, 20 of which have tunable
parameters, and the other 80 of which don't--e.g., max pooling and
normalization.

------
mrfusion
That's pretty amazing. It seems like we're at a point where we could build
really practical robots with this?

Robots to do dishes, weed crops, pick fruit? Why isn't this being applied to
more tasks?

~~~
drcode
I would eat a "hat with a wide brim" if Google isn't going to release a robot
that can do basic household chores (laundry, dishes, dusting) within the next
3 years.

Google has been gobbling up robotics startups, and given how Google also loves
gobbling up personal data, having robot "boots on the ground" in every home
must be extremely appealing to them.

~~~
robotresearcher
I doubt it. The easy parts of laundry and dishes are already done by simple
robots sold in every white goods department. The remaining parts are very
demanding indeed.

Research labs are not robustly demonstrating these capabilities yet, even with
very expensive robots.

~~~
bfung
I had this thought the other day - not robots washing the laundry, but folding
it. Folding is actually not that simple after thinking about it.

~~~
robotresearcher
[https://www.youtube.com/watch?v=gy5g33S0Gzo](https://www.youtube.com/watch?v=gy5g33S0Gzo)

Work is progressing. That's a $300K robot.

(edit: no affiliation. Video shows PR2 robot at Berkeley folding towels
competently but very slowly in 2010)

------
hyperion2010
I wonder whether some of the intermediate layers in these models might
correspond to something like "living room" or other locations that provide
additional information about the objects that might be in the scene. For
example, I suspect it was much easier for me to identify the preamp and the
wii in one of the pictures because I knew it was a living room/den instead of
an office or study.

~~~
pdenya
In this case, no they didn't have pre-labeled location/setting available. You
can see one of the datasets they used here: [http://image-
net.org/challenges/LSVRC/2013/#data](http://image-
net.org/challenges/LSVRC/2013/#data).

Generally speaking, Neural Networks are black boxes. The layers interact with
each other but not in a defined categorical manner like that. Layer size/depth
are parameters you provide when setting up that have tradeoffs in result
accuracy, space, time spent, etc like jpeg quality.

------
Someone1234
I wish this was available as a translation app. You point your phone at a
fruit stand and it names every single item, and you can then ask the vendor
for the item by name.

It isn't that crazy, in fact that's exactly what they have right now but just
in English only.

~~~
Igglyboo
Not exactly what your'e talking about but you should take a look at Word
Lens(which google just bought).

Basically you hold your phone up and position it over a piece of text using
the camera. It then OCRs the text, translates it, and replaces it in realtime
in the camera feed. It's pretty remarkable.

------
MichaelAza
These classifications are amazing but the fact that the first image in the
article is classified as "a dog wearing a wide-brimmed hat" and not as "a
chihuahua wearing a sombrero" is telling of how far we are from true
understanding of images.

Only a human possessed with the relevant cultural stereotypes (chihuahua
implies Mexican, ergo, the hat must be a sombrero) could make that conclusion.

Even so, I firmly believe that at this rate of improvement, we're not far from
that kind of deep understanding.

~~~
wodenokoto
I'm not sure how your example is more true.

------
joelthelion
How big is the model? Training these kinds of networks is expert work and
requires enormous infrastructure; but if they released the model, I'm sure
people like us could come up with all sorts of very useful applications.

~~~
gcr
If you're interested, Caffe
[http://caffe.berkeleyvision.org/](http://caffe.berkeleyvision.org/) comes
with some pre-trained models for ImageNet, which was close to state-of-the-art
a year or two ago.

