From UofT: http://arxiv.org/pdf/1411.2539v1.pdf
From Baidu/UCLA: http://arxiv.org/pdf/1410.1090v1.pdf
From Google: http://googleresearch.blogspot.com/2014/11/a-picture-is-wort...
From Stanford (our work): http://cs.stanford.edu/people/karpathy/deepimagesent/
Edit: And seemingly also from Berkeley http://arxiv.org/pdf/1411.4389v1.pdf
At least from my perspective, the main motivation behind this work is in thinking of natural language as a rich label space. As humans, we've very successfully used language for communication and knowledge representation, and I think our algorithms should do the same. This also enables much more straightforward I/O interactions between a human and machine because humans speak text fluently. They don't speak a set of arbitrarily devised and assigned labels.
So the hope is that in the future you can, for example, search your photos based on text queries, or ask the computer questions about images, and get answers back in natural language. There's a lot of exciting work to be done!
Have you thought about generating images by flipping the model?
This is really great work. This area may be the final step needed to bring us "Rosie the Robot".
Edit: Generating images is much harder and out of scope. What might be feasible is to perhaps stitch parts of existing images together. I'm not sure, that's an open problem and would make for an excellent SIGGRAPH paper.
Taking images from a body worn camera it could provide a near real-time narrative of the scene. Maybe not to the same level of descriptive commentary from Amélie, but still potentially useful:
"There is a particularly memorable scene in the French movie “Amélie” in which the title character [...] sees an older man with a white cane waiting to cross a busy street. “Let me help you."
And they’re off, moving quickly through the neighborhood as Amélie provides a running commentary: The golden bust of a horse hanging over a shop is missing an ear; that laughing is the florist, who has crinkly eyes; there are lollipops in the bakery window. “Smell that! They’re giving out melon slices. We’re passing the park butcher, ham is 70 francs. Now the cheese shop — picadors are 12.90. Now we’re at the kiosk by the metro. I’ll leave you here. Bye!”
More likely, "glass, what am I looking at?", but still!
Do you have a comment on the diversity of approaches across the papers? (did everyone take more or less the same path to do the task?)
These works all present a different model, but it's not clear what works best because they all popped up so quickly and I think we didn't really have a chance to settle and agree on common evaluation criteria. Since the task is so new, I chose the simplest/cleanest way to extend a Recurrent Network to condition on image data (through additive interactions). The others chose a bit more fancier models but don't have experiments showing that they necessarily perform better than the simpler models or by how much. I don't yet know what Google did, but so far it looks like a model very similar to mine.
Also, all of the other works work on the level of entire images (as a single giant blob of 256x256x3 pixels), while my paper is the only one that also breaks down images into parts and objects, and there are parts of my paper dealing with that.
All that being said, a large fraction of queries like this will work _today_, with what we have and the challenge is simply implementing it. However, telling the difference between whether the wood logs are inside your backpack, on your head, or in your hands might take a few more years :)
Can you add an attribute to every potential item detection that includes a proportionality measure to something standardized?
Wouldn't this be a "simple" (that is, definable) calculation if every item that can be detected (meaning the item is already cataloged) has a standardized dimensional measurement in relation to the camera?
That's not the only information that you have to work with though.
If I understand correctly, object recognition is trained from a data set. That means that every classification is a map from a cluster of images to a single label.
This cluster of images can be quantified in terms of proportionality, in relation to other objects identified in the scene. We are beginning with the assumption we can compute the identification of individual objects when they are grouped. The next thing to train on is groups of objects as a unit structure, where we infer and learn a proportion relation.
Consider a macro shot of an ant with a baseball in the background and perhaps someone's finger, and a shot of all three with only one dimension altered to the camera. In the first, all three objects appear to be the same size. In the other, object one and object two express a ratio. Then we can normalize vectors based on the 3 computed ratios (object a : object b). Distance is not solved, but relative distance is.
Then, if you can build a catalog of standardized measurements (common objects with a defined distance from a camera, like a face 3 feet away from a camera), then you can start training for actual distances.
I'm thinking about this like how I would reason about the objects in a raycaster: their definition in the machine versus what gets projected onto a 2d plane, and how moving the objects and camera affects the final projected plane. I'm certain I'm glossing over a lot of details that would actually be difficult in implementation, not to mention computational complexity.
> So there is less information in those locations than you might think.
I agree, but the only way I understand machine learning at all, as a computer scientist, is "gradual accumulation, definition, and relational organization of humanly defined atomic units". I don't think there are any magic tricks, just a lot of carefully constructed computation, even if it spans over generations of computers, groups, people.
This may make the NN more resilient.
I understand that this problem is interesting to work on but seriously, this is obviously going to fall into the hands of people who are going to use it to make the world worse. If you don't think computer vision is going to be used to kill people and to take away freedom you are lying to yourself.
For every positive use this could have there are far worse negative uses and it is so clear that they will happen.
There are so many other fields you could work in that make use of machine learning algorithms. Why would you work to advance a field that has so many obviously evil applications that will definitely happen.
If you don't like war or mass surveillance, stop voting for it. It's like blaming the Wright brothers for the city bombings of world war II.
Do you really believe if these researches didn't work on it, the technology just wouldn't be invented at all?
And I think the military applications are sort of overstated. We already have drones and effective mass surveillance. Machine vision can help, but it doesn't make or break it.
Furthermore, since this also opens some very positive perspectives, why not try and work together towards them? Whatever well-intentioned people do, the ill-intentioned would still have researched this and devoted it maybe more exclusively towards evil purposes.
That is like saying that the people who were working on nuclear fission are as responsible as the people who invented the process for creating the metal casing of the nuclear weapons used in Japan.
"Whatever well-intentioned people do, the ill-intentioned would still have researched this and devoted it maybe more exclusively towards evil purposes."
That is like saying you can do whatever you want because if you didn't do it someone else would. Maybe bad people would do this eventually, but maybe they wouldn't. I can tell you that most if not all of the people who want to use these technologies to do awful things are completely incapable of developing these technologies themselves. They have to get people like us to make them.
There is no question that computer vision is being used for military purposes and any advance will definitely be used by the CIA. It is completely ridiculous to think otherwise.
Sure this does have some positive uses, but the very clear danger that is posed by advanced AI and AI imaging techniques really makes this a field that should not be pursued. There is a high chance that this tech will have a net negative effect on the world.
There are many other techs that are much less dangerous to pursue that will provide positive benefits.
If surveillance was entirely automated one could argue that it would be less likely to be corrupted by political or social influences. One could argue the other way too, but I'm certainly not persuaded that it is more negative than manual surveillance. David Brin has written on this and related topics extensively, and I think there are more subtleties on the topic than it is popular to acknowledge.
As for autonomous robots: I think there are plenty of dangerous tasks that could be given to more intelligent robots, and I'm fairly sure this would be a net good.
Additionally there are many fields where this technology will be used that are almost completely positive. Medical imagery, search and rescue, and many fields of science are obvious applications.
Like most scientific fields, this undoubtedly has both positive and negative outcomes. How those are judged probably says as much about a person's political and social views as it does about the absolute moral view of the technology.
I believe technology can be a force of good in the world. The fact that evil people do evil things with it does not negate the potential beauty of a world of abundance. Fear of the negative impacts of technology will not promote progress towards a better world.
TL;DR: I believe in techno-utopia.
I felt I was moving among two groups [literary intellectuals and scientists] comparable in intelligence, identical in race, not grossly different in social origin, earning about the same incomes, who had almost ceased to communicate at all, who in intellectual, moral and psychological climate had so little in common that instead of going from Burlington Hom or South Kensington to Chelsea, one might have crossed an ocean. — Baron C.P. Snow
The Two Cultures: The Rede Lecture (1959)
There is only one major application large enough to take advantage of large scale research in to video interpretation of this genre, and it is surveillance. If you look at the changes in recent society, we have little alternative but to believe further global militarization of air, space, and sea under the benevolent governance of large corporations and pretend-democratic nation-states. As population rises, quality of life deteriorates and pressure on resources expands still further, in all probability, this work will power their tools of oppression, first and foremost.
Nuclear fission, genetic engineering of viruses, identifying a gay gene/gay fetuses, advanced AI approximating human intelligence. These are very dangerous or ethically questionable avenues of research.
You don't have infinite time on this earth. TRhere are paths of research that have a high change of making the world worse. Why would you want to look back on your contribution when you are older and be like "well I totally fucked the world up for everyone"