Hacker News new | comments | show | ask | jobs | submit login

Hi all, I'm the author on the Stanford article, so I can answer any more detailed/technical questions. The article doesn't mention this, but closely related work I've only very recently become aware of (if anyone is interested) are additionally:

From UofT: http://arxiv.org/pdf/1411.2539v1.pdf

From Baidu/UCLA: http://arxiv.org/pdf/1410.1090v1.pdf

From Google: http://googleresearch.blogspot.com/2014/11/a-picture-is-wort...

From Stanford (our work): http://cs.stanford.edu/people/karpathy/deepimagesent/

Edit: And seemingly also from Berkeley http://arxiv.org/pdf/1411.4389v1.pdf

At least from my perspective, the main motivation behind this work is in thinking of natural language as a rich label space. As humans, we've very successfully used language for communication and knowledge representation, and I think our algorithms should do the same. This also enables much more straightforward I/O interactions between a human and machine because humans speak text fluently. They don't speak a set of arbitrarily devised and assigned labels.

So the hope is that in the future you can, for example, search your photos based on text queries, or ask the computer questions about images, and get answers back in natural language. There's a lot of exciting work to be done!

How feasible would it be to run this neural network algorithm on a single machine?

Have you thought about generating images by flipping the model?

This is really great work. This area may be the final step needed to bring us "Rosie the Robot".

Extremely feasible. The model predicts a sentence for an image in ~3ms on a GPU, and ~100ms on CPU. Convolutional Networks, Recurrent Networks, in general most neural nets are very efficient at test time. The training is the somewhat expensive part, but for example I've trained all of the models in the paper on an average (CPU) cluster machine in about one day. Of course, this assumes a pretrained image Convolutional Network that was trained previously on ImageNet - that part can take a day to get to 90%, and then a week more to get the last 10% of cutting edge performance.

Edit: Generating images is much harder and out of scope. What might be feasible is to perhaps stitch parts of existing images together. I'm not sure, that's an open problem and would make for an excellent SIGGRAPH paper.

What kind of GPU - Intel integrated or high-end Nvidia?

A CUDA capable GPU.

This sort of work seems like it could go hand-on-hand with the recently reported Microsoft 3D Soundscape for blind people [1]

Taking images from a body worn camera it could provide a near real-time narrative of the scene. Maybe not to the same level of descriptive commentary from Amélie, but still potentially useful:

"There is a particularly memorable scene in the French movie “Amélie” in which the title character [...] sees an older man with a white cane waiting to cross a busy street. “Let me help you."

And they’re off, moving quickly through the neighborhood as Amélie provides a running commentary: The golden bust of a horse hanging over a shop is missing an ear; that laughing is the florist, who has crinkly eyes; there are lollipops in the bakery window. “Smell that! They’re giving out melon slices. We’re passing the park butcher, ham is 70 francs. Now the cheese shop — picadors are 12.90. Now we’re at the kiosk by the metro. I’ll leave you here. Bye!”

More likely, "glass, what am I looking at?", but still!


Why did all of these papers come out today? Did some sort of deadline/embargo lifting pass today?

This Friday was the deadline for a big Computer Vision conference (CVPR), so effectively you're seeing what everyone has been working on over the last half a year :) Some authors choose to publish right away to arxiv, which is why you're seeing this influx right now. Some authors choose to upload to arxiv later, and some choose to never upload and wait for the whole review process to finish in some number of (long) months. I think the community is slowly deciding that the field and the ideas evolve too fast, much faster than the length of the review process. For example, I'm about to present my work from March at NIPS, happening in a few weeks. I've almost forgotten what I did, and I deprecated my NIPS model 3-4 times. It's not a chance to present cutting edge research, it's a chance to talk to your friends/colleagues and stand next to your poster awkwardly, trying to sell a model that you now know doesn't work very well.

Ah, no wonder, thanks! It still seems like quite the coincidence that so many groups were independently (?) working on describing images with CNNs hooked up to RNNs, though. I guess it's just an idea whose time has come?

So there goes my productivity this week. Amazing work, all of it! Is there any kind of simple setup with which we can test any of this? Python, numpy, anything along those lines?

Not yet. I don't know about the other papers but I really like to release code and make as much of my stuff reproducible as possible. I still have to clean a lot of it up (when you're doing research and rapidly iterating things get a bit messy), but it's based on Python and numpy and I plan to make it available sometime in December.

How can we be notified of your release? Do you have a github that I can follow? Thanks!

This might be the platform for my re-entry into machine learning. So I appreciate your efforts to publish code ahead of time.

It's incredible to see how much LSTM networks are proving useful in many different experiments!

From a bystander point of view this looks amazing!

Do you have a comment on the diversity of approaches across the papers? (did everyone take more or less the same path to do the task?)

Thanks! It's really not that crazy. Normally you're given an image and predict a label. For example ImageNet has 1000 labels. Here the idea is exactly the same, but we're predicting one of ~20,000 words one by one in sequence until a special terminating word is predicted (think of it as predicting the dot at the end of a sentence), at which point the RNN stops generating. There's a little more to it but that's the gist.

These works all present a different model, but it's not clear what works best because they all popped up so quickly and I think we didn't really have a chance to settle and agree on common evaluation criteria. Since the task is so new, I chose the simplest/cleanest way to extend a Recurrent Network to condition on image data (through additive interactions). The others chose a bit more fancier models but don't have experiments showing that they necessarily perform better than the simpler models or by how much. I don't yet know what Google did, but so far it looks like a model very similar to mine.

Also, all of the other works work on the level of entire images (as a single giant blob of 256x256x3 pixels), while my paper is the only one that also breaks down images into parts and objects, and there are parts of my paper dealing with that.

ArXiv abstract links, which are preferable to direct links to the PDF:




When you get an output, is there a measure of certainty? Can you have the algorithm point out images that it's having trouble with? Specifically, I was wondering if you can add the marginal cases to your training set, to increase the amount of training data around edge cases.

Just wanted to say thanks again for the Neural Networks book; eagerly awaiting the next chapters.

What are your thoughts on using LSTM? The google paper is using it, but it seems you're not.

I have experiments with LSTMs as well, but I didn't include them in the paper because I couldn't get them to work significantly better than the RNN, which I consider to be a simpler model (they work comparably). That is, a simple RNN with RMSProp, gradient clipping, dropout on non-recurrent connections and careful cross-validation of hyperparameters has so far given me the best results. This might be because sentences are not actually very long structures, so the advantages of LSTM might not be as large. I'm not certain on this point. I'm currently running more extensive experiments with LSTMs and it might turn out that once I tweak everything properly they might work better, but I think it's still valuable to have the RNN numbers as a baseline if nothing else.

Does it work on ANY image or just on a few thousand test images ?

Thanks. Got the answer.

Hi, i am new at machine learning with some basics of the field. I would highly appreciate if you kindly explain in simplest term about the working principle of your paper.

What are the remaining obstacles to being able to search search our photos with on text queries? E.g., "photos of me wearing a backpack with wood logs"?

Queries like this already work well in my experiments. For a large fraction of useful queries it's mostly a matter of engineering. My current model (which is right now at state of the art performance) doesn't yet take into account the spatial relationships of things, but it would be good at retrieving pictures that have you, backpack and wood logs somewhere in it. Reasoning about relations is harder because in the image space of x,y coordinates spatial relationships don't make too much sense, and we haven't yet figured out good ways to do spatial, 3D understanding from images (this is to a large part a lack of data problem). All we can do is "see" that there is a backpack detection nearby a person detection, and that it seems to be of relatively compatible size in the image plane.

All that being said, a large fraction of queries like this will work _today_, with what we have and the challenge is simply implementing it. However, telling the difference between whether the wood logs are inside your backpack, on your head, or in your hands might take a few more years :)

Is it because of what the items are, specifically?

Can you add an attribute to every potential item detection that includes a proportionality measure to something standardized?

Wouldn't this be a "simple" (that is, definable) calculation if every item that can be detected (meaning the item is already cataloged) has a standardized dimensional measurement in relation to the camera?

I'm not sure if I fully understand, but consider that "a person with backpack" tells you extremely little about the actual positions of those two items in the 2-D coordinate system of the image: It could be an extreme closeup where only the torso/head of the person is visible and a bit of the backpack, or it could be a tiny person anywhere in the image with a small blob sticking out from their back. The backpack could also be large, small, very occluded, etc. So there is less information in those locations than you might think.

> I'm not sure if I fully understand, but consider that "a person with backpack"

That's not the only information that you have to work with though.

If I understand correctly, object recognition is trained from a data set. That means that every classification is a map from a cluster of images to a single label.

This cluster of images can be quantified in terms of proportionality, in relation to other objects identified in the scene. We are beginning with the assumption we can compute the identification of individual objects when they are grouped. The next thing to train on is groups of objects as a unit structure, where we infer and learn a proportion relation.

Consider a macro shot of an ant with a baseball in the background and perhaps someone's finger, and a shot of all three with only one dimension altered to the camera. In the first, all three objects appear to be the same size. In the other, object one and object two express a ratio. Then we can normalize vectors based on the 3 computed ratios (object a : object b). Distance is not solved, but relative distance is.

Then, if you can build a catalog of standardized measurements (common objects with a defined distance from a camera, like a face 3 feet away from a camera), then you can start training for actual distances.

I'm thinking about this like how I would reason about the objects in a raycaster: their definition in the machine versus what gets projected onto a 2d plane, and how moving the objects and camera affects the final projected plane. I'm certain I'm glossing over a lot of details that would actually be difficult in implementation, not to mention computational complexity.

> So there is less information in those locations than you might think.

I agree, but the only way I understand machine learning at all, as a computer scientist, is "gradual accumulation, definition, and relational organization of humanly defined atomic units". I don't think there are any magic tricks, just a lot of carefully constructed computation, even if it spans over generations of computers, groups, people.

Andrej, I totally get it, but I think what you have some far may already be pretty good to form the foundation of a product that has a clear/big need in the market. If you'd like to dig in a bit more and bounce ideas, ping me at ilya@robinlabs.com.

Oh, and congrats - this is super-exciting!

How does the system react to stylised images, e.g. a photo with a cartoon effect or even a hand-drawn stick figure?

As is usually the case with Machine Learning, its reactions are whatever the training data says they should be, or mostly undefined otherwise. For example, if there is some picture in the training data of a cartoon and it says "cartoon of donald duck", then you can expect to see something of that type in the predictions. If there are no cartoon images in the training data whatsoever and you show it one, you will get a funny result and have a good laugh. At best, you'll have an "I can kind of see it if I squint my eyes a bit" result. At worst, it will say it's a cat.

Probably not well, but it's hard to say. One way of getting around this would be to train the NN on not just the input training set, but also on certain permutations of each image in the training set. So don't just train it on set(image -> labels), but also set(cartoonize(image) -> labels), set(smooth_filter(image) -> labels).

This may make the NN more resilient.

So, so, cool. Congrats to you and Li!

I don't understand why you are doing this and I don't understand why other people in this thread are praising you.

I understand that this problem is interesting to work on but seriously, this is obviously going to fall into the hands of people who are going to use it to make the world worse. If you don't think computer vision is going to be used to kill people and to take away freedom you are lying to yourself.

For every positive use this could have there are far worse negative uses and it is so clear that they will happen.

There are so many other fields you could work in that make use of machine learning algorithms. Why would you work to advance a field that has so many obviously evil applications that will definitely happen.

The main application of computer vision is robotics and automation.

If you don't like war or mass surveillance, stop voting for it. It's like blaming the Wright brothers for the city bombings of world war II.

Do you really believe if these researches didn't work on it, the technology just wouldn't be invented at all?

And I think the military applications are sort of overstated. We already have drones and effective mass surveillance. Machine vision can help, but it doesn't make or break it.

Mass surveillance is almost worthless without computers to monitor all of the video/audio feeds. You would need to have like a million people listening to the feeds and even then you wouldn't be able to monitor like 99% of the data.

It seems like you are unfairly singling out one particular advance in technology among many which could be equally responsible for killing or surveillance.

Furthermore, since this also opens some very positive perspectives, why not try and work together towards them? Whatever well-intentioned people do, the ill-intentioned would still have researched this and devoted it maybe more exclusively towards evil purposes.

That is ridiculous. This is clearly a key advance needed for automated surveillance and autonomous robotics.

That is like saying that the people who were working on nuclear fission are as responsible as the people who invented the process for creating the metal casing of the nuclear weapons used in Japan.

"Whatever well-intentioned people do, the ill-intentioned would still have researched this and devoted it maybe more exclusively towards evil purposes."

That is like saying you can do whatever you want because if you didn't do it someone else would. Maybe bad people would do this eventually, but maybe they wouldn't. I can tell you that most if not all of the people who want to use these technologies to do awful things are completely incapable of developing these technologies themselves. They have to get people like us to make them.

There is no question that computer vision is being used for military purposes and any advance will definitely be used by the CIA. It is completely ridiculous to think otherwise.

Sure this does have some positive uses, but the very clear danger that is posed by advanced AI and AI imaging techniques really makes this a field that should not be pursued. There is a high chance that this tech will have a net negative effect on the world.

There are many other techs that are much less dangerous to pursue that will provide positive benefits.

Your assumption that automated surveillance and autonomous robotics is wholly negative needs to be tested.

If surveillance was entirely automated one could argue that it would be less likely to be corrupted by political or social influences. One could argue the other way too, but I'm certainly not persuaded that it is more negative than manual surveillance. David Brin has written on this and related topics extensively[1], and I think there are more subtleties on the topic than it is popular to acknowledge.

As for autonomous robots: I think there are plenty of dangerous tasks that could be given to more intelligent robots, and I'm fairly sure this would be a net good.

Additionally there are many fields where this technology will be used that are almost completely positive. Medical imagery, search and rescue, and many fields of science are obvious applications.

Like most scientific fields, this undoubtedly has both positive and negative outcomes. How those are judged probably says as much about a person's political and social views as it does about the absolute moral view of the technology.

[1] http://www.davidbrin.com/transparentsociety.html

I have another response to this post too, but I feel strongly enough about it to add this:

I believe technology can be a force of good in the world. The fact that evil people do evil things with it does not negate the potential beauty of a world of abundance. Fear of the negative impacts of technology will not promote progress towards a better world.

TL;DR: I believe in techno-utopia.

You are, of course, correct. Though here (HN's character clear enough to those of us who have been about awhile) we can now observe that you will be downvoted remorselessly. Some people 'get it': actions in technology have consequences. In sympathy allow me to contribute a quote for further reflection. I found it hard to choose between one analyzing the moral engagement of earlier science and technology figures, or the hopeless social situation encountered when moral concerns are discarded amongst otherwise learned people. Without further ado, then, a dated but parallel quote in the British tradition.

I felt I was moving among two groups [literary intellectuals and scientists] comparable in intelligence, identical in race, not grossly different in social origin, earning about the same incomes, who had almost ceased to communicate at all, who in intellectual, moral and psychological climate had so little in common that instead of going from Burlington Hom or South Kensington to Chelsea, one might have crossed an ocean. — Baron C.P. Snow The Two Cultures: The Rede Lecture (1959)

Would we be better had we never harnessed electricity, built a computer, or designed a camera? If the social conscience of technology depends on whether or not something can be used for bad intentions, why pursue anything? Farming advances raise armies, should we blight the land? Your response is erudite, elitist, and short-sighted. The poster is not "of course, correct" -- fire fell into the wrong hands as well and yet the world has continued to be an awe-inspiringly amazing place.

I assure you, your optimism is entirely misplaced.

There is only one major application large enough to take advantage of large scale research in to video interpretation of this genre, and it is surveillance. If you look at the changes in recent society, we have little alternative but to believe further global militarization of air, space, and sea under the benevolent governance of large corporations and pretend-democratic nation-states. As population rises, quality of life deteriorates and pressure on resources expands still further, in all probability, this work will power their tools of oppression, first and foremost.

Can you give me an example of a research project that couldn't be used to make the world worse?

There are lines of research that are clearly much more dangerous than others. This isn't a black and white situation.

Nuclear fission, genetic engineering of viruses, identifying a gay gene/gay fetuses, advanced AI approximating human intelligence. These are very dangerous or ethically questionable avenues of research.

You don't have infinite time on this earth. TRhere are paths of research that have a high change of making the world worse. Why would you want to look back on your contribution when you are older and be like "well I totally fucked the world up for everyone"

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact