Hacker News new | past | comments | ask | show | jobs | submit login
Descriptive Camera: A camera that prints a description, not an image (mattrichardson.com)
250 points by ColinWright on Apr 25, 2012 | hide | past | web | favorite | 78 comments

  | |                             | |
  | |    This is a picture of     | |
  | |  30-60% off top brand name  | |
  | |    shoes like Nike, Vans,   | |
  | |    Adidas, Toms, and more.  | |
  | |   Call 1-800-TURK-GAMER to  | |
  | |   find your perfect pair.   | |
  | |                             | |
  | ------------------------------- |
  |                                 |
  |                                 |

Why does the human race have to have griefers? Why can't we all just get along and have cool sci-fi tech without people trying to game it?

Well, let's draw a distinction here. The above isn't griefing, it's simple capitalism. It exists for some reason other than aggravating people and/or wasting their time. You might also have griefers who would put out text-photographs like:

    /                                \
    ||  This is a picture of how I  ||
    ||  just want to tell you how   ||
    ||  I'm feeling, gotta make you ||
    ||  understand, never gonna     ||
    ||  give you up, never gonna    ||
    ||  let you down, never gonna   ||
    ||  turn around and desert you, ||
    ||  never gonna make you cry,   ||
    ||  never gonna say goodbye,    ||
    ||  never gonna tell a lie and  ||
    ||  hurt you.                   ||
    |                                |

Its annyoing but I rather have cool stuff and people abusing at then not having that stuff at all.

Except that might not be used by 'griefers' instead it could be a love note between two lovers.

'Greifers' is just a subjective term depending on which chair you're sitting in to watch the performance.

Now you've ruined it for everybody. On the other hand, I shall get rich offering SAAAS (SpamAssassin As A Service) on Mechanical Turk...

Could be helped by adding a secondary task, asking Turkers whether the description matches the photograph.

Amazing - the meta concepts behind the "Descriptive Camera" are limited only by the imagination, the resolution of the sensing devices, the ubiquity/bandwidth of the data pipe, and the ability to manage/train the backend human workforce and maintain a consistent level of quality. I'm pretty certain, that within 3-5 years, a billion dollar company will emerge using the basic concepts captured by the "descriptive camera." That is, a remote sensing device feeding data into a human-backed work-management queue, and returning some type of structured higher-level processing of the remote-sensing device's data in close to real-time.

Just off the top of my head:

  o Flora/Fauna identification
  o Scene analysis (Car Crashes/Building Wreckages)
  o Medical analysis
  o Substance analysis
  o Structural integrity review
  o Online Translators.  Today.
I'm pretty certain companies would be more than happy to pay $1000+ for 3-4 hours of a real-time translator than worked through a smartphone for business trips. Ubiquitous high bandwidth wireless (LTE)/High Resolution Mics/Cameras/smartphones are the enabling technologies which trigger this class of application.

Medical analysis

I don't think I would rely on a description from a Mechanical Turk human for describing what they see in a pictue.

Is that a tibia or a fibula?!?

Right - you don't send to Mechanical Turk. You instead send it to a highly trained backend workforce, presumably highly skilled in the vertical that you are doing analysis on. And, it's not just a picture, but video/sound/other sensing. Who knows what the next advances in remote telemetry will be.

You can always require Mechanical Turks to pass qualification tests before they work on your HITs. There are also pre-existing qualifications like language and location.

Of course, you need to pay well enough to get enough people to bother, but I imagine one could build a reasonably skilled workforce for many areas if your training material is good and you have an objective measure of the correct answer.

This image description application is both perfect and terrible for Mechanical Turk though - it's an ideal task for a human rather than a computer, but it's also impossible to score the result objectively, so you'll have to pay everyone, all the time, or introduce another level of scoring - "Is this an accurate description?" "Does this description read fluently in $LANGUAGE" etc.

Indeed, we do this kind of work today in MobileWorks, using exactly this fact.

Once your workforce is skilled all kinds of new opportunities of this nature open up.

Nice. I guess we don't need to look too far to see that soon-to-be-billion dollar company then. Have you considered doing real-time feedback to a device that has remote telemetry (video/sound/pictures/etc...)?

We've done some work in this vein already, as a matter of fact: real-time communication between the crowd and cameras / robots. We're currently supporting a few fast camera-driven smartphone applications that developers have built on the platform.

I think what you're after, though, is more comprehensive integration: a remote telemetry system that has crowd intelligence baked into its circuits along several media, rather than one: analyzing simultaneous audio and video. I don't think this would be difficult for a developer to build, and it's a great idea.

So why is this "technology" ground-breaking then? I thought the point of this is that it that you 1. take picture 2. send over the internets 3. wait 3~6min to get a response on what the picture is about.

Try scaling this out..."highly-trained" means poor scalability and lots of liability.

The technology isn't ground breaking - it's the combination of five or six technologies that are maturing at the same time that make this sort of application (real time human assessment of remote telemetry) work.

Lots of US hospitals already use offshore radiology services who interpret various medical imaging tests taken in the overnight hours an report back the findings.

Citation? (note - not trying to be snarky...moreso curious)

I think the fauna/flora possibility is the strongest. There are lots of amateur gardeners (me:-) who would love to snap something and find out what it is (and hence how best to feed, water and shade it). I've often thought this should be possible with an android/iphone app.

Have you seen Snapseed or Leafsnap?

No I hadn't. Thanks!

This already exists- remember Google Goggles? Except in this implementation it's a little more hipster/artsy since it prints it out for you on receipt paper.

Honestly, as a camera, it is a step backwards. Photographs efficiently capture many times more information than a simple text description.

However, what it is a good demonstration of is the growing cognitive abilities of image-reading machines.

It's not trying to be more hipster by printing descriptions, the pictures are described via Mechanical Turk (from humans), this is why it's interesting.

Ah, well fuck me for not reading the writeup more thoroughly.

You were right that it is artsy though...

(And in your defense, having humans manually describe images in the image recognition domain seems so...unmodern, that your assumption was probably fair)

I was hoping that the description would be derived algorithmically rather than via human intervention. I guess that really was a bit optimistic though :(

Agreed, the concept's great but really you're just taking a picture then asking someone to describe it & printing that; that tech's been around for a while.

A more practical approach may be to develop a camera with access to the web. When pictures have been taken and the camera can get a 3g/4g/wifi network connection it uploads the pics to the cloud (optionally removing or retaining to save space on the camera's local memory or allow the images to be accessed on the camera in offline mode). You can then put in some serious server power to run through all uploaded images to describe the contents (including tagging people in the photos and using GPS data to help identify content based on location context) and have this data automatically sent back to the camera (if desired) as well as stored against the image online. Randomly picked images, or those with a low accuracy rating (e.g. the algorithm has rated its description as tenuous) could be sent to a service such as mechanical turk so that the results can be verified / improved upon / to ensure the algorithm has feedback to learn from.

Heh you're describing the whole implementation except for that pesky but critical computer vision part.

It is a cute idea but unfortunately (or fortunately?) this particular implementation of the camera comes with (a) opinions, (b) typos, and (c) probably trolls, if they know that their text will automatically be accepted and printed out.

I guess these factors could probably be reduced with more money? I don't mean paying more for the job: I mean using the API to ask a second human to verify the results of the first human and/or fix typos.

No need for (b) typos and (c) trolls, but (a) opinions are output that computers currently fail to effectively provide, but things that much of human society currently depends on. I was going to cite medicine and marketing as examples, and act like science doesn't depend on opinion, but that is actually very false. The scientific process is basically 1) experiment/observe (i.e. take pictures), 2) form opinions (calling them "hypotheses") about what we observe and 3) decide what to do next to convert that opinion into a truth value.

I think you're missing the point of it.

What IS the point of it? It seems like a parlor trick to me, especially with the "send it to my friend and have him describe it" option.

I don't see how this gadget is particularly useful in any way, any more than the first "automated chess device" with a midget inside moving chess pieces around was actually revolutionary.

The comments on how it would be useful for the blind (with braille or speech synthesis) is as close as I've seen to a reasonable justification for it. Even then it's a stretch. Beyond that, basing it on Mechanical Turk means every "photo" costs money, and frankly doesn't provide much value.

If you want to caption your photos it would be much better to do it AFTER you've pruned the ones you don't care about, so you're not paying for descriptions of 100 throwaway photos for every keeper. In which case the fact that the camera itself is hooked up to Mechanical Turk is just a gimmick, since it would be easier to run a script on your desktop or on a server.

Am I? Would the point somehow be served if you took a self-portrait and the response back was "haha ur so ugly"...?

(I should add that I know that Amazon offers a review process, but I am not sure how it integrates with the above camera, and I am thus assuming that the camera automatically has to accept a description as correct before printing it out. This might not be the case; and I don't know.)

You are. This isn't ment to be a completely realistic project ready to be shipped, it's a glimpse of the future, or maybe an art project. I think it's wonderful to see out-of-the-box stuff like this appearing on Hacker News.

My apologies; I suck at communicating.

I think I agree. This camera is cute but this particular implementation has flaws -- that's what I said before and what I hoped would be my take-home message, and I guess I didn't highlight that enough. I certainly didn't intend, "oh, fix this before it goes into production!" or anything like that.

Yep, Matt Richardson (my classmate at NYU/ITP, but I wasn't in the course he made this for) makes some pretty cool stuff: http://mattrichardson.com/tech/ He made a project for physical computing class that took hashtagged tweets and displayed them via laser onto phosphorescent vinyl, which caused them to fade away over time.

He's also a writer for Make: http://blog.makezine.com/author/makemattr/

I think the key is that he's one of those people who are experimenting with mashing up not only different technologies but also different generations of technologies.

I guess one could call this stuff "parlor tricks" as one other person said but I agree that it's totally within the spirit of hackernews.

That raises another good point - privacy. I take a pic of my girlfriend on the beach, and some random guy on the Turk gets a copy to distribute / use as they like. This means building this functionality into a standard camera would need an option to turn off the "auto-annotate" feature selectively if using the human method. My example's obviously quite tame, but there are a plethora of more serious ones (including people who'd want to use the service to keep electronic copies of legal documents and have the annotation feature provide a quick cataloguing system).

Add speech synthesis and this would be an awesome camera for the blind.


> VizWiz is an iPhone app that allows blind users to receive quick answers to questions about their surroundings. VizWiz combines automatic image processing, anonymous web workers, and members of the user's social network in order to collect fast and accurate answers to their questions.

Just had a go: it's great!

Even with a six-minute latency?

Absolutely. for example, a big challenge for blind people is interacting with devices in the environment that report status via visual signals. For example, a blind coffee fiend I know can't tell what state his fancy espresso machine is in ("brewing", "out of beans") without sighted help to read the little display. He would happily wait a few minutes to get someone to tell him - it's quicker than waiting for his wife to come home.

Thanks for this. Now I'm trying to think of ways that you could actuate thin metal rods, so that you could transform a greyscale picture into a sort of texture-photograph, so that you could "see" details by placing a hand atop it. If the rods are needle-thin you could probably get a reasonable resolution on it, and even have some brightness/contrast sliders on the side to help -- but it seems like the real problem is just moving all of those little rods automatically.

If you built it, I think you'd find the real problem is having enough resolution in your fingertips to work out what the hell the picture is about.

Lots of people who become visually impaired later in life can't read Braille as their fingertips aren't sensitive enough, especially if they did manual work earlier in their life. Working out a picture is even harder.

As a side note about this, William Moon invented 'Moon script' which was a kind of simplified alphabet in embossed writing to help visually impaired people read. It was invented a little before Braille invented the raised dot patterns, and even after Braille became popular, Moon still had it's niche for those who didn't have the resolution for Braille. As Moon was based on the Roman alphabet, it was also easier to learn for people who'd been able to read then lost their sight.

If he has an iPhone, ask him to check out: http://vizwiz.org/

Maybe Amazon could add a new category of HITs, "real-time HITs".

Someone (the "turk", or whatever Amazon calls them) keeps an open window (and is optionally paid for each minute with the window open), so new HITs would appear immediately, and the "turk" would be rewarded based on the speed of the response.

Why speech synthesis? Just have the person recording instead of typing.

I'm disappointed. I was thinking some one had optimized a template matching algorithm and implemented in an ASIC that you could slot into a camera.

Let me see if I understand this technology correctly: you take a picture of something, mail the picture to someone, somewhere, they write a quick summary of the picture and mail that back to be printed out on what appears to be a thermal receipt printer.


Yep, and if you can internalize why that makes the front page of HN you can get a sense of what Web 3.0 is going to look like.

Welcome to the world of tomorrow!

It is interesting in its cuteness though; it seems like it could have real applications depending on how 'powerful' the mechanical turk is.

  It is pitch black. You are likely to be eaten by a grue.
Ooops, lens cap is on.

  Here is an idea, translate the text into a picture.  An API will take text and generate a representation of the description using images from the web via google image search.

There's been some interesting efforts at that already. eg:

WordsEye constructs 3d images from text descriptions http://www.wordseye.com/

Sketch2Photo takes a crude annotated sketch, and creates a composite image: http://cg.cs.tsinghua.edu.cn/montage/main.htm

The output of wordseye isn't great - it looks a bit 90's POV-Ray. Sketch2Photo does a nicer job - more like automated photoshopping - but needs more assistance on the placement of objects in the scene.

I'm sure there's others.

The Wordseye renderer actually happens to be a 90s vintage ray tracer.

Maybe not too practical as presented here, but if you start with mechanical turks doing the recognition, the pairs of images / description could be an invaluable source when developing and training automatic algorithms that does this kind of image analysis.

Thare are already systems that can analyze wounds automatically by analyzing pictures of it.

My first thought was "oh great, another silly art project", but I've been wanting something like the Mechanical Turk part of it for awhile.

My least favorite part of photography is organizing and finding the best photos. I'd happily pay to have a bunch of people rate all my photos (if the results are meaningful enough)

You'd probably need to show each photo to multiple Turkers to get good data. You could run analysis on the votes to kick out the Turkers who give ratings that deviate significantly from other Turkers. Does MTurk have anything like that built in?

And then comes pictureless pinterest, oh wait here it is: http://twitter.com/picturelesspins

The source of their content are submissions from their website http://picturelesspinterest.tumblr.com

Like reverse Dwarf Fortress?

This is like the DARPA Mind's Eye project, except the DARPA project does this computationally. http://www.wired.com/dangerroom/2011/01/beyond-surveillance-...

> Here is an idea, translate the text into a picture. An API ...

This one uses a combination of computer vision and optional human vision. http://www.iqengines.com/omoby/. There is an API allowing for computer vision training.

This is brilliant. It reminds me of Sascha Pohflepp's Blinks & Buttons project, which swaps your photo with someone else's taken at the same time. There's even an iPhone app.


Reminds me of Swapshot.


"Exactly what humans do."

Isn't that because the real work is being done by a human?

You're right. I just scanned the post with the eyes before making my comment. I thought this was done with an algorithm.. That's why I deleted it. False alarm. I will not say that is useless but is not a revolution. With this we will never get a real time answer.

A camera for the blind - if the printer is adapted to print Braille.

Oh god, the Searle Prophesy is coming true!

its too bad tineye doesn't do a better job of matching similar, you could feed the picture in, find the closest match, and then scrape the description off the corresponding pages where it was found.

Of course, these should be saved in JTEG format.


Now I know how Thicknesse felt.

Can someone build a version of this that prints snarky descriptions of certain pictures? Give it a libertarian political slant and a vegan diet. Only good things can come of this.

I wonder how it holds up against a human being. They sould run a test on that.

Did you read the article? It is being done by a human being, trough Amazon's Mechanical Turk service.

Yes I did, Mr. serious face.

My camera gives me descriptions too, but they're in terms of quantized frequency coefficients of 8x8 squares.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact