Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: API for detecting people, cars, and everyday objects in images (dextrorobotics.com)
179 points by jluan on Jan 11, 2013 | hide | past | web | favorite | 72 comments

This reminds me of how my visual psychology professor was attempting to help those with poor vision 15 years ago, but didn't appear to get anywhere with at the time.

The idea was a simple (but clever) one - use virtual reality to segment the world into solid blocks of identified objects. The solid blocks are identifiable to those with poor vision in a way that the real world is not.

Essentially this meant processing an image, identifying items e.g. cars, fences, roads etc and then colouring them solid. So instead of a confusing scene of blur, you have a blurred but still identifiable scene of a solid strip of grey for the road, a solid blob of red for the car, another solid yellow stip for a fence etc. A poorly sighted person could still identify from this something that made sense in a way that they couldn't in the real world.

What was required was an input, real time visual processing and then display back to the user - all of which was fantasy 15 year ago.

However, attempt this today with a visual feed, real time processing like this, and then near instantaneous display of the results back to the person with e.g. google glass, and you might have a viable way to show the world categorised in a visual way that will help those with poor vision. Interesting times.

However, attempt this today

According to http://news.ycombinator.com/item?id=4985100 it won't be today, perhaps tomorrow.

I'm guessing that many severely sight-impaired people would be willing to take the latency to have vision that is significantly more useful.

Looks like this is using training data from the PASCAL VOC object detection challenge [1], which is the standard benchmark for evaluating object detection performance in computer vision.

Object detection is an extremely tough problem (some would say it is the computer vision problem ;-)), and while we've made a lot of progress in the past decade, the best methods are still terrible [2] -- average detection precision between 30-50%. For reference, most consumer applications require an AP of 90+% to be considered usable.

So if this is a completely automated solution, it's not going to be able to do much better, unless the creators can make massive (I mean orders-of-magnitude) improvements on the state-of-the-art.

But that being said, there are some applications where lower performance is acceptable. And if you add some manual verification, you could conceivably make this much better (with an increase in latency, though). Another possibility is to specialize on a certain type of input image (e.g., if you're a company taking photos in your warehouse, where all your photos look very similar and/or you can control the lighting and environment).

Still, I'm excited to see companies attempting to take object detection out to the real world. All the best to these guys!

[1] http://pascallin.ecs.soton.ac.uk/challenges/VOC/

[2] http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2011/resu...

One of my main hobbies is photography. I do mainly outdoor shots, and really enjoy macros of flowers. The problem being that "oh last weekend I took an amazing shot of a purple flower" isn't all that helpful for someone who is trying to find a picture of an iris. When someone comes up with an algorithm that can take my shot, compare it to a library, and tell me what wildflower it is, I will be a happy camper. I suspect Flickr and 500px will also become more valuable places since it would be possible to correlate geotagged shots with flora to document what seems to be there.

It's not quite what you want, but I worked on Leafsnap [1], which automatically identifies trees by their leaves, using computer vision techniques. We focused on leaves since they are present throughout much more of the year than flowers. Our free apps also include high-resolution, high-quality photos of all aspects of the species we cover -- leaves, flowers, fruits, bark, etc. So you can at least browse through and compare the flowers you're looking at with those in the app.

Our current coverage is of the trees of the northeast US (about 200 species), but we are working on expanding that.

[1] http://leafsnap.com

Isn't the 30-50% only applicable to doing object recognition? I.e. multi-classification.

In this case, you have to tell it which object you're looking for.

The relevant table on the results page is Table 3, which is detection performance. Classification is actually an easier problem (see Table 1), in part because the types of scenes in which different classes appear are often quite different, making it easy to avoid some "easy" mistakes.

This is not Pedro Felzenszwalb's discriminative part-model algorithm. This is simple AdaBoost. The authors have labeled a bunch of datasets (1000s of them) and are able to detect whatever object. AdaBoost (Viola/Jones) is the most popular Yes/No detector, there is an OpenCV api for it. It used for detecting faces and license plates in commercial applications. Full person detector is nothing but a SVM+HOG descriptor.

As a computer vision researcher, I am not impressed by this. It is primarily an api for smartphone app makers who want a binary result for detection. It does not help with scene context analysis. For instance, if I have a big picture of a airplane on a wall, it will detect the airplane.. Does it know that this airplane is in the sky? or on a wall? There are a 1000 failure cases.

It got zero of six airplanes for the link below, even though the images are not overlapping and are against a blue sky background:


But it did find one potted plant for that image. I could not see it (bottom left hand corner).

"Not again."

A bit off-topic, but... can you recommend a library or a service to recognize license plates? Thanks!

Failed completely for me across a half dozen tries. I wonder how cheaply you could get results via Mechanical Turk. I bet you could get much more accurate results for a very low price but with some added latency.

One to two cents a task. Anytime you have a language agnostic task (identifying/classifying objects, etc), the tasks can be done very cheaply. Just make sure you do triplicate validation.

Language dependent/creative tasks run much higher (smaller worker pool, more brain power needed).

I've never used mechanical turk before and don't understand what you mean by language agnostic. I'd want someone to tell me that it's a "car" and not a 汽車. And I'd want to give the instructions for the task in English.

If I had to guess what they meant: A car is a car, regardless if you call it 'car' or 'das auto.'

Use icons!

I agree, crowdsourcing is the way to go if you need to understand images. Image recognition is a very tough problem, especially if you're trying to detect anything nuanced.

We've developed RTFM at CrowdFlower to handle the similar task of moderating images and providing detailed reasons for why they are flagged. It's a common problem that the computers can't solve well enough yet.


Sheeps were detected as horses, faces but not as cats or cars. This seems to be the current state of the art for a general purpose classification. I haven't seen anything better yet (unless you'd specialize in sheep detection)

Can you give me bit more technical background. Tell me how this is better than for eg. out of the box openCV filters.

It is probably based on "Object Detection with Discriminatively Trained Part Based Models"; Pedro F. Felzenszwalb"... Somebody took http://people.cs.uchicago.edu/~rbg/latent/ made REST API and hooked up a payment system.

Hey everybody, OP here. Thanks for the great feedback! We're really happy that so many people have checked this out.

One thing that I want to mention: our service was built favoring Precision over Recall; we reasoned that we'd rather have a low number of false positives and make sure that when we do report a detection, that it actually is one. Thus, our service may occasionally miss instances.

I'm going to implement a button on the Experiment page that lets you flag a detection as something that we need to work on; we will use your feedback to improve the accuracy.

You might want to let the user decide if it is more important to have a false positive or a false negative. For some applications a false alarm is a minor nuisance but a false negative is catastrophic, but for some applications it is flipped. In the past I have let the end user define the balance (i.e. "a false negative is 10X as bad as a false positive") and the decision results were scaled by their decision rule. It's not always easy to do as many machine learning algorithms are nonlinear but at least you can cast a wider net of potential customers.

This is a Dutch street, therefore it has many bikes in it: http://i.imgur.com/qQwAS.jpg .

Your application detects none of them... Is it because my ancient phone camera's pics are too grainy? Or do the bikes need to be en profile to be detected properly? Or maybe it's trained to detect bikes with people on them, instead of bikes parked in the street?

http://www.dauntless-soft.com/products/freebies/airbus380/a3... detected 0 planes, there should be at least 5

when used http://www.airbus.com/fileadmin/media_gallery/aircraft_pages... it detected 2 planes, there was 1 only

but hope with additional training images, it would improve.

Over a dozen experiments, the recognition rate for faces seems to be about 70%. Example of failure: only 2 faces detected here (in particular NOT the one in focus) http://iamdaveknockles.files.wordpress.com/2011/03/meeting_j...

This is worse than OpenCV (I thought you were using OpenCV but apparently aren't?)

Similar result with the image below. It got four of six faces, missing the most important of the faces.


If you read the description on the site, you'd know that the face detection stops after finding 4 faces.

How would I know that? I don't read documentation until absolutely necessary. If it claims to find faces, well, then let's see it work! And then we'll count the faces found.

In any case, the documentation is wrong if it says that. E.g., the software found all seven SEGs in the photo below:


As a long time CV enthusiast, I applaude the tech and the way you guys make it "just work". However for any serious application I feel a few things are missing:

- your pricing won't work for video (even at only 5fps)

- I can't really use the data without a confidence level of detection. Because for some applications I'd rather discard a bouding box that is below a threshold I set.

Other than that, congrats for the great work :)

Hey steeve, thanks for the really kind feedback! We're aware of the video pricing issue, and it's something that we're thinking hard to come up with a solution to for makers and developers.

In the meantime, if you want to experiment with Dextro for video, shoot us an email at team@dextrorobotics.com and we will hook you up!

With regard to confidence level, that's something that we provide the enterprise-class service with; if this is a critical feature, we can potentially offer it to everyone as well.

Tried detecting Airplanes on this image with 18 airplanes, but it only detected 4 of them.


The browser demo page says "Dextro supports up to 4 objects to be detected concurrently when used via the full API.", I think that's the problem.

Hey tunnuz and limejuice, sorry to hear we only picked up on 4 of the planes. We've biased our service towards precision rather than recall; thus, we try to be wrong about detected objects as little of the time as possible at the expense of perhaps missing a few object instances.

I want to clarify: the 4 object concurrent detection refers to 4 classes of objects. On the Experiment page, you can only choose one class to detect on (whether that is person, bottles, cars, etc). However, by using the API, you can simultaneously search for cars, planes, people, and motorcycles, for example.

Ok, so the two 4s have nothing to do with each other. Thanks for the clarification.

Tried to detect Cats in a whole room full of cats, and it detected zero cats.


Wow, I just tried this image of faces:


It got almost all of them but so many errors. It can't detect sheeps either.

I was really impressed at first, but as I tried out more and more images, it became apparent that the api isn't mature enough for one or two cents worth of money. There is a 90% of the algorithm detect the image correctly, but sometimes it doesn't detect the entire object. For example, I used another image of two jets, but it only found one of them even though the jets were identical, but one was smaller than the other.

Interesting technology. It got a couple correct for me. But failed on a bunch as well. Here's a few horses it failed to find correctly.

2 horses / detected 0: http://images4.fanpop.com/image/photos/23500000/horse-horses...

4 horses / detected all as 1: http://4.bp.blogspot.com/-Rso9vw4BmSE/TqZU6vHl3kI/AAAAAAAACL...

Did a few tests and it works pretty well! No false positives at least.

Any plans to increase the number of objects you can search for at once? Very interested in using this but I'd want to be able to scan for ~20 objects.

It detects a false positive in this picture for "full-body person": http://farm9.staticflickr.com/8232/8366217251_972624d84b_b.j...

The man is detected, but also a shape above the umbrella.

Edit: direct link the the result picture: https://s3.amazonaws.com/dextro_detection_results/debug13579...

Yes! Scanning for more than 4 objects at once (which is currently supported) is something that we definitely want to enable in the future. The only constraint is the number of GPU machines we can afford.

Concerning the API: (On page https://www.dextrorobotics.com/api)

* The documentation is pretty weak.

* I am not sure what a classID is, and I don't see any links to where the numbers come from.

* The example request is posting to an insecure http address, but the secret api key is required?

* The example request doesn't fit on one line? It took me a while to see it was in the "GET / HTTP/1.1" style.

* How do errors work? Having clearly specified error responses would be really useful.

If you trying to sell me on your API, show it to me.

It didn't recognise the cat in this picture (http://i.imgur.com/TgFaJh.jpg) so I'm doubtful of it's practicality.

Very interesting application though, but I couldn't realize real life usage via web api. As my knowledge those kind of stuff is for realtime applications and with web based approach it might not serve the purpose.

BTW, it can find only two airplanes in this photo http://www.q8.com/SiteCollectionImages/Gatwick%20Airport.jpg

Hey - I found a bug - no cat was detected here: http://i.imgur.com/wGxWy.jpg

It seems to have some trouble finding cats: http://i.imgur.com/ONFis.jpg

I have been tinkering with a similar side project which you can read about here:


It's still in the development stage because I can only fiddle with it when I have the time and impetus to do so. Criticisms/comments welcome.

Works well! Found a few it didn't work on. For example, it didn't detect an airplane in this image (but it's a fighter jet, so maybe not part of the training set): http://cdn-www.airliners.net/aviation-photos/photos/2/8/0/20...

Hmm.. It's a fantastic idea and really great website, but the actual algorithm is very unprecise.

See this: http://i.imgur.com/ulith.png?1

You need to get a higher percentage of actual matches before you can use this for anything.

shamelessly plug: libccv supports REST-ful API in 0.4 version, it is open-source, and free: http://libccv.org/doc/doc-http/. Trained pedestrian / car / face detectors are included.

Didn't work for me. That said, image recognition via an API will be huge once things mature a little more.

I've been searching lately for a post-face.com API and have been following a few for a while, but they seem to have similar issues with poor results.

It'd be great if you could use this to detect nudity. Any plans for that? I'm assuming the balls on the "in the works" list are of the sport variety? ;)

In the works: Shoes Balls Smartphones and tablets Dogs Keyboards Cups and glasses Doors Keys


detected 3 planes... there is only 1 plane and a car

That's great! I like that you guys are offering a small free usage tier!

It didn't detect face here: http://i.imgur.com/c34dX.jpg The algorithm probably got distracted and raised an exception.


there's a face there?

Is there a good way for submitting recommendations for improvements? http://i.imgur.com/yLSHW.jpg

That is amazingly awesome. Glad it can integrate with Ruby and Python. I haven't even read the whole info and I already signed up.

Isn't there a risk for this service to be used as an image proxy? The analyzed images are rehosted on their S3...

Over/underdetection of bicycles: http://imgur.com/a/tIS6c

Very nice. Would y'all consider offering some embeddable solution that does not need to be ran on the net.

Are you willing to pay for this service but need higher accuracy? I'd love to hear from you.


does a good job with painting too, but it did find the phantom neighbor peeping in as well:


detected 2 planes, there are 7 http://iskin.co.uk/wallpapers/imagecache/1280x800/jet_plane_...

seems too buggy to pay just yet

seems pretty good, but my first test found a potted plant in the aeroplane demo picture -- a 100 story potted plant :P very cool idea, super hard problem so mad respect regardless!

I tried to search for a cow into a horse's image, but it failed

Hi, would love to have this API on Mashape.com

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact