This reminds me of how my visual psychology professor was attempting to help those with poor vision 15 years ago, but didn't appear to get anywhere with at the time.
The idea was a simple (but clever) one - use virtual reality to segment the world into solid blocks of identified objects. The solid blocks are identifiable to those with poor vision in a way that the real world is not.
Essentially this meant processing an image, identifying items e.g. cars, fences, roads etc and then colouring them solid. So instead of a confusing scene of blur, you have a blurred but still identifiable scene of a solid strip of grey for the road, a solid blob of red for the car, another solid yellow stip for a fence etc. A poorly sighted person could still identify from this something that made sense in a way that they couldn't in the real world.
What was required was an input, real time visual processing and then display back to the user - all of which was fantasy 15 year ago.
However, attempt this today with a visual feed, real time processing like this, and then near instantaneous display of the results back to the person with e.g. google glass, and you might have a viable way to show the world categorised in a visual way that will help those with poor vision. Interesting times.
Looks like this is using training data from the PASCAL VOC object detection
challenge [1], which is the standard benchmark for evaluating object detection
performance in computer vision.
Object detection is an extremely tough problem (some would say it is the computer vision
problem ;-)), and while we've made a lot of progress in the past decade, the best
methods are still terrible [2] -- average detection precision between 30-50%.
For reference, most consumer applications require an AP of 90+% to be considered
usable.
So if this is a completely automated solution, it's not going to be able to do
much better, unless the creators can make massive (I mean orders-of-magnitude)
improvements on the state-of-the-art.
But that being said, there are some applications where lower performance is
acceptable. And if you add some manual verification, you could conceivably
make this much better (with an increase in latency, though). Another possibility
is to specialize on a certain type of input image (e.g., if you're a company
taking photos in your warehouse, where all your photos look very similar and/or
you can control the lighting and environment).
Still, I'm excited to see companies attempting to take object detection out to
the real world. All the best to these guys!
One of my main hobbies is photography. I do mainly outdoor shots, and really enjoy macros of flowers. The problem being that "oh last weekend I took an amazing shot of a purple flower" isn't all that helpful for someone who is trying to find a picture of an iris. When someone comes up with an algorithm that can take my shot, compare it to a library, and tell me what wildflower it is, I will be a happy camper. I suspect Flickr and 500px will also become more valuable places since it would be possible to correlate geotagged shots with flora to document what seems to be there.
It's not quite what you want, but I worked on Leafsnap [1], which automatically identifies trees by their leaves, using computer vision techniques. We focused on leaves since they are present throughout much more of the year than flowers. Our free apps also include high-resolution, high-quality photos of all aspects of the species we cover -- leaves, flowers, fruits, bark, etc. So you can at least browse through and compare the flowers you're looking at with those in the app.
Our current coverage is of the trees of the northeast US (about 200 species), but we are working on expanding that.
The relevant table on the results page is Table 3, which is detection performance. Classification is actually an easier problem (see Table 1), in part because the types of scenes in which different classes appear are often quite different, making it easy to avoid some "easy" mistakes.
This is not Pedro Felzenszwalb's discriminative part-model algorithm. This is simple AdaBoost. The authors have labeled a bunch of datasets (1000s of them) and are able to detect whatever object.
AdaBoost (Viola/Jones) is the most popular Yes/No detector, there is an OpenCV api for it. It used for detecting faces and license plates in commercial applications.
Full person detector is nothing but a SVM+HOG descriptor.
As a computer vision researcher, I am not impressed by this. It is primarily an api for smartphone app makers who want a binary result for detection. It does not help with scene context analysis. For instance, if I have a big picture of a airplane on a wall, it will detect the airplane.. Does it know that this airplane is in the sky? or on a wall?
There are a 1000 failure cases.
Failed completely for me across a half dozen tries. I wonder how cheaply you could get results via Mechanical Turk. I bet you could get much more accurate results for a very low price but with some added latency.
One to two cents a task. Anytime you have a language agnostic task (identifying/classifying objects, etc), the tasks can be done very cheaply. Just make sure you do triplicate validation.
Language dependent/creative tasks run much higher (smaller worker pool, more brain power needed).
I've never used mechanical turk before and don't understand what you mean by language agnostic. I'd want someone to tell me that it's a "car" and not a 汽車. And I'd want to give the instructions for the task in English.
I agree, crowdsourcing is the way to go if you need to understand images. Image recognition is a very tough problem, especially if you're trying to detect anything nuanced.
We've developed RTFM at CrowdFlower to handle the similar task of moderating images and providing detailed reasons for why they are flagged. It's a common problem that the computers can't solve well enough yet.
Sheeps were detected as horses, faces but not as cats or cars. This seems to be the current state of the art for a general purpose classification. I haven't seen anything better yet (unless you'd specialize in sheep detection)
It is probably based on "Object Detection with Discriminatively Trained Part Based Models"; Pedro F. Felzenszwalb"... Somebody took http://people.cs.uchicago.edu/~rbg/latent/ made REST API and hooked up a payment system.
Hey everybody, OP here. Thanks for the great feedback! We're really happy that so many people have checked this out.
One thing that I want to mention: our service was built favoring Precision over Recall; we reasoned that we'd rather have a low number of false positives and make sure that when we do report a detection, that it actually is one. Thus, our service may occasionally miss instances.
I'm going to implement a button on the Experiment page that lets you flag a detection as something that we need to work on; we will use your feedback to improve the accuracy.
You might want to let the user decide if it is more important to have a false positive or a false negative. For some applications a false alarm is a minor nuisance but a false negative is catastrophic, but for some applications it is flipped. In the past I have let the end user define the balance (i.e. "a false negative is 10X as bad as a false positive") and the decision results were scaled by their decision rule. It's not always easy to do as many machine learning algorithms are nonlinear but at least you can cast a wider net of potential customers.
Your application detects none of them... Is it because my ancient phone camera's pics are too grainy? Or do the bikes need to be en profile to be detected properly? Or maybe it's trained to detect bikes with people on them, instead of bikes parked in the street?
How would I know that? I don't read documentation until absolutely necessary. If it claims to find faces, well, then let's see it work! And then we'll count the faces found.
In any case, the documentation is wrong if it says that. E.g., the software found all seven SEGs in the photo below:
As a long time CV enthusiast, I applaude the tech and the way you guys make it "just work". However for any serious application I feel a few things are missing:
- your pricing won't work for video (even at only 5fps)
- I can't really use the data without a confidence level of detection. Because for some applications I'd rather discard a bouding box that is below a threshold I set.
Hey steeve, thanks for the really kind feedback! We're aware of the video pricing issue, and it's something that we're thinking hard to come up with a solution to for makers and developers.
In the meantime, if you want to experiment with Dextro for video, shoot us an email at team@dextrorobotics.com and we will hook you up!
With regard to confidence level, that's something that we provide the enterprise-class service with; if this is a critical feature, we can potentially offer it to everyone as well.
Hey tunnuz and limejuice, sorry to hear we only picked up on 4 of the planes. We've biased our service towards precision rather than recall; thus, we try to be wrong about detected objects as little of the time as possible at the expense of perhaps missing a few object instances.
I want to clarify: the 4 object concurrent detection refers to 4 classes of objects. On the Experiment page, you can only choose one class to detect on (whether that is person, bottles, cars, etc). However, by using the API, you can simultaneously search for cars, planes, people, and motorcycles, for example.
It got almost all of them but so many errors. It can't detect sheeps either.
I was really impressed at first, but as I tried out more and more images, it became apparent that the api isn't mature enough for one or two cents worth of money. There is a 90% of the algorithm detect the image correctly, but sometimes it doesn't detect the entire object. For example, I used another image of two jets, but it only found one of them even though the jets were identical, but one was smaller than the other.
Yes! Scanning for more than 4 objects at once (which is currently supported) is something that we definitely want to enable in the future. The only constraint is the number of GPU machines we can afford.
Very interesting application though, but I couldn't realize real life usage via web api. As my knowledge those kind of stuff is for realtime applications and with web based approach it might not serve the purpose.
shamelessly plug: libccv supports REST-ful API in 0.4 version, it is open-source, and free: http://libccv.org/doc/doc-http/. Trained pedestrian / car / face detectors are included.
It'd be great if you could use this to detect nudity. Any plans for that? I'm assuming the balls on the "in the works" list are of the sport variety? ;)
In the works:
Shoes
Balls
Smartphones and tablets
Dogs
Keyboards
Cups and glasses
Doors
Keys
seems pretty good, but my first test found a potted plant in the aeroplane demo picture -- a 100 story potted plant :P very cool idea, super hard problem so mad respect regardless!
The idea was a simple (but clever) one - use virtual reality to segment the world into solid blocks of identified objects. The solid blocks are identifiable to those with poor vision in a way that the real world is not.
Essentially this meant processing an image, identifying items e.g. cars, fences, roads etc and then colouring them solid. So instead of a confusing scene of blur, you have a blurred but still identifiable scene of a solid strip of grey for the road, a solid blob of red for the car, another solid yellow stip for a fence etc. A poorly sighted person could still identify from this something that made sense in a way that they couldn't in the real world.
What was required was an input, real time visual processing and then display back to the user - all of which was fantasy 15 year ago.
However, attempt this today with a visual feed, real time processing like this, and then near instantaneous display of the results back to the person with e.g. google glass, and you might have a viable way to show the world categorised in a visual way that will help those with poor vision. Interesting times.