It reminds me of early approaches for robot walking, which tried to plan everything out, and more recent approaches of incorporating feedback - which turned out to be much simpler and work better. Sort of waterfall vs. agile.
It seems a tad unreliable (his "mouse pointer" was lost a few times while still on screen), but this is still a prototype. It's really impressive how the panda was tracked in 360 orientations - probably helped by the distinctive colouring.
New input devices (this, kinect, multi-touch) and applications that can really use them, may be a main source of disruptive innovation in computers for the next decade or two.
The tracking improved during the mouse pointer example (which I found incredibly impressive). The point he was making during that example was that it learns the different scales/rotations of objects on the fly and tracking improves automatically.
btw: my comment originally included praise for his work - but I thought his merit was obvious and that it distracted from my comment's point, so I deleted it. Instead, I'll just note that the first telephone also had lots of room for improvement - practically, the first of anything does. The cool thing was the idea of the telephone, and then making it real. How good it was is irrelevant compared with that it became real. It doesn't take away from the immense task of doing something that had not been done before or even imagined. Quality is not as crucial, because once you have the basic thing, it's (relatively) easy to iterate to improve it. I think having the idea and making it real deserves far far greater admiration than the quality of the prototype algorithm and implementation. Just as with the telephone.
Sounds like you're used to bad algorithms. I think there is a serious disconnect between the state of the art in computer vision and what's used in industry.
The demo was cool, but the techniques are not that revolutionary. From a cursory glance through the papers, it is basically AdaBoost (for detection) and Lucas-Kanade (for tracking), with a few extensions.
Not to discount the guy's work at all, it's very cool and does a good job of pulling together existing algorithms. But not groundbreaking in the sense of, say, Viola-Jones was for object detection.
The point is of course that being broadly familiar with a number of things can help you put together a novel thing out of a previously unknown combination of those things.
There's a lot of current work going on that effectively splits computer vision into multiple parallel tasks for better results but uses previously well known techniques (PTAM is another good example).
As an aside, I read through the paper and it doesn't look like this could track, say, your index finger separately from other fingers if, for a moment, your hand was occluded. This pretty much bars using this exclusively in a Minority Report style interface (you would need hand pose tracking like the stuff Kinect does). Though, I'm just re-iterating your point that this isn't the second coming of computer vision.
That being said, there are some really good ideas here.
A trackpad with a separate screen would be optimal (so you don't have to look at your hands).
By now, it's practically a 145-minute tech concept video with a plot starring Tom Cruise.
So, according to him, it is lightweight enough to run on mobile devices. I'd imagine there are also several optimizations that can be done (leveraging multi-core chips or GPUs, for instance) to make the performance significantly better than the prototype he's demonstrating now. Also, taking into account Moore's Law, we may not be able to run this on today's mobile devices, but surely could on tomorrow's. Given that research is generally a few years ahead of industry, I would expect that, by the time this would come to market, the devices will be more than capable.
Clearly it could be applied immediately to robotic manufacturing. Tracking parts, understanding their orientation, and manipulating them all get easier when its 'cheap' to add additional tracking sensors.
Three systems sharing data (front, side, top) would give some very good expressive options for motion based UIs or control.
Depending on how well the computational load can be reduced to hardware small systems could provide for head mounted tracking systems. (see CMUCam  for small)
The training aspect seems to be a weak link, in that some applications would need to have the camera 'discover' what to track and then track it.
A number of very expensive object tracking systems used by law enforcement and the military might get a bit cheaper.
Photographers might get a mode where they can specify 'take the picture when this thing is centered in the frame' for sports and other high speed activities.
Very nice piece of work.
He's clearly developing it on his laptop with a shitty webcam. That's why this is amazing. Screw robotic manufacturing, this is for my phone.
It says he's running on a "Intel Core 2 Duo CPU 2.4 GHz, 2 GB RAM" according to his website. As a good rule of thumb, computer vision runs about an order of magnitude slower (10x) on a phone (like an iPhone) than on a desktop/laptop.
Also - a crappy webcam actually makes things computationally easier because there's less data to deal with. In a lot of computer vision algorithms the first step is to take input and resize it to something that can be computed on in a reasonable time frame.
I bet he isn't using the GPU though.
Also - a crappy webcam actually makes things computationally easier because there's less data to deal with
Perhaps, but lens distortion, motion blur and a rolling shutter don't make things easier.
Anyway, the inventor himself claims a phone implementation is feasible.
Also, completely agree on how camera blur would worsen the accuracy of said algorithm, I was trying to point out that it would run faster on a lower quality camera (with the caveat that it might not work nearly as well).
I have only a vague understanding of the math behind how it works, yet I'm very successfully using it in an art project I'm playing with. An afternoon's Googling found me the OpenCV plugins for Processing and some face detection examples, and I've got a prototype that really disturbs my girlfriend - I call it "Death Ray" for extra creepiness factor - but I've got a infra-red capable camera mounted on a pair of servos to steer it, and another pair of servos aiming a low power laser. An Ardunio driving the servos and switching that laser, with Processing just "magically" calling OpenCV for face detection in the video stream - _all_ the "heavy lifting" has been done for me - viva le open source!
 The thing that _really_ creeps the girl out is when I sit it all on top of the TV, and have it find faces watching the tv and paint "predator style aiming dots" onto peoples foreheads...
Wait a year or two and phones will be as powerful as today's laptop.
Also, computer vision demos are trivially easy to fake, and it's even easier to make an impressive demo video. You can have the guy who invented it spend a couple hours in front of the camera trying it over and over, then edit it down to three minutes of the system working perfectly. It wouldn't be nearly as impressive when you have an untrained user trying it live, in the field.
Also, the message where he states the source code is under GPL 2.0 dissapeared. Seems that he chose to leave Richard Stallman empty handed and go to the dark side.
Actually, the guy who invented the Minority Report interface commercialized it and has been selling it for years. Product website: http://oblong.com Edit better video: http://www.ted.com/talks/john_underkoffler_drive_3d_data_wit...
Edit: I sent him an email :)
But this technology doesn't seem new to me - technology already exists for surveillance cameras in police and military helicopters to track an object like a car and keep it in vision as the helicopter turns and maneuvers.
Likewise, facial recognition - both statically and within a video stream - isn't new either.
Not taking anything away from the guy, but just wondering what it is I'm not getting that is new/amazing with this particular implementation?
But facial recognition aside, the uses are endless. If it can be brought to the same level Kinect drivers are at, but with finger tracking and no custom hardware, this could change everything.
Absolutely amazing stuff!
He shows how he trained a restricted bolzmann machine to recognize handwritten numbers and how he can run it in reverse as a generative model, in effect the machine 'dreams' about all kinds of numbers that it's not been trained on but nonetheless makes up properly formed legible digits.
> That is not straightforward.
anyone know why not?
So the question is, can predator be used to improve mapping? AFAIK, that would require a) automatically selecting trackable objects and b) tracking many of them simultaneously. That PTAM technique tracks thousands, but with tracking this reliable, you might get by with much less.
So, more work is required to apply it to mapping, but I have to imagine it could be done. And seeing how well predator adapts to changes in scale, orientation, and visibility, I suspect it could improve mapping considerably.
I'm not really sure I understood you, but this two problems are already solved. Hugin for example has automatic control point generation for photo stitching. Were you talking about something else?
A good addition would be an algorithm that automatically delineated "objects" in the visual field, then passed them to Predator.
Which raises another question: how many "objects" can Predator simultaneously track (with given horsepower)?