Hacker News new | comments | show | ask | jobs | submit login
Predator Object Tracking Algorithm (gottabemobile.com)
460 points by helwr on Apr 3, 2011 | hide | past | web | favorite | 76 comments

The key thing seems not to be the specific algorithm, but the idea of using images obtained during performance for training - an algorithm that can do that. An early prototype algorithm, with lots of room for tweaking - and there are likely radically different learning algorithms, as yet untried or undiscovered, that work better. It seems that in the past, performance images has been religiously separated from training image.

It reminds me of early approaches for robot walking, which tried to plan everything out, and more recent approaches of incorporating feedback - which turned out to be much simpler and work better. Sort of waterfall vs. agile.

It seems a tad unreliable (his "mouse pointer" was lost a few times while still on screen), but this is still a prototype. It's really impressive how the panda was tracked in 360 orientations - probably helped by the distinctive colouring.

New input devices (this, kinect, multi-touch) and applications that can really use them, may be a main source of disruptive innovation in computers for the next decade or two.

The tracking was incredibly good. The first seconds of each example are a little off because the algorithm doesn't have ANY training data to work with. It's 100% ad-hoc.

The tracking improved during the mouse pointer example (which I found incredibly impressive). The point he was making during that example was that it learns the different scales/rotations of objects on the fly and tracking improves automatically.

Before commenting, I viewed it a second time, pausing it several times (it's at 1:20, if you'd like to confirm the following). After about 10 seconds of training data (500 frames), he starts to draw with it. It loses tracking four times (not counting the two times his hand goes off-screen), and it's not obvious to me why - there don't seem to be major rotations or changes of his finger placement. In practical use, that would be annoying.

btw: my comment originally included praise for his work - but I thought his merit was obvious and that it distracted from my comment's point, so I deleted it. Instead, I'll just note that the first telephone also had lots of room for improvement - practically, the first of anything does. The cool thing was the idea of the telephone, and then making it real. How good it was is irrelevant compared with that it became real. It doesn't take away from the immense task of doing something that had not been done before or even imagined. Quality is not as crucial, because once you have the basic thing, it's (relatively) easy to iterate to improve it. I think having the idea and making it real deserves far far greater admiration than the quality of the prototype algorithm and implementation. Just as with the telephone.

No offense, but I'm gonna trust the guy who did this for his PhD when he says that it was novel and difficult :)

Not to dismiss the importance, but the idea is used in developing Ensemble Tracking: http://www.merl.com/projects/ensemble-tracking/

This is massively ground breaking. You'll get it if you've used motion tracking on several game interfaces and had to make perfectly white backgrounds with bright lights to make it work. This is incredibly accurate - really game changing stuff.

>This is massively ground breaking.

Sounds like you're used to bad algorithms. I think there is a serious disconnect between the state of the art in computer vision and what's used in industry.

The demo was cool, but the techniques are not that revolutionary. From a cursory glance through the papers, it is basically AdaBoost (for detection) and Lucas-Kanade (for tracking), with a few extensions.

Not to discount the guy's work at all, it's very cool and does a good job of pulling together existing algorithms. But not groundbreaking in the sense of, say, Viola-Jones was for object detection.

Valid comment but its analogous to claiming peanutbutter cups aren't novel because peanut butter and chocolate were both well known. There is novelty in being able to synthesize new systems from known elements which frankly I don't believe gets quite the credit it deserves. But that is just me.

The point is of course that being broadly familiar with a number of things can help you put together a novel thing out of a previously unknown combination of those things.

Spot on.

There's a lot of current work going on that effectively splits computer vision into multiple parallel tasks for better results but uses previously well known techniques (PTAM is another good example).

As an aside, I read through the paper and it doesn't look like this could track, say, your index finger separately from other fingers if, for a moment, your hand was occluded. This pretty much bars using this exclusively in a Minority Report style interface (you would need hand pose tracking like the stuff Kinect does). Though, I'm just re-iterating your point that this isn't the second coming of computer vision.

That being said, there are some really good ideas here.

I don't understand why everyone seems to have such a hardon for Minority Report-style systems. Gorilla arm pretty much rules that out from the start, and a tablet is more natural anyway.

A trackpad with a separate screen would be optimal (so you don't have to look at your hands).

Gorilla arm would prevent people from using that kind of system to replace mouse and keyboard, but I don't see why it could not work for some applications. I can think of several use cases where I would like a UI that does not require me to touch the hardware (think cooking or watching videos in the bath tub).

Minority Report is tremendously well-remembered as an interface concept and almost forgotten as a Spielberg movie. The Wikipedia article's longest section is "Technology."

By now, it's practically a 145-minute tech concept video with a plot starring Tom Cruise.

Yes using 2d displays in minority report was a huge mistake, but imagine it in 3d. Also that doesn't mean that you will have to keep your arms out before your eyes. Ideally you don't have to sit before your computer the whole day and use only a keyboard and a mouse, when you can have so much more freedom. Think of opening a book, ironing, or placing lego blocks etc.

Tony Stark, Iron Man: JARVIS --- that was his greatest invention - without JARVIS he couldn't have made his later generation suits.

since you're familiar with the topic, does this look lightweight enough for say, mobile applications, or does it require massive processing power?

According to his website [1] for this, he says that "TLD has been tested using standard hardware: webcam, Intel Core 2 Duo CPU 2.4 GHz, 2 GB RAM, no GPU processing is used and runs in a single thread. The demands of the algorithm depend on required accuracy of the algorithm. Implementation for mobile devices is feasible." in response to "What kind of hardware it was running on?"

So, according to him, it is lightweight enough to run on mobile devices. I'd imagine there are also several optimizations that can be done (leveraging multi-core chips or GPUs, for instance) to make the performance significantly better than the prototype he's demonstrating now. Also, taking into account Moore's Law, we may not be able to run this on today's mobile devices, but surely could on tomorrow's. Given that research is generally a few years ahead of industry, I would expect that, by the time this would come to market, the devices will be more than capable.

[1]: http://info.ee.surrey.ac.uk/Personal/Z.Kalal/tld.html

I think that it has an online-training algo for their 2-bit binary pattern as well. Haven't checked out the paper yet though.

Yeah, pruning and growing random forests (whatever that's formally called)

agreed, about the disconnect between state of the art and industry application of CV.

Also the applications are enourmous. Watch the end of the video for ideas. For example, it will be a huge enabler in robotics and some of the quadcopter applications you've been seeing.

Keep in mind that there's no report of the processing power required to do this in the video. It very well could be an algorithm that is extremely accurate, but at the expense of many CPU cycles. While it's obvious, remember that for use in games, you have to perform the detection and run the game in realtime. Whether or not this can be done with current hardware is what I'm interested in.

On the website he claims it's a fairly standard dual core setup and you can see the number of frames per second in the video. I noticed at one point it was staying around 15fps. It may not work at this point for an FPS (Especially because it would have to be done in conjunction with graphics and other game-related processes), but it would likely be fine with other less fast-paced games.

At least some of the processing could also be off-loaded to an accessory device. How much work is the Kinect doing vs. the actual 360?

Yesterday I bodged together something that's maybe half as good by gluing various OpenCV components together with numpy, and that runs at 15fps on an Atom netbook. I get the impression that this is just a lot more clever with the algorithms it uses, rather than specifically relying on the CPU grunt.

As this doesn't seem like an April fools joke (some of the papers were published last year :-)) its interesting to think about it in the context of what it might change. That being said I don't doubt for a minute that the university has locked up as much of the technology as possible in patents but that is another story. We can speculate about what it will be like in 20 years when people can do this without infringing :-)

Clearly it could be applied immediately to robotic manufacturing. Tracking parts, understanding their orientation, and manipulating them all get easier when its 'cheap' to add additional tracking sensors.

Three systems sharing data (front, side, top) would give some very good expressive options for motion based UIs or control.

Depending on how well the computational load can be reduced to hardware small systems could provide for head mounted tracking systems. (see CMUCam [1] for small)

The training aspect seems to be a weak link, in that some applications would need to have the camera 'discover' what to track and then track it.

A number of very expensive object tracking systems used by law enforcement and the military might get a bit cheaper.

Photographers might get a mode where they can specify 'take the picture when this thing is centered in the frame' for sports and other high speed activities.

Very nice piece of work.

[1] http://www.cs.cmu.edu/~cmucam/

Depending on how well the computational load can be reduced to hardware small systems could provide for head mounted tracking systems

He's clearly developing it on his laptop with a shitty webcam. That's why this is amazing. Screw robotic manufacturing, this is for my phone.

Phone != Laptop.

It says he's running on a "Intel Core 2 Duo CPU 2.4 GHz, 2 GB RAM" according to his website. As a good rule of thumb, computer vision runs about an order of magnitude slower (10x) on a phone (like an iPhone) than on a desktop/laptop.

Also - a crappy webcam actually makes things computationally easier because there's less data to deal with. In a lot of computer vision algorithms the first step is to take input and resize it to something that can be computed on in a reasonable time frame.

It says he's running on a "Intel Core 2 Duo CPU 2.4 GHz, 2 GB RAM" according to his website

I bet he isn't using the GPU though.

Also - a crappy webcam actually makes things computationally easier because there's less data to deal with

Perhaps, but lens distortion, motion blur and a rolling shutter don't make things easier.

Anyway, the inventor himself claims a phone implementation is feasible.

Yep, I'm sure he isn't. I don't doubt that you could optimize this algorithm to run on a phone but that takes an insane amount of effort and expertise and is a feat in and of itself. The word lens guys, for example, spent about a year porting from an optimized C implementation on i386 to ARM for the iPhone - they even initially used the GPU but decided that the overhead of shuffling data between buffers wasn't worth the advantage gained by the iPhone's measly GPU (which only had 2 fragment shaders at the time I think).

Also, completely agree on how camera blur would worsen the accuracy of said algorithm, I was trying to point out that it would run faster on a lower quality camera (with the caveat that it might not work nearly as well).

Specialized processing hardware != general-use CPU. Face tracking and image stabilization in dirt-cheap cameras is a good example, as is hardware video decoders or graphics cards. If a market emerges, specialized hardware will be built, and it'll be embeddable in just about anything.

Face tracking is a remarkable well solved problem these days.

I have only a vague understanding of the math behind how it works, yet I'm very successfully using it in an art project I'm playing with. An afternoon's Googling found me the OpenCV plugins for Processing and some face detection examples, and I've got a prototype that really disturbs my girlfriend - I call it "Death Ray" for extra creepiness factor[1] - but I've got a infra-red capable camera mounted on a pair of servos to steer it, and another pair of servos aiming a low power laser. An Ardunio driving the servos and switching that laser, with Processing just "magically" calling OpenCV for face detection in the video stream - _all_ the "heavy lifting" has been done for me - viva le open source!

[1] The thing that _really_ creeps the girl out is when I sit it all on top of the TV, and have it find faces watching the tv and paint "predator style aiming dots" onto peoples foreheads...

> Phone != Laptop

Wait a year or two and phones will be as powerful as today's laptop.

That 'first step' is so dangerous its mind blowing. One thing that is seriously holding academic CV back is datasets made for slow computers. Eyes take advantage of every possible input and the idea that you should start your CV task by throwing away data to make it 'easier' is so dumb its laughable. While I admit industry demands speed, if you have the luxury of doing pure research today and you're using black and white images you're not even wrong.

Interesting that TFA mentions "Minority Report-like interfaces" several times when: 1.) The Minority Report interface is the canonical example of a UI that is very impressive visually, and is beautifully mediagenic; but is hideously fatiguing and impractical in a real world scenario. (Hold your hand out at arm's length. Okay, now hold that pose for eight hours.) 2.) The MR UI has actually been commercialized, and has entirely failed to take the world by storm.

Also, computer vision demos are trivially easy to fake, and it's even easier to make an impressive demo video. You can have the guy who invented it spend a couple hours in front of the camera trying it over and over, then edit it down to three minutes of the system working perfectly. It wouldn't be nearly as impressive when you have an untrained user trying it live, in the field.

From his webpage at Surrey: "We have received hundreds of emails asking for the source code ranging from practitioners, students, researchers up to top companies. The range of proposed projects is exciting and it shows that TLD is ready to push the current technology forward. This shows that we have created something "bigger" than originally expected and therefore we are going to postpone the release of our source code until announced otherwise. Thank you for understanding."

Also, the message where he states the source code is under GPL 2.0 dissapeared. Seems that he chose to leave Richard Stallman empty handed and go to the dark side.

Did anybody get a GPLv2'd copy?

With something like this we could have truly “Minority Report” style human-computer interface.

Actually, the guy who invented the Minority Report interface commercialized it and has been selling it for years. Product website: http://oblong.com Edit better video: http://www.ted.com/talks/john_underkoffler_drive_3d_data_wit...

Predator doesn't need gloves

Well, he mentioned a "Minority Report"-style interface, and there it is. At least they could use cooler gloves: http://singularityhub.com/2010/05/28/mits-ridiculously-color... :)

Which is presumably what Microsoft means when they say "The next PC is the room".

Technical details here, with links to relevant papers at the bottom. http://info.ee.surrey.ac.uk/Personal/Z.Kalal/tld.html

Unfortunately you cannot download the source code, the link is disabled. And the GPL license he says he is using requires that the source code must be available for download without any restrictions like "send me an email" or "create an acount".

Edit: I sent him an email :)

Doesn't the GPL only require you to release the code if you publicly release the software binaries? I was under the impression that if you only released the results (aka his research and papers) you aren't required to make the source open.

He's licensing the code to you under the GPL. He's free to use his own code however he likes.

...and you're free to redistribute it!

Consider it done.

Done? Done where?

The GPL says nothing of the sort. Besides which, it's his code. He's not bound by the GPL.

Ok so the fact that he has produced this himself, using off-the-shelf commodity laptops etc is really great.

But this technology doesn't seem new to me - technology already exists for surveillance cameras in police and military helicopters to track an object like a car and keep it in vision as the helicopter turns and maneuvers.

Likewise, facial recognition - both statically and within a video stream - isn't new either.

Not taking anything away from the guy, but just wondering what it is I'm not getting that is new/amazing with this particular implementation?

The face recognition part was too good for not picking up the face of other people. Or was it detecting just the most similar face?

But facial recognition aside, the uses are endless. If it can be brought to the same level Kinect drivers are at, but with finger tracking and no custom hardware, this could change everything.

Bah! I was hoping to download the source (from here: http://info.ee.surrey.ac.uk/Personal/Z.Kalal/tld.html) and check out his algorithm, but he requires you to email him with his project. If anyone knows how the algorithm works, or where it is described in detail, I'd love to read that!

Absolutely amazing stuff!

What if you just... email him and ask for it?

Don't really have a project, just curiousity...

I certainly intend to drop him an email and see if he's willing to share his stuff with an interested compsci undergrad. It's always worth asking - he's probably just doing it for metrics.

Let us know what happens please, I'm also interested.

Well, you might join a plethora of other HN readers who also email him asking for a copy, and he might just remove the GPLv2 notice and "postpone the release of [the] source code until announced otherwise"

He's published 4 papers on the subject (they're available at the bottom of the page) why not start there?

Hopefully he will put it back, with a donation button ;)

Every time something like this comes out, I feel us taking a step away from "video camera mounted on a robot where the eyes should be" and a step toward real perception. I always wonder though, if a computer can one day recognize all different types of hands, could it draw a new one?

To answer your question you can watch this presentation by Prof Hinton: http://www.youtube.com/watch?v=AyzOUbkUf3M

He shows how he trained a restricted bolzmann machine to recognize handwritten numbers and how he can run it in reverse as a generative model, in effect the machine 'dreams' about all kinds of numbers that it's not been trained on but nonetheless makes up properly formed legible digits.

World becoming a better place with such code available for public to be built up on and not only to military in homing heads. I guess it is one point for "Make something free that was initially available for pay?" Just like "plenty of fish" doing... http://www.vti.mod.gov.rs/vti/lab/e-tv.htm

> Can Predator be used to stabilize and navigate a Quadcopter?

> That is not straightforward.

anyone know why not?

Probably because this is just 2D tracking rather than 3D mapping. But tracking can be applied to mapping, for example: http://www.robots.ox.ac.uk/~gk/PTAM/

So the question is, can predator be used to improve mapping? AFAIK, that would require a) automatically selecting trackable objects and b) tracking many of them simultaneously. That PTAM technique tracks thousands, but with tracking this reliable, you might get by with much less.

So, more work is required to apply it to mapping, but I have to imagine it could be done. And seeing how well predator adapts to changes in scale, orientation, and visibility, I suspect it could improve mapping considerably.

"AFAIK, that would require a) automatically selecting trackable objects and b) tracking many of them simultaneously."

I'm not really sure I understood you, but this two problems are already solved. Hugin[1] for example has automatic control point generation for photo stitching. Were you talking about something else?


Yeah, there are many ways to detect features but I haven't read the paper yet and I don't know what kind of features it wants and if there are any problems with choosing them automatically. Like, can it group features into distinct objects without a human pointing them out?

There are much easier existing algorithms that are currently used by, for example, the Parrot AR Drone so using this for that would not be optimal.

The video where system tracks Roy from IT Crowd sucking his fingers is epic:) http://www.youtube.com/user/ekalic2#p/u/2/tKXX3A2WIjs

It must be shown what to track. That is, you (or some other external system) define the "object" to be tracked by clicking on a bounding box.

A good addition would be an algorithm that automatically delineated "objects" in the visual field, then passed them to Predator.

Which raises another question: how many "objects" can Predator simultaneously track (with given horsepower)?

Wow, this is pretty amazing stuff. I sincerely hope this guy makes a pile of money off this.

This looks impressive. I've written tracking systems previously, so can appreciate that it's not an easy problem to solve.

Extremely impressive. Can't wait to see how this is applied to everyday problems. Kudo's to this gentleman.

This looks awesome! I want to build this into apps for iPad 2.

Uhhh... 'Predator?' What's his next project, SkyNet Resource Planning? This seems like an April fools to me. I mean I'm sure he's done work in the area... but the article is dated April 1 and the previous literature didn't mention 'Predator.' I could be wrong, but it seems too advanced, and scary.

So nobody else found it suspicious that the source code was promised but not actually available. And now not even promised.

Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact