It says he's running on a "Intel Core 2 Duo CPU 2.4 GHz, 2 GB RAM" according to his website. As a good rule of thumb, computer vision runs about an order of magnitude slower (10x) on a phone (like an iPhone) than on a desktop/laptop.
Also - a crappy webcam actually makes things computationally easier because there's less data to deal with. In a lot of computer vision algorithms the first step is to take input and resize it to something that can be computed on in a reasonable time frame.
Yep, I'm sure he isn't. I don't doubt that you could optimize this algorithm to run on a phone but that takes an insane amount of effort and expertise and is a feat in and of itself. The word lens guys, for example, spent about a year porting from an optimized C implementation on i386 to ARM for the iPhone - they even initially used the GPU but decided that the overhead of shuffling data between buffers wasn't worth the advantage gained by the iPhone's measly GPU (which only had 2 fragment shaders at the time I think).
Also, completely agree on how camera blur would worsen the accuracy of said algorithm, I was trying to point out that it would run faster on a lower quality camera (with the caveat that it might not work nearly as well).
Specialized processing hardware != general-use CPU. Face tracking and image stabilization in dirt-cheap cameras is a good example, as is hardware video decoders or graphics cards. If a market emerges, specialized hardware will be built, and it'll be embeddable in just about anything.
Face tracking is a remarkable well solved problem these days.
I have only a vague understanding of the math behind how it works, yet I'm very successfully using it in an art project I'm playing with. An afternoon's Googling found me the OpenCV plugins for Processing and some face detection examples, and I've got a prototype that really disturbs my girlfriend - I call it "Death Ray" for extra creepiness factor - but I've got a infra-red capable camera mounted on a pair of servos to steer it, and another pair of servos aiming a low power laser. An Ardunio driving the servos and switching that laser, with Processing just "magically" calling OpenCV for face detection in the video stream - _all_ the "heavy lifting" has been done for me - viva le open source!
 The thing that _really_ creeps the girl out is when I sit it all on top of the TV, and have it find faces watching the tv and paint "predator style aiming dots" onto peoples foreheads...
That 'first step' is so dangerous its mind blowing. One thing that is seriously holding academic CV back is datasets made for slow computers. Eyes take advantage of every possible input and the idea that you should start your CV task by throwing away data to make it 'easier' is so dumb its laughable. While I admit industry demands speed, if you have the luxury of doing pure research today and you're using black and white images you're not even wrong.