I wonder why it's a hard problem? I'm sure it is, but it would be interesting to know the key issues if anyone here has experience with it.
It seems like w/ an image sensor you can track eye vectors and head location relative to screen w/ a degree of certainty. And—much like tracking a rocket position—you could use a Kalman/Particle Filter to get a screen position pretty close to wherever I'm looking on the screen. I'd guess within 3 characters.
Feels like the kind of thing Apple should invest in and revolutionize...
First you have the quality of optics. Most computers have very small cameras that are low resolution and prone to noise in situations without ideal lighting.
That makes eyes hard to capture as a whole.
Then you need to figure out eye direction. Eyes flit around a lot (saccade) but you could perhaps smooth it out. But pupils are hard to see anyway through glasses. You better hope people wear large glasses with skinny frames and don’t suffer from very poor eyesight or astigmatism, both which lead to high refractions.
There are actually good products for this like tobi (sp?) etc where you can wear prescription lenses and have IR tracking for your eyes.
but even the , even if you get over the technical issues there’s the UX issue. How do you account for something getting your users attention without changing the input focus there? Let’s say they’re listening to music and a track changes, showing a notification.
And even if you figure all of that out, there’s the privacy angle. People don’t like being monitored constantly.
I've been trying to flit my eyes around the screen to get a sense for how fast you'd need to track. Definitely seems pretty fast! At least sample 30Hz I'd bet. I could see how doing all the image processing that quick might be tricky w/out custom hardware (even w/out the glasses problem you mentioned).
For the UX, I was imagining you have to press a button to instruct the computer that "Hey I'd like for you to move my cursor via eye-tracking". That way the cursor only moves when you want it to (same as today w/ a mouse) and isn't constantly moving around when you look around. Press down to have it move cursor to eye-tracked position and stop when you release.
Could possibly decompose the space bar to have that space for that button. Like have 3 mouse buttons where the right side of space bar is: (a) Track my eye movement while I press down and stop when I release, (b) left-click, (c) right click. Then you don't have to leave home row on your keyboard.
Or add one of those IBM Thinkpad mouse knubs somewhere on a keyboard and use those as mouse buttons instead of a mouse itself.
So here’s the other rub with using the keyboard to enter the command. Most people , even experienced touch typists, will flit their eyes towards the thing they’re trying to interact with.
Therein lies a big part of the problem with eye based interaction. Our brains move our eyes for a lot of different tasks, saccade to get a constant read of your scene (eyes have very poor resolving power so need to move a lot), they also signify what you’re thinking (there’s a lot of studies in neurolonguistics about eye direction signalling how you’re thinking, but at a base level, you tend to look up or away when you’re pondering).
Anyway not to say it can’t be done. But it’s a fascinating domain at the cross section of UX, technology and neural science.
For what it’s worth, there are VR headsets with dedicated eye trackers built in (PSVR2, certain Vive Pro models, Varjo etc..) and there have been cameras from Canon (in the film days even!) that used eye tracking for autofocus targets.
It’ll be interesting to see how things shape up. Meta have their big keynote on Tuesday where the Quest Pro / Cambria is meant to have eye tracking sensors.
Maybe you could place 3-4 30-60FPS cheapish CMOS cams around the edge of the screen and stagger their frame captures? You'd get different angles (better for detecting eye vectors) and increase the sampling rate.
(Can't reply to your other comment for some reason)
Why is the IR part of the spectrum better for the cameras? Is it because if I take an image of my eye and look at the IR part of the spectrum is it just easier to see parts of my eye that determine where it's looking?
I've worked a little with commercial eye tracking software. It's typically paired with an infrared sensitive camera and IR lighting because the pupil stands out more against the iris under IR light. It also benefits from a rectilinear camera lens and/or software calibrated lens distortion.
Environmental challenges like lighting conditions, glare, whether the user has glasses or hair obscuring their eyes need to be controlled. If the user is looking downwards towards the screen, their eye appears more closed making it hard to accurately find the iris location.
You also need an accurate measure of the position of the head/eyes relative to the camera. So dedicated hardware like a depth camera might be needed for high precision tracking. Depth cameras come with their own set of issues.
The resolution of even a high-res camera versus the change in pupil location for small eye movements means that by the time you crop out the eyes, you might only be working with a very small image (<100pixels or less). Even with subpixel hinting, there's not a lot of detail left. Small errors here and in the head tracking location can cause large errors in the screen position estimation.
Do our eyes ever lock onto (or centre on) a pixel, or even a small group of pixels for any useful amount of time?
I would think our brain is doing some level of motion tracking on a point while our eyes build up a mental image of the local environment.
Pressing virtual UI buttons with eye tracking makes sense, but if my eyes flick to the wrong character in a text file, it's going to wreck my concentration and end up being more effort than tying a vim chord.
In this comment thread, for e.g., you can have your eyes target the [+] or [-] icons, and then click to expand/collapse those. If you had the "eye-track button" you move you eye to [+], tap button, and left click.
The main use case for me is jumping cursor close to some position while text editing. I'm a Vimmer and I typically just want to look somewhere and instantly move my cursor to where I'm looking sometimes. Easymotion style jumping is nice (I use it in fzf menus, for instance). But there's still that slight mental overhead of figuring out what I gotta type in to jump. Or if I use a mouse then I take hands off keyboard, move mouse, hands back on keyboard. Which sounds pretty simple but it's so slow compared to just jumping around in vim w/ hands always on keyboard.
It seems like w/ an image sensor you can track eye vectors and head location relative to screen w/ a degree of certainty. And—much like tracking a rocket position—you could use a Kalman/Particle Filter to get a screen position pretty close to wherever I'm looking on the screen. I'd guess within 3 characters.
Feels like the kind of thing Apple should invest in and revolutionize...