Hacker News new | past | comments | ask | show | jobs | submit login
On-Device, Real-Time Hand Tracking with MediaPipe (googleblog.com)
306 points by neversaydie 9 months ago | hide | past | web | favorite | 67 comments

Remember that HN post the other day about someone going on a massive mission on voice recog because they can't use a mouse due to pain?

Stuff like this makes me hopeful even if it seems like a gimmick when viewed in isolation.

A combination voice recognition + gesture recognition + eye tracking would make mobile computers better productivity tools.

> A combination voice recognition + gesture recognition + eye tracking would make

... retail surveillance much more effective and privacy eroding...

if they do it all on device and do not transmit out, then the privacy concern seems squelched. On retail tracking yeah it's messed, practically no one knows their movements are tracked on their phones - cell tower triangulation, GPS, then the kicker NFC. I love / hate tech. It's powerful stuff. [edit] - the cell tower triangulation is forever given the nature of the tech.

GPS is passive and NFC is very short range. I think it'd be easier to track people by bluetooth or wifi MAC.

> all on device and do not transmit out

Like Apple? Retorical question. My point is: this stays in device is just propaganda. It can be done but will never happen

Cell tower is rather trilateration, right?

I believe the scale is too small for height to make an impact.

Remember that HN post the other day

No. Link? Or can you remember any other details?

Close to 400 comments. It was a good discussion.



If I'm thinking of the same thread, the mentioned user created a speech to text accessibility platform with a partner developer (2 people operation) for disabled users.

The underlying project MediaPipe looks pretty cool: https://github.com/google/mediapipe/blob/master/mediapipe/do...

I wonder why they didn't build it on top of one of the gazillion flow-based visual programming languages instead?

Probably because it uses TensorFlow (lite) and there aren't many (any?) VPLs supporting C++ integration.

And also Google has NIH syndrome, it's hard to think of any outside projects they use besides the Linux software stack and LLVM.

> Probably because it uses TensorFlow (lite) and there aren't many (any?) VPLs supporting C++ integration.

which VPL do not support C++ ? all the major ones have externals / plug-ins / modules that can be written in C++ - Max/MSP, PureData, LabVIEW, vuo, BluePrints, vvvv...

Google depends on thousands of external software projects.

They heavily support a VPL, Scratch.

AFAICT the relationship for Scratch is that Google wrote a new VPL (Blockly) and the MIT Media Lab released it as Scratch 3.0.

I guess the "thousands" is from https://opensource.google.com/? In a few minutes of browsing, I couldn't find anything besides Bullet that wasn't just Google releasing something as open source.

Python? Git?

Git is mainly for external consumption. They've got their own VCS called Piper that comes with a client called CitC (sit-see).

This is neat. I'm curious about a limitation though.

Was this trained on people with missing or partially missing digits? Like if someone is missing the top part of their third and fourth finger does it always predict Spiderman or Rock for an open hand?

I don't think this is a necessary thing for an openly released piece of software that's not aimed at edge cases like these. I'm just curious about limitations and how it deals with edge cases. I also don't currently have a friend with a missing finger to test with.

(Also you could probably fine tune this model to pick up those cases. Would be curious how good results would be because I imagine it'd be difficult)

I've started to get really interested in lower-budget and/or easier motion tracking for special effects. I've been looking at optical motion tracking with a multi camera setup, and optical facial tracking. With the right math and assumptions, you could capture a full performance with little to no specialized equipment. I've been wondering if ML could output enough detail to make it feasible.

With the right models, and gluing it all together yeah.

For some commercials we've dropped xsens suits for openpose. Facial capture from afar needs too much in the way of exaggerated movement, but for mouth capture, audio processing gave more pleasing results. 3D models still aren't here but for cameras that are staying basically in one plane it's good.

Used open pose initially to correct capture suit drift live, but with some math (I'm a game & computer vision dev) translating to 3D was pretty good. As always, you just have to fix outliers.

I’m always keeping an eye on tracking results to see if they’re doing ML (or nother technique) to extrapolate the muscle movements so that those movements can be translated directly in to a character. Especially watching this in the fascial tracking since these “skin-stretch” animations are so jarring.

> We are excited to see what you can build with it!

The killer app is typing. Qwerty would be nice for a transition, but someone please invent a gesture "keyboard" more optimal for a free floating hand. Because of the lack of feedback I imagine it couldn't be as good as an actual keyboard. But it could be brilliant as an away from keyboard keyboard.

There's still ergonomics. Say we teach computers to recognize casual ASL (which is a big job for people learning ASL, but whatever)... You're not going to be able to spend the time using that input method you could using a keyboard, because of simple fatigue.

I am wondering if we are not too pessimistic about feedback for that application : why not use physical chord-like movements to different actions (delete characters, words, sentences) or like making distinct handsigns assigned to more used phonemes and less used one to chords of several finger joins like index / thumb, etc. Would be weird at first but, as learning vim taught me, building a coherent grammar can definitely make a big difference.

As a engineer I can see how truly amazing feat that is. As a human being I'm staggered that it took so much effort into AI and machine learning to do so little.

> so little

Nature took 85 million years to perfect the hand, and dexterous use takes 1-2 years of training for babies. Interpreting the hands of other takes longer.

Try drawing hands. (If you're a non-artist like myself.) It's impressively hard --- they are complex artefacts we can't quite see clearly because we are so used to them. There are a lot of, as it were, polygons.

If, after staring at these things for several decades, I still can't draw them with my eyes closed, I will assume that an AI would not find it easy to think about them either.

I think it’s the opposite problem. I can see an AI being able to very easily draw what it sees but struggle at interpreting it. When you learn to draw, one of the first step is relearning to see without interpretation. A better example is example is how kids draw stick figures so easily and how hard it is for computer to do so: for decades we had to put reflector interpreted as dots used to render vector... use to render stick figures

What a great point, thanks!

That's very good point. I would correct one thing, a baby can learn sign language (aka interpret other's hands gesture) by 4 to 6 months old.

Cant wait for the pixel 4. No doubt this will be put to use.

Next Pixel will have Soli built in which should provide a much more detailed signal than a camera.

I'd like to see this kind of technology employed in VR headsets to allow for more natural interaction using my hands and fingers instead of controllers.

You can buy a LeapMotion and stick it to the front of Rift right now. I've done it, it's amazingly immersive and feels fantastic.

I really wonder how this compares to the LeapMotion tracking. My suspicion is that the leap tracking is now hardened by years of real world experience, so it's probably ahead of anything that's still in the R&D stage. But hard to know without testing it.

This article doesn't actual mention anywhere but it implies it's doing all this with a regular camera. LeapMotion and others use more complex sensors. The ML approach is really impressive but getting clearer input would seem to be a more reliable approach.

The LeapMotion actually use ultrasonics for measurements and then guesses to translate that to hand-tracking. Doing it fully from a camera may actually improve on it if done right.

That’s the magic leap. The original leap motion uses stereoscopic infrared cameras afaik.

Nope, you’re totally right. Don’t know where I got the idea that the little Leap Motion was ultrasonic. Apologies to the readers.

pose detection is a mixed bag because it encourages always-watching devices (like how alexa is the killer app for always listening mics)

That said, incredibly useful for interacting w/ technology in physical space. Could imagine this doing really well for handheld drone landings or hybrid human / robot factories.

Could be activated by Bluetooth based presence detection, leading to face recognition to verify it is the right user, and then activate gesture detection.

Thinking of all the baseball applications here: catcher signals, third base coach, head coach etc.

Why would it be useful to have technology detect signals? Aren't people already going to be doing it?

Genuine question, I don't know how baseball works.

Hand signals are commonly used in baseball to surreptitiously communicate intent to teammates without giving away your strategy to opponents. I think what GP was getting at is that this technology could be used to automate the reading of hand signals. I'm not sure it would be effective, as the pro baseball players are already quite sophisticated at both reading and obfuscating hand signals, at least at the highest levels of the game.

I'm not sure if the complexity of the codes is significantly higher in the pro teams, but I found this video on baseball code decoding pretty fun.


Combined with this, it certainly seems like there's potential for a fully automated pipeline

This is amazing! Google could now build a chording keyboard, without the extra device needed.

First skim of the title "what are they tracking now? Wow, very clear language on the tracking part of the business for googleblog.com link. Oh.. Hand tracking"

But this is very neat.

Wasn't this posted here about 2 days ago?

11 days ago, but with no comments: https://news.ycombinator.com/item?id=20739a577

Also 11 days ago, with comments: https://news.ycombinator.com/item?id=20743575

Might have been, sorry - came across this via a fresh story on Gamasutra this evening, just realised the original blog post is from last week.

FWIW, thank you for (re)posting. This is very intriguing and I did not know about it.

realtime hard tracking tech. Typo.

Given the company behind it, I can't get out of my head the thought that this will find a lot of other applications...

Please give a thumbs-up to acknowledge your engagement with this ad

Sorry, the middle finger is not acceptable. Please give a thumbs-up.

Yeah, I hate to be cynical, but I'm not buying the stated use cases as the main motivator here for Google. It's cool they are releasing it. Interested to see what other folks come up with.

I wonder if this could be used to identity people based on hand movements alone? Like some sort of movement 'finger print' or something.

I've got to imagine we all have somewhat different paterns of moving our hands. Is it possible AI could be trained to study existing footage of a person, and identify them this way? Maybe akin to facial recognition, but hand movements instead?

Or maybe they are all too similar to be able to tell one person from another. Hell if I know, but interesting to think about.

Hand identification has been used to convict at least one person:


My thoughts exactly. Ten or 15 years ago I would think "cool, they are going to make something awesome with it".

Three years ago: "they'll shut it down when they get another exciting idea".

Today: "userbinator beat me to it only I am worried they will come up with something worse".

Sorry to all googlers here. I'm trying not picking on you individually but the company you work for has worked long and hard to erode mountains of trust and it is starting to pay off :-/

It’s an open source library, so you can’t “shut it down”, it’s a gift to the community and people can continue to use it if Google doesn’t.

Honestly these kinds of comments are getting silly and applied out of context, it’s almost like there needs to be. Godwin’s law for dredging up Reader’s cancellation or conspiracy theories about how finger tracking is going to be used for ad recommendation.

Edit: upvoted for the point that it is open source.

I never cared about Reader.

For me it was 5 or 10 or even more different other things like

- Killing xmpp through extend, embrace extinguish

- Creating Google+ without making it open

- Killing Google+

- ReCaptcha v2 and v3

- Constantly pushing the boundaries for tracking, now also buying credit card logs

- etc etc

Google did not kill XMPP, the number of people who actually used Google Chat client was tiny. What killed XMPP was ICQ, MSN, Yahoo, Facebook, and text messaging.

Your comments on federation are mistaken IIRC. Google Chat supported federation from 2005 to 2014, it was dropped in a Google I/O announcement because none of the other players mentioned above who were gobbling up the messaging market reciprocated, they instead took advantage of Googles support of federation to onboard users to their proprietary networks. Google Chats user base was declining and the network effort of the proprietary social networks was pulling everyone in.

A similar thing happened with OpenID. Google initially supported it, and Social networks used it for new user signup account creation, but did not allow it for signin.

Thanks facts always appreciated, you might very well be right about those two. I still think Google has earned their current reputation.

(And don't for a second think I think less of Google than the rest, it is probably just that I used to trust Google somewhat more than the rest.)

I work for Google, I definitely think the behavior of Apple, Microsoft, Yahoo, Facebook, and others hurt the idea that "open wins". Google tried hard with XMPP, OpenID, OAuth, ActivityStreams, WebFinger, OpenSearch, FOAF, OpenSocial, et al to build distributed, federated solutions. Even Google Wave was designed with federation support. At some point, even they had to admit, they were losing.

The problem is, tech geeks love it (open), but consumers don't care, and after the smartphone revolution, it's far easier to build siloed apps, and get people into things like iMessage, WeChat, or WhatsApp, than it is to get people to adopt a federated protocol.

If Email had been invented in the modern era with consumers controlling what wins, we would not have SMTP. We'd have proprietary platform specific mailboxes, and people would have to create accounts to send people mail on a platform.

The internet had a brilliant run in the 80s, and early 90s, before the great masses arrived, back when protocols were designed by people interested in technical capabilities, not money, when these things were hashed out at IETF meetups and mailing lists.

Consumer behavior and investment decisions today inherently force centralization I think, and it's hard to build a truly open system these days.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact