Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: YoHa – A practical hand tracking engine (handtracking.io)
292 points by b-3-n 8 days ago | hide | past | favorite | 64 comments

The demo really sells it here [1]. It's amazingly intuitive and easy to use, it should be a part of video-conferencing software.

[1] https://handtracking.io/draw_demo/

Thank you for the feedback. Such an integration would be nice indeed.

You could integrate it with all video-conferencing software building it as a virtual camera plug-in that works with OBS

It's like an initial beta of the software - it's not production ready. I can't imagine this adding value to a meeting _yet_. Seems promising though.

This is a GREAT website, I can understand what it does with zero clicks, zero scrolls.

Really great, congratulations, I hope that I can find a way to apply this lesson to my SaaS.

Agreed, but also nature of the beast. It's really easy to explain hand tracking software in a single media element. It's a lot harder to explain some crypto AEAD encapsulation format the same way.

I assume YoHa means Your Hands... I don't think I could have resisted OhHi for hand tracking.

I've been working on a couple of chording keyboard designs and was thinking I might be able to create a virtual keyboard using this library. It would be nice to also be able to recognize the hand from the back. A keyboard would also obviously be necessary to track two hands at a time.

How does the application deal with different skin-tones?

That's an interesting idea. I have not tried to build something similar but a humble word of caution that I want to put out is that no matter what kind of ML you use the mechanical version of the instrument will always be more precise (you likely are aware of it, just want to make sure). However, you might be able to approximate precision of the mechanical version.

Two hand support would be nice and I would love to add it in the future.

The engine should work well with different skin tones as the training data was collected from a set of many and diverse individuals. The training data will also grow further over time making it more and more robust.

Similar thought here. I'd like to track two hands above a real keyboard (size/position established via 2 X QR stickers) to feed into a VR piano training simulation.

The tech is all there, really it's just having the time and effort to get all the pieces together!

Was wondering how easy it'd be to port to native mobile, so went looking for the source code, but doesn't appear to actually be open source. The meat is distributed as binary (WASM for "backend" code and a .bin for model weights).

Aside from being a cool hand tracker, it's a very clever way to distribute closed source JavaScript packages.

Thank you for the feedback. You are right that the project is not open source right now. It's "only" MIT licensed. That's why I also don't advertise it as open source (if you see the word open source somewhere it would be a mistake on my end, feel free to tell me if you see it somewhere). I wanted to start out from just an API contract so that it is easier to manage and get started. In general I have no problem open sourcing the JS part. But first there is some refactoring to do so it is easier to maintain upon open sourcing. Stay tuned!

As a side note: The wasm files are actually from the inference engine (tfjs).

Please let me know if you have any more questions in that regard.

This architecture was also used in the link referenced when bringing up alternative implementations:


An "undo" gesture seems necessary, it was a bit too easy to accidentally wipe the screen. Aside from that, this is fantastic! Love to see what WASM is enabling these days on the web.

Thank you for the feedback. Indeed such a functionality would be nice. One could solve this via another hand pose or in some way also with the existing hand poses. E.g. make a fist for say 2 seconds to clear the whole screen. Anything shorter will just issue an "undo".

YoHa uses tfjs.js which provides several backends for computation. One indeed uses WASM, the other one is WebGL based. The latter one is usually the more powerful one.

Hi, I'm not sure if you've looked into this or not, but another area that is interested in this sort of thing and might be very excited is musical gesture recognition.

Hey, I believe there are multiple things you could have meant. From the top of my head one thing that might be interesting would be an application that allows conductors to conduct a virtual orchestra. But there are other possibilities in this space too I'm sure! If you had something else in mind feel free to share.

I have not explored this space much so far as my focus is rather to build the infrastructure that enables such applications rather than building the applications myself.

Also look at leap motion. https://www.ultraleap.com/product/leap-motion-controller/ (tip: mouser has them in stock and usually the best price) with midipaw http://www.midipaw.com/ (free)

Latency is very low which is very important for this use case. Look on YouTube for demos.

What would be nice is a version that can be used to paint on the screen with your fingers, such that the lines are visible on a remotely shared screen. The use-case is marking up/highlighting on a normal desktop monitor (i.e. non-touch) while screen-sharing, which is awkward using a mouse or touchpad (think circling stuff in source code and documents, drawing arrows etc.). That would mean (a) a camera from behind (facing the screen), so that the fingers can touch (or almost touch) the screen (i.e. be co-located to the screen contents you want to markup), and (b) native integration, so that the painting is done on a transparent always-on-top OS window (so that it's picked up by the screen-sharing software); or just as a native pointing device, since such on-screen painting/diagramming software already exists.

Thank you for sharing this creative idea. "so that the fingers can touch (or almost touch) the screen" I think this is a big advantage of this approach since you can only achieve this with the back facing camera. On the flip side, with a back facing camera you either have to place the camera in between yourself and the screen which might be awkward or you have to ensure a placement of the camera behind you that isn't prone to occlusions (e.g. your head or chair might occlude your hands from the cameras point of view). The latter might also make calibration more difficult or impact precision since you might have to mount the camera with some elevation causing a less optimal camera angle.

This looks great! Recently I've been wanting to make a hand-tracking library for video editing. I'd make a hand gesture like an OK with my index and thumb to begin recording, and when I was done I'd make a thumbs up to keep the take or thumbs down to delete a bad take. That way, I could very easily record stuff while only keeping the good takes, to sort out later.

Hell, the library could even stitch the takes together, omitting the times when my hand started/finished doing the gestures.

This reminds me of TAFFI [0], a pinching gesture recognition algorithm that is surprisingly easy to implement with classical computer vision techniques.

[0] https://www.microsoft.com/en-us/research/publication/robust-...

Bit of feedback: the home page is pretty sparse. The video is great, but it wasn't obvious how to find the repo or where to get the package (or even what language it can be used with). I had to open the Demo, wait for it to load, and then click the Github link there, and then the readme told me it was available on NPM.

Otherwise looks pretty impressive! I've been looking for something like this and I may give it a whirl

Thank you for the feedback. You are right, the home page should probably be enriched with more information and maybe I can make the information you were looking for stand out better. As a side note: There is a link to GitHub in the footer. The language ("TypeScript API") is also mentioned in the body of the page. But I see that these two can quickly go unnoticed.

The demo doesn't seem to work on my chromebook. Maybe it's too underpowered?

Web page doesn't say anything after `Warming up...` and the latest message in the browser console is:

    Setting up wasm backend.
I expected to see a message from my browser along the lines of "Do you want to let this site use your camera", but I saw no such message.

Thank you for the feedback. I would like to fix this but I neither own a Chromebook nor does it seem like I can use a platform like browserstack to reproduce the issue (didn't find Chromebook as available device there). If you would like to help debugging the issue you can open a GitHub issue here: https://github.com/handtracking-io/yoha/issues

I wish there was a nice open-source model for tracking hands and arms with multiple viewpoints (multiple cameras), similar to commercial software like this: https://www.ipisoft.com/


Just note that in the demo video, the user is 'writing' everything mirrored.

The video itself could be mirrored.

My god, you're right. Unless he's wearing a women's shirt.

BTW this would be great for spaced repetition foreign character learning (Chinese, Arabic, Japanese, Korean, etc.): if the drawn figure is similar enough to the character the student is learning mark it as studied.

Congrats again

Thank you for your feedback and for sharing this potential use case. I think it is a very creative idea.

My first question is whether this has the capability of being adapted to interpret/translate American Sign Language (ASL)?

Thank you for this inspiring question. For interpreting sign language you need multi-hand support which YoHa is currently lacking. Apart from that you likely also need to account for the temporal dimension which YoHa also does not do right now. If those things were implemented I'm confident that it would produce meaningful results.

It's worth noting that movements of the mouth are extremely important in ASL (and other sign languages) and so this probably isn't as useful as it might seem at first.

Thank you for pointing this out. I overlooked this. I presume that on top of that what also could be relevant is the movements of arms, facial expressions and maybe also general body posture. Please correct me if I'm wrong as I'm not too familiar with sign language.

Signs also tend to be expressed by the hands' position/movement in relation to _other_ body parts.

Edit: OTOH fingerspelling (https://en.m.wikipedia.org/wiki/Fingerspelling) might be a more feasible usecase!

Very impressive.

I want something like this so I can bind hand gestures to commands.

For example scroll down on a page by a hand gesture.

One can build this pretty easily for a website that you are hosting with the existing API (https://github.com/handtracking-io/yoha/tree/master/docs).

However, you likely want this functionality on any website that you are visiting for which you probably need to build a browser extension. I haven't tried incorporating YoHa into a browser extension but if somebody were to try I'd be happy to help.

That's nice but I'd also want it for general desktop stuff.

So I guess it would have to be sitting on my machine.

For example hand gestures to switch the desktop workspace.

Swipe left/right motion to switch desktop workspace. That would be the dream :)

Could it be used to translate sign language signs to written or spoken words, I wonder.

Thank you for the question. Since a similar question was asked already let me refer you to this comment: https://news.ycombinator.com/item?id=28830943

Thank you for the pointer that I had overlooked.

Demo keeps crashing on iOS.

Thank you for the feedback. I can confirm there seems to be an issue with iOS/Chrome and created a GitHub issue for it (https://github.com/handtracking-io/yoha/issues/5).

Note that if you were trying iOS/Safari and not iOS/Chrome there is nothing that can be done due to a limitation that is documented in the section "Discussion" here: https://developer.apple.com/documentation/webkitjs/canvasren... Will document this.

Great idea which is brilliantly executed.

So many educational uses, well done.

Thank you for the feedback.

Would you provide the related paper to this approach?

In contrast to similar works there is no dedicated paper that presents e.g. the neural network or the training procedure. Of course ideas from many papers influenced this work and I can't list them all here. Maybe it helps that the backbone of the network is very similar to MobileNetV2 (https://arxiv.org/abs/1801.04381). Let me know if you have any more questions in that regard.

Thanks for your reply! I just thought that SIGCHI is around the corner and it will be presented there! Awesome work!

Great work!

Thank you for the feedback.

Great demo.

Thank you for your feedback.

I think these tools are super interesting, but I tools like this marginalize users with non-standard number of limbs or fingers.

So does the real world. Things are hard to do with disabilities. That's what the word means. This has great potential, and it's not worth shutting down because some people aren't able to use it.

I can also see this being very helpful for people who have cerebral palsy, for example. Larger movements are easier, this might help someone use the web more easily.

What if a bank used this for authentication and disable people can't use their custom interface devices? Does that mean that disabled people shouldn't access to their bank accounts?

Maybe if this was the input device that interacts with the standard web, then there is potential here, but it would be unfortunate if a company used this as a primary means of input.

That's the bank's mistake, not this library's.

I feel the opposite. The more computer interfacing takes place in software, the better for disabled users. If you have a device that expects a keyboard scancode to respond, then you need to build a physical keyboard to talk to it. Building a physical keyboard that doesn't suck is expensive, and so disabled people pay crazy prices for gear tailored to them.

Tailoring software that can use very general-purpose input equipment is much cheaper. Training a neural net to recognize one-handed gestures, for instance, could be done by one developer then deployed worldwide. Making a decent one-hand keyboard is way less easy and way harder to scale.

What do you suggest be done about it?

I'm fine with using them, as long as alternatives are available for people with disabilities are able to participate as well.

Imagine if your bank started using these to access your account and suddenly disabled customers could no longer use their adaptive input devices to interact with their account.

So we can't give people nice things unless we can give everyone nice things?

This is a very valid point, but as a counter argument the technique implemented here could be adapted to help users with other needs like say, a browser extension that can help you navigate back and forward with the blink of an eye.

This all gets complicated, because not everyone has 2 eyes :-/.

You end up with complicated systems trying to cover all of the edge cases.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact