Agreed, but also nature of the beast. It's really easy to explain hand tracking software in a single media element. It's a lot harder to explain some crypto AEAD encapsulation format the same way.
I assume YoHa means Your Hands... I don't think I could have resisted OhHi for hand tracking.
I've been working on a couple of chording keyboard designs and was thinking I might be able to create a virtual keyboard using this library. It would be nice to also be able to recognize the hand from the back. A keyboard would also obviously be necessary to track two hands at a time.
How does the application deal with different skin-tones?
That's an interesting idea. I have not tried to build something similar but a humble word of caution that I want to put out is that no matter what kind of ML you use the mechanical version of the instrument will always be more precise (you likely are aware of it, just want to make sure). However, you might be able to approximate precision of the mechanical version.
Two hand support would be nice and I would love to add it in the future.
The engine should work well with different skin tones as the training data was collected from a set of many and diverse individuals. The training data will also grow further over time making it more and more robust.
Similar thought here. I'd like to track two hands above a real keyboard (size/position established via 2 X QR stickers) to feed into a VR piano training simulation.
The tech is all there, really it's just having the time and effort to get all the pieces together!
Was wondering how easy it'd be to port to native mobile, so went looking for the source code, but doesn't appear to actually be open source. The meat is distributed as binary (WASM for "backend" code and a .bin for model weights).
Aside from being a cool hand tracker, it's a very clever way to distribute closed source JavaScript packages.
Thank you for the feedback. You are right that the project is not open source right now. It's "only" MIT licensed. That's why I also don't advertise it as open source (if you see the word open source somewhere it would be a mistake on my end, feel free to tell me if you see it somewhere). I wanted to start out from just an API contract so that it is easier to manage and get started. In general I have no problem open sourcing the JS part. But first there is some refactoring to do so it is easier to maintain upon open sourcing. Stay tuned!
As a side note: The wasm files are actually from the inference engine (tfjs).
Please let me know if you have any more questions in that regard.
An "undo" gesture seems necessary, it was a bit too easy to accidentally wipe the screen. Aside from that, this is fantastic! Love to see what WASM is enabling these days on the web.
Thank you for the feedback. Indeed such a functionality would be nice. One could solve this via another hand pose or in some way also with the existing hand poses. E.g. make a fist for say 2 seconds to clear the whole screen. Anything shorter will just issue an "undo".
YoHa uses tfjs.js which provides several backends for computation. One indeed uses WASM, the other one is WebGL based. The latter one is usually the more powerful one.
Hi, I'm not sure if you've looked into this or not, but another area that is interested in this sort of thing and might be very excited is musical gesture recognition.
Hey, I believe there are multiple things you could have meant. From the top of my head one thing that might be interesting would be an application that allows conductors to conduct a virtual orchestra. But there are other possibilities in this space too I'm sure! If you had something else in mind feel free to share.
I have not explored this space much so far as my focus is rather to build the infrastructure that enables such applications rather than building the applications myself.
What would be nice is a version that can be used to paint on the screen with your fingers, such that the lines are visible on a remotely shared screen. The use-case is marking up/highlighting on a normal desktop monitor (i.e. non-touch) while screen-sharing, which is awkward using a mouse or touchpad (think circling stuff in source code and documents, drawing arrows etc.). That would mean (a) a camera from behind (facing the screen), so that the fingers can touch (or almost touch) the screen (i.e. be co-located to the screen contents you want to markup), and (b) native integration, so that the painting is done on a transparent always-on-top OS window (so that it's picked up by the screen-sharing software); or just as a native pointing device, since such on-screen painting/diagramming software already exists.
Thank you for sharing this creative idea. "so that the fingers can touch (or almost touch) the screen" I think this is a big advantage of this approach since you can only achieve this with the back facing camera. On the flip side, with a back facing camera you either have to place the camera in between yourself and the screen which might be awkward or you have to ensure a placement of the camera behind you that isn't prone to occlusions (e.g. your head or chair might occlude your hands from the cameras point of view). The latter might also make calibration more difficult or impact precision since you might have to mount the camera with some elevation causing a less optimal camera angle.
This looks great! Recently I've been wanting to make a hand-tracking library for video editing. I'd make a hand gesture like an OK with my index and thumb to begin recording, and when I was done I'd make a thumbs up to keep the take or thumbs down to delete a bad take. That way, I could very easily record stuff while only keeping the good takes, to sort out later.
Hell, the library could even stitch the takes together, omitting the times when my hand started/finished doing the gestures.
This reminds me of TAFFI [0], a pinching gesture recognition algorithm that is surprisingly easy to implement with classical computer vision techniques.
Bit of feedback: the home page is pretty sparse. The video is great, but it wasn't obvious how to find the repo or where to get the package (or even what language it can be used with). I had to open the Demo, wait for it to load, and then click the Github link there, and then the readme told me it was available on NPM.
Otherwise looks pretty impressive! I've been looking for something like this and I may give it a whirl
Thank you for the feedback. You are right, the home page should probably be enriched with more information and maybe I can make the information you were looking for stand out better. As a side note: There is a link to GitHub in the footer. The language ("TypeScript API") is also mentioned in the body of the page. But I see that these two can quickly go unnoticed.
Thank you for the feedback. I would like to fix this but I neither own a Chromebook nor does it seem like I can use a platform like browserstack to reproduce the issue (didn't find Chromebook as available device there). If you would like to help debugging the issue you can open a GitHub issue here: https://github.com/handtracking-io/yoha/issues
I wish there was a nice open-source model for tracking hands and arms with multiple viewpoints (multiple cameras), similar to commercial software like this: https://www.ipisoft.com/
BTW this would be great for spaced repetition foreign character learning (Chinese, Arabic, Japanese, Korean, etc.): if the drawn figure is similar enough to the character the student is learning mark it as studied.
Thank you for this inspiring question. For interpreting sign language you need multi-hand support which YoHa is currently lacking. Apart from that you likely also need to account for the temporal dimension which YoHa also does not do right now. If those things were implemented I'm confident that it would produce meaningful results.
It's worth noting that movements of the mouth are extremely important in ASL (and other sign languages) and so this probably isn't as useful as it might seem at first.
Thank you for pointing this out. I overlooked this. I presume that on top of that what also could be relevant is the movements of arms, facial expressions and maybe also general body posture. Please correct me if I'm wrong as I'm not too familiar with sign language.
However, you likely want this functionality on any website that you are visiting for which you probably need to build a browser extension. I haven't tried incorporating YoHa into a browser extension but if somebody were to try I'd be happy to help.
In contrast to similar works there is no dedicated paper that presents e.g. the neural network or the training procedure. Of course ideas from many papers influenced this work and I can't list them all here. Maybe it helps that the backbone of the network is very similar to MobileNetV2 (https://arxiv.org/abs/1801.04381). Let me know if you have any more questions in that regard.
So does the real world. Things are hard to do with disabilities. That's what the word means. This has great potential, and it's not worth shutting down because some people aren't able to use it.
I can also see this being very helpful for people who have cerebral palsy, for example. Larger movements are easier, this might help someone use the web more easily.
What if a bank used this for authentication and disable people can't use their custom interface devices? Does that mean that disabled people shouldn't access to their bank accounts?
Maybe if this was the input device that interacts with the standard web, then there is potential here, but it would be unfortunate if a company used this as a primary means of input.
I feel the opposite. The more computer interfacing takes place in software, the better for disabled users. If you have a device that expects a keyboard scancode to respond, then you need to build a physical keyboard to talk to it. Building a physical keyboard that doesn't suck is expensive, and so disabled people pay crazy prices for gear tailored to them.
Tailoring software that can use very general-purpose input equipment is much cheaper. Training a neural net to recognize one-handed gestures, for instance, could be done by one developer then deployed worldwide. Making a decent one-hand keyboard is way less easy and way harder to scale.
I'm fine with using them, as long as alternatives are available for people with disabilities are able to participate as well.
Imagine if your bank started using these to access your account and suddenly disabled customers could no longer use their adaptive input devices to interact with their account.
This is a very valid point, but as a counter argument the technique implemented here could be adapted to help users with other needs like say, a browser extension that can help you navigate back and forward with the blink of an eye.
[1] https://handtracking.io/draw_demo/