Hacker News new | comments | show | ask | jobs | submit login
Getting Alexa to Respond to Sign Language Using Your Webcam and TensorFlow.js (medium.com)
213 points by hardmaru 43 days ago | hide | past | web | favorite | 27 comments

ASL student here. I just wanted to highlight how many challenges there are to translating, even for humans. Here's two examples.

First is dialect. One would think the Internet/youtube/videochat would vaporize geography differences, and it's starting to happen, but that's not (yet) the case. Lots of signs have "synonyms" based on signer's preference, etymology, who they learned from and who they hang out with, etc.

Another is grammar. What OP is doing in his video, for example is English grammar, where signs are directly substitited for words. But ASL has its own sign order, modifiers like facial expressions, and idioms for brevity. For example "have you ever been to San Francisco before?" might be "SF TOUCH YOU FINISH" with raised eyebrows at the end, as a modifier. Note also there are no conjugations or articles like some languages, but there are pronouns. In fact, there are local bindings where you make up a sign name for someone on the fly and then use it during a conversation.

I think with large enough training sets, this will all be mitigated, like Google needed years of speech samples to get Translate working okay.

I took 2.5 years of ASL in University, and I've always been annoyed by this, but had a hard time describing what I mean to others. You have just summed up the problem perfectly in a few sentences, and I thank you for that.

That said, I think the OP is still on the right track here:

> I put it together so you can train it on your own set of word and sign/gesture combos.

A Deaf person should be able to train the system on each command they want so that it works for them automatically in the dialect they want.

> there are local bindings where you make up a sign name for someone on the fly and then use it during a conversation

My very favourite thing about ASL, having learned programming beforehand, was that I could assign people to variables/registers in space. "John point at spot to my left said to Susan spot on right bla bla bla" and then later be able to just point at my left and everyone knows I mean John. And then if John moves to the right, I've just moved him into that register and can reassign the left one to someone or something else. And with a group of experienced Signers, everyone just comprehends this perfectly.

Ah, pointing at sign space for someone... I like to think of that like a pronoun (there's probably an official term).

There's also this thing where you introduce a character, then make up a "temporary name sign" on the fly. Don't know what that's called either :)

The term is 'deixis'. Not sure about the latter.

> ASL [...] programming

Hand pose tracking is becoming increasingly viable, both with rapidly advancing ML of camera video[1], and perhaps with VR gloves. So as someone interested in expert UIs for programming inside VR/AR, I ask...

Any suggestions for ASL linguistics resources to mine for non-novice UI idioms/vocabularies/grammars/etc?

Your "3-space as namespace" as one example.

Gaze tracking will similarly be available. Facial expression, at least while wearing HMDs, regrettably not so much (despite prototypes). Is there signing experience with hand-held objects? - fast fine-control "fiddling with a pencil" is available with sub-millimeter few-degrees precision 6DOF, but it's unclear what vocabulary/grammar to use with it.

Big picture: VR/AR seems an opportunity to leverage accumulated insights from signing.

[1] eg, https://github.com/xinghaochen/awesome-hand-pose-estimation

Student, not expert, but.. the definitive reference is [1]. The front matter establishes a system for taxonomy of signs: one hand or two, handshapes, location, motion, facial expression, modifiers, classifiers, etc. and then has a big dictionary. That front matter should be very helpful for translation authors. Note however, that's their dialect/accent, and other regions will have variations. Bill Vicars [2] is pretty good about cataloging many variations but it's all video material, not reference.

1. https://www.amazon.com/Gallaudet-Dictionary-American-Sign-La...

2. http://asluniversity.com


As you intimate, it's a very common misunderstanding that sign languages are forms of spoken languages rather than languages in their own right.

This page (https://signly.co/, for a sign app that gives prerecorded interpretation of public info) has an example of BSL vs English grammar - see "Show Example".

It appears the user is doing an ASL form of Signed English.

A part of me wishes that written sign languages caught on. With a large enough corpus, they'd be way easier to translate into other languages, and it'd also be easier to translate between signing and written sign language. The way things stand, ML solutions need to perform both steps of the translation, which is so much harder.

That would be interesting. It seems there's going to be a tradeoff between entropy (information content) in a human sign versus the written form.

An example might be the sign RAIN, which could mean anything from drizzle to monsoon depending on how energetic and exagerated the motion. Differentiating which would would need some context, baselining how animated the signer is in the conversation. If you wrote that down you might need RAIN... for one and RAIN!!!!! for the other.

Hey, I'm the guy who made this! Happy to answer any questions that the blog didn't address

Do you think you could get this to work for Google Assistant as well?

Does ASL-to-text already exist outside of voice assistants, or is that part totally new?

Whoa, this is really cool! My wife knows sign language and I showed her this and she was impressed, and had so many questions about how it worked =) I love the innovation — Thanks for sharing!

Interesting I think if I were approaching BSL sign interpretation I'd start with recognising hand shapes, and then move on to other elements of sign notation ... in theory then you could feed the system a dictionary and recognise all signs.

I think BSL uses a variant of Stokoe notation.

Presumably the system of the OP could be modified to recognise dance/skating/snowboarding/martial arts moves.

It seems that once you have an ability to model every aspect of body position from video (as there doing for enhanced love-action sequences in movies now) that this sort of thing becomes much easier.

This looks interesting. I'd be interested to see how the set-up works when it is trained with someone that signs natively.

Most of my experience in the deaf community has been at trivia nights where the signing is incredibly fast (to me) and grammatically loose (just like everyone that's voicing at the bar).

In the end I think it's an invaluable avenue to explore, the tech is important no matter what the end device looks like. Great work.

Love the concept, great work. Came across these guys who are working on making travel more accessible for the deaf community. I would see some overlap in using Alexa to make more voice inclusive content accessible


This is similar to an AWS DeepLens community project called ASLens https://aws.amazon.com/deeplens/community-projects/ASLens/

Although off topic, I loved your paper with Schmidhuber, hardmaru. :)


How many are keeping using alexa/echo these days? mine were turned off months ago and never felt the need to switch them on again, phone does pretty much everything I need, actually it does more than what I need.

plus I don't feel comfortable that there is a silent listening device lurking 24x7 for all family members, so it is not just they're useless, but I intentionally do not want to have them on at all 99% of the time.

the only amazon device I use is FireTV for netflix and Amazon prime video, all amazon ereaders, echo, alexa are replaced by my cellphone...

to that extent, i feel amazon stock price is too high, too much hype for the alexa/echo/kindles at least in my opinion

> I don't feel comfortable that there is a silent listening device lurking 24x7 for all family members

They are listening for the wake word. If they were streaming this data anywhere, people would notice. Heck, I would notice as I keep a close eye on them.

I find it amusing that you are not ok with Echo, but you are ok with a cellphone, which not only is listening all the time too ("Ok Google", "Hey Siri") but also has your location information, emails, photos, all sort of messages AND has its own backchannel to upload this information (the cellular data network), which is much more difficult to keep track of.

My Echos are not useless, when you add some home automation they become much more useful. They are still pretty "dumb" though.

We've yet to see any useful and no-nonsense functionality from that sound boxes I suppose. All the functions available today, like "weather" or "play a song" are definitely done faster by the phone - and by hands, with no voice command at all.

Every day I am at home for weather, timers, lists.

and phone does it all, plus being portable on-the-go

Yes and we even did all these things before a phone and life was fine. The question wasn't if I use the Echo to do things that are impossible otherwise.

I cook a lot and I find it a lot easier to just say "Alexa set a chicken timer for 15 minutes" or "Alexa add black pepper" than it is to get out my phone and do those things.

agreed -- being able to verbally set timers is very valuable to me. Also I have Lifx bulbs in my living room and it is much easier to tell Alexa to turn them on/off/change brightness than it is to get out my phone. I only use the phone app for that when I want to set up some custom color thing (which is pretty rare tbh)

Applications are open for YC Winter 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact