Hacker News new | past | comments | ask | show | jobs | submit login
Designing for Voice User Interfaces (snappymob.com)
41 points by allending 12 days ago | hide | past | favorite | 9 comments

To me, there are 3 areas where voice user interfaces fail quite well at:

- being able to spell out a word that is not understood (mainly for proper names)

- teach new commands vocally (hey assistant, when I say “hit it”, actually play that song and turn the lights of the living room on)

- being able to understand words in a different language as part of the sentence.

That one makes Alexa quite a pain to use in some countries. For example, if Alexa is set to French and you wanna listen to an English song, you then need to make sure the title of the song is pronounced with a french accent, otherwise it won’t get it. The same is true if Alexa is set to English and you ask for a French or Chinese song title.

It makes it so frustrating that it’s unusable.

I just want to be able to spell things out with the NATO alphabet, same way I would talking to a human.

This applies the other way around to - please, some satnav app, give me the option of the directions reading out as "take exit 52 Bravo" instead of "take exit 52bee". (To me, 52bee and 53 sound way too similar, and require a screen glance to disambiguate!) I would switch apps for that feature, so long as the navigation was at least okay and it still read out street names.

You could likely customize Osmand to do this. The announcements are assembled using JavaScript and it is straightforward to add a new script.

https://yingtongli.me/blog/2020/01/25/osmand.html does it to customize something else.

Yup. My friend drives a bike and would sometimes like Maps to navigate him to his street, Włodarzewska. Good luck.

I have one trouble with voice user interfaces: how difficult or feasible is it to localize them? I can easily translate the UI of a text-based application in an afternoon to my minority language. Especially if it uses gettext, it is a matter of editing a single text file. Will I be able to do that easily for a voice user interface?

No. Translation is not easy.

I speak Norwegian. There are about five million language users. The Norwegian median wage is perhaps the highest in the world, possibly the top fifth, so there is a lot of pull to have Norwegian users. But only a few voice assistants have Norwegian interfaces (Siri was the first, Google may have joined the crowd). I work in IT, and making a conversational interface is rarely on the table; never if you exclude chat bots.

To get an idea of the work involved, I know the guy who translated Siri through friends. Allegedly, that was at least six months of a single person working. I don't know if he only did the translation or if he also did a lot of AI work, but that's a lot in either case.

The result isn't too impressive either. Siri has problems understanding my girlfriend because of her Bergen dialect, which differs from my Oslo dialect mainly in how the R sound is pronounced.

My two cents.

A good speech recognition model takes more than 10k hours of labeled speech data to train. The more diversity the better. Most production speech models (Siri, Google) need some domain-specific training as well - for offline or low-resource inference. Additionally, language models that help improve speech recognition accuracy need a fairly decent amount of unsupervised text as well. If you are building a conversational speech model, also consider the data needed to label intents for a large number of phrases.

For a language like Norwegian, you are unlikely to find so much labeled and unlabeled data without forking out a large amount of money to literally ask some Norwegians to help. I suspect the work is being done, but it's extremely time consuming and researchers are likely spending time building models that can learn high-quality high-level representations of language that can be transferred to different scripts/languages with a small amount of data.

That depends a bit on whether the language you want to support is already supported by your platform (Alexa, say). If it is, it can be as easy as gettext. It can also be a bit harder because you want to take cultural behavior for conversations into account.

If your language isn't supported by the platform, you're just out of luck. The best you can do is request your platform to translate it, but that is - like vages described - a lot of work and will take time.

My biggest problem with voice user interfaces is that everyone can bloody hear you using them! When you type, you can clickety clack, but nobody knows what your clickety clacking means. When you write, and do work by hand, nobody can see what you write unless they're looking over your shoulder. But when you tell your voice-computer: "Create new document, titled 'Reason's why Tom is being fired,'" the whole office will hear you! If they could make a voice computer where you wouldn't have to talk, that'd be a winner.

(On a more serious note, maybe some kind of throat mic?)

Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact