Hacker News new | past | comments | ask | show | jobs | submit login
Thoughts on Voice Interfaces (ianbicking.org)
153 points by ingve 10 months ago | hide | past | favorite | 64 comments

> Many systems use a “wakeword” or “keyword spotting” to start the interaction. What if we used keyword spotting to determine the end as well? “Please” might be a good choice. It’s like the Enter key.

This would solve so many problems. Sometimes I like to think a bit about what level to set the lights at, but by that time the machine thinks I'm done with my request.

We need richer interaction models, and this is an area where it'd be nice if we could focus on programmers first to find the appropriate mechanisms.

I think modal systems that set contexts would also be amazing.

"Robot, let's adjust the lights"

Which could set the system into dealing only with lighting-related requests.

We took a small stab at this at Alexa a few years ago with Follow-up Mode (default off, you have to enable it in the settings since that continued listening should be up to the customer) that allows you to keep interacting with Alexa for "follow-up" utterances to know when you're done with the conversation, but it's not quite the same as just allowing for a single run on utterance or set of commands in quick succession, just turn taking without the need for a wakeword for the next interaction. Alexa determines you're done with some set keywords like "thank you" and stops listening, as well as using ML to understand when you're not talking to her anymore.

That's really cool!

Do you find that different types of users interact with the product differently? Children, engineers, laypeople...?

A combination of Star Trek and CB Radio gives us some pretty good starting points:

Computer, set lights at... 50%. Over.

Isn't it significantly faster and more accurate to just turn a knob or move a slider to the middle? 500ms and no conversational interruption vs. 5-10 seconds where everyone has to be quiet.

If you're near the knob, and you have a clean hand, and a free hand, and your motor control is fine, and your lights are wired to switches how you want to control them (some people want individually controlled bulbs, but that's a lot of switches and knobs).

BTW, always on the lookout for an offline only voice activated clock with multiple timers. wifi for NTP only is also acceptable; no cloud allowed. 98% of what we use the voice assistants for is task timers; it's nice to have those voice activated so everyone knows they were set.

There seem to be some efforts to make Mycroft run without an internet connection here https://community.mycroft.ai/t/the-mycroft-personal-server-s...

A little offtopic, but am I the only person that is driven absolutely batty by stuff like google maps talking over conversations out of nowhere and without warning (usually telling me about something of no value)?

I really wish these text-to-speech interfaces would begin notivation with a unobtrusive ding, wait a moment so you can stop talking... or even better, listen and slightly delay their notice so they're not talking over you if at all possible.

You're not alone, it drives me insane. With driving directions, why does it need to tell me to make a right off my street onto the main road when I've done it a thousand times? No matter where you're going from my house, the first three directions are always the same unless you've got a helicopter.

That's not to mention the overly verbose, repetitive, and late instructions whenever you're taking some ramp from one highway to another. God forbid you try to have a conversation during that.

Try hearing it as a computer, not a person.

I say "try" because I've failed at this myself a couple of times. But it may help.

It's about the speaking itself. I have a hard time with auditory discrimination so I have difficulty speaking if others are as well because I can't properly listen to myself speak.

I wish there was a level of detail slider for the navigation system, because by default I think it's overly detailed and verbose in many instances. An option to skip the first n instructions or any instructions within an x mile radius of my home would be nice too. I don't need the navigation to tell me how to get to the major highways near my home, just which one to go to.

Yes, especially when listening to a podcast in the car.

"And so now begins the most interesting twi--"

"Get ready to make a left turn up ahead"

"--and it was all because--"

"Left turn in 100 meters"

"--but what really--"

"Turn left"

"--and that would change things forever."

Not so with CarPlay. I’ve noticed that while using the Podcasts app and Apple Maps with CarPlay, it pauses the podcast while the maps app is speaking and then resumes it.

I think it depends on the specific app. Does what you’re saying in Pocketcasts, but the Economist app does what parent is saying.

On iOS some combination of the navigation app and the app playing audio can decide whether to interrupt the audio or to talk over it.

The Overcast podcast player tries to backup to complete a word or phrase when it is interrupted.

>or even better, listen and slightly delay their notice so they're not talking over you if at all possible.

with ios adding an indicator to let the user know whether an app is listening in or not, I'm sure this is going to cause a lot of pushback from users in the form of "omg google is eavesdropping on everything we say!!!".

> I really wish these text-to-speech interfaces would begin notivation with a unobtrusive ding,

At least the Amazon and Google produced devices have notification sounds in accessibility options. One sound when it wakes up, and another when it stops listening. Sometimes you can hear it figured out it miswoke and cancels itself (like when the La Cucaracha car horn in Ant-Man triggered a Google Home? I dunno sometimes it wakes on the weirdest stuff)

> Many systems use a “wakeword” or “keyword spotting” to start the interaction. What if we used keyword spotting to determine the end as well? “Please” might be a good choice. It’s like the Enter key.

I hate saying "Alexa, dismiss" for Reminder notifications rather than accepting the natural conversational "OK", "Thanks", "Got it", or "OK, thanks, I got it"...

Polite language sometimes conveys information: "please" signals an expectation of action, "thank you" can act as an acknowledgement or mark the end of a conversation, etc.

  Me: Alexa, text my spouse please
  Alexa: What would you like to say?
  Me: Thank you for reminding me to pack my lunch, the cafeteria options were terrible today!
  Alexa: what would you like to say?

Discovery is IMO the biggest reason why voice interfaces will never take off.

Mapping the space of possible solutions is delegated to your brain. You are responsible for remembering what's possible, what you can do, and what you should say to make that happen. The majority of the time you will fail, because you forgot what the magic spell actually was.

Visual interface has an unobtrusive way of showing you what you can do. More importantly - it can show you multliple options at once. Voice feedback can only tell you one thing at a time and very slowly.

Voice assistants are called that for a reason - core assumption is that you share your intent, and assistant is smart enough to make that happen.

But that does not work with people either! Nobody can read your mind (yet) - that is why project requirements and management exists. I would love to tell my team one sentence of what I want done, but I have to review that at some point.

I am not using voice assistant to do human things badly. I want it to do computer things well.

Looks like a good article. I'll look more closely in a bit.

Re. your comment about discovery, I use an old version of a voice recognition software (dragon). From the help:

"Say "What Can I Say?" to see a list of sample commands you can say in the current application you are using"

So not a problem, at least in voice transcription software.

From the article criticising recognition loss in brevity, which is very true:

> “Paste” is almost never detected correctly.

Yup, so dragon simply expands the phrase to "paste that". It uses such redundancy to fix the problem, and because words are cheap (if you have no voice disability) this IME works well.

I have to say, having used dragon a long time now, I don't see much value in voice command interfaces primarily but it totally shines in plain old text transcription.

There's a lot of really nicely summarized points here. I think it's a good "makes you think about things" bit to read if you're going to work in this space.

It's a really interesting writing style.

- Feels more like an outline than an article or blog post

- Covers all the essential points

- Wastes no space on superfluous language

- I think I like it

This is just the evolution of thought process.

This is bullet list phase which essentially just gathering data. Next phase should be to digest this data to reveal patterns, identify best practices and that article would be more interesting - worthy of a good essay. Next phase would be to synthesize new facts, bring new perspective and fix old issues. That article would be Worthy of a research paper.

I fed the beginning of this post into GPT-3 and it came up with these additional thoughts. I’m extremely interested in how voice interfaces can be improved with a language model like this.

Unedited GPT-3 output: “““ I think this problem is especially bad when you have to use a specific vocabulary, and when you have to speak in a specific order. It’s not just the laggy UI, it’s the idea that your vocabulary and phrasing is not flexible. The laggy UI is an artifact of that lack of flexibility. I don’t think it’s hard to understand how frustrating that can be. A lot of people in the voice community are concerned with “flow”. This is the idea that the conversation should just flow, that we shouldn’t have to interrupt it to do work. Flow can be achieved in other interfaces by turning off prompts, and putting in defaults. But it’s hard to do in voice. First, if you have to wait too long, then you are in trouble. Second, you can’t do defaults, because the only way to get the answer is to wait for the user to speak. I think that is true. But there are ways to achieve a kind of flow, by turning off prompts and being very responsive. We don’t have to show you all the information on the screen at once. We can use highlights to let you know the computer is paying attention, or make use of non-visual modalities to let you know something happened. If we show you something on the screen, it’s because we are trying to convey something important, and you’ll be able to act on it, but we can’t expect you to act on everything. Even with this in mind, flow is a design principle, not a technical requirement. I don’t think we should design systems that are trying to achieve flow, because it’s very hard to do. I do think we should design systems that are responsive and forgiving. Voice systems have a unique capability: they can communicate with you without being able to see you. This is the key to being able to make an AI feel more like a person. If I could have a computer read my mind, that would be much more powerful than if it just read my words. This is the reason that voice is so compelling, and why it’s so interesting to implement it. Flow is a principle of interaction design, and it’s about putting in a lot of work to make a system seem less work. Voice does not make interaction feel less work. In fact, voice can be the opposite. It is work. Voice requires us to work harder, and because it is so personal, we feel that more. Voice is a more intimate interface. You are using your voice to speak to a computer. There is a large class of problems which are better solved with a screen. I believe this to be ”””

That sounds like incoherent rambling, so perhaps it could replace a manager but not a thinking human.

What a mass of verbiage. It sounds like an upspeak guy had a double shot of Asperger's. I truly hope that is not the future of communication.

How has nobody connected GPT-3 up to a voice interface yet? It wouldn't reveal anything new but it I think the level of conversation GPT-3 is capable of combined with actually talking to an AI would be fairly mind blowing.

Isn't GPT-3 just a language model? Where would the part that parses input and then figures out a response come from?

Language models can take input and figure out a response. They're designed to predict words - you just set it up so they predict the response. In fact I think that's the only thing their API lets you do at the moment.


How do you get access to GPT-3? I applied to the beta with a world-changing idea and didn't even get a reply.

AIDungeon (https://play.aidungeon.io/) in "custom" mode lets you enter any prompt. If you go for the paid version, you can use the higher quality "Dragon" model and tweak some of the parameters.

Shouldn't be too difficult to pipe GPT-3's output into a text-to-speech tool.

There are so many great thoughts in here... Only halfway through but one of the most substantive UI pieces (or any piece) I've seen in a bit. Lots of this is relevant beyond voice, to then widening range of immersive interfaces around us (text bots, dynamic alerts and recommendations, etc.)

One thing on my mind with voice is that humans generate useful shorthand very dynamically. It would be great if I could conversationally set up more context with Alexa about tasks I do often or music I like, and have that improve efficiency and accuracy when I repeat. Instead, I experience the opposite: memorized commands often stop working because the outside world has changed; for example a song I like getting shadowed by a crappy (but newer) cover.

How about a high-bandwidth vocal signal language in place of natural language? Here is an example: https://youtu.be/sG3PWet8fDk?t=4545

I've heard a trick that acrobatic flight squads like the Blue Angels use when coming into really tight formations is each pilot hums at a specific frequency/tone. By varying their pitch up or down, they can signal to the other pilots what they're about to do, kinda like an audio turn signal. And because each pilot is humming at a different frequency, everyone in the squad can be communicating at the same time -- in the same way you can pick out different instruments in a song, our ears can separate different simultaneous humming at different frequencies.

Not sure this is what the author meant in the first section when he wrote:

> I suspect you can push the user into a conversational skeuomorphism if you think that’s best, and the user will play along, but it’s no more right than another metaphor. It’s a question of quality of interaction, not ease or familiarity.

...but it's an interesting example of non-conversational multiplexed voice communication.

I think what he's saying is that you can push people towards "natural" conversation but if it doesn't really work well, people will get frustrated. Which is sort of the problem with voice assistants today for the most part. You really do have to get the right wizard incantations a lot of the time.

I haven't seen that video before, I'm excited to watch it in its entirety.

I've always felt a bit disappointed in myself that I haven't created the kind of optimized experience that Engelbart put forward – like maybe I'm not just being inefficient, but leaving room to overengage with my own distraction.

OTOH, there are limits to our own ability to be engaged and efficient. I am a good typist, which means I can write faster than my own thoughts... do I need to type faster? I could switch to Dvorak, but I don't think that's the limiting factor. Some of the user interfaces Engelbart advocated for feel like the Dvorak of interfaces, optimizing for something that is already past human ability. Looking at some of the interfaces in the Mother Of All Demos, it feels like they would require years of living in the tool to make it worth it, to make it natural and comprehensible and focusing instead of distracting. (I haven't done it, so I might be wrong!)

A high-bandwidth vocal signal feels similar, optimizing past our mental speed, and single-system-focused. At least that's my take on it.

> like maybe I'm not just being inefficient, but leaving room to overengage with my own distraction

I feel that way when I'm writing text and code.

But when doing visual design, or playing with a demoscene setup, I tend to feel limited by the number of variables I can simultaneously control. I have tried using MIDI nobs and sliders, but the bottleneck is how many fingers I have, rather than the number of input devices in front of me.

I feel like you'd lose most of the advantage of a voice interface if you had to learn a new (and presumably difficult) high-bandwidth vocal signal language. It might still be useful for some niche scenarios where it's very important to have your hands free or to operate over an audio channel, but the real notable advantage of a voice interface is that most humans already use speech as their primary method of communication with other humans.

The learning curve would certainly be a disadvantage. But just like learning to play an instrument, the payoff could be worth it.

One thing that is not mentioned is that voice is horrible for internationalization. One issue with voice is that the language of the UI is far more deeply entangled with the language of the content.

And it's definitely not future-proofed. Languages can change a lot in decades, from pronunciation to vocabulary to grammar.

People think it's crazy that we still have people editing 50 year old COBOL programs today. Imagine how annoying it'd be to have to use a 50 year old computer that hasn't had "accent updates" and needing to emulate the speech patterns of your grandparents to be recognized.

Or spatially-proofed: https://www.youtube.com/watch?v=TqAu-DDlINs "Please? Pathetic."

(Autocorrected SMS was unpopular in my country when it was first introduced, because it corrected to german german spellings, and ever since the mid-twentieth, we've been proud of regional orthographies. I imagine francophones who use "SMS language" also have to turn off autocorrect.)

> Some people think it is important that we not abuse assistants. They believe abuse will make us cold or abusive to each other. I do not agree.

And then he provides four bullet points explaining exactly why we should not abuse assistants...

He says we have the opposite problem. People feel judged by their talking computers and feel sorry for tools. He says we should make systems we don’t anthropomorphize.

Now, my sample size is small, but I've noticed that people react to voice interfaces with what I can only think to call "dumb foreigner syndrome" where they actively slow down and simplify their speech as if they were talking to someone that barely understood the language. I think this comes from familiarity with bad voice interfaces, but maybe we should indulge this expectation of basic but explicit commands, rather than rich but imprecise language.

Dare I say it, but maybe we should look to the likes of Zork for inspiration for voice input.

> Dare I say it, but maybe we should look to the likes of Zork for inspiration for voice input.

On that note, there's nothing intrinsic here, really, to why these need to be voice interfaces. Based on watching the video[1] and my experience using voice interfaces, if my hands are already on the keyboard, I'd wager I can type some of these things as quickly as if I'd spoken them and waited for the agent to keep up (even assuming they can).

I'm also surprised that the Awesomebar-/Sublime Text-/VSCode-style command palette hasn't by now become a ubiquitous feature of user interfaces. I'm really surprised they haven't replaced for many use cases the combo of Bash running in a terminal emulator, given how well-known a problem it is to have to remember the args to infrequently used commands. And I'd rather type "it" (as in in delete it) than to ever type "$_" even once—let alone ask someone with a straight face to remember that this is the way to do things when you're talking to the computer.

1. https://www.youtube.com/watch?v=3sqKsfj8WRE

> I'm also surprised that the Awesomebar-/Sublime Text-/VSCode-style command palette hasn't by now become a ubiquitous feature of user interfaces.

This makes me think of programs on the Mac[1] like Alfred and LaunchBar. While they're usually called "application launchers," in a lot of ways they're more like simple, always-available command lines. I use Alfred for launching nearly every program, but also use it to do various web searches (like "g foobar" for Google[2], "wp foobar" for Wikipedia, etc.), perform simple calculations ("=123+(45/67)"), and even actions like converting Markdown on the clipboard to BBCode or creating Jira URLs for pasting into GitHub PRs. The Mac's native Spotlight has evolved to do some of these tasks, but Alfred, Launchbar, et. al. are more powerful in part because they're more explicit -- Spotlight makes educated guesses about what you're searching for, but Alfred operates more like, well, a command line.

[1] I presume programs like these exist for Linux, too, although I haven't found any when I've gone looking. The lack of one would actually be a pretty major stumbling block for me for platform switching. (As would the lack of anything quite like Keyboard Maestro, but that's a different topic, and maybe that's out there, too!)

[2] Actually I'm using DuckDuckGo but still type "g <thing>" because of years of muscle memory.

My comment about Zork comes from Don Gentner and Jakob Nielson's "The Anti-Mac Interface", which mentioned a simple search bar that could take more-or-less natural language instructions such as "make a new document called outlines" and the computer would make a new word processor document with that file name and maybe automatically title it for you. It sounds like it may be a similar thing to your command palette, or maybe something like the Unity HUD or those Rofi launchers I keep seeing i3 users showing off.

Come to think of it, I usually type into my Google Assistant app on my phone, rather than talk to them with voice, so there's no need to make these voice exclusive.

Most of the voice interfaces are basically Zork, but Zork came with a manual and HELP would tell you the keywords. If you ask the question in the right form, it will work, and if the form is wrong, it won't, but the forms are hard to find (and they change).

People are very adaptable and computers are hard to adapt. Tell me what I'm supposed to say, and I'll say it.

I always do this. It's because there are only bad voice interfaces. Siri,Alexa etc, it's utter crap.

Has anyone yet tried programming in Inform 7 via voice?

This comment probably doesn't add to the conversation, so I apologise in advance.

We make extensive use of Sonos and Amazon Echo products in our household and have two young children (3 and 5) the older one doesn't like issuing instructions for the Echo devices because he pauses occasionally and it somewhat aggressively "SORRY, I CAN'T HELP RIGHT NOW".

A bigger problem we've noticed is that things "drift". We've got muscle memory of asking "Alexa, play classical lullabies" or "Echo, play the skeleton dance song". Occasionally though these commands will be "hijacked" by some new content in Spotify or somewhere else.

"Alexa, play music for children" used to play nice kids songs, now it plays some murder-metal album called (one assumes ironically) "music for children", prior to that match, it seems like it was a search, or a playlist?

And there seems to be no way to deny-list music like this. It's driving us off of Spotify in our household, "skeleton dance" plays either screaming metal artists or the cutesey kids song with about 50% accuracy, it seems.

FWIW I had the same problem with Amazon FreeTime where for 5 bucks a month you get access to 10,000 kids shows and games, which is about 9950 more than I had time to vet to check if they were appropriate for my kids.

Please, hold your criticism if you want to berate me for raising kids in a household with voice assistants and tablets, I am trying to balance exposing them to useful technology without exposing them to the underworld of utter shite that is to be found beneath those interfaces.

So many great points. I'm very happy to hear that someone is thinking of the UX. I only wished they worked on Siri.

The only addition I can think of is during the wake word section. I'd like to remove the wake word requirement if I'm in the middle of an interaction.

For example, sending a text on Siri.

"Hey Siri text my wife. Tell her I'm on my way"

Siri shows me the text and then I have to again say "Hey Siri send it" when just "Send it" should do.

Trying your example starts an interaction with Siri for me. Siri verifies the message content, asks if I’m ready to send it, and listens for my reply.

Using AirPods Pro is supposed to allow somewhat reduced vocal utterances to Siri, but I haven’t spent time learning that.

Agreed. This article is so loaded with not-so-obvious observations that can be used as a source of ideas for mini research projects. It's a goldmine for things to try. Kudos to the author for sharing their thoughts on this space.

A few of my own thoughts from when I was working on this full-time:

"Dialogue designers replace graphic designers when creating voice interfaces"


I’m surprised neither Google nor Apple have taken voice reading of webpages out of the accessibility ghetto and made them full fledged interfaces yet. Why can’t I walk the dog and say “hey dingus, read Hacker News” and get a decent experience without having to look at my phone or put it in a voice UI only mode.

Great blogpost, thanks for sharing. I'm hoping the next 5 years will lead to enough improvement that accuracy levels will get high enough that users will feel more comfortable using voice interfaces. We'll see. It does seem to me like as long as it can only be used for commands it will never take off. But being able to do conversations consistently is a very hard problem.

Making voice input like talking to a person, is like trying to make a GUI similar to an oven interface.

Don't try to use full sentences to command a computer. People are happy to learn a few words if accuracy increases to >99% and it is more convenient.

Also, incorporate humming in tones.

The most important point Firefox people forget: it is better to be a good browser than bad Alexa.

I think, all this AI is overkill.

If those things would be more active they would work much better.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact