
Thoughts on Voice Interfaces - ingve
https://www.ianbicking.org/blog/2020/08/thoughts-on-voice-interfaces.html
======
echelon
> Many systems use a “wakeword” or “keyword spotting” to start the
> interaction. What if we used keyword spotting to determine the end as well?
> “Please” might be a good choice. It’s like the Enter key.

This would solve so many problems. Sometimes I like to think a bit about what
level to set the lights at, but by that time the machine thinks I'm done with
my request.

We need richer interaction models, and this is an area where it'd be nice if
we could focus on programmers first to find the appropriate mechanisms.

I think modal systems that set contexts would also be amazing.

"Robot, let's adjust the lights"

Which could set the system into dealing only with lighting-related requests.

~~~
dbish
We took a small stab at this at Alexa a few years ago with Follow-up Mode
(default off, you have to enable it in the settings since that continued
listening should be up to the customer) that allows you to keep interacting
with Alexa for "follow-up" utterances to know when you're done with the
conversation, but it's not quite the same as just allowing for a single run on
utterance or set of commands in quick succession, just turn taking without the
need for a wakeword for the next interaction. Alexa determines you're done
with some set keywords like "thank you" and stops listening, as well as using
ML to understand when you're not talking to her anymore.

~~~
echelon
That's really cool!

Do you find that different types of users interact with the product
differently? Children, engineers, laypeople...?

------
nullc
A little offtopic, but am I the only person that is driven absolutely batty by
stuff like google maps talking over conversations out of nowhere and without
warning (usually telling me about something of no value)?

I really wish these text-to-speech interfaces would begin notivation with a
unobtrusive ding, wait a moment so you can stop talking... or even better,
listen and slightly delay their notice so they're not talking over you if at
all possible.

~~~
programmarchy
Yes, especially when listening to a podcast in the car.

"And so now begins the most interesting twi--"

"Get ready to make a left turn up ahead"

"\--and it was all because--"

"Left turn in 100 meters"

"\--but what really--"

"Turn left"

"\--and that would change things forever."

~~~
woadwarrior01
Not so with CarPlay. I’ve noticed that while using the Podcasts app and Apple
Maps with CarPlay, it pauses the podcast while the maps app is speaking and
then resumes it.

~~~
nullspace
I think it depends on the specific app. Does what you’re saying in
Pocketcasts, but the Economist app does what parent is saying.

------
sradman
> Many systems use a “wakeword” or “keyword spotting” to start the
> interaction. What if we used keyword spotting to determine the end as well?
> “Please” might be a good choice. It’s like the Enter key.

I hate saying "Alexa, dismiss" for Reminder notifications rather than
accepting the natural conversational "OK", "Thanks", "Got it", or "OK, thanks,
I got it"...

Polite language sometimes conveys information: "please" signals an expectation
of action, "thank you" can act as an acknowledgement or mark the end of a
conversation, etc.

~~~
jldugger

      Me: Alexa, text my spouse please
      Alexa: What would you like to say?
      Me: Thank you for reminding me to pack my lunch, the cafeteria options were terrible today!
      Alexa: what would you like to say?

------
artpi
Discovery is IMO the biggest reason why voice interfaces will never take off.

Mapping the space of possible solutions is delegated to your brain. You are
responsible for remembering what's possible, what you can do, and what you
should say to make that happen. The majority of the time you will fail,
because you forgot what the magic spell actually was.

Visual interface has an unobtrusive way of showing you what you can do. More
importantly - it can show you multliple options at once. Voice feedback can
only tell you one thing at a time and very slowly.

Voice assistants are called that for a reason - core assumption is that you
share your intent, and assistant is smart enough to make that happen.

But that does not work with people either! Nobody can read your mind (yet) -
that is why project requirements and management exists. I would love to tell
my team one sentence of what I want done, but I have to review that at some
point.

I am not using voice assistant to do human things badly. I want it to do
computer things well.

~~~
throwaway_pdp09
Looks like a good article. I'll look more closely in a bit.

Re. your comment about discovery, I use an old version of a voice recognition
software (dragon). From the help:

"Say "What Can I Say?" to see a list of sample commands you can say in the
current application you are using"

So not a problem, at least in voice transcription software.

From the article criticising recognition loss in brevity, which is very true:

> “Paste” is almost never detected correctly.

Yup, so dragon simply expands the phrase to "paste that". It uses such
redundancy to fix the problem, and because words are cheap (if you have no
voice disability) this IME works well.

I have to say, having used dragon a long time now, I don't see much value in
voice command interfaces primarily but it totally shines in plain old text
transcription.

------
ocdtrekkie
There's a lot of really nicely summarized points here. I think it's a good
"makes you think about things" bit to read if you're going to work in this
space.

~~~
Ajedi32
It's a really interesting writing style.

\- Feels more like an outline than an article or blog post

\- Covers all the essential points

\- Wastes no space on superfluous language

\- I think I like it

~~~
sytelus
This is just the evolution of thought process.

This is bullet list phase which essentially just gathering data. Next phase
should be to digest this data to reveal patterns, identify best practices and
that article would be more interesting - worthy of a good essay. Next phase
would be to synthesize new facts, bring new perspective and fix old issues.
That article would be Worthy of a research paper.

------
cookingrobot
I fed the beginning of this post into GPT-3 and it came up with these
additional thoughts. I’m extremely interested in how voice interfaces can be
improved with a language model like this.

Unedited GPT-3 output: “““ I think this problem is especially bad when you
have to use a specific vocabulary, and when you have to speak in a specific
order. It’s not just the laggy UI, it’s the idea that your vocabulary and
phrasing is not flexible. The laggy UI is an artifact of that lack of
flexibility. I don’t think it’s hard to understand how frustrating that can
be. A lot of people in the voice community are concerned with “flow”. This is
the idea that the conversation should just flow, that we shouldn’t have to
interrupt it to do work. Flow can be achieved in other interfaces by turning
off prompts, and putting in defaults. But it’s hard to do in voice. First, if
you have to wait too long, then you are in trouble. Second, you can’t do
defaults, because the only way to get the answer is to wait for the user to
speak. I think that is true. But there are ways to achieve a kind of flow, by
turning off prompts and being very responsive. We don’t have to show you all
the information on the screen at once. We can use highlights to let you know
the computer is paying attention, or make use of non-visual modalities to let
you know something happened. If we show you something on the screen, it’s
because we are trying to convey something important, and you’ll be able to act
on it, but we can’t expect you to act on everything. Even with this in mind,
flow is a design principle, not a technical requirement. I don’t think we
should design systems that are trying to achieve flow, because it’s very hard
to do. I do think we should design systems that are responsive and forgiving.
Voice systems have a unique capability: they can communicate with you without
being able to see you. This is the key to being able to make an AI feel more
like a person. If I could have a computer read my mind, that would be much
more powerful than if it just read my words. This is the reason that voice is
so compelling, and why it’s so interesting to implement it. Flow is a
principle of interaction design, and it’s about putting in a lot of work to
make a system seem less work. Voice does not make interaction feel less work.
In fact, voice can be the opposite. It is work. Voice requires us to work
harder, and because it is so personal, we feel that more. Voice is a more
intimate interface. You are using your voice to speak to a computer. There is
a large class of problems which are better solved with a screen. I believe
this to be ”””

~~~
IshKebab
How has nobody connected GPT-3 up to a voice interface yet? It wouldn't reveal
anything new but it I think the level of conversation GPT-3 is capable of
combined with actually talking to an AI would be fairly mind blowing.

~~~
Kiro
How do you get access to GPT-3? I applied to the beta with a world-changing
idea and didn't even get a reply.

~~~
orangecat
AIDungeon ([https://play.aidungeon.io/](https://play.aidungeon.io/)) in
"custom" mode lets you enter any prompt. If you go for the paid version, you
can use the higher quality "Dragon" model and tweak some of the parameters.

------
evrydayhustling
There are so many great thoughts in here... Only halfway through but one of
the most substantive UI pieces (or any piece) I've seen in a bit. Lots of this
is relevant beyond voice, to then widening range of immersive interfaces
around us (text bots, dynamic alerts and recommendations, etc.)

One thing on my mind with voice is that humans generate useful shorthand very
dynamically. It would be great if I could conversationally set up more context
with Alexa about tasks I do often or music I like, and have that improve
efficiency and accuracy when I repeat. Instead, I experience the opposite:
memorized commands often stop working because the outside world has changed;
for example a song I like getting shadowed by a crappy (but newer) cover.

------
AriaMinaei
How about a high-bandwidth vocal signal language in place of natural language?
Here is an example:
[https://youtu.be/sG3PWet8fDk?t=4545](https://youtu.be/sG3PWet8fDk?t=4545)

~~~
ianbicking
I haven't seen that video before, I'm excited to watch it in its entirety.

I've always felt a bit disappointed in myself that I haven't created the kind
of optimized experience that Engelbart put forward – like maybe I'm not just
being inefficient, but leaving room to overengage with my own distraction.

OTOH, there are limits to our own ability to be engaged and efficient. I am a
good typist, which means I can write faster than my own thoughts... do I need
to type faster? I could switch to Dvorak, but I don't think that's the
limiting factor. Some of the user interfaces Engelbart advocated for feel like
the Dvorak of interfaces, optimizing for something that is already past human
ability. Looking at some of the interfaces in the Mother Of All Demos, it
feels like they would require years of living in the tool to make it worth it,
to make it natural and comprehensible and focusing instead of distracting. (I
haven't done it, so I might be wrong!)

A high-bandwidth vocal signal feels similar, optimizing past our mental speed,
and single-system-focused. At least that's my take on it.

~~~
AriaMinaei
> like maybe I'm not just being inefficient, but leaving room to overengage
> with my own distraction

I feel that way when I'm writing text and code.

But when doing visual design, or playing with a demoscene setup, I tend to
feel limited by the number of variables I can simultaneously control. I have
tried using MIDI nobs and sliders, but the bottleneck is how many fingers I
have, rather than the number of input devices in front of me.

------
zokier
One thing that is not mentioned is that voice is horrible for
internationalization. One issue with voice is that the language of the UI is
far more deeply entangled with the language of the content.

~~~
fiblye
And it's definitely not future-proofed. Languages can change a lot in decades,
from pronunciation to vocabulary to grammar.

People think it's crazy that we still have people editing 50 year old COBOL
programs today. Imagine how annoying it'd be to have to use a 50 year old
computer that hasn't had "accent updates" and needing to emulate the speech
patterns of your grandparents to be recognized.

~~~
082349872349872
Or spatially-proofed: [https://www.youtube.com/watch?v=TqAu-
DDlINs](https://www.youtube.com/watch?v=TqAu-DDlINs) "Please? Pathetic."

(Autocorrected SMS was unpopular in my country when it was first introduced,
because it corrected to german german spellings, and ever since the mid-
twentieth, we've been proud of regional orthographies. I imagine francophones
who use "SMS language" also have to turn off autocorrect.)

------
jedberg
> Some people think it is important that we not abuse assistants. They believe
> abuse will make us cold or abusive to each other. I do not agree.

And then he provides four bullet points explaining exactly why we should not
abuse assistants...

~~~
pseudalopex
He says we have the opposite problem. People feel judged by their talking
computers and feel sorry for tools. He says we should make systems we don’t
anthropomorphize.

------
dafoex
Now, my sample size is small, but I've noticed that people react to voice
interfaces with what I can only think to call "dumb foreigner syndrome" where
they actively slow down and simplify their speech as if they were talking to
someone that barely understood the language. I think this comes from
familiarity with _bad_ voice interfaces, but maybe we should indulge this
expectation of basic but explicit commands, rather than rich but imprecise
language.

Dare I say it, but maybe we should look to the likes of Zork for inspiration
for voice input.

~~~
cxr
> Dare I say it, but maybe we should look to the likes of Zork for inspiration
> for voice input.

On that note, there's nothing intrinsic here, really, to why these need to be
_voice_ interfaces. Based on watching the video[1] and my experience using
voice interfaces, if my hands are already on the keyboard, I'd wager I can
type some of these things as quickly as if I'd spoken them and waited for the
agent to keep up (even assuming they can).

I'm also surprised that the Awesomebar-/Sublime Text-/VSCode-style command
palette hasn't by now become a ubiquitous feature of user interfaces. I'm
_really_ surprised they haven't replaced for many use cases the combo of Bash
running in a terminal emulator, given how well-known a problem it is to have
to remember the args to infrequently used commands. And I'd rather type "it"
(as in in _delete it_ ) than to ever type "$_" even once—let alone ask someone
with a straight face to remember that this is the way to do things when you're
talking to the computer.

1\.
[https://www.youtube.com/watch?v=3sqKsfj8WRE](https://www.youtube.com/watch?v=3sqKsfj8WRE)

~~~
dafoex
My comment about Zork comes from Don Gentner and Jakob Nielson's "The Anti-Mac
Interface", which mentioned a simple search bar that could take more-or-less
natural language instructions such as "make a new document called outlines"
and the computer would make a new word processor document with that file name
and maybe automatically title it for you. It sounds like it may be a similar
thing to your command palette, or maybe something like the Unity HUD or those
Rofi launchers I keep seeing i3 users showing off.

Come to think of it, I usually type into my Google Assistant app on my phone,
rather than talk to them with voice, so there's no need to make these voice
exclusive.

~~~
cxr
See also
[https://en.wikipedia.org/wiki/Ubiquity_(Firefox)](https://en.wikipedia.org/wiki/Ubiquity_\(Firefox\))

------
codebeaker
This comment probably doesn't add to the conversation, so I apologise in
advance.

We make extensive use of Sonos and Amazon Echo products in our household and
have two young children (3 and 5) the older one doesn't like issuing
instructions for the Echo devices because he pauses occasionally and it
somewhat aggressively "SORRY, I CAN'T HELP RIGHT NOW".

A bigger problem we've noticed is that things "drift". We've got muscle memory
of asking "Alexa, play classical lullabies" or "Echo, play the skeleton dance
song". Occasionally though these commands will be "hijacked" by some new
content in Spotify or somewhere else.

"Alexa, play music for children" used to play nice kids songs, now it plays
some murder-metal album called (one assumes ironically) "music for children",
prior to that match, it seems like it was a search, or a playlist?

And there seems to be no way to deny-list music like this. It's driving us off
of Spotify in our household, "skeleton dance" plays either screaming metal
artists or the cutesey kids song with about 50% accuracy, it seems.

FWIW I had the same problem with Amazon FreeTime where for 5 bucks a month you
get access to 10,000 kids shows and games, which is about 9950 more than I had
time to vet to check if they were appropriate for my kids.

Please, hold your criticism if you want to berate me for raising kids in a
household with voice assistants and tablets, I am trying to balance exposing
them to useful technology without exposing them to the underworld of utter
shite that is to be found beneath those interfaces.

------
killion
So many great points. I'm very happy to hear that someone is thinking of the
UX. I only wished they worked on Siri.

The only addition I can think of is during the wake word section. I'd like to
remove the wake word requirement if I'm in the middle of an interaction.

For example, sending a text on Siri.

"Hey Siri text my wife. Tell her I'm on my way"

Siri shows me the text and then I have to again say "Hey Siri send it" when
just "Send it" should do.

~~~
jagged-chisel
Trying your example starts an interaction with Siri for me. Siri verifies the
message content, asks if I’m ready to send it, and listens for my reply.

Using AirPods Pro is supposed to allow somewhat reduced vocal utterances to
Siri, but I haven’t spent time learning that.

------
lkrubner
A few of my own thoughts from when I was working on this full-time:

"Dialogue designers replace graphic designers when creating voice interfaces"

[http://www.smashcompany.com/technology/script-designers-
repl...](http://www.smashcompany.com/technology/script-designers-replace-
graphic-designers-when-creating-voice-interfaces)

------
earthboundkid
I’m surprised neither Google nor Apple have taken voice reading of webpages
out of the accessibility ghetto and made them full fledged interfaces yet. Why
can’t I walk the dog and say “hey dingus, read Hacker News” and get a decent
experience without having to look at my phone or put it in a voice UI only
mode.

------
Nimitz14
Great blogpost, thanks for sharing. I'm hoping the next 5 years will lead to
enough improvement that accuracy levels will get high enough that users will
feel more comfortable using voice interfaces. We'll see. It does seem to me
like as long as it can only be used for commands it will never take off. But
being able to do conversations consistently is a very hard problem.

------
david_draco
Making voice input like talking to a person, is like trying to make a GUI
similar to an oven interface.

Don't try to use full sentences to command a computer. People are happy to
learn a few words if accuracy increases to >99% and it is more convenient.

Also, incorporate humming in tones.

------
nshm
The most important point Firefox people forget: it is better to be a good
browser than bad Alexa.

------
k__
I think, all this AI is overkill.

If those things would be more active they would work much better.

