
Voice Interfaces - uptown
http://dcurt.is/voice-interfaces
======
untog
_When I open to the home screen on my phone, I should be able to just say
“Instagram” and have that app open._

Is it me, or is the benefit of that extremely small? I can move my finger and
tap the 'Instagram' icon quicker than I can say "Instagram". Never mind that
the app itself is highly visual, so there's no point launching it without
already looking at the screen and holding the phone to your hand.

 _When I’m inputting my home address in a web browser (on mobile or desktop),
I should be able to tap the “State” dropdown and just say “California” and it
have it select that option for me._

Or it should just already know your home address. That's why I don't think
Google Now should be included in this criticism - the idea of it is to preempt
any need for you to issue voice commands, because it's already presented you
with the information you need.

 _On iOS, when I get a notification that covers the top of the screen, I
should be able to just say “ignore” and have the notification instantly
disappear._

I really don't think voice is a great interface at all, because it involves
invading other people's personal space. I don't want to be sat on the train
with people saying "Instagram","Facebook", "ignore", "ignore" all around me. I
don't see what's so bad about the current interfaces, nor do I see anything in
this post that is particularly new or insightful.

~~~
haroldp
> Is it me, or is the benefit of that extremely small?

It's not even a small benefit. It's a horrible interface. Streets and offices
filled with people jabbering commands at machines would be just horrible.

It's a gimmick to sell phones that looks cool in a demo and grows old after a
little actual use.

~~~
mtrimpe
_Right now_ it it's probably not worth it yet unless you're already using
speech dictation for a disability (including RSI e.a.)

Once we get eye tracking by default on both desktop and AR there will be
nothing more natural than the look-and-speak interface though...

But yes; ideally we would quickly get sub-vocalization detection too so that
we can address your noisiness concern as well.

------
kajecounterhack
I think his post demonstrates a poor understanding of the state of voice
technology.

> That being said, current voice recognition technology is incredibly good at
> certain things. It’s great at detecting and transcribing words, listening
> for specific commands, and making matches against expected inputs.

Current state of the art is not actually great at transcription or detection,
in fact only Google and Apple's algos are any good and they both involve
sending all your voice recordings to their servers where huge models are used
to transcribe, meaning lag times and it's not efficient to always be sending
every minute of audio to them. If you want to do continuous listening or use
local voice transcription it's possible but you are limited by your hardware.

Listening for activation words is also a hard problem. In fact, Moto X has a
chip designed just to listen for "Ok Google" so clearly you see some of the
problems: 1. it uses lots of power, 2. it has no flexibility.

Other examples he gives assumes the phone won't intercept outside noise or
will differentiate between voices. This is unfortunately not the case -- you
could maybe apply a training and have a custom voice model but it's still just
an unrealistic idea at this point.

> The reason current voice interfaces suck is because they force the speaker
> to consciously enter a “voice” mode and then create context around the
> action they want the computer to perform. This makes no sense; the computer
> should just always be listening for potential commands within the context of
> whatever the user is doing.

Yes, this is true. But it's all side-effect because of power, bandwidth, and
processing-power constraints given today's algos / models. Today's voice
processing algos rely on lots of data. And also the fact that voices aren't
differentiated right now and it'd be havoc if everyone were setting off each
others' phones.

~~~
eghad
Your post is a bit ironic seeing as the Moto X does differentiate between
voice (if you're good at imitating someone's voice you can set it off, but
sending a command is unlikely) and the power usage is not as bad as you assume
(it's got average battery life but can easily make it through the day), but I
will agree that the "Ok Google Now" being inflexible is a niggling point.

The new Moto X is always listening as well, but allows you to change the
trigger phrase via some magic they've worked out.

~~~
kajecounterhack
I see, thanks. I didn't know these things about the Moto X. It's really a
spectacular feat of engineering!

But in the broader picture my point stands, he still made a number of false
assumptions regarding voice tech ("right now it's good at x y z," and it's not
by any means). It's as if he were imagining today's voice tech to be at the
equivalent of capacitive touch screens when in reality it's not, they aren't
quite invented yet and our current level of tech is still resistive. So of
course sensitivity is poor and multi-touch hasn't been implemented. It's not
design oversight, it's technological limitation.

This isn't to say things aren't changing -- that's what the Moto X represents,
improvement on the bleeding edge :)

------
justinsb
I think this just shows that Siri has fallen behind Android. On Android:

"Who invented the light bulb?": lists the 3 inventors

"Open web-page The Economist": goes to theeconomist.com

"Launch Instagram": launches instagram app

"Send email to <person>": starts an email to <person>

There are lots more, though you do have to guess/remember the magic words if
you don't want to do a web-search. And yes, it is annoying that you can't
enter California in a drop-down by voice (but really, your browser should be
auto-completing that for you anyway). But the hard problems have been solved,
and I see a bright future here.

~~~
refulgentis
With the exception of the light bulb query, these work just fine with iOS too
– the thrust of his post is that these actions should be available without
having to specifically request a voice interaction mode to be turned on, which
Android requires as well.

~~~
justinsb
I certainly take your point; that would be awesome. And thanks for the iOS
information!

It was more confused by the original article though; he seemed to be
suggesting that you would click on a particular UI component first: Want to
open an webpage? Go to the home screen, then say "Web Browser", then click on
the browser bar, then say 'The Economist'.

But why touch on small areas of the screen at all, if Siri is just a button-
hold away and takes you right there? I think the take-home message is that
many of those UI components could go away - we don't need a browser bar, we
don't need a home screen, if we choose to use voice instead.

~~~
jodrellblank
The whole point of the article is that the button hold is bad because Siri has
no context to understand what you are about to say, isn't very intelligent,
and that means you end up making up weird phrasings to try and indicate that
you want the website "the Economist" to load in a browser, rather than
directions to The Economist office building, or facts about The Economist
website traffic from wolfram Alpha, or to open The Economist latest issue in
iBooks ...

By being in a browser and selecting the URL bar, the context is narrowed from
"anything you can say in English" to "a website, if it's not a website, search
for it".

By dropping a dropdown, the context is "one item from this list".

N.b. he also talks about desktops, not just mobile.

------
nateparrott
It's not just voice input that's severely neglected by UI designers—it's text-
based interface in general. Consider his example of choosing "California" from
a list of states—most people would find it way easier to just type "CA" than
to scroll and hunt for "California" in a list—but how often do you see an
interface that prods you to do this, or even makes apparent that it's
possible? (In some web apps that use custom menus, it isn't!) Why, when I want
to apply a 10px blur in Photoshop, can't I just type "blur 10 pixels" instead
of digging through 2 nested menus and a modal? WIMP interfaces are great for
_discovering_ what an app can do, but terrible at letting you do a specific
thing fast.

Things like spotlight and wolfram alpha are a step forward, but we're still
light-years behind where we could be if we took text-based UI—or hybrid
WIMP/text UI—seriously.

------
jjcm
I worked with TellMe (a voice reco company) right after they were acquired by
Microsoft. My job was mainly to work on VUIs (voice user interfaces) for the
Windows Phone, Xbox, and Cortana during its early days.

There are two things blocking voice interfaces from being more prevalent. The
first is an action word. As you said yourself, all voice systems today require
the user to enter in to a voice mode before actions can be recognized. It's
sadly a necessity as devices today can't differentiate between when they're
being talked to and when a user is talking to someone else. Even if you only
allow voice commands within a specific context (i.e. an alert pops up, the
user says "ignore", the pop up goes away), you're playing a very dangerous
game. Take this scenario for instance:

Bob on his computer: "Have you seen this cat video? Look at how the cute kitty
cat will completely..."

Phone: _POP UP WITH ALERT ALL YOUR ICLOUD NUDES ARE BEING STOLEN_

Bob on his computer: "...ignore..."

Phone: _pop up goes away, never seen by Bob (unlike his nudes, which are now
everywhere)_

Bob on his computer "...everything around him! I would never ignore things
that were important."

See the problem there? While it is a slim chance of happening, the effects of
misinterpreting a command can be dire. The potential negatives outweigh the
convenience. Maybe if everyone actively used voice commands, the net sum would
be positive despite it destroying a select few peoples' lives, but that brings
us to our other problem:

Voice commands are socially looked down upon and are rarely used by most
people aged 25-50.

This group of people have two things in common - they typically know how to
use a keyboard and trust its input, and they've all used the terrible voice
commands from old phone systems in the past, which leads them not to trust
voice. The lack of trust means few people use it, making it seem weird to do
in public. This is getting better, and certainly more people are using voice
commands now than they were five years ago. But it's still an extremely small
percentage of users that are voice power users. Ask yourself this: how many
people do you know that send text messages with voice? What percentage of your
friend group is that? I know two people - one of them was a voice engineer who
had to test voice input on phones constantly.

Voice will get better, but it's going to take time. More people need to use it
before we can take more risks, and we need to develop systems that know when
they're being addressed before we can get rid of action words.

~~~
hnha
Something like this already happens all the time if programs steal focus. I
have no idea.how often I've been happily typing only to have some dialog pop-
up and accept my newline Enter as confirmation for whatever it wanted. Things
are better now that I switched to Linux and enabled all the prevention but it
still happens every now and then. Sometimes I end up writing my _passwords_ in
a thieving application

------
themodelplumber
> Siri and Google Now are simply not yet ready to exist

After reading the article: Dustin's ideas are simply not ready to exist.

And somehow he hand-waves toward the ever-enlarging group that uses Siri and
Google Now all the time. Sports, weather, directions, etc.

~~~
VLM
"ever-enlarging group that uses Siri and Google Now"

I was curious where we are on the stereotypical fad graph, the upswing, the
plateau of disillusionment, the decline, whatever. Turns out its VERY hard to
find usage stats.

"In the wild" I've never seen any human being use siri or now, other than
fooling around a couple years back when it was new. My wife and I both have
Now equipped phones and its mostly just an annoyance when it detects an upward
swipe.

Anecdotally I find voice UI much like having printer support on my phone.
Something I imagine would be insanely useful, so of course I have an app that
can print to my printer. What an amazing network of possibilities. But in
practice I never use it. Ever. Once to verify it works. Yup, it works. Never
printed anything again. To some extent this is just "owning a laser printer"
in general, which I really don't need and will not replace when it eventually
breaks.

------
trishume
One source of this problem is that only Google and Nuance have voice
recognition engines that are any good, and they are very closed up (unless you
pay tons of money to Nuance like Apple does for Siri).

Most developer's only option for voice technology is to use Nuance's API which
requires uploading the voice sample and waiting for a response, this is
nowhere near fast enough for pleasant interaction. Things will only get better
when Apple, Google and Microsoft open up really high quality on-device speech
APIs for their operating systems.

~~~
cbr

        In addition to using voice actions to launch
        activities, you can also call the system's
        built-in Speech Recognizer activity to obtain
        speech input from users. This is useful to obtain
        input from users and then process it, such as
        doing a search or sending it as a message.
    

[https://developer.android.com/training/wearables/apps/voice....](https://developer.android.com/training/wearables/apps/voice.html#FreeFormSpeech)

------
matthew-wegner
My personal hunch is that Apple is going to heavily roll out physical/location
context in the next few years. If my friend has a future Apple TV, I should be
able to say "Hey, Siri"* and have Siri recognize that it's me and make my
entire media library available for playback.

This isn't as hard as it sounds, I don't think--Apple only has to search my
friend's contacts' voices, I'm probably on my friend's wifi already, etc.

* iOS 8 includes the ability to say "Hey, Siri" at any time when your iPhone is charging, but I've had crap luck with it during the betas (trying to use it in my car).

------
smacktoward
_> Because they are still so frustratingly limited, Siri and Google Now are
simply not yet ready to exist._

In fairness, Google Now's pitch is a lot more about location-awareness than it
is about speech recognition, innit?

~~~
dragonwriter
> In fairness, Google Now's pitch is a lot more about location-awareness than
> it is about speech recognition, innit

Google Now's pitch is a lot more about all-around awareness -- not just
"location" awareness. Voice isn't really even part of it (the voice actions
are all part of the Google app, which Now integrates, but they aren't really
part of Now -- and actually predate it.)

------
ar7hur
Dustin is right: a truly pleasant user experience would necessary involve
continuous listening (without hotword à la OK Google), as well as leveraging
the user context much more than what Siri or Google Now do today. See for
instance the vision depicted in the movie Her [1].

Beyond Google or Apple, a few startups (like us at wit.ai) are hard at work
designing building blocks to help developers solve these problems.

[1] [https://wit.ai/blog/2014/02/24/her-the-
movie](https://wit.ai/blog/2014/02/24/her-the-movie)

------
anigbrowl
_When I highlight the browser address bar, I should be able to just say “The
Economist” and have it automatically find the address in my favorites and go
there._

Would be nice, but not as good as 'Computer_name...anything new in the
Economist?' 'There's a provocative analysis of the arms trade and some
particularly egregious punning of the sort you claim to despise.'

I have a marvelous scheme to stimulate market demand for such services, but
this comment is insufficiently well-funded to contain it.

------
mwcampbell
This post is right about one thing: Speech recognition technology is good at
context-specific recognition, i.e. with a small grammar, as in VoiceXML IVR
applications (anyone else remember those?). This has been true on typical PC
hardware since at least 2000, so it should be easy to run that kind of speech
recognition locally on a smartphone.

But, last time I did anything serious with that kind of speech recognition, it
still required a push-to-talk button or the like. Maybe a trigger phrase would
work now.

------
AVTizzle
Of the examples Dustin offers here, the ones that have voice replacing typing
make the most sense:

\- _When I’m inputting my home address in a web browser (on mobile or
desktop), I should be able to tap the “State” dropdown and just say
“California” and it have it select that option for me._

\- _When I highlight the browser address bar, I should be able to just say
“The Economist” and have it automatically find the address in my favorites and
go there._ ...

\- _When I click the “To” field in a mail app or in Gmail, I should be able to
just say a person’s name and have it fill in automatically (and maybe show me
a dropdown to select which email address to send to)._

Dustin's examples where the voice recognition replace a series of swipes and
or taps seem, to me, a bit frivolous:

\- _When I open to the home screen on my phone, I should be able to just say
“Instagram” and have that app open._

\- _On iOS, when I get a notification that covers the top of the screen, I
should be able to just say “ignore” and have the notification instantly
disappear._

I know exactly where Instagram is on my phone. It's a swipe and a tap. And to
ignore a notification on iOS, all one has to do is swipe up on it.

The incremental time spent vs voice control seems negligible, and not
something I'd fret about as a user nor a UX designer.

------
antidaily
_The reason current voice interfaces suck is because they force the speaker to
consciously enter a “voice” mode and then create context around the action
they want the computer to perform_

Isnt Google testing this right now? ex
[http://i.imgur.com/fJrQZ0H.jpg](http://i.imgur.com/fJrQZ0H.jpg)

~~~
jon-wood
They still require the utterly ridiculous "Ok, Google" activation though. I
know its irrational, but I really don't want to have to use phrases I would
never use in day to day life to communicate with my computer.

~~~
jeffgreco
Isn't that the whole point of an obscure catchword? To avoid the computer
intercepting normal conversation?

~~~
smacktoward
Sure, but that's just an artifact of the crude capabilities of current speech
recognition. A sufficiently advanced speech recognition engine wouldn't need
you to insert a stopword to know that you're talking to it. It would just...
_know._

~~~
untog
Sometimes I, as a human being, don't know if someone is talking to me unless
they preface their statement with my name. How are computers supposed to be
any better?

~~~
jodrellblank
Long term, because it can monitor way more world state and be designed with as
much processing power as necessary; you have one brain and don't want to spend
100% of your time working out if someone is talking to you.

------
dllthomas
I've given a small amount of thought to attempting a shell designed for spoken
interaction, and also - relatedly - a scripting language designed to be
spoken. I've not really gotten anywhere, though.

This doesn't really sound like the same thing as either of those, but close
enough I wanted to mention.

------
angersock
So, let's assume that the issue is context (based on discussion, that seems to
be a major consensus point).

Obvious solution is to allow more context to better tune understanding of
voice, right?

Allowing better context is pretty much predicated on constant audio
surveillance.

Is this worth it, truly?

------
jmonegro
You can open apps with Siri on iOS.

------
serve_yay
Okay? Noted? I'm not sure how much there is to respond to here.

