
Speech Is 3x Faster Than Typing for English and Mandarin on Mobile Devices - rezist808
http://hci.stanford.edu/research/speech/
======
gok
(disclaimer: I work on speech recognition for mobile devices)

The experiment setup is a little weird here. The participants were given a set
of pre-created phrases to type/speech. At least for English, the phrase set
[1] contains utterances like

"circumstances are unacceptable"

…which contains rare but in-vocabulary words. That makes it hard for keyboard
input (hard for humans to spell long words, hard for touch input to predict
unlikely words) but very easy for speech recognition (no other word sounds
like "circumstances"). And that test set is so old that it's very likely that
the speech recognizer from the experiment (or any state of the art speech
recognizer) has already been trained on those sentences.

The utterances being pre-selected is also unfortunate. When users are given a
sentence to speak ahead of time, they tend not to hesitate or stutter. They
also speak faster than when they're trying to think of something to say on the
fly, which is more typical of text input on mobile devices.

All that being said, it's certainly true that you can often input text very
quickly with speech recognition, and it's getting better every day. :)

[1]
[http://www.yorku.ca/mack/PhraseSets.zip](http://www.yorku.ca/mack/PhraseSets.zip)

~~~
kentlyons
(similar disclaimer: I've done a bunch of text input work for mobile)

That phrase set was explicitly designed for text entry (typing) experiments.
While not optimal, it does allow for more direct comparison to a large body of
previous studies using the same phrase set (and similar procedures).

Having said that, keyboard input methods that provide suggestions/corrections
at multi-character or word level features probably also benefit from longer
words since the recognizer has more signal to work with. Revisiting the
assumptions that went into that initial phrase set (character at a time input)
in light of modern text input techniques might be a good thing.

~~~
kbutler
> That phrase set was explicitly designed for text entry (typing) experiments.

I'm sure that set was designed for <del>typing on full-sized, physical
keyboards,</del> not touch-screen mobile devices. (Thanks for the correction!)

Also, even though speech is faster than typing on a touch-screen mobile
device, it's a lot easier to correct the errors that inevitably happen via
typing.

Sometimes it's impossible to verbally correct errors or enter unrecognized
words (or names!).

~~~
kentlyons
It does predate capacitive touchscreen keyboards. However it wasn't made
explicitly for full size physical keyboards. They explicitly had soft
keyboards in mind. Here is the paper that introduced the phrase set [1] at it
refers to this soft keyboard they made previously [2].

[1]
[http://www.yorku.ca/mack/chi03b.html](http://www.yorku.ca/mack/chi03b.html)
[2]
[http://www.yorku.ca/mack/p25-mackenzie.pdf](http://www.yorku.ca/mack/p25-mackenzie.pdf)

------
eknkc
I couldn't make a habit of using speech recognition. It somehow feels weird in
public. You need to speak like a news anchor, hold the phone in an unnatural
way (never liked video calls due to that too) and keep an eye on the
recognition so it's more effort than just instinctively typing something.

Am I getting old or is this common?

~~~
drzaiusapelord
I'm getting comfortable with talking to my watch. Its pretty discreet if you
do it correctly. The mics on these things are very sensitive, so you don't
have to yell. In fact, whispering works just fine. The problem is that no one
tells people this and as we have seen from the popularity of Siri and other
tech that people just automatically start yelling commands like they're using
a computer from a 1950s sci-fi movie.

Even in places with loud background noise, its a non-issue considering
everything nowadays has two mics for noise reduction. I'm very, very surprised
at how well Google voice-to-speech works. When it fails its almost always
because I have a poor signal from t-mobile or am saying something that's just
too difficult for machines to parse correctly.

~~~
TheOneTrueKyle
This has been my problem with talking to the watch or phone. It is usually the
internet in the area that is the problem.

Also, one thing that is relatively hard for voice recognition to distinguish
is varying between two different languages. I am always cooking and words
like, dashi, kombu, and gnocchi are hard to parse. There are uses of words
from other languages that don't involve saying "translate x in English"

------
jbombadil
Based on what they show on the video alone, I feel this study is unfair. They
are comparing "one of the best commercial speech recognizers out there" and in
absolutely ideal conditions (no external noise, no echo etc.) against a normal
on screen keyboard with no prediction enabled. I'm no pro and I can type much
faster than that with SwiftKey.

I'm not saying the study doesn't have merit, but "Speech _Is_ 3x Faster than
Typing for English and Mandarin Text Entry on Mobile Devices" sounds a bit of
a stretch.

~~~
treehau5
Agreed. I type 3-4x faster than most using Swype on Samsung phones. I have
become so used to it I can also type without looking at the screen in most
cases.

~~~
lqdc13
To provide a counterexample, I definitely type less than 50 (their median) wpm
on a mobile device. More like 10 - especially if there are symbols involved
that are not on the first two screens. On blackberry maybe 50.

On the other hand, speech is a lot worse. A lot of times you can't afford to
have 10%, 5% or even 1% error rate since messages are usually short and you
cannot infer intended meaning easily. So my WPM accounting for correcting
speech with swype is <10.

------
anonova
It's silly they only compared it to a normal on-screen keyboard and not any
other input methods, predictive or not (Swype, etc.).

It should be also be noted that their speech tests were done in a controlled,
silent environment. I'd expect the error rate and time to complete a phrase
would dramatically increase in a noisy room.

~~~
caffinatedmonk
They should have tried it with a full size physical keyboard and larger on
screen keyboards too.

------
Piskvorrr
What this says is "software keyboards suck." That's the elephant in the room.
"Better than utter crap" does not mean "wonderful." I do wonder how the SR
test would stack up against a hardware keyboard - a device-sized one, and a
full-scale one.

(Well of course on-screen keyboards suck: they're a skeuomorphic ugly hack
that has been bolted onto a touchscreen. With a slideout hardware QWERTY
keyboard on an Xperia Pro, I was typing slower than on a full-sized kb, but
still several times faster than any onscreen input - predictive or not, swipe
or not.)

~~~
pluma
There's zero tactile feedback on software keyboards and the way most people
are using them is basically the good old "hunt and peck" style that's the
least efficient way to use a hardware keyboard. I'm not sure the results would
be any better or worse on a desktop on-screen keyboard with mouse input.

------
minouye
Speech-to-text seems like a technology that suffers from the 9x effect[1].
Creators overvalue the impact of voice transcription, and users overvalue
their existing input options.

Even for the desktop, speech will be roughly 2X faster than typing, but I have
no desire to buy a copy of Dragon because I like/overvalue my keyboard.

[1] - [https://hbr.org/2006/06/eager-sellers-and-stony-buyers-
under...](https://hbr.org/2006/06/eager-sellers-and-stony-buyers-
understanding-the-psychology-of-new-product-adoption)

~~~
ryao
Next someone is going to suggest that we all learn to program by voice:

[https://www.extrahop.com/community/blog/2014/programming-
by-...](https://www.extrahop.com/community/blog/2014/programming-by-voice-
staying-productive-without-harming-yourself/)

I cannot find the article at the moment, but there was one discussed on hacker
news recently that mentioned a privacy device for telephones before the 1950s
that achieved the same effect as cupping your hands over the receiver so that
others in the room could not hear what you were saying into it. If programming
by voice takes off, I imagine such a thing would be a necessity to keep office
environments sane. The same goes for regular text input by voice.

~~~
schoen
I'm sure the device you're thinking of was the Hush-a-Phone, which was also
important for competition policy history.

[https://en.wikipedia.org/wiki/Hush-A-
Phone_Corp._v._United_S...](https://en.wikipedia.org/wiki/Hush-A-
Phone_Corp._v._United_States)

------
jpalomaki
With speech the main problem (I have felt) is that mistskes are difficult to
correct.

I haven't yet seen an input system which would combine speech with touch in
nice way.

~~~
rossjudson
Don't correct. Just keep talking.

I found speech recognition to be useful mostly for brain dumps. I've found I
tend to think best when explaining or talking. I used to bring along a voice
recorder on long drives, capture what I was thinking, then run it through
voice recognition later. It was often a garbled mess, but usefully captured a
_lot_ of thinking. Modern recognition systems would do a lot better. Yeah, I
should try that again.

~~~
jpalomaki
This is a good point. I've been using the same strategy for taking notes in
meeting where there is lots of talk. Just write down stuff, not concentrating
on format or content. Then after the meeting go through the notes and clarify.
Trouble is, writing even quick notes, is difficult to do if you also need to
talk and think at the same time.

Probably you could build some neat solution around this. Like a speakerphone
style device with array of microphones to make it easier to identify who is
speaking and pick up the words. Or maybe a regular smartphone is enough. The
device/app would then just transcript what is spoken and annotate it with
names.

Compared to audio recording the benefit would be that going through the
written raw stuff is much faster and if you were present, then you can
probably recall the stuff even if the transcription is not perfect. Also it
might be more socially acceptable to use this solution that to record the
meetings.

------
douche
But is it 3x as valuable? Anyone can talk a mile a minute and let the words
flow out of their face-hole as fast as their brain can string words together.

Slowing down to think about and review what you're communicating is a feature,
not a bug, of text.

~~~
oopsies49
I think it's pretty valuable. I would rather talk to my phone to send a text
message while driving rather than pick it up and cause an accident.

"Ok Google, text Jim Traffic is bad, I'll be late"

~~~
emodendroket
While voice commands may be better than using the keyboard, they're still
distracting and dangerous. It is best to avoid interacting with your phone at
all while you drive.

------
1812Overture
I don't stink that speech rank addition is completely ready to real place tie
pin. To mulch time is spent core wrecking what you rowed.

~~~
schoen
This reminds me of the 1920 poem "The Typewriter Revolution":

[https://trialbysteam.com/2010/03/09/d-j-enright-the-
typewrit...](https://trialbysteam.com/2010/03/09/d-j-enright-the-typewriter-
revolution/)

(I think people in the comments there missed and/or misunderstood some of the
poem's references, a few of which are scatological.)

------
nibs
I think a world of all-speech interfaces would be flawed (if that is the
natural conclusion of this line of thinking). I may be able to speak 3x faster
than I type, but I read 10x faster than I can listen to someone speak. Speech-
to-text is good for typing but that is not to me analogous with the idea that
it should supersede visuals.

~~~
S_Daedalus
Why would embracing one destroy the other? I think a combination of gestural
interface, subvocalization voice recognition, and AR could be the big winner
in our lifetimes. The text won't be bound to a screen, you don't give anything
up, you just gain.

When you need to get into serious writing or bulk data entry, maybe it would
be a keyboard.

------
willvarfar
Soldiers often use throat microphones so you can speak so quietly its silent.

I wonder if a front facing camera on a phone can capture your throat in
sufficient detail to decipher what you say even if you speak silently if you
hold the phone in your palm as people do when browsing?

------
nathan_f77
Here's something positive: This kind of research might indirectly make people
healthier and save a lot of lives.

Calorie tracking is a proven way to lose weight, but all the tracking apps
that I tried feel like they take too much effort. So I came up with an idea a
few years ago: you should be able to just say what you had eaten out loud, and
then use speech-to-text and NLP to search for each item and count up the
calories.

I never got around to building that, but these guys did:
[https://www.nutritionix.com/app](https://www.nutritionix.com/app)

It works amazingly well. Not only is speech 3x faster than typing, it's also
much faster to have a free-form text field that is automatically parsed. And
they integrate with Amazon Echo, which I look forward to trying out.

I've been thinking about many other ways to automate calorie tracking. For a
while I thought the answer would be an AI that recognizes photos of food, but
that doesn't feel important any more. I think speech-to-text takes
approximately the same amount of time as opening the camera and snapping a
photo. There have a been a few minor errors with Apple's voice dictation, but
so far I've seen 100% accuracy from the actual text searches.

So anyway, speech-to-text research. It's all important.

~~~
r-w
Although I kind of wish Apple hadn’t baked speech recognition into their
parsing algorithm. Not everything I say will be in their dictionary, and there
are times I’d rather just be able to type the query in myself than have Siri
misunderstand it and then have to manually retype the parts that were
misinterpreted.

------
AIMunchkin
One of the amazingly dumb things about my Nexus 6 is that when it's listening
for my voice it still plays incoming SMS message tones that it then takes in
as part of speech recognition and messes it up. Why? Really, just why?

I end up having to correct it with typing anyway.

------
contingencies
I am a native English speaker and a second language Mandarin speaker who
learned at early adult age.

I personally consider typing English and Mandarin to be very different. There
are linguistic, psychological, and cultural issues at play.

Firstly, the dominant Chinese input system is phonetic ( _pinyin_ ), which
perhaps implies some kind of different mental state when typing.

Secondly, it is the case for adult learners like me but also reportedly for
many native speakers that precise Chinese characters are easy to forget.
People have visual memories of 3-10,000 characters, but perhaps can write
confidently as little as half of them from memory. The phonetic input system
presents a context-based suggested intent shortlist and the user is requested
to select the character they intended.

Sometimes, in more extreme cases, particularly for native speakers with heavy
accents or new second language speakers, users may be unsure which character
to select or may even input an incorrect but close phoneme, scan for visual
recognition of the correct character, fail to find it, then type a different
phoneme.

Frequently, typing Chinese is the only major creative interaction that
Mandarin speakers have with modern Chinese text, since writing is becoming
increasingly rare outside of a school or government-form context.

------
imjustsaying
Yeah and it's even slower to listen to. I still have a 50+ second long
recording from someone that I haven't bothered to listen to, because if
they're too lazy to type it out, why should I be expected to wait nearly a
minute to listen to it?

------
brokenmachine
Whenever I try to use speech recognition on my Android phone, it's frustrating
because when it recognizes the wrong word, there's no way (that I know of) to
delete a word or move the cursor.

Also when I try to put a period in (we call them "full stops" in Australia),
half the time it inserts a period, and the other half it literally writes
"period".

I would like to see some documentation on how to use it better, I had a go at
googling for some a while back but couldn't find anything useful. I ended up
with the impression that such editing commands weren't implemented.

I still have a go every now and then to see if its improved with updates, but
it's basically unusable in its current state.

------
jwtadvice
Unfortunately, speech recognition has been used for mass surveillance and is
likely to be abused in the future. There needs to be a hardware control on
microphones that allows the external user to control whether they are being
listened to.

~~~
skoocda
I agree 100%. Right now we don't even know the number of microphones our
devices have, much less the contexts in which they may be active. Speech
recognition protocols should also integrate low-level encryption standards to
ensure access is only provided to trusted parties.

------
Houshalter
Is this surprising to anyone? I can speak way faster than I can type even
under the best of conditions, let alone on a shitty mobile keyboard. I'm
surprised it's _only_ a factor of 3. I tend to avoid writing stuff on mobile
and wait until I get home, because it just feels so painfully slow compared to
a real keyboard.

I see a bunch of comments disputing the result. To those people, do you really
think that typing on mobile is as fast or faster than speaking?

~~~
ramblerman
Nobody is disputing that speaking is faster.

It's comparing the two as viable inputs on a smartphone. Speech recognition is
far from perfect and doesn't yet allow you to talk like you would to a human.
Did you read the article?

------
flukus
I can type anywhere but there are a lot of places where speech is
impossible/inconvenient. I wouldn't want to work in an office with everyone
shouting commands at their computer. Audio feedback doesn't work if I'm
listening to music from another device. The computer can't here me if I'm
playing music, etc.

------
nitwit005
I have to type a single number into Google Maps for it to auto-complete to my
home address. That's not really a beatable speed.

And there is the tiny problem that I can't even pronounce some of the things
I've typed into Google. Examples include: Japanese manga names, company names,
and foreign names from news articles.

~~~
monsieurbanana
Those are specific needs, speech recognition isn't supposed to be a 100%
replacement for a keyboard.

------
spatten
From the paper, the IST on the graph stands for "Initial Speech
Transcription", and those data points are for the speech-input text before any
corrections were made. The other "Speech" data-points include time to make
corrections either by using the keyboard or using speech recognition.

------
teddyh
Tap-type vs. Swype vs. Speech Recognition:

[http://www.wastedtalent.ca/comic/text-what-i-mean-not-
what-i...](http://www.wastedtalent.ca/comic/text-what-i-mean-not-what-i-swype)

------
siliconc0w
Error correction is more intuitive for typing - typing errors are usually
obvious transpositions or missing characters which our brains easily correct.
When a speech recognition engine makes a mistake, it's less obvious to the
user and usually more jarring and the word used is an actual word but just
wrong in context. This makes it more difficult for the user to decipher. So
every if it is more accurate or faster, the cost of inaccuracy seems higher.

------
aaron695
People will never do it, it's like saying control your phone with your
genitalia in public. I don't think it'll break social bounds before better
methods are available.

But at the end of the day typing speed doesn't matter. It's not whats limiting
us (In English, not all languages)

But I think many people equate speak recognition with language parsing which
is why people seem to be obsessed with it.

------
minikomi
As an aside, I use 9-key swipe input for Japanese and it's really great once
you're used to it. I feel like there's still some way to input English we
haven't found yet which really suits the language..

[https://www.youtube.com/watch?v=ClDenxOxeeM](https://www.youtube.com/watch?v=ClDenxOxeeM)

~~~
ojii
You can get it for your PC too
[https://www.google.co.jp/ime/furikku/](https://www.google.co.jp/ime/furikku/).
Very good for coding.

------
voltagex_
I once saw someone using a "radial" keyboard on an Android device.
Unfortunately at the time I saw it, the company had vanished.

I'm using SwiftKey (the non-online version), and I can type pretty quickly
after the keyboard has been "trained". I think a combination of prediction and
a better keyboard layout would help input speed quite a lot.

------
pbhjpbhj
I just tried Google voice and was surprised hour fast it was. The only problem
for me read with punctuation. It always seems to type the word rather than
using the correct punctuation mark.

So far this has taken me 1.25 minutes using swipe input on Google b keyboard.
Couple of errors and issues (probably because i use my fat thumbs for entry).

[3:07 minutes.]

~~~
pbhjpbhj
I I just trying to Google Voice and was surprised how fast it was. The only
problem for me was with punctuation. It always seems to type the word, rather
than using the correct punctuation mark. So far this has taken me 37 seconds
using voice input from Google keyboard period couple of errors and issues open
bracket probably because I use my fat from the entry for my voice close
bracket period

[1:15 minutes]

------
f_allwein
I use Swype, which is significantly faster than typing - maybe a good
alternative if you don't want to speak to your phone in public.

[http://www.swype.com](http://www.swype.com)

Then I guess we'll get used to people speaking to their phones very soon, just
as we got used to people talking on mobile phones back in the day.

~~~
montibbalt
I'll have to try when I get home, but I'm curious how speech recognition will
do against the message chosen for the touchscreen record. I'm not 100% sure
how to even pronounce a couple of the words:
[http://www.guinnessworldrecords.com/news/2014/5/fastest-
touc...](http://www.guinnessworldrecords.com/news/2014/5/fastest-touch-screen-
text-message-record-officially-broken-with-fleksy-keyboard-57380/)

------
samfisher83
I wonder what it is with a normal keyboard? I think I can type faster than I
can talk. I never learned the keyboard, but my fingers know where all the keys
are. I can't tell you the order of letters of the qwerty keyboard, but my
brain knows where all the keys are.

~~~
skoocda
You can't type faster than you can talk. Trained stenotypists come close, but
there's nobody who can consistently hit 130-150 wpm on a QWERTY or DVORAK
keyboard.

~~~
fsiefken
You can - if 150 wpm is what you speak, with Plover and training you can reach
120-225 wpm on a qwerty keyboard with NKRO capability. But it's indeed steno
(on a regular keyboard). With regular strokes and autocomplete you might reach
70 wpm or more [https://www.youtube.com/watch?v=Wpv-Qb-
dB6g](https://www.youtube.com/watch?v=Wpv-Qb-dB6g)

------
ryan-allen
When I'm at home or in my car I use speech to write text messages and perform
google searches all the time. It's so much more convenient now that the voice
recognition is so reliable these days :)

------
a_c
It would be great if I can register a voice for certain word, regardless of
language. e.g. pawned => pwned, snafu => snafu, and some other language
pronunciation to some foreign word

------
dboreham
Hmm...for me typing AND speech recognition on mobile devices are horrible. To
know that one is slightly more horrid than the other is small consolation.

------
neves
I'd like to know how does it compare with swipe methods. I feel faster with
swipe, but doesn't know if it really is.

------
ww520
Dictation style input would be great in situation where hand dexterity is not
optimum, like in a car.

------
NicoJuicy
I don't type, i swipe... How faster would speech be then?

------
dominotw
what about emoji, gifs, pics, hyperlinks. texting is way more efficient than
talking.

~~~
tree_of_item
Talking can be used to send texts. People really don't seem to be getting
this.

