

Whatever Happened to Voice Recognition? - dmytton
http://www.codinghorror.com/blog/2010/06/whatever-happened-to-voice-recognition.html

======
gahahaha
He is wrong about the accuracy of speech recognition. I know he is wrong
because I have actually used Speech Recognition to get real work done. AND to
make it worse I have a crap Norwegian accent.

MAYBE it is as bad as he describes when you do not train the system and just
start speaking. But that is like complaining about Emacs after using it for
only 10 minutes. Retarded.

I wouldn't be able to code with it though, but for producing prose in English.
It. Actually. Works.

(I recognize that the problems described in the post is about recognition "in
the wild" and not with one person in one quiet room with one microphone like I
use it)

~~~
WalterGR
_He is wrong about the accuracy of speech recognition. I know he is wrong
because I have actually used Speech Recognition to get real work done._

As have I. The dictation capability _built in_ to Vista / Windows 7 is really
quite amazing. (They both also have voice control.)

I'm an extremely fast typer, but dictating - even with time spent correcting
mistakes - is quite a bit faster.

~~~
ableal
> dictation capability built in to Vista / Windows 7

I'll be ... yep, tucked well away under Accessories/Ease of Access. Who knew?
(Six month old laptop w/ Win7, successor to an XP machine.)

Needs setup and the, so-far unused, built-in mic is not cooperating. It does
say it allows text dictation, besides OS voice control.

~~~
WalterGR
_I'll be ... yep, tucked well away under Accessories/Ease of Access. Who
knew?_

And try this one on for size: Microsoft Office _2003_ came with built-in voice
control and dictation, which could also be used in other applications!

~~~
ableal
Thanks. I was just thinking "why don't they demo this" and then remembered -
they did, back in 2006, and it went embarrassingly wrong:
[http://en.wikipedia.org/wiki/Windows_Speech_Recognition#cite...](http://en.wikipedia.org/wiki/Windows_Speech_Recognition#cite_ref-5)

~~~
WalterGR
Yes, it did. But be careful:

" _In 2007, reports surfaced that Windows Speech Recognition could be used to
remotely access and/or control a user's computer. Theoretically, playing a
pre-recorded message containing Windows Speech Recognition commands could
allow one to execute tasks on another computer remotely._ "

Why, that reads like a Wikipedia article about a Microsoft product!

~~~
ableal
I think that anyone who ever gave a non-rigged demo will feel a painful twinge
of sympathy for the engineer in that one - "There, but for the grace of God,
go I" ...

(And, yes, the "security" issue is pretty far-fetched but is factual, I
remember it being raised.)

------
demallien
The promess of Voice Recognition has never been the actual transcripting of
the spoken word - although even that would be quite useful if a high degree of
accuracy is achieved - but rather in the idea of controlling a computer by
simply talking to it using natural language. People want the computer to
behave as if it is a colleague, not a machine.

Of course we're a _long_ way from that goal. Firstly, people don't understand
just how difficult it is to express clearly what it is that you want. I have a
good friend that had bought an iPhone whilst she was living abroad for 6
months. On returning back home, she wanted to be able to add new songs to her
iPhone without losing the stuff already on there. That's how she described the
task to me, as the go-to person for computer problems. So I started asking a
bunch of questions - do you still have the same computer that you were using
abroad? Do you want copy the music already on the iPhone onto your computer?
What about other data on the iPhone, do you need to recover that too, or can I
blow it away? She got very frustrated with all of these questions and finally
just snapped "Oh, you know what I mean, I just want to be able to use my
iPhone normally, including adding new stuff to it!"

Sigh. She really didn't (doesn't) understand that this just isn't enough
information for me to work with - and that's me, as a walking talking human
being with a strong understanding of what the computer is doing. How is a
computer, with problems in transcribing the spoken words, let alone
understanding the underlying meaning of those words, and not having any idea
of the real world context of the problem, supposed to figure out what she
wanted?

~~~
kristiandupont
There are some cases where this makes sense but there are also a lot of
situations where I want my computer to be just that - not a colleague. Simply
because many of the tasks that I perform on it are trivial and repetitive and
easy to trigger with keystrokes and mouse clicks. It would be annoying to me
and the rest of the office if I was speaking each command.

------
dazzawazza
I worked with a coder about 10 years ago who had been looking into speech
recognition for video games. He came up with some pretty interesting
prototypes that he ultimately had to throw away. His companies legal
department told him speech recognition is a legal minefield with most
algorithms patented and some very trigger happy lawyers defending those
patents.

In the end they went with an inferior solution from one of the big hardware
manufacturers who coincidentally were also mentioned by the legal team as a
patent holder.

So maybe real research is being hampered?

~~~
arethuza
Some games do have voice control - Rainbow Six: Vegas on the Xbox 360 allows
you to give order to your computer controlled teammates.

~~~
Tycho
I remember Rainbow 6 III Black Arrow or something had very usable voice
recognition for the solo game. You basically had a few commands like
'regroup,' 'go go go,' 'hold this position' which worked out better than
button commands. It seemed to disappear in subsequent Tom Clancy games but
glad to hear they brought it back for Vegas.

I think the key with speech recognition and natural language processing is to
forget the HAL9000/StarTrek 'hello computer' nonsense, and focus on
constrained domains augmented by other technologies.

Touchpad keyboards are blatantly less usable than hard keyboards but by the
time you factor in a) portability and b) auto-completion of words/urls/search-
strings, then they become a more than viable alternative.

People have incredibly adaptive linguistic skills, developers should harness
this more IMO rather than be overly ambitious with NLP.

That said I was very impressed with the accuracy of Google's voice search app.
But like Jeff Atwood says, it's easier just to search normally.

------
robryan
I think we get the impression at times that no matter what there is always an
army of people quietly working away on this stuff and as time passes it will
get better and better.

The article Jeff linked talks about how how the performance has pretty much
levelled off except for small incremental improvements and most of the money
and research was shifted away from the area. I guess the trend will continue,
gaining small improvements through applying more and more data as training for
current methods until someone comes up with a completely different approach to
push it forward.

------
garribas
Robert Fortner's article "Rest in Peas: The Unrecognized Death of Speech
Recognition" was discussed recently on HN, a must read for those interested in
speech recognition.

<http://news.ycombinator.com/item?id=1313679>

------
superdavid
So many "new interface" ideas seem to forget that a large percentage of what
people use computers for is in offices, producing documentation. Using your
voice, or spinning 3D interfaces around with your hands, just isn't practical
in an office environment with dozens of people working, where you are
essentially outputting text.

Until that basic reality changes, I don't think mainstream computing needs
will change. Along the same lines, I don't think the reverse is feasible, with
computing interface changes bringing about social changes. The shifts we've
seen from hand-writing to type-writing to computer-based word-processing has
been happening in roughly the same environment for a very long time.

------
iamdave
Not to state that the exception proves the rule, but my company actually makes
use of two racks of IVR servers on a daily basis and we love them. Speeds up
the inbound call process ten fold from 2009.

~~~
CWuestefeld
I despise these things, and I don't know of anyone who disagrees. If you're
seeing an improvement, then I guess most of the world treats them differently
than I do.

As soon as I hear "If you want X, please say...", I start pounding the zero
button. The thing is, relying on buttons is not only faster in itself, but
also doesn't force me through so many "you said 'X'; is that correct?"
questions.

If it were something straightforward I would already have helped myself on
your website. The fact that I'm on the phone means that it's an open-ended
question requiring a real human on the other end (or that your website is
useless).

------
jimmyjim
I think there have been numerous acquisitions of voice/speech-recognition
startups by both Google and Apple
([http://www.enterprisemobiletoday.com/news/article.php/387924...](http://www.enterprisemobiletoday.com/news/article.php/3879241/Apple-
Buys-Voice-Recognition-Firm-Siri-Hints-At-Future-iPhone-Mobile-Computing-
Functionality.htm)), because indeed it's very clear that it's going to be a
thing of the future, with nlp.

And we're probably not much into it just because of how impenetrably tortuous
it is. Almost everything related to it is variable -- culture is always
evolving language, slangs die of usage over time, and there are accents to
worry about. With all of these problems, there is the demand in tandem for
progress in nlp areas to deal with -- where really some of the most difficult
challenges lie (
[http://en.wikipedia.org/wiki/Natural_language_processing#Con...](http://en.wikipedia.org/wiki/Natural_language_processing#Concrete_problems)
). With all that said, I have my money down on Google. They seem to be doing a
lot of work in areas where the variability of these tasks is required.
Presently, the voice transcription feature on youtube seems fairly impressive,
and I've noticed in the past google search's nlp abilities to be curiously
good, certainly more far ahead than any other search engine today.

------
edo
Can anyone tell me why we aren't able to progress into better levels of
recognition. Am I correct to assume it has nothing to do with computing power,
and everything to do with (semantic/linguistic) software?

~~~
compay
It's more than just linguistic software. Our knowledge of linguistics itself
is currently very limited; it's a nacent science and there's still a great
deal of debate about how to even approach the study of language. Even leaving
aside the difficulties in just _transcribing_ speech, linguists are still a
long ways off from coming up with any formalism of human syntax that could
help create software to syntactically parse normal human speech.

~~~
CWuestefeld
...because most of our linguistic formalism is derived from written language,
which is generally an artificial approximation of _real_ language.

Remember grade school, with parts of speech, _Conjunction Junction_ , and
diagramming sentences? Well, think back to the last (non-trivial) sentence you
actually said out loud. I challenge you to try to diagram that sentence.

Although there's quite a lot of research that's been done, surprisingly little
of it deals with the way most of us really communicate.

~~~
compay
Modern linguistic formalism is not derived from written language at all. I
studied at MIT and UConn in the mid '90's and the separation between
prescriptive grammar and linguistics is pretty clear.

That's not to say that current approaches to linguistics are correct; I don't
know enough to make that judgement. But I think the problem lies more in the
complexity of the domain - not because modern linguists are so foolish as to
base theory on written language.

~~~
CWuestefeld
It's quite possible I'm behind the times.

 _not because modern linguists are so foolish as to base theory on written
language_

That was actually what I was trying to get at. While there are clearly some
rules we follow when speaking, they are much more open than those governing
our writing.

It's odd, though, or at least counterintuitive, that our comprehensive of
these less-structured spoken communications is higher than that of more-
structured written ones.

------
ashbrahma
And I thought only us Indians and Chinese were having issues with voice
recognition software..

