

Rest in Peas: The Unrecognized Death of Speech Recognition - robertfortner
http://robertfortner.posterous.com/the-unrecognized-death-of-speech-recognition
Speech recognition seems good on a cell phone but accuracy for conversational speech flatlined in 2001--and no one noticed. Computer understanding of language was supposed to lead to artificial intelligence. Now we have need AI to get computers to understand language. Catch 22. We just haven't recognized it yet. (Sorry, Ray Kurzweil.)
======
nostrademons
This comment is voice posted to my nexus 1, without ed its.

I find that the speech recognition on my next 1 is adequate 4 basic search
queries. I tried old freezes listed in the article as search query. Rest in
peace high st correctly. Sb inspiration came out of sudan inspiration. Serve
as the installation, remarkably, king out exactly correct. Saving 1 into the
phone give me a number instead of a word. Saying recognize speech came out
okay.

The problem with speech recognition of long passages things to beat that there
is a large amount of information beyond the worst insults. This looks like
that speak for example. Humans are also very sensitive to misplaced woods.
That would be in the last sentence completely changes the meaning of this. I
also found the speaking twin machine feels very natural. I have to stop and
pause between each sentence because i can't remember what i'm thinking about.

As you can see from descon and, speech recognition has a long way to go to it.
But you can at least sort of get the gist of the conversation.

~~~
nostrademons
I'm going to forget what I actually meant to say above by morning, so here's
the translation typed out:

"This comment is voice-posted from my Nexus One, without edits.

"I find that the speech recognition on my Nexus One is adequate for basic
search queries. I tried all the phrases listed in the article as search
queries. 'Rest in peace' parsed correctly. 'Serve as the inspiration' came out
as 'Sudan inspiration'. 'Serve as the installation', remarkable, came out
exactly correct. Saying 'one' into the phone gave me a number instead of a
word. Saying 'recognize speech' came out okay.

"The problem with speech recognition of long passages seems to be that there
is a large amount of information beyond the words themselves. This looks like
netspeak, for example. Humans are also very sensitive to misplaced words. The
'woods' in the last sentence completely changes the meaning of it. I also
found that speaking to a machine feels very unnatural. I have to stop and
pause between each sentence because I can't remember what I'm thinking about.

"As you can see from this comment, speech recognition has a long way to go
before it becomes practical. But you can at least sort of get the gist of the
conversation.

~~~
FluidDjango
> you can at least sort of get the gist of the conversation.

Or: you can get the _opposite_ of the intended meaning, e.g., when you said
'UNnatural' it heard 'natural'.

~~~
nostrademons
The unfortunate part is that I didn't even know it made that error until I
posted and re-read my comment from my laptop. The Nexus One's text boxes seem
to have issues with a lot of text...when the page is big enough to scroll and
the text is also big enough to scroll, it's hard to scroll the text without
moving the page. So everything after the second paragraph basically happened
off-screen with no visual feedback.

------
chime
For those who haven't dug into the audio analysis and synthesis field, it is
really hard to understand why speech recognition is so complex. As living
beings, we interpret sounds in a fundamentally different way than computers.
Our brain has the ability to interpret sound-waves and provide meaning and
context to them without us even realizing it. Once you hear a cat's meow, even
if it happens in a different frequency range later, you can still associate it
as a cat's meow.

This is what computers "hear": <http://www.ling.ed.ac.uk/images/waveform.gif>
\- in fact, this is ALL they can hear. When you hear 44.1khz, audio, that is
44100 bytes of information (a value between -128 to +127 per byte) per second.
There is no hidden metadata behind the waveform. At 100% zoom-level, the
waveform IS the audio. Theoretically, you can take a screenshot of a 100%
zoomed waveform and convert that to actual sound with absolutely no loss of
data (of course, you'd need a really high resolution to show a graph 44100x256
pixels in size. Now given such a waveform, how would you convert that to
plain-text?

As an example, try recording "ships" and "chips" into a mic and view the
waveform. See if there are any patterns you can identify between the two
waveforms. I've done it over a hundred times. There isn't an easy way to
discern if the letter was "sh" or "ch". Yet our brain does it so very easily
thousands of times every day. So, failing easy pattern recognition, we have to
use frequency analysis, DFT, and tons of AI.

My unscientific gut-feeling is that we're going about all of this the wrong
way. We are using the wrong tools. Discrete, digital computers will never be
able to tackle problems like pattern recognition in their current state.
Switching to analog isn't going to improve anything either. I don't know what
the correct instruments/devices will be but I know programming them will be
very different.

~~~
viraptor
As much as I agree with you in general... why this? " _So, failing easy
pattern recognition, we have to use frequency analysis, DFT, and tons of AI._
"

It's not like we hear the waves really. We hear the pitch and volume mainly
(proved nicely by playing simple impulses at >10Hz). So it's not like we fail
easy pattern recognition on waves. I'd say that it's something we should not
be looking at. Ever. Especially when you can have many representations which
sound the same (they sound the same, because they look really similar in the
frequency domain).

Using the frequency analysis, DFT, etc. is _the_ way to approach it.

~~~
chime
> Using the frequency analysis, DFT, etc. is the way to approach it.

Indeed. I meant to say there is no shortcut that immediately makes speech
recognition easy to solve. Hence, we have to use a lot of advanced math that
works pretty well but like the article says, not as well as a typical human.

However, I don't think humans are that great either. The proof of that becomes
evident when working on projects with people from around the world. Ask 10
people from around the world to dictate 10 different paragraphs that the other
nine have to write down. I doubt they'll have the 98% accuracy that the
article states, especially if there is no domain-constraint. Understanding
what the other person says is hard. Put a Texan and South Indian in the same
room and see what I mean. Of course, this doesn't mean it's not fun and
interesting for computer scientists. It's just hard for most people to realize
why the stupid automated voice-attendant can't understand they said "next" and
not "six".

~~~
viraptor
It works quite well if you don't have to care about the accent though.

Trivia: " _Automatic computer speech recognition now works well when trained
to recognize a single voice, and so since 2003, the BBC does live subtitling
by having someone re-speak what is being broadcast._ "
(<http://en.wikipedia.org/wiki/Closed_captioning>)

But I'm pretty sure they started using it globally sometimes, because you can
see the recognition quality fall when there's an interview with someone from
outside UK.

------
barrkel
I think this article conflates two quite different things: speech recognition
and understanding language.

The problem of speech recognition, it seems clear _now_ , is unsolvable
without understanding language, because only with an understanding of language
are the ambiguities inherent in speech properly resolved.

But having a workable understanding of language would seem to rely on us
having some kind of model of a mind. Mere statistical relations between words
or sentence structures aren't enough to map human language into mental models
of the world, but having a consistent internal model of the world seems to be
the key to understanding language. We disambiguate by minimizing the internal
inconsistency.

Having a computational model for such an internal view of the world seems to
be a Hard AI problem. We probably won't make much further progress with speech
recognition until we've made headway there.

~~~
mstoehr
Actually most research effort in speech is more on the language side rather
than the signal processing of the speech signal. So I think many people have a
similar intuition as yourself.

Bear in mind though, that humans significantly outperform machines in tasks
where isolated or streams of non-sense syllables are said: i.e. "badagaka" is
said and humans can pick out the syllables whereas computers can have a lot of
difficulty (in noise in particular).

Computers start approaching human performance most when there is a lot of
linguistic context to an utterance. So it appears that humans are doing
something other than using semantics.

~~~
fauigerzigerk
Good points, but I think we underestimate how much situational context humans
use when they interpret language. Sometimes we can communicate with very
little language simply because we know what the purpose of the interaction is.

Another thing I keep wondering about is why so little emphasis is put on
dialog. When humans don't understand something, they ask, or offer an
interpretation and ask whether it's the right one.

Speech recognition systems don't seem to do that. They say "Sorry, I could not
understand what you said. Please repeat". That's not very helpful for the
computer of course. It should say: "Huh, Peas? Why would anyone rest in peas
for heaven's sake??". Then the human could sharpen his SS and say "PeaCCCEE!!!
not peas. I'm not talking about food, I'm talking about dying!".

~~~
lutorm
Context is _huge_ for human interpretation. If you've ever have someone
address you in a different language than you were expecting, you know what I
mean. It's almost like you can imagine the search just going deeper and deeper
without finding anything that makes sense until it swaps in the other language
and go: Ah, you said "good morning"! :-)

~~~
eru
Especially embarrassing when somebody addresses you in your native language,
and you expected something different.

------
ebiester
Two anecdotes:

First, I worked with a quadriplegic engineer who relied on speech recognition.
While it wasn't perfect, he certainly did well with it. The trick is that he
trained himself as much as he did the computer. He spoke more slowly, and
enunciated the places where he knew of problem points. I don't even know how
much of this was conscious - I only watched a few times and never asked him.

If we treat our voice recognition similarly, we get much better results.

2\. Due to a few too many loud concerts and bars, I have partial hearing loss
and can miss words myself. The difference between myself and a computer is
that the computer is expected to output immediately rather than being allowed
to wait for more context clues, and the computer isn't allowed to interrupt to
ask about words.

These two strategies, allowing for a more conversational flow, may be what we
need to improve speech recognition.

------
TravisAllison
Surprising fact despite what your smartphone tells you: speech recognition in
general situations hasn't improved since 2001. A good exposition of why Norvig
and Google's approach of using statistical analysis may be approaching a dead
end.

~~~
Tichy
I think something must be wrong with Norvig's approach, because humans can
learn a lot faster. We don't need one billion voice samples to learn
understanding. So there must be a better way than the ginormous dataset
approach.

~~~
eru
Why? Perhaps your genes have been trained (=evolved) on big datasets to
produces good speech recognizers.

------
xxzz
But Hinton recently made significant advances in speech recognition using
RBMs.

Google Tech Talk: <http://www.youtube.com/watch?v=VdIURAu1-aU>

~~~
mstoehr
He didn't make any advance that has made its way to a full word recognizer,
he's merely recognizing phonemes (which are linguistic subunits of words)
several researchers in the field have criticized his methods. Additionally,
none of the top five phoneme recognizers have ever been deployed as a word
recognizer, and there is little chance that they even will be in the next few
years.

~~~
woodson
The concept of phonemes isn't undisputed either. When analyzing actual speech
it becomes clear that there are no real steady states, but much coarticulation
between the "segments". Of course, part of it could be attributed to the fact
that speech sounds are produced by articulatory gestures, which necessarily
overlap in time. On the other hand, these coarticulation patterns are not
language-independent. So, a purely (articulatory/auditory) phonetical
explanation of why these differences exists is rather unlikely.. I know this
seems rather off-topic with regard to speech recognition, but the question of
the basic building blocks of language is kind of at the heart of the problem.

~~~
mstoehr
I agree that its at the heart of it (and I'm presently writing a paper where
I'm using articulatory-phonetic features rather than phonemes). Unfortunately,
there is no large-vocabulary speech recognizer that uses articulatory
phonetics (yet!). Every large scale speech recognizer and most small scale use
phonemes and are trained using speech that has been transcribed into phonemes.
There is almost no data that is annotated with articulatory phonetics (a
problem I'm working on right now).

~~~
woodson
I guess that's in part because it's even more difficult to (manually)
transcribe speech into articulatory-phonetic elements based on the acoustic
signal (laryngeal gestures?? Clearly they are there in articulation, but their
acoustic correlates are masked to some extent).

Automatic alignment methods are probably quite hard to implement, given the
various coarticulation patterns in the signal depending on context/prosodic
position etc.

Could you provide a link to papers or other materials dealing with
articulatory features in speech recognition?

I guess I should take another look at Browman/Goldstein's Articulatory
Phonology

------
ephermata
While the article focuses on speech recognition for arbitrary speech, it
misses the fact that speech interfaces for specific situations are now
actually useful. Today, I told my car "play track 'severed head'," and it
actually played the correct song. I asked my phone "what is my next
appointment?" and heard my calendar. I then said "dial 206 421 8989," my phone
dialed properly, and so on and so forth. This is all without any explicit
training on my part. No need to read Mark Twain or anything like that.

There are still problems here, but the technology for speech interfaces has
gone from terrible to OK in the last seven years. I'm looking forward to
seeing where it goes next.

~~~
jerf
"But sticking to a few topics, like numbers, helped. Saying “one” into the
phone works about as well as pressing a button, approaching 100% accuracy. But
loosen the vocabulary constraint and recognition begins to drift, turning to
vertigo in the wide-open vastness of linguistic space."

"As with speech recognition, parsing works best inside snug linguistic boxes,
like medical terminology, but weakens when you take down the fences holding
back the untamed wilds."

No, it did not miss it. It was a core point of its argument; the entire arc of
the article is about how we made steady progress on the small cases but
crapped out on the general case.

------
discipline
Totally anecdotal, but I thought I'd pass this on.

Whenever I use Goog 411 or Dragon, my results improve if I speak like a
standard radio announcer. You know, that kind of jokey, fake lilting voice
that they use on car commercials? I hate doing it because I sound like such an
ass, but whenever I've done so, Google understands what I'm saying. YMMV.

------
JasonAllison
Speech recognition is pattern recognition. Computers are inefficient and
relatively limited in their capacity to discern relevant data from complex
data sets when the size of these sets begins to soar exponentially. That's why
weather prediction in many areas isn't worth much more than a week. Consider
how well the human brain can synthesize disparate data in the relatively
simple form of sentences with scrambled word spellings. Add to this, the
nearly infinite combinations of vocabulary, grammar, voice tone and context of
speech and we can begin to see the scale of the task of matching with programs
and algorithms, what the human mind does unconsciously in real time with a
high degree of accuracy.

<http://simplebits.com/notebook/2004/01/16/mipellssed-wdors/>

Mipellssed Wdors Posted on January 16th, 2004 at 1:44 pm

Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn’t mttaer in waht
oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist
and lsat ltteer be at the rghit pclae. The rset can be a total mses and you
can sitll raed it wouthit a porbelm. Tihs is bcuseae the huamn mnid deos not
raed ervey lteter by istlef, but the wrod as a wlohe. Amzanig huh?

Filed in: misc

------
MichaelGG
MSR's MindNet, mention in the article, is online:

<http://stratus.research.microsoft.com/mnex/Main.aspx>

------
donaldc
It's not dead. It is in fact used a lot more than ten years ago. I've yet to
get a voice message that google voice was not able to decode sufficiently, and
most of the transcribed messages have been 100% accurate.

General speech recognition has a long way to go, but it will happen _last_ in
speech recognition. Meanwhile, in the real world, speech recognition is
noticeably better for everyday use than it was a decade ago.

------
tjpick
I have conflicting anecdotes: 1\. it's used heavily and successfully in
medical dictation situations, ie medical professionals dictating case notes.
2\. I was so excited about speech control in OSX, and I tried it, and it was
shit because I had to speak with a terrible impression of an American accent
to get it to do anything.

~~~
kowen
I paid my way through college doing medical transcription. Throughout my time
there, the hospital was experimenting with speech recognition, but the success
rate was very low, and the doctors spent a lot of time going over and
correcting the output. I believe they finally ditched the voice recognition
software (though I do not know for sure).

I'm guessing that the reason the experiment failed at this hospital is that
the language was non-English. In addition, most of the doctors were foreign,
speaking with wildly differing accents. Also, many of the doctors would mix
two languages when dictating, in addition to the usual Latin terms.

Basically, it was easier to just pay some humans to parse it.

~~~
tjpick
the ones where it is successful invest quite a bit into training the software
and the doctors use macros heavily.

------
billswift
I think the core difficulty to further speech recognition is semantic
understanding.

One of the reasons computers aren't good at it yet is sheer lack of computer
power. Which forces them to use less context to decide word meanings. There is
a reason humans think and remember things mostly in the form of stories, it
provides more context for memory and decoding cues.

The article talks about how much computer power has increased over the last
decade without any increase in transcription accuracy ("freakish" is the word
it uses) without mentioning the fact that it is still enormously behind human
processing capabilities.

------
noisedom
Is the problem in guessing phonemes or guessing which word is being said based
on the phonemes? I'd assume that phonemes are easy to guess at since it's such
a small set compared to the number of possible words.

------
kvh
I disagree. This is another example of underestimating how powerful the human
brain is: it is an exquisitely designed piece of dedicated hardware with more
power than even our super-compute-clusters. Hardware that's dedicated, to a
large extent, to language processing. I think current statistical approaches,
ramped up on future computing resources, will continue to chip away at 'word
error rate'.

------
wglb
My favorite hard-to-understand phrase is "Isle of Lucy", which is pretty tough
to get right when spoken human-to-human.

------
jheriko
I wouldn't draw so much from what looks like a local minimum - especially when
the abilities of people imply that there is a suitable algorithm. The problem
is its no longer a successful marketing gimmick - it won't help to sell
software/hardware anymore until it works as well as people do.

------
motters
So the question remains as to how to get that last 10-15% of transcription
accuracy. It sounds as if existing methodology has run out of steam and that a
paradigm shift is required to get to human-like speech recognition.

------
RyanMcGreal
Was speech recognition software ever any good for anything other than having
fun with your kids?

------
korch
"It's hard to wreck a nice beach"

------
J3L2404
Well, so much for HAL reading lips - he can't even hear what I'm saying. I
call bullshit - HAL's way smarter than that!

~~~
eru
Perhaps they had better (and now forgotten) techniques back in 2001?

------
dmfdmf
Great article.... Like a lot of things it all hinges on P=NP.

~~~
robertfortner
Thanks for reading. But I think a key take-away is that even with infinite
computing power, you only increase the chance of being right a little. It's
still a guess.

That's the real surprise. It's not a question of computing power.

~~~
ahk
Your articles seem to follow a theme of "no new ideas/breakthroughs". The
anecdote of the NASA head imploring people to come up with ideas was striking.
Any guesses as to why this is happening now? After all, we _were_ having
critical breakthroughs through much of the last century.

~~~
robertfortner
To some degree, I'm critiquing Ray Kurzweil who generalizes about progress.
So, for better or worse, I take a lesson from that which is to be empiricist:
look closely at each field to see whether it's vaulting forward or not.

It's not happening in space (although hypersonics seem to be getting more
real). It's not happening in AI or speech recognition/understanding. It's not
happening in medicine.

So the _net_ empirical situation looks like the opposite of what Kurzweil
says. Only IT and Moore's Law are going totally nuts. One can conjecture about
why, but I think that's conjecture and there aren't deep fundamental
principles about scientific and technological progress, at least that we've
found so far.

------
Tichy
A total horseshit article... I mean to conclude from a lack of progress for a
couple of years that AI is impossible.

~~~
Tichy
That was quote from the article btw.

