

Speech Recognition Leaps Forward - Garbage
http://research.microsoft.com/en-us/news/features/speechrecognition-082911.aspx

======
brandonb
The really great thing about deep networks isn't that they're more accurate.
It's that they're radically simpler.

Current speech recognizers are basically layer upon layer of tricks discovered
by researchers over the course of decades. Chop up the input signal. Then take
a Fourier transform. Take the log to even the signal out. Do another transform
to de-correlate different components of the audio. Add noise to the input.
Project down to a subspace. Switch objective functions halfway through
training to trade off different kinds of errors. Use more Guassians here. Use
fewer there. Pump it into a language model.

It works, and it's a marvel of engineering, but it's not "artificial
intelligence." It's pretty much a big stack of statistical hacks piled up over
the years.

The nice thing is that a deep belief network can figure out a lot of this
structure automatically, much closer to how the brain works.

This paper is actually incremental, not a "leap forward." They've basically
replaced two of the middle layers of a speech recognizer (the Gaussian mixture
model and hidden Markov model) with a modified neural network. But the
exciting thing is that the neural network can start there, and slowly eat its
way toward the outer layers, replacing a big stack of hacks with one simple
algorithm.

~~~
exit
> _It works, and it's a marvel of engineering, but it's not "artificial
> intelligence." It's pretty much a big stack of statistical hacks piled up
> over the years._

i'm not sure about this attitude. it reminds me of a quote by dijkstra:

"The question of whether Machines Can Think... is about as relevant as the
question of whether Submarines Can Swim."

why demand that intelligence proceed from a single parsimonious gesture?

~~~
jamesrcole
I'm not sure I agree with that quote.

We can agree that what computers can do these days is quite different to
human-style thinking.

But to imply what machines can do is somehow different from thinking, just as
submarine propulsion is different from swimming, implies that thinking is
different from computation.

Whether computation, a mechanical process, encompasses what we consider
thinking is not a settled question, but I think there's a pretty good case for
believing that computation does encompass thinking.

[EDIT: wording]

~~~
exit
? you aren't disagreeing with the quote.

just as our sense of what swimming entails is wrongly constrained by our
familiarity with specific implementations, beyond "moving about under water",
we shouldn't limit "thinking" to mean "activity in a neural network", etc.

------
kondro
Is it just me or does 18% seem like a high error rate - and this is after
improvement?

I've used technologies (Nuance??) that have significantly lower errors rates
than this, even for systems I have not trained personally. Is there something
I'm missing?

~~~
romanows
The difference in error rates is in large part due to the to the difference
between dictated speech and spontaneous, informal conversational speech.

Switchboard
([http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=...](http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC97S62))
is a set of telephone conversations between two people. Speakers tend to say a
lot of "ums", abruptly restart an utterance in progress, talk past the
telephone handset, etc. Dictated speech, especially when speakers know they're
talking to a computer, has less acoustic and linguistic noise.

~~~
danmaz74
It will be very interesting to see how this approach will work with dictated
speech.

Also let's not forget that the word-level error rate can be reduced by using
statistical data about words sequences.

------
runjake
The speech recognition on Windows Phone 7 is really, really, really good.

I suspect Bill Gates went on a chair-throwing rampage after that infamous
speech recognition demo flop for Vista [1].

[1] <http://video.google.com/videoplay?docid=-1123221217782777472>

------
urlwolf
Does anyone know if this will impact applications soon enough to matter to the
typical startup that could benefit from better speech recognition?

~~~
hollerith
Probably not. This web page is from the PR department of Microsoft Research.
The probability is low enough even if it had come from researchers, not PR
types.

~~~
StavrosK
Hmm, what's stopping anyone from just implementing these solutions?

~~~
gdahl
1\. Nothing. People ARE implementing similar things. It takes time, effort,
and lots of computation. 2\. People often prefer to implement their own ideas
and compete (especially researchers). 3\. Potentially lack of patents might
discourage other firms from doing it.

