
Researchers achieve speech recognition milestone - gzweig
http://blogs.microsoft.com/next/2016/09/13/microsoft-researchers-achieve-speech-recognition-milestone/
======
josho
For those not familiar with the NIST 2000 Switchboard evaluation[1] it is a
series of 8kHz audio recordings (ie. crappy phone quality samples) of a
conversation, including things like "uhhuh" and other pause words. So, 6%
seems pretty good.

[1]
[http://www.itl.nist.gov/iad/mig/tests/ctr/2000/h5_2000_v1.3....](http://www.itl.nist.gov/iad/mig/tests/ctr/2000/h5_2000_v1.3.html)

~~~
cs702
Judging by my everyday interactions, a 6% error rate is lower than human error
rates in casual conversation.

People regularly ask each other, "sorry, what did you say?", "wait, what did
she say?", "would you repeat that please?", "huh?", etc.

~~~
chipperyman573
The difference is that people are OK with a human asking for clarification,
but systems like Siri need to have a near-zero error rare before people will
consider them good (a person who has to repeat themselves once every 20 times
will consider it bad, or at least not good enough)

~~~
a3_nm
I'm not sure people expect super-human performance out of Siri. An important
difference is that a human who doesn't understand will say so, and ask to
repeat the relevant part (or to choose between two alternatives),
conversationally; or it will pick an interpretation which is not the intended
one but was an understandable misunderstanding.

Contrast this with speech recognition, which will often substitute words that
are nonsensical in context, making it look silly from a human perspective...

~~~
click170
I think another important difference is that humans won't get stuck in a loop
asking you for clarification the same way several times, after 2 or 3 times
they'll typically change behaviors. Eg they'll ask you to spell the word or
respond with the not-understood word with a questioning tone to signal that
they don't understand what that word means.

~~~
jobigoud
This could be implemented though. Based on the part of the sentence that is
understood, figure out most likely words for the missing part and ask a
specific question about it to fill the gap.

~~~
legolas2412
See, it's not about hard coding such behavior. I would say that it reaches
human level of understanding if it automatically learns these ways of solving
the problem. Asking relevant questions can be hard coded, but it doesn't equal
"understanding" the problem.

I think the chinese room experiment overlooks this part of "understanding"

------
gok
6.3% on Switchboard. This is of course in response to IBM getting 6.6%, which
was in turn in response to Baidu getting...

Switchboard is kind of a lame evaluation set. It's narrowband, old, and
doesn't contain all that much training data (100s of hours, whereas many newer
systems are trained on 1000s or 10Ks of hours). And the quest for a lower
Switchboard WER to publish means teams are now throwing extra training data at
the problem, or using frankly unlikely-to-be-deployed techniques like speaker
adaptation, impractically slow language models, or bidirectional acoustic
models (which require the entire utterance before they can emit any results).

I really wish they would have stuck to just publishing a paper explaining was
actually new here (ResNet for acoustic models? Cool!) rather than just a
"let's see how low we can push this 20 year old benchmark" paper.

~~~
meepmorp
I'm not sure what your complaint is. The paper (on arxiv, linked in the blog
post) describes the general techniques used.

Are you saying the benchmark is useless? It's old, yes, but it's extremely
valuable to have a benchmark that allows one to assess system performance over
time. It gives a good idea of the rate of progress and the distance still to
go to match people - after ~16 years, computed are still about a third worse
than humans, error rate wise.

~~~
gok
Surely not "useless" but it doesn't reflect the way speech recognition is used
today. Unless you're routinely listening in on two humans having a phone call
conversation that they don't expect a computer to be hearing, which is what
the test set actually contains.

If you looked at modern performance on test sets from the 1980s (like Resource
Management or TIDIGITS) you might be under the impression that we'd achieved
human-level accuracy levels years ago, but we clearly haven't. And similarly,
what users expect from speech recognition today is in many ways much more
demanding than it was in 2000: vocabularies are huge (think about all the
words you could say to Google), latency needs be very low, and no one thinks
it's acceptable to require users to perform enrollment any more.

So yes, just like other benchmarks, we should retire them after a few years.
The fact that a modern computer could get 100,000 FPS on a video game from
2000 wouldn't be considered a "milestone."

------
dmreedy
I would love to see a breakdown of the kinds of errors these systems make. WER
is an interesting broad stroke, but it doesn't necessarily tell me how useful
a given system will be for some given application[0] (unless, of course, it is
0). It'd be even more interesting to see comparative error analysis across the
selection of these systems. A 0.06 point improvement is certainly impressive,
especially this close to the end of the scale, but I'd be curious to see if it
lost anything in getting there. It's one thing if this system is strictly
better than it's predecessor. It's entirely another if it is now 10% better at
recognizing instances of the word 'it', but has lost the ability to
distinguish 'their' and 'they're'[1].

\---

[0] It is like that any viability analysis would be on an by-application
basis, so I don't pretend like I'm asking for an insignificant amount of work
here!

[1] a crude, toy, and likely inaccurate example. Not trying to belittle the
work.

~~~
hiddencost
IBM was asked this, at Interspeech, and answered that a lot of their errors
were on very short words ('a', 'the', 'of', etc.)

------
nshm
Open source Kaldi gives you 7.8%, Microsoft didn't went too far.

Also, major issue with this kind of research is that they combined several
systems in order to get best results. Most practical systems don't use
combinations, they are too slow.

~~~
timgws
There are 33% less errors with the Microsoft solution then with Kaldi... one
could say that is quite significant.

~~~
afsina
Relative decrease in WER is not so significant for lower percentages. How
about "we make 6 errors on 100 words but Kaldi makes 7".

------
rngesus
The paper itself can be found here
[https://arxiv.org/pdf/1609.03528v1.pdf](https://arxiv.org/pdf/1609.03528v1.pdf)
quite interesting to see that the failure rate is lower than the average human
failure rate, can't wait to see how this will improve over the coming years.

------
random42
Speech-to-text has to still go a long way when it comes to foreign accents.
Google now's "Ok Google" initializer has about 3/10 hit-rate for my Indian
accent speech.

~~~
EricBurnett
'Ok Google' is a different problem though - needs to be a low resource,
always-on listener with a low false-positive rate. That's quite a different
problem space than general speech-to-text.

I think I get about 50% hit rate with 'OK Google' and I'm a native English
speaker :).

~~~
DiabloD3
I had the weirdest conversation with my phone this morning, and I'm a native
English speaker with nothing really identifiable as an accent.

"Okay, Google." Nothing.

"Okay, Google." Nothing.

"Okay, Google fucking work or I am taking a hammer to this fucking phone."
"DING!"

Apparently threats of violence still work against our machine overlords.

------
mintplant
How about the inverse process -- speech synthesis? Anyone know what the state
of the art is in that field? The tech has been getting steadily better but we
still seem a ways away from passable machine-generated audiobooks, for
example.

~~~
Devid2014
Like thins one ?

WaveNet: A Generative Model for Raw Audio [https://deepmind.com/blog/wavenet-
generative-model-raw-audio...](https://deepmind.com/blog/wavenet-generative-
model-raw-audio/)

~~~
mintplant
That's very cool, thanks for the link! From the comparison, the audio quality
is clearly improved and they eliminated the sort of digital "wobble" that I
usually associate with TTS. Intonation is still a bit off, though. Will check
out their paper.

------
cbasoglu
From the linked arxiv paper,
[http://arxiv.org/abs/1609.03528](http://arxiv.org/abs/1609.03528) this is a
very interesting use of CNTK to adapt image CNN techniques to speech
recognition. Surprising that CNNs worked so well on speech audio. Full
disclosure: I am a MSFT employee.

~~~
dave168
For the world record, CNTK delivered the best speech recognition system and
TensorFlow delivered the best Go player (switched from Torch) :-)

------
0xdeadbeefbabe
If these speech to computer interfaces are so important, why don't we develop
a dialect for humans to speak to computers more efficiently, kind of like the
grafiti alphabet on the palm pilot but for speech?

~~~
pmontra
Or the way we google. It would mean acknowledging that the listener (the
computer) is severely limited. Actually it happens when speaking with
foreigners with low proficiency in our language.

~~~
0xdeadbeefbabe
Acknowledging that the computer is severely limited seems like a good first
step. Some people hate thinking of it that way though, and I have no idea why.
It's annoying.

------
yalogin
Didn't google announce some speech breakthrough last week?

~~~
dredmorbius
Generation, not recognition.

[https://deepmind.com/blog/wavenet-generative-model-raw-
audio...](https://deepmind.com/blog/wavenet-generative-model-raw-audio/)

------
Dwolb
Are speech recognition systems also paired with vision recognition systems to
determine intent? Seems like that would be where research would be headed.

~~~
copperx
That's an interesting one. Not only for intent, but for getting the WER down.
For example, my mother often mumbles but if I'm in front of her seeing her
face I understand her perfectly, but if she's out of my sight where I can't
read her lips I have trouble understanding her.

------
wodenokoto
Why is this score a milestone?

------
ausjke
is Microsoft's speech cloud api supporting this so that we can use it?

~~~
visarga
Others, such as Google, do the same. They develop some system in their labs,
but put in production a worse system because it has lower resource usage.

------
danielvf
The best [speech] recognition engine in the world, and it hears the wrong word
more than 6% of the time. Ouch.

I would have though the state of the art would be better, given anecdotal
evidence from friends who write with speech to text programs, and love them.

Perhaps some of this is due to deliberately bad audio quality in the
switchboard samples.

~~~
fma
Mistakes are part of real life, just like how you spelled speach, instead of
speech.

~~~
escapecharacter
The parent comment has a failure rate of 1/38 words, which means a 2.6%
failure rate, and this is one of the best commentators in the world, too.

