
Researchers reach human parity in conversational speech recognition - jonbaer
http://blogs.microsoft.com/next/2016/10/18/historic-achievement-microsoft-researchers-reach-human-parity-conversational-speech-recognition/#sm.0001iqs780it2ea2s2f1rb1k04g5c
======
Eridrus
The actual paper has a section on error analysis that is particularly
enlightening:
[https://arxiv.org/abs/1610.05256](https://arxiv.org/abs/1610.05256)

On the CallHome dataset humans confuse words 4.1% of the time, but delete 6.5%
of words, most commonly deleting the word "I".

Their ASR system confuses 6.5% of words on this dataset, but only deletes 3.3%
of words, so depending on how you view this their claim about being better
than humans isn't definitely true, if you consider the task to be speech
recognition, rather than transcription.

Also, while the overall "word error rate" is lower than humans it's not clear
if this is because the transcription service they used is not seeking perfect
output, but rather good enough output and the errors the transcription service
makes may not be as bad as the errors the ASR system makes in terms of how
well you can recover the original meaning from the transcription.

It's clearly great work, but reaching human parity is marketing fluff.

~~~
joe_the_user
I was recently at a bar where they showed a movie with incomprehensible
subtitles (English to English). I assume this was because they skimped and
bought automatic subtitling.

I think one important aspect is while humans miss words, they often get the
sentence meaning correct. When computers miss words, they tend to substitute
words that sound similar. That's readable if you have time but not necessarily
as a stream of text going by...

~~~
ChrisClark
It might have also been a 'bootleg' DVD from China or downloaded a version
from there. I've had quite a few where they had done a horrible job
translating into Chinese for the subtitles already, and then did a literal
machine translation back to English.

The character on the screen said "Hello", the English subtitle said "You
good." Which would be the literal translation of nihao.

~~~
Arkaad
see also the infamous Star Wars Episode 3 Chinese bootleg "Do no want".

------
jpm_sd
I look forward to being able to converse with Microsoft's research team as
easily as I can with humans. I hope that one day, journalists can learn to
write headlines with similarly low rates of error.

~~~
hashkb
It's on purpose... journalists learned to do it this way.

~~~
avodonosov
[http://www.smbc-comics.com/comics/20090830.gif](http://www.smbc-
comics.com/comics/20090830.gif)

~~~
Namrog84
Linking the image instead of the site shall more people to miss out on the red
button extra comic frame. And the meta hover text(xkcd style)(which there
usually is) .

[http://www.smbc-comics.com/?id=1623](http://www.smbc-comics.com/?id=1623)

~~~
mastazi
Wait a second... I just realised I've been reading SMBC for years now and I've
never noticed the red button! I'm mad at how much I must have missed out on
but, at the same time, I'm glad you pointed that out!

------
uvesten
Good for them! I'm a bit surprised that the researchers didn't already possess
human-level speech recognition, though.

~~~
jameshart
I wasn't sure which way round to parse it (ironically). Are they saying that
Microsoft researchers are now almost as good at recognizing conversational
speech as humans? Or that Microsoft researchers can now almost pass as human
in conversational speech? Either way, good news for Microsoft Research, I
think.

------
radarsat1
The term "human parity" refers to a comparison of the error rate, which is a
single scalar summarizing performance in terms of mistakes made. It says
nothing about the _kind_ of mistakes, and I can easily imagine that machines
qualitatively do not make at all the same kind of mistakes as humans. I'd be
curious to know if the kind of mistakes machines make might strike human
listeners as quite stupid, but maybe not.. many algorithms are getting better
at taking into context and prior knowledge into account.

~~~
saidajigumi
In my mind, this is analogous to the reasons why evaluation of lossy audio
compression codecs _requires_ human listening tests. Simply running some
simplistic signal analysis like SNR (Signal-to-Noise Ratio) completely fails
to capture the as-perceived quality of a compression implementation.

To explore that analogy: In the case of lossy audio compression, the
compressor deliberately introduces quantization noise into the signal. It does
so by running a "psychoacoustic model", which attempts to capture a broad
quality of human hearing called auditory masking. There's a number of
different kinds of masking[1]: a strong tonal sound creates an "umbrella"
across nearby frequencies that can mask quieter noise-like sounds. Similarly
there's noise-vs-noise masking, as well as forwards- and backwards- temporal
masking. (Yes, backwards. A sound can mask perception of a sound that
_occurred before it_.)

In the audio compression case, we've built an algorithm that attempts to
characterize exploitable phenomena of human hearing. These masking
characteristics aren't perceived quite the same by human listeners as the
model, nor even the same between individual human listeners. Thus the need for
human listening tests. These, due to the experimental care and human subjects
required, are expensive.

Back to speech transcription. Say the end goal is "how well does a human
comprehend this transcribed speech"? (e.g. vs some standard, such as the
original speech, vs. the original speech transcribed by a skilled specialist,
etc.) The problem starts to look pretty similar. We can cite numerical, word-
centric error rates, but that fails to capture how well meaning is preserved
and transmitted. Imagine a perverse algorithm that did a perfect
transcription, but then dropped or altered words for maximum meaning
obfuscation. It might equal or even beat the cited error rates but be _much_
harder to actually comprehend.

[1]
[https://en.wikipedia.org/wiki/Auditory_masking](https://en.wikipedia.org/wiki/Auditory_masking)

~~~
ArkyBeagle
I've always evaluated audio codecs on differential signals - subtract A from
A'. For telephony codecs, there are formal tests for MOS and/or PESQ .

~~~
saidajigumi
Which, isn't substantially different than evaluating a codec on introduced
SNR.

These classic signal processing analyses tell you nothing about the correct
operation of a psychoacoustic model codec design (i.e. MP3, AAC, Vorbis, etc.)
An analysis of A (original signal) vs. A' (signal passed through a
compression-decompression cycle) is just extracting the quantization noise
introduced by the codec. That provides no information about how effective the
codec was in masking that noise with the original signal content.

To illustrate, imagine a perversely designed codec: it runs two models, the
first "good model" is a normal psychoacoustic model. The second "bad model" is
the one the compressor uses: it applies the total amount of quantization noise
allowed by the good model, but applies it in ways that are maximally annoying
to human listeners. This isn't just avoiding masking, it's things like using
noise correlated to the original signal, which is generally more obtrusive
than uncorrelated noise, etc.

A codec using just the "good model" and one using the "good + bad model" would
have (by definition) exactly the same introduced noise, but the latter would
sound FAR worse to a human listener.

~~~
ArkyBeagle
I am not familiar with the term "introduced SNR"... Googling.

No joy.

I can't really follow your thinking above; sorry.

In audio, there is distortion, which is correlated with the signal. Noise is
uncorrelated. Codec error would seem very much to me to be at least much more
correlated than uncorrelated. MP3 artifacts _sound_ at least more "phase-ey"
than they sound anything like quantitization noise ( at least to me ). This
may be because I have heard badly aligned tape machines make that sort of
error happen. "Phase-ey" also triggers my (feeble) mind into thinking about
allpass filters as the model for the error.

In terms of intelligibility, it's possible to improve intelligibility by
_adding_ noise, and by adding clipping - aviation comms does this at times.
What destroys intelligibility is phenomes being destroyed by phase changes and
bad amplitide errors ( where there are actually good amplitude errors ).

I've run an ABACUS voice quality analyzer several times, and I don't think
it's purely a distortion analyzer - adding clipping at least can _improve_
MOS/PESQ score surprisingly. Even more surprisingly, there's no mechanism
available for calibrating gain staging on one.

~~~
saidajigumi
Ah, that's a mistake, that should have read "introduced [quantization] noise",
which reduces the SNR in A' vs A.

If I understand you correctly, your discussion around intelligibility
primarily applies to signal processing of _voice_ vs _general audio_. Codecs
such as MP3, AAC, etc. don't have the luxury of making assumptions about the
signal content, and so aren't designed along those principles. E.g. speech
codecs can generally run well at much lower bitrates than general audio codecs
because they operate on a constrained domain of audio (i.e. speech).

Regarding distortion management, see Rate-distortion optimization[1] for lossy
audio codecs: _where the purpose is to manage distortion within the limits of
the bit rate supported by a communication channel or storage medium._

[1]
[https://en.wikipedia.org/wiki/Quantization_(signal_processin...](https://en.wikipedia.org/wiki/Quantization_\(signal_processing\)#Rate.E2.80.93distortion_optimization)

~~~
ArkyBeagle
Thanks for clarifying.

------
Animats
Very nice. How long before something this good is available as open source?

A tough test would be to hook this up to a police/fire scanner, or air traffic
control radio.

~~~
ChuckMcM
They did post the code on github
([https://github.com/Microsoft/CNTK](https://github.com/Microsoft/CNTK)) with
the Microsoft open source license.

Presumably you could feed it speech from a running instance of gnu-radio.

~~~
ar15saveslives
CNTK is just a toolkit, like tensorflow or theano. Code for the paper was not
published.

~~~
azinman2
Let alone the datasets used to train, which are worth a lot of money.

~~~
zump
Aussie company (Appen) provides the datasets :D

~~~
mtrimpe
That's the one that seems to be using illegal GCHQ wiretaps ... see the
(Dutch) article below about an Appen translator getting a private voicemail of
her ex to translate.

[http://www.volkskrant.nl/tech/privegesprekken-van-
duizenden-...](http://www.volkskrant.nl/tech/privegesprekken-van-duizenden-
nederlanders-in-handen-van-tech-bedrijf~a4386302/)

~~~
zump
No, it's GCHQ contracting Appen. They only deliver the platform for
distributed outsourced transcription.

------
grzm
I don't have a background in this area, so I'm likely easily impressed, but
this seems really impressive. And the acknowledgement that there's a lot of
work to be done, such as discriminating between speakers and recognition in
adverse environments. Yeah, it's Microsoft writing on their own technology,
but they addressed in the text the questions I had already in mind from just
reading the title. It didn't leave me with feeling that it's just a marketing
piece.

> Still, he cautioned, true artificial intelligence is still on the distant
> horizon

It's frustrating when technologies like image and speech recognition and
robotics are conflated with AI.

~~~
gremlinsinc
I don't get why that's frustrating, without image and speech recognition AI
isn't possible. -- How intelligent could people be without sensory perception?
Hellen Keller is the exception, take away the ability for humans to process
sound, and images as a species and we wouldn't be nearly as advanced as we are
today even if we still had the same brain structure and mental capacity.

Ray Kurzweil understood this and is why he paved the way for some of the first
speech / image recognition platforms like ocr/fax/etc...

To me speech/image recognition is a precursor/adjunct of AI. You can have the
former without the latter, but the latter will never be realized without
those.

~~~
ghurtado
> I don't get why that's frustrating, without image and speech recognition AI
> isn't possible

Wait, what? The most commonly known (and perhaps oldest) AI test in no way
requires either image nor speech recognition.

Millions of blind and deaf human beings would like to disagree with your claim
that they are not intelligent or sentient beings.

Seriously, what is the basis of this claim?

~~~
gremlinsinc
I'm not saying one can't be smart/intelligent without being able to see/hear,
I'm saying if you take away all senses from ALL human beings there won't be
any way to communicate at all, or know that others exist, and to learn from
them. Learning is what makes intelligence possible--and the passing of
information through generations. Hellen Keller fought extremely hard to
overcome her sensory issues, but she had a good teacher, who presumably had
someone else help them or guide them some. -- But to organically learn to
speak when nobody can hear you, or to read when you can't even feel the
braile, etc..would be nearly impossible.

~~~
nercht12
> Learning is what makes intelligence possible

Intelligence (at least in this case) is being able to take data and draw
conclusions from it, but you just can't see such potential when you don't have
input. Is a computer not a calculator when there's no software installed? No.
It's still a calculator, but it just doesn't have inputs. One day, we may be
able to give sight to the blind, but for now, considering such people without
ANY learning capacity as "unintelligent" is still wrong.

------
windlep
I'll admit I'm not very interested in speech recognition of this nature when
it can't disambiguate the speaker. ie. the way Amazon Echo and other voice
recognition systems can't tell the difference between a human in the room and
the TV. Even when one might be clearly a female voice vs. a male.

None of the voice recognition systems on the market learn my voice distinctly
from my wife's or sons, and I don't want their speech triggering things on
accident (especially my son's), so I don't use any of them.

I'll be more impressed when I can restrict Amazon Echo or one of these
assistants to ignoring any voice that isn't at least rather similar to my own,
not merely recognizing the words I'm speaking.

~~~
gusmd
for what it's worth, the "OK Google" functionality in my Android phone is
trained against my voice and does a pretty good job at rejecting my wife's and
coworkers commands.

~~~
sean2
For what it's worth, even after some training none of the phones in my office
seem to pick up their owner's voice rather than someone else telling their own
phone to run a search (we're all mid 30s male).

My phone rarely listens to me until I hold down the home button but one guy I
know, who has a slow, deep voice, triggers Google to start listening in normal
conversation all the time.

------
Maarten88
Nice to read this, being someone who uses lots of Microsoft products, but I
have mixed feelings: after all these years Cortana still understands 0.0% of
my native language (Dutch). Very disappointing, especially seeing that Google
has no problems understanding Dutch.

~~~
jcoffland
What do you think the chances are that Philips is working on this?

~~~
Maarten88
Zero. They focus on medical equipment these days, so their research is
probably in different areas.

~~~
jcoffland
They have a dictation product which sports some sort of voice recognition.

------
eb0la
Knowing Microsoft this will be part of Cortana in a few weeks.

I hope it will be integrated soon with the Speech API as well (
[https://msdn.microsoft.com/en-
us/library/hh361633(v=office.1...](https://msdn.microsoft.com/en-
us/library/hh361633\(v=office.14\).aspx) ).

~~~
bpicolo
It's not necessarily the case that they can do it in near-real-time (or near
real time at reasonable scale). I would expect the true state of the art to
take a while? Maybe part of the definition is real-time, but the article
doesn't specify that.

------
Kenji
I have read too many human parity claims that left me disappointed to believe
this one. Call me a pessimist or a cynic. I'll be very excited when I have the
code running on my machine and when I can compose this comment verbally
without a hassle.

------
cellis
Ok, when's the next 2GB Xbox One update and will this fix the problem of me
saying "Xbox watch NBC", and it 'hearing' "Xbox watch TV"?

------
wbhart
So Microsoft finally have an AI that can "wreck a nice beach". Along with text
autocompletion, we are all set for a decade of irritating miscommunication.

~~~
_kst_
I thought it was "wreck a nice peach".

------
raimue
> [...] a speech recognition system that makes the same or fewer errors than
> professional transcriptionists.

How low would the error rate be for humans that can fully concentrate on
listening instead of writing at the same time? Unfortunately, that cannot be
tested.

------
nattyice
Meaning will always escape us when it comes to language. Not only will there
always be a disconnect between the speaker and his or her audience, there will
always be a subjective perspective that cannot be tapped into. Can AI ever
really be compared to a subjective perspective?

Although the article recognizes that perfection has not been assumed, parity
might not even be a capacity.

Conversation is difficult to measure. Take a look at the philosophical
viewpoint of Deconstruction. Food for thought.

[http://www.iep.utm.edu/deconst/](http://www.iep.utm.edu/deconst/)

~~~
aisofteng
In that case, it's already 100% accurate.

Don't confuse handwaving with science.

------
chris_st
Perhaps the folks at Microsoft's Lync (named after this gentleman, [1], no
doubt) or maybe it's Skype for Business now could get some of this research.

We have this at work (alas), and it does "transcription" of voicemail, which
it sends as an email. It's easily 90% wrong, regardless of speaker, unless
it's a slightly bad connection, when it's worse.

[1]
[https://www.youtube.com/watch?v=NV9fKUkx76Q](https://www.youtube.com/watch?v=NV9fKUkx76Q)

------
dalys
I think it's more impressive when you actually hear a sample from the
switchboard task:
[https://catalog.ldc.upenn.edu/desc/addenda/LDC97S62.wav](https://catalog.ldc.upenn.edu/desc/addenda/LDC97S62.wav)

From
[https://catalog.ldc.upenn.edu/LDC97S62](https://catalog.ldc.upenn.edu/LDC97S62)

------
braindead_in
I run an human powered transcription service and I get really excited on such
news. Typing is the first step of our process (of four) and any ASR system
which can generate even around 80% accurate transcript of a file will be
incredibly useful. We have tried several systems but unfortunately none have
been able to get there yet.

------
swagtricker
Time files like an arrow, but fruit flies like a banana.

Wake me up when they can match human recognition of context.

~~~
tempestn
At first I wondered whether you meant to type it that way to make a deeper
point, but I'm assuming it's just a typo.

------
andulus
I wonder if this success has any help in the advancement of another
application neural networks? Do these achievements translate easily to other
domains, or it's just an isolated case?

------
loup-vaillant
Great. Now Microsoft has the means to store every Skype conversations
indefinitely —it's only text, now.

Seriously, great work, but just like facial recognition, this will cut both
ways.

~~~
dlubarov
Compressed speech doesn't take much space anyway. Narrowband AMR uses around
7kbit/s (depending on the desired quality), or ~1 megabyte for a 20 minute
call. The quality isn't great, but it's adequate for most purposes, including
speech recognition with reasonable accuracy.

------
jarboot
How long do you think it is until captioning companies / TRSs such as Captel
downsize significantly because of tech like this?

~~~
aab0
Supply generates its own demand; by making captioning even cheaper, it can
increase the demand for transcription services and people to check it over.
There are a lot of podcasts and YT videos that could benefit from
transcriptions but it's too expensive now.

------
mirekrusin
Wouldn't just simple "word after list of words" probability help?

------
nicklovescode
Is there a demo or video of them using this? Would enjoy playing with it.

------
mirekrusin
...still waiting for english to/from dolphin translator.

------
plussed_reader
Do I have to use Windows to leverage this new software setup?

------
botw
off topic but related, is speech-to-text engine in android open sourced? can
it work offline entirely?

------
dfgonzalez
Did someone put it as SaaS already?

~~~
skoocda
We're close to an alpha release of Spreza, which might be relevant to your
question. Look us up! dm me if you've got questions

~~~
braindead_in
Very cool. How do you compare to speechmatics?

~~~
skoocda
Very similar accuracy, timing and alignment. We don't do speaker diarization
at all because the results seem consistently weak, even among the competitors
such as Speechmatics. I'd hazard to say our web editor is much better for
providing an end-to-end solution, where accuracy and verification matters.

If I may ask, what software does your team use at Scribie?

------
EGreg
In English, probably.

------
ahmetyas01
any video or audio to get the idea how close they are?

