
How Baidu is improving Mandarin voice recognition with deep learning - oska
https://medium.com/s-c-a-l-e/how-baidu-mastered-mandarin-with-deep-learning-and-lots-of-data-1d94032564a5
======
kargo
"To put that in context, this is in my opinion — and to the best of our lab’s
knowledge — the best system at transcribing Mandarin voice queries in the
world."

There is some data about this claim here:
[https://gigaom.com/2014/12/18/baidu-claims-deep-learning-
bre...](https://gigaom.com/2014/12/18/baidu-claims-deep-learning-breakthrough-
with-deep-speech/)

Anyone here that has actually used their system and compared it to others?

The data looks impressive, but I am a bit skeptical of this claim, because I
compared the Baidu OCR API for Chinese character recognition with the one from
Microsoft recently. And I was disappointed. In many cases they were just as
good as Microsoft's litte OCR dll that ships with Windows 10 [1]. Of course,
OCR and speech recognition are somewhat different fields, but if anyone is an
expert for Chinese language, then it should be Baidu.

[1] [http://blog.a9t9.com/2015/09/baidu-ocr-
api.html](http://blog.a9t9.com/2015/09/baidu-ocr-api.html)

~~~
oska
_> Of course, OCR and speech recognition are somewhat different fields, but if
anyone is an expert for Chinese language, then it should be Baidu._

It's not about being experts in Mandarin. The basis of their approach is that
it doesn't encompass any expert design. It's an end-to-end deep learning
approach. From the article:

> Our system is different than that system in that it’s more what we call end-
> to-end. Rather than having a lot of human-engineered components that have
> been developed over decades of speech research — by looking at the system
> and saying what features are important or which phonemes the model should
> predict — we just have some input data, which is an audio .WAV file on which
> we do very little pre-processing. And then we have a big, deep neural
> network that outputs directly to characters. We give it enough data that
> it’s able to learn what’s relevant from the input to correctly transcribe
> the output, with as little human intervention as possible.

> One thing that’s pleasantly surprising to us is that we had to do very
> little changing to it — other than scaling it and giving it the right data —
> to make this system we showed in December that worked really well on English
> work remarkably well in Chinese, as well.

So they've quickly trained their Deep Speech engine [1] to process Mandarin
after first training it to transcribe English, _without_ injecting specific
language expertise into the engine.

Finally, I strongly doubt the OCR and speech recognition teams are the same. I
don't know about the OCR team but their speech recognition team is based in
California [2] and includes Andrew Ng and Awni Hannun from Stanford
University.

[1] [http://arxiv.org/abs/1412.5567](http://arxiv.org/abs/1412.5567)

[2] [http://usa.baidu.com/deep-speech-lessons-from-deep-
learning/](http://usa.baidu.com/deep-speech-lessons-from-deep-learning/)

------
cageface
_There are a couple of differences with Mandarin that made us think it would
be very difficult to have our English speech system work well with it. One is
that it’s a tonal language, so when you say a word in a different pitch, it
changes the meaning of the word, which is definitely not the case in English._

This is fascinating. I've been living in Vietnam for a while now and
struggling to learn Vietnamese. Mastering any tonal language is quite
difficult for native English speakers so I've wondered how hard it would be to
develop speech recognition algorithms for Chinese or Vietnamese. Sounds like
it's actually not that difficult but it requires a different approach than
non-tonal languages.

------
aianus
I wonder why tonal language ever evolved.

It doesn't seem to make any sense to have the same words mean different things
if you can just make a different word...

~~~
ketralnis
I wonder why vowels ever evolved. It doesn't seem to make any sense to have
the same words mean different things if you can just make a different word.
After all, what's really the difference between bat and bet and bit and bot
and but? It's clearly all the same word!

Sarcasm aside though, a whole lot of languages are tonal. Most languages
spoken in Asia and many African ones are. And you use tone too: what's the
difference between "You there." and "You there?" and "You there!". It just
serves a different purpose in English than it does in Chinese so it seems less
mysterious to an English speaker.

The only tonal language I can really speak about is Mandarin (although I
encourage you to read about other systems). The way the tones work in Mandarin
isn't as complicated as it sounds. The classic example is this[1]:

1\. mā (妈, mother). This sounds like a somewhat high-pitched, steady tone

2\. má (麻, hemp). This is called a rising tone, and it sounds sort of like the
end of a question in English

3\. mǎ (马, horse). This tone falls and then rises (sort of like a #4 then a
#2). Imagine a whiny child saying "mooooom!" to get their mom's attention

4\. mà (骂, scold). This tone falls quickly, it sounds like in the movies when
someone shouts "Hey!" at a robber that's getting away

5\. ma (吗, an interrogative particle). This is a neutral tone, its sound
depends mostly on the syllable before it but in isolation it either falls
slightly or sounds like an English "toneless" syllable.

Mandarin only has those 5 tones, it's not like "ma" can sound fifteen
different ways. They aren't differentiated by the pitch itself exactly, it's
whether the pitch is rising or falling.

A Chinese speaker might describe the difference between má and mà the same way
you'd describe the difference between "bet" and "bit" to a speaker of a
language that doesn't differentiate those vowels. Sort of like rhyming, but
not quite. As an easier example, Spanish doesn't really have a contrastive /ɪ/
as in "bit", so a Spanish speaker might have the same trouble differentiating
or producing it without some practise that you might have doing the same to má
or mǎ.

[1]
[https://en.wikipedia.org/wiki/Tone_(linguistics)#Mechanics](https://en.wikipedia.org/wiki/Tone_\(linguistics\)#Mechanics)

~~~
aianus
> I wonder why vowels ever evolved.

Is Chinese missing vowels?

Usually complexity increases because it confers some kind of evolutionary
advantage.

~~~
PepeGomez
How is pitch more complex than vowels?

~~~
rspeer
In fact, acoustically, tones are a lot simpler than vowels.

