
DeepSpeech: Scaling up end-to-end speech recognition - cbcase
http://arxiv.org/abs/1412.5567
======
cbcase
Thought it best to post the arXiv link, but there's some press coverage as
well:

\- [https://gigaom.com/2014/12/18/baidu-claims-deep-learning-
bre...](https://gigaom.com/2014/12/18/baidu-claims-deep-learning-breakthrough-
with-deep-speech/) \- [http://www.forbes.com/sites/roberthof/2014/12/18/baidu-
annou...](http://www.forbes.com/sites/roberthof/2014/12/18/baidu-announces-
breakthrough-in-speech-recognition-claiming-to-top-google-and-apple/)

~~~
cbcase
I should add that I had the opportunity to work on this project and am happy
to answer questions.

~~~
hyperbovine
Why 5 hidden layers? Why are the first 3 non-recurrent? How did you decide how
wide to make the internal layers? Are there some organizing principles behind
these design decisions, or is it just trial and error?

~~~
cbcase
As in many things, it's a combination of both. For example:

\- We wanted no more than one recurrent layer, as it's a big bottleneck to
parallelization.

\- The recurrent layer should go "higher" in the network, as it's more
effective at propagating long-range context when using the network's learned
feature representation than using raw input values.

Other decisions are guided by a combination of trial+error and intuition. We
started on much smaller datasets which can give you a feel for the
bias/variance tradeoff as a function of the number of layers, the layer sizes,
and other hyperparameters.

------
pesenti
To put it in perspective, my team in IBM Watson has already published better
numbers (10.4% WER vs 13.1% WER for Baidu) on the SWB dataset. We haven't run
our model on the CH part so we can't compare on the full test set. Paper here:
[http://www.mirlab.org/conference_papers/International_Confer...](http://www.mirlab.org/conference_papers/International_Conference/ICASSP%202014/papers/p5609-soltau.pdf).

~~~
cbcase
Hi Jerome, those are great results! We got an email this morning from someone
else on the Watson team pointing out that we didn't include the latest IBM
number -- we'll be sure to update the results in the next version of the paper
(three cheers for arXiv).

Of course, we openly say in the paper that we don't have the best result on
easy subset of Hub5'00 (we had it as 11.5%). We're more interested in
advancing the state of the art on challenging, noisy, varied speech. Of course
we'll be working to push the SWB number down too :)

~~~
pesenti
The team is already working on seeing what we get with CH. We'll let you know
where we land. But your results are definitely impressive. We love to see new
published innovation in the field. Kudos to the team!

~~~
ogrisel
What is the average and standard deviation of the performance level on this
dataset?

------
brandonb
This is very fast progress from Baidu's Silicon Valley AI lab! Andrew Ng only
joined Baidu in May, and (nearly?) all of the co-authors of this paper have
joined him since then: [http://www.technologyreview.com/news/527301/chinese-
search-g...](http://www.technologyreview.com/news/527301/chinese-search-giant-
baidu-hires-man-behind-the-google-brain/)

Congrats to Carl, Sanjeev, Andrew, and the others.

~~~
cbcase
Thanks for the kind words, Brandon! Been a busy couple of months :)

------
greeneggs
Very nice. I wonder if training can be simplified by training pieces of the
model separately, instead of training all together. For example, the
DeepSpeech model has three layers of feedforward neurons (where the inputs to
the first layer are overlapping contexts of audio), followed by a bi-
directional recurrent layer, followed by another feedforward layer. What would
the results be if we trained the first layers (perhaps all three) on a
different problem, such as autoencoding or fill-in-the-blank (as in word2vec),
and then fixed those network weights to train the rest of the network?

Breaking the network up like this would reduce training time and perhaps
reduce the needed training data. Since the first layers could be trained
without supervision, less labeled data would be needed to train the last two
layers. It would also facilitate transferring models between problems; the
output of the first few layers, like a word2vec, could be fed into arbitrary
other machine learning problems, e.g., translation.

If this does not work, then how about training the whole model together, but
only once? The final results are reported for an ensemble of six independently
trained networks. What if started by training one network, and then fixed the
first three layers to train other networks? (Instead of fixing the first
layers, you could also just give them a slower training rate, although it
isn't clear whether that would save you much.)

------
gok
So with 300 hours of training data it does worse on SWB than a DNN-HMM, or
even a GMM-HMM system? But when they give it 2300 hours or training data, it
can beat those 300 hour trained systems?

This is still very cool, but that comparison doesn't seem fair at all.

~~~
sherjilozair
Why not? DNN-HMM and GMM-HMM won;t have done any better even if trained for
2300 hours.

~~~
cbcase
Mostly this, though it's not so black-and-white. The paper discusses results
from a DNN-HMM system (Maas et al., using Kaldi) trained on 2k hours, and it
does provide a small generalization improvement over 300 hours.

Much of the excitement about deep learning -- which we see as well in
DeepSpeech -- is that these models continue to improve as we provide more
training data. It's not obvious a priori that results will keep getting better
after thousands of hours of speech. We're exited to keep advancing that
frontier.

~~~
gok
That was an even weirder comparison. They compare a system trained on 2000
hours of acoustic data mismatched with the testing data to their system, which
was trained on 300 hours of matched data _in addition_ to the 2000 hours of
mismatched acoustic data.

