
Building an end-to-end Speech Recognition model in PyTorch - makaimc
https://www.assemblyai.com/blog/end-to-end-speech-recognition-pytorch
======
zerop
Good article. Speech recognition for real time use cases must get a really
working open source solution. I have been evaluating deepspeech, which is
okay. but there is lots of work needed to make it working close to Google
Speech engine. Apart from a good Deep neural network, a good speech
recognition system needs two important things:

1\. Tons of diverse data sets (real world)

2\. Solution for Noise - Either de-noise and train OR train with noise.

There are lots of extra challenges that voice recognition problem have to
solve which is not common with other deep learning problems:

1\. Pitch

2\. Speed of conversation

3\. Accents (can be solved with more data, I think)

4\. Real time inference (low latency)

5\. On the edge (i.e. Offline on mobile devices)

~~~
ALittleLight
Your point about needing a dataset made me think about how a post on
hackernews like this may be a good way to get data. How many people would
contribute by reading a prompt if they visited a link like this and had the
option to donate some data? That would get many distinct voices and
microphones and some different conditions.

The article mentions that they used a dataset composed of 100 hours of
audiobooks. A comment thread here [1] estimates 10-50k visitors from a
successful hackernews post. Call it 30k visitors. If 20% of visitors donated
by reading a one minute prompt, that's another 6,000 minutes, or, oddly, also
100 hours.

Seems like a potentially easy way to double your dataset and make it more
diverse.

1 -
[https://news.ycombinator.com/item?id=20612717](https://news.ycombinator.com/item?id=20612717)

~~~
Isn0gud
You might be interested in a project, doing exactly that:
[https://voice.mozilla.org/](https://voice.mozilla.org/)

Audio data of people reading prompts is quite common, what is missing for
robust voice recognition is plenty of data of e.g. people screaming it across
the room. There is only so much physics simulations can do.

~~~
ALittleLight
That is interesting. I gave my contribution!

------
albertzeyer
This seems to be a CTC model. CTC is not really the best option for a good
end-to-end system. Encoder-decoder-attention models or RNN-T models are both
better alternatives.

There is also not really a problem about available open source code. There are
countless of open source projects which already have that mostly ready to use,
for all the common DL frameworks, like TF, PyTorch, Jax, MXNet, whatever. For
anyone with a bit of ML experience, this should really not be too hard to
setup.

But then to get good performance, on your own dataset, what you really need is
experience. Probably taking some existing pipeline will get you some model,
with an okish word-error-rate. But then you should tune it. In any case, even
without tuning, probably encoder-decoder-attention models will perform better
than CTC models.

~~~
tasubotadas
It would seem that the best practical approach is to use RNNT as it still lets
you do streaming predictions (while Attention won't really let that).

~~~
albertzeyer
If you need streaming, then yes, RNNT is a good option. If not, encoder-
decoder-attention performs a bit better than RNN-T.

Note that there are also approaches for encoder-decoder-attention to make that
streaming capable, e.g. MoChA or hard attention, etc.

Google uses RNN-T on-device. But they are researching on extending it with
another encoder-decoder-attention model on-top, to get better performance.

This is a quite active research area, and it has not really settled. But CTC
is not really too much relevant anymore, as RNNT is just a better variant.

~~~
woodson
Recent work on transformer transducers with limited right (and left) context
seem to give decent results as well:
[https://arxiv.org/abs/2002.02562](https://arxiv.org/abs/2002.02562)

~~~
p1esk
I opened the pdf, did ctrl+F for 'github', got zero results. Have you
reproduced their "decent results"?

~~~
woodson
ESPnet apparently already has implementations of versions of transformer
transducers with RNN-T loss (though, with different network architecture). At
least the paper cites results on a freely available dataset, right?

~~~
p1esk
Why would you not expect Google to post their code? E.g. the transformer paper
had it: [https://arxiv.org/abs/1706.03762](https://arxiv.org/abs/1706.03762)

------
spzb
This is probably really good but the linked Colab notebook is failing on the
first step with some unresolvable dependencies. This does seem to be a bit of
a common theme whenever I try running example ML projects.

Edit: I think I've fixed it by changing the pip command to:

!pip install torchaudio comet_ml==3.0

~~~
zanew101
Hah, classic. But in all seriousness I think its a pretty interesting issue. A
lot of the ML and data science sees people coming in who do not have formal
computer science and software development backgrounds. We build tools and
methodologies around abstracting away some of the code development process and
hope that it lands us in an environment that's easy to share with others. This
is unfortunately rarely the case.

Its a problem that as an industry I think we are in the middle of "solving"
(probably can't be solved fully, but things are getting better). I'm really
excited to see what kinds of tools and tests will be developed around getting
ML projects with some better practices.

------
option
Have a look at NeMo
[https://github.com/nvidia/NeMo](https://github.com/nvidia/NeMo) it comes with
QuartzNet (only 19M of weights and better accuracy than DeepSpeech2)
pretrained on thousands of hours of speech.

~~~
sniper200012
really interesting repo

------
coder543
Mentioned once in the other comments here without any link, but another open
source speech recognition model I heard about recently is Mozilla DeepSpeech:

[https://github.com/mozilla/DeepSpeech](https://github.com/mozilla/DeepSpeech)

[https://hacks.mozilla.org/2019/12/deepspeech-0-6-mozillas-
sp...](https://hacks.mozilla.org/2019/12/deepspeech-0-6-mozillas-speech-to-
text-engine/)

I haven't had a chance to test it, and I wish there were a client-side WASM
demo of it that I could just visit on Mozilla's site.

~~~
mikaelphi
Author here! Deep Speech is an excellent repo if you just want to pip install
something. We wanted to do a comprehensive writeup to give devs the ability to
build their own end-to-end model.

------
komuher
Dunno why (probably dataset) but open source Speech Recognition models are
performing very poorly on real world data compared to google speech to text or
azure cognitive.

~~~
dylanbfox
One of the main factors for this is probably due to dataset size. Commercial
STT models are trained on 10s of thousands of hours of data from real-world
data. Even a decent model architecture is going to perform pretty well on that
much data.

Most open source models are trained on Libri, SWB, etc. which are not really
big or diverse enough for real-world scenarios.

But to max-out results the devil is in the details IMO (network architecture,
optimizer, weight initialization, regularization, data augmentation,
hyperparam tuning, etc) which requires a lot of experiments.

~~~
bginsburg
There are new, very large public English speech datasets: Mozilla Common
Voice, National Speech Corpus, which can be combined with LibriSpeech to train
large models.

~~~
solidasparagus
Amazon worked with 7k hours of labeled data + 1 million hours of unlabeled
data -
[https://arxiv.org/pdf/1904.01624.pdf](https://arxiv.org/pdf/1904.01624.pdf)

