
Towards an ImageNet Moment for Speech-to-Text - lelf
https://thegradient.pub/towards-an-imagenet-moment-for-speech-to-text/
======
eindiran
One interesting point: "1,000 hours is also a good start, but given the
generalization gap (discussed below) you need around 10,000 hours of data in
different domains."

Many of the companies working in this space who aren't Google/Amazon target
less general domains, using domain-specific languages models to skirt around
the lack of annotated acoustic data. But the lack of annotated acoustic data
is a real problem, so any effort to provide data to everyone (particularly for
a non-English language) is extremely admirable. I applaud the OpenSTT team for
their work.

~~~
brian_herman
Yeah we have an automated payment system at work with amazon. I helped
prototype it and an intern which we hired finished it. People use it all the
time and the intern has less errors then my code.

------
adyavanapalli
The author wrote a follow-up article as well:
[https://thegradient.pub/a-speech-to-text-practitioners-
criti...](https://thegradient.pub/a-speech-to-text-practitioners-criticisms-
of-industry-and-academia/)

~~~
lunixbochs
And a corresponding HN thread here:
[https://news.ycombinator.com/item?id=22790188](https://news.ycombinator.com/item?id=22790188)

------
andreyk
TLDR, this is about an effort to collect a new large dataset for speech-to-
text and an exploration of model design choices to enable programmers without
huge computational resources to work in the space (it mostly does not touch on
fine-tuning so 'ImageNet moment' is a bit overclaiming, but then again it does
say 'Towards').

I found the "Why not share this in an academic paper" part interesting (and I
expect it to be interesting to HN). As more practitioners enter the field, a
lot of empirical knowledge will be gained but will be tricky to share since
they will likely not know the convention of academic writing, Latex, etc. (to
be clear, I think when done well papers are a good format to share
information). Perhaps stuff like this and distill.pub will become more common?
Seems like a good development IMO (as is researchers writing blog posts in
addition to papers).

------
devit
I don't get the claim about lack of training data.

Most TV stations broadcast with closed captioning and most movies have
subtitles available, which should give millions of hours of training data.

It's technically copyrighted, but as long as you don't distribute the video as
well, they aren't going to care about it.

Also the complaint that the data lacks compression artifacts is completely
ridiculous and absurd: if you want compression artifacts, just compress and
decompress the speech yourself!

You can also pay people to transcribe audio or read text (or even do it
personally), and since this is something anyone can do it can be paid at very
low rates.

~~~
PeterisP
There's lots of resources for English and much less for many other languages -
this article is about the non-English case.

Closed captioning and subtitles are often used, but they are 'dirty data' \-
they are usually not a one-to-one match, the differences mess up training. And
licensing issues are a pain; "they aren't going to care about it" is not a
solution and things like background music in movies (the rightholders very
much _do_ care about it, even if just to make a point) make it pretty much
impossible for e.g. some university to legally distribute a dataset that
includes movie audio tracks; so they can run some experiments on it themselves
but as soon as you want any collaboration, that data is taboo.

Paying people to transcribe or read works, but it's not cheap. Reading 10000
hours takes at least 10000 hours of paid work and generally a bit more than
that - mistakes matter, so you need review and correction. If you try
transcribing things yourself, a tiny 100 hour dataset is going to take you at
least something like 400 hours, which is months of work.

So that's the point - getting a usable dataset for some language costs
hundreds of thousands of dollars if you're frugal, and millions if you want
good results. And there are very many languages in the world.

------
sandreas
Heres a Project which is usable for speech to text via API and docker image:

[https://github.com/gooofy/zamia-
speech#download](https://github.com/gooofy/zamia-speech#download)

[https://github.com/mpuels/docker-py-kaldi-asr-and-
model](https://github.com/mpuels/docker-py-kaldi-asr-and-model)

(also supporting german language pretty good but not production ready)

------
lunixbochs
Great writeup! I like the detail, especially in Making a Great Speech To Text
Model.

It looks like their research is against an older wav2letter model (likely the
conv_glu model from the original 2018 Gated ConvNet paper). Facebook has
released a lot of interesting architectures since then, my favorite of which
is the Streaming ConvNet arch [1]

Ideas from the post compared to my experience with wav2letter:

1\. Model Stride - increased model stride to 8x

Streaming ConvNet increased the stride to 8, while the original conv_glu model
had stride 2 (cite: [2]), so Facebook agrees on this one.

2\. Compact Regularized Networks - use separable convolutions, add skip
connections, and attention modules

Streaming ConvNets use Time-Depth Separable Convolutions, but no skip
connections or attention. Facebook's "state of the art 2019" research tried a
lot of architectures, the most accurate of which was a transformer model
trained with attention and seq2seq criterion. However in my experience it was
very large and slow to train.

3\. Using Byte-Pair Encoding

Streaming ConvNets model uses 10,000 word pieces, which is a similar thought
(the token set FB uses is here: [4] but I've had better luck with a
sentencepiece model trained on more text data) instead of a ~26 letter
alphabet from the original conv_glu model.

4\. Better Encoder

The blog post conclusion here isn't very obvious, it seems to hand wave that
you could maybe use a transformer architecture but doesn't go into specifics.

5\. Balance Capacity

I don't completely understand what they changed here but it seems useful.

6\. Stabilize the Training in Different Domains, Balance Generalization

When adding domains to wav2letter, I've basically trained entirely from
scratch using all available data. I long for a catastrophic forgetting
mitigation like EWC but for now I've had a lot of success by just making the
data funnel bigger and more varied. Data matters quite a bit, in this post [5]
I compare a model I trained on a wide variety of data, with Facebook's
equivalent model as well as Facebook's State Of The Art model. The wide-
variety model generalized significantly better than both of Facebook's
examples that had only trained on audiobooks.

7\. Make a very fast decoder.

The beam search decoder in this branch [6] of Talon's wav2letter fork was at
the time significantly faster than Facebook's decoder. It was hand optimized
and even includes some code for threaded decoding (which I'm not using at the
moment because single threaded is fast enough for my workload). On single
threaded decoding it can hit a realtime factor in the ballpark of 0.01x, or
100 seconds of audio per cpu second. With my smallest model (a significantly
reduced size conv_glu wav2letter model), which is also in the 0.01x ballpark
for encoding, we can consistently hit around 0.02x end to end on a consumer
CPU.

To demonstrate, here is a random sample of debug timing information from my
machine when recently decoding command phrases with wav2letter during real
interactive use (the very long audio samples aren't the model's fault, I was
testing a mic in a noisy environment):

    
    
        [audio]=1260.000ms  [emit]=16.164ms (0.01x)  [decode]=11.326ms (0.01x)  [total]=27.490ms (0.02x)
        [audio]=1260.000ms  [emit]=16.164ms (0.01x)  [decode]=11.326ms (0.01x)  [total]=27.490ms (0.02x)
        [audio]=930.000ms  [emit]=12.853ms (0.01x)  [decode]=3.716ms (0.00x)  [total]=16.569ms (0.02x)
        [audio]=1080.000ms  [emit]=14.936ms (0.01x)  [decode]=3.296ms (0.00x)  [total]=18.231ms (0.02x)
        [audio]=1020.000ms  [emit]=12.976ms (0.01x)  [decode]=10.058ms (0.01x)  [total]=23.034ms (0.02x)
        [audio]=930.000ms  [emit]=13.836ms (0.01x)  [decode]=9.998ms (0.01x)  [total]=23.834ms (0.03x)
        [audio]=1140.000ms  [emit]=14.535ms (0.01x)  [decode]=3.589ms (0.00x)  [total]=18.124ms (0.02x)
        [audio]=1650.000ms  [emit]=20.516ms (0.01x)  [decode]=5.281ms (0.00x)  [total]=25.797ms (0.02x)
        [audio]=3480.000ms  [emit]=31.386ms (0.01x)  [decode]=32.966ms (0.01x)  [total]=64.352ms (0.02x)
        [audio]=29760.000ms  [emit]=253.671ms (0.01x)  [decode]=325.198ms (0.01x)  [total]=578.869ms (0.02x)
        [audio]=8850.000ms  [emit]=70.844ms (0.01x)  [decode]=76.267ms (0.01x)  [total]=147.111ms (0.02x)
        [audio]=960.000ms  [emit]=13.924ms (0.01x)  [decode]=8.135ms (0.01x)  [total]=22.060ms (0.02x)
        [audio]=1770.000ms  [emit]=19.760ms (0.01x)  [decode]=18.612ms (0.01x)  [total]=38.372ms (0.02x)
        [audio]=930.000ms  [emit]=12.488ms (0.01x)  [decode]=6.842ms (0.01x)  [total]=19.330ms (0.02x)
        [audio]=1050.000ms  [emit]=15.726ms (0.01x)  [decode]=7.645ms (0.01x)  [total]=23.371ms (0.02x)
        [audio]=930.000ms  [emit]=13.899ms (0.01x)  [decode]=8.714ms (0.01x)  [total]=22.613ms (0.02x)
        [audio]=40410.000ms  [emit]=292.916ms (0.01x)  [decode]=289.109ms (0.01x)  [total]=582.025ms (0.01x)
    

[1]
[https://github.com/facebookresearch/wav2letter/tree/master/r...](https://github.com/facebookresearch/wav2letter/tree/master/recipes/models/streaming_convnets)

[2]
[https://github.com/facebookresearch/wav2letter/issues/543#is...](https://github.com/facebookresearch/wav2letter/issues/543#issuecomment-588062135)

[3]
[https://github.com/facebookresearch/wav2letter/tree/master/r...](https://github.com/facebookresearch/wav2letter/tree/master/recipes/models/sota/2019)

[4]
[https://dl.fbaipublicfiles.com/wav2letter/streaming_convnets...](https://dl.fbaipublicfiles.com/wav2letter/streaming_convnets/librispeech/librispeech-
train-all-unigram-10000.tokens)

[5]
[https://github.com/facebookresearch/wav2letter/issues/577](https://github.com/facebookresearch/wav2letter/issues/577)

[6]
[https://github.com/talonvoice/wav2letter/tree/decoder](https://github.com/talonvoice/wav2letter/tree/decoder)

