
A speech-to-text practitioner’s criticisms of industry and academia - hprotagonist
https://thegradient.pub/a-speech-to-text-practitioners-criticisms-of-industry-and-academia/
======
totorovirus
I work in a startup doing STT. We have reached SOTA on lowest CER in our
language among industry. The main reason we are doing well is not because we
have smart engineers tuning on fancy model, but rather we developed novel
method to collect tremendous amount of usable data from via internet (crawling
speech and text transcripts, using subtitles from movies, etc). Implementing
interesting paper improves 1%, but pouring in more data improves 10%. I guess
this is why big guys aren't exposing what data they've used. It takes a
fortune to collect just 100 hours of clean speech-to-text labeled data, and
they will never meet user expectations in market.

~~~
totorovirus
Also, we have developed our internal framework that eases pretraining -
finetuning to subtask pipeline. After months of usage it required a lot of
refactoring to match the need of all forms of DNN models. I would like to hear
from other ML engineers if they have a internal framework which generalizes at
least one subfield of DNN(nlp, vision, speech etc)

~~~
snakers41
Hi, my name is Alexander, I am an author of both Gradient pieces, Open STT and
silero.ai

Interesting We did mostly the same Did you open source your data as well?

~~~
totorovirus
I don't think my superiors are going to open our source code. Thanks for
letting me know about your project

------
lunixbochs
I agree with many points of the article. I do think we’re closer to an English
STT ImageNet equivalent than you think. For most other languages I don’t think
it’s possible until data is collected/released, or some kind of automatic data
collection process becomes standard that can generate 10k hours of usable
training data for each of a bunch of arbitrary languages.

Seeing better “state of the art” results on librispeech is far less
interesting to me than two recent developments:

\- Facebook’s steaming convnet wav2letter model/arch, pretrained on 60k hours
of audio, which I’ve been using for transfer learning with great results. It’s
fast, not too huge, runs on cpu, and has excellent results without a language
model.

\- The librilight dataset. I aligned 30k hours of it (on CPU, on a single
large server, in about a week) and my resulting models are generalizing very
well.

~~~
yorwba
> The librilight dataset. I aligned 30k hours of it (on CPU, on a single large
> server, in about a week) and my resulting models are generalizing very well.

Was this using the pretrained model from Facebook? If so, how much custom code
did you need to write to get the alignment? I've been looking for a way to
take an arbitrary position in a given text and look up where in the
corresponding audio it appears, but I'd prefer not having to train a speech
recognition model to do that.

~~~
lunixbochs
I train my own models here:
[https://talonvoice.com/research/](https://talonvoice.com/research/)

I used my wav2train project, which is based on Mozilla’s DSAlign:
[https://github.com/talonvoice/wav2train](https://github.com/talonvoice/wav2train)

DSAlign only generates roughly sentence level alignment, which still may be a
good start for you. It works by segmenting the audio by pauses, transcribing
the segments with a strict language model, then figuring out where in the text
the segments are based on the transcript. wav2train then generates audio
segments using the alignment, but you could edit it to stop there and just
look at the tlog file that tells you where the sentences are in the original
file.

I’ve also used the Gentle Forced Aligner and Montreal Forced Aligner. MFA
wasn’t too hard to set up for English, but I don’t recommend running it on
large batches of audio, as it was slow and unreliable at large scale for me.

------
woodson
As another practitioner, I find it odd that the author mentions frameworks
such as OpenNMT but not Kaldi when comparing SST toolkits/frameworks. Overall,
I get the feeling that the author hasn't worked with speech data for too long.

I agree with some of the general points the author makes, such as papers
having tons of equations just for equations sake, reproducibility, not enough
details to reproduce (in a conference paper..), chasing SOTA on some dataset,
but these aren't limited to speech processing research. Large companies don't
release their internal datasets for voice search (Google Assistant, Alexa,
etc.), call center, or clinical applications, as there is no incentive to do
so and also because they likely can't for licensing and data rights issues. By
the way, what's the situation for the author's OpenSTT corpus? Did the people
speaking on the thousands of hours of phone calls release them under a free
license?

~~~
qayxc
> As another practitioner, I find it odd that the author mentions frameworks
> such as OpenNMT but not Kaldi when comparing SST toolkits/frameworks.

[unpopular opinion ahead] Maybe that's because Kaldi represents the ultimate
culmination of the author's critique:

• the provided scripts are useless to anyone without access to the academic
datasets they use

• Kaldi is very cumbersome and overly complex to use and adapt (most of the
toolchain relies on __exact __copies of entire directory trees)

• Kaldi is a __research __tool by researchers for researchers and not in any
way shape or form aimed at practitioners in search of a deployable solution

• Kaldi is very poorly documented (from a user perspective) and focuses on
recipes for your own experiments, not writing applications and roll out

In summary, Kaldi isn't a practical toolkit. It's a framework for R&D on
speech-to-text.

~~~
nshm
> the provided scripts are useless to anyone without access to the academic
> datasets they use

Kaldi team shares all datasets when possible on
[http://openslr.org](http://openslr.org), a major collection of speech
datasets. Librispeech and Tedlium were major breakthrough in their times. When
everyone in research trained on 80 hours WSJ and Google trained on 2000 hours
private dataset, 1000 hours of librispeech was a game changer.

> Kaldi is very cumbersome and overly complex to use and adapt (most of the
> toolchain relies on exact copies of entire directory trees)

On the other hand the Kaldi API is very stable. You still can run 4 year old
recipes with simple run.sh. Any tensorflow software gets obsolete once in 3
months when TF API changes.

Kaldi recipe results are usually reproducible up to the digit and you can even
see the tuning history (which features did improve, which didn't).

> Kaldi is a research tool by researchers for researchers and not in any way
> shape or form aimed at practitioners in search of a deployable solution

There are hundreds of companies all over the world building practical
solutions with Kaldi. The only thing that Kaldi team should do is to ask users
to mention it.

> Kaldi is very poorly documented (from a user perspective) and focuses on
> recipes for your own experiments, not writing applications and roll out

There are also dozen of projects on Github which enable use of Kaldi with
simple pip install or docker pull. kaldi-gstreamer-server, kaldi-active-
grammar, vosk and many others.

------
shadowgovt
Interesting (possible) additional piece of the puzzle on the industry:

As opposed to the explosion of CV training data, big players may find
themselves thinking they _can 't_ expose the tagged raw data for audio the
same way that tagged raw data for images was exposed to the public. The
privacy backlash on the way some previous public-release datasets have been
used was notable.

Has the ecosystem changed so the kind of mass-collaboration that led to
ImageNet can't be repeated due to privacy concerns of how the audio could be
(ab)used?

------
nutanc
Very good explanation of the current state of the art in STT. I have also
personally gone down this rabbit hole(especially the CTC bit) and agree 100%
with what the author says. My personal viewpoints:

1\. To have an ImageNet moment, we should have an ImageNet. LibriSpeech
doesn't cut it. We should have an equivalent SpeechNet. Problem is visual
language is one, while speech languages are many. So we should have an
ImageNet for every language? And train for every language?

2\. What is an ImageNet moment? Personally, though ImageNet has contributed a
lot, similar to speech and NLP, even vision applications have promised more
and delivered less in real world scenarios. And just like speech and NLP, only
the big boys actually provide solutions in vision which shows in vision also
they have access to better data which they don't share.

~~~
ragebol
I'm no way an expect in any of this, but I'd expect that there would be a lot
of features common between a lot of languages, akin to the International
Phonetic Alphabet [0]. Pre-training on _all_ languages to get those shared
features could make it easier to fine-tune eg. English on top perhaps. Or not,
just pondering here.

[0]
[https://en.wikipedia.org/wiki/International_Phonetic_Alphabe...](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet)

~~~
tasogare
Good luck finding records of the 6000+ existing languages...

~~~
nshm
You can check [https://github.com/festvox/datasets-
CMU_Wilderness](https://github.com/festvox/datasets-CMU_Wilderness), it has
recordings of 700 languages created from New Testaments from
[http://www.bible.is/](http://www.bible.is/)

------
nmfisher
TDLR:

\- the big boys (Google, Baidu, Facebook etc - basically FAANG) are getting
SOTA using private data without explicitly stating it

\- their toolkits are too complicated and not suited to non-FAANG-scale
datasets/problems

\- published results aren't reproducible and academic papers often gloss over
important implementation details (e.g which version of CTC loss is used).

\- SOTA/leaderboard chasing is bad and there's a general over-reliance on
large compute, so non-FAANG inevitably end up overfitting on publicly
available datasets (LibriSpeech).

I'm far from an expert, but having spent the last 6 months familiarising
myself with the STT landscape, I would say I mostly agree.

First, the author wants more like Mozilla's Common Voice. Not sure if he/she
is aware, but I think the M-AILABS speech dataset
([https://www.caito.de/2019/01/the-m-ailabs-speech-
dataset/](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/)) will go
a long way to "democratizing" STT.

Second, agree that the toolkits are all complicated and weighted in favour of
highly parameterized NNs. This works great if you have FAANG-style data (and
hardware), but really isn't pushing the field forward in other respects
(compute/training time, parameterization/etc). In all fairness, though, HMM-
based frameworks like Kaldi are notoriously complex and take significant
effort to wrap your head around. I think speech is just inherently more
complex than image or text.

Third - and this applies to ML in general - agree that papers all too often
gloss over the important details. I see no reason why publications shouldn't
all be accompanied by code and dataset hashes.

I think we're seeing FAANG all converge on a non-optimal solution, simply
because we've effectively removed resource constraints. When your company
doesn't blink at spending a few million on creating its own datasets, and you
have massive TPU farms available at your disposal, there's no incentive to
focus on sample- or compute- efficient models.

What's more, they also have the resources to market their successes much more
effectively. Think open source frameworks, PR pieces, new product releases,
and so on.

This effect is two-fold. First, it creates a lot of noise that crowds out the
slower, smaller contributors. Second, it also draws a lot of junior research
attention - who would want to risk 4 years of their life on a pathway that
might not work when NNs are here today and there's clear evidence that they
_work_ (albeit only for certain problems in certain environments with certain
datasets).

~~~
gok
Facebook released a pretty impressive SOTA system with public data sources a
few weeks ago

~~~
triztian
What does "SOTA" mean in this context; all I could find is "Software Over the
Air" but I'm unsure of what it means in a Speech to Text context

~~~
wodenokoto
“State of the art”.

But I think your original question is still great:

What _does_ state of the art mean in this context?

------
Causality1
On a somewhat related note, why does the Google STT engine have a grudge
against the word "o'clock"?

If you say "start at four" it gives you the text "start at 4." So far so good,
but if you say "start at four o'clock" you just get "start at 4" again, not
"start at 4:00" like you wanted. In fact you can say the word o'clock over and
over for several seconds and it will type absolutely nothing.

~~~
yorwba
Having recently experienced a similar issue with "filler" words that
mysteriously disappeared in a transliteration pipeline, I'd guess that this is
due to an artifact in the training data. Most likely the STT engine was
trained on transcriptions created by lazy humans who simply wrote "4" when
they heard "four o'clock", so the engine learned that "o'clock" is a speech
disfluency just like "umm" that usually doesn't appear in writing.

------
pheme1
As someone who thinker with Text to Speech (TTS), I can say this apply to TTS
as well. Good model such as Tacotron2 rarely scale beyond clean ( good text
and speech alignmen ) large ( > 12 hours ) datasets.

------
fxtentacle
I see the same issue in optical flow. Everyone and their grandma are
overfitting Sintel.

But everyday phenomena, like changing light conditions or trees with many
small branches cause the SOTA to fail miserably.

I even approached research teams with my new data set that they could use to
measure generalization, but the response has been frosty or hostile. Nobody
likes to confirm that their amazing new paper is worthless for real world
usage.

~~~
Tade0
This is one of the reasons why I never got myself to finish my master's
thesis.

My advisor advised me to use synthetic data, because the work I was supposed
to build upon did as well.

I insisted on real data which I generated myself and, unsurprisingly, it
showed how undoable this whole thing was(using stereo audio data to localise
speakers) with the suggested methods.

------
joe_the_user
This is quite interesting, looking at chip technologies, algorithms, actual
papers and industry processes together.

Neither debunking nor selling AI but a big picture.

------
PaulHoule
A lot of it is solving the wrong problems.

'text to speech' is good for tapping people's phones. Voice user interfaces
don't get good enough until they can hold a conversation. That is, ask for
clarification if you don't understand what people say. Superhuman performance
at TTS isn't good enough to replace a keyboard with a backspace button.

~~~
mkl
Are you mixing up STT and TTS here? Maybe I'm misunderstanding what you're
trying to say.

~~~
PaulHoule
yep, that's the kind of mistake that a human can catch... remember some of
those errors are inserted at the transmitter, some along the channel, and some
at the receiver. If a receiver is going to get great accuracy, it has to catch
errors introduced anywhere!

------
eximius
I feel like there is a project waiting to happen (or that I'm unaware of) to
use movies/TV and captions to extract well-annotated audio.

While for copyright reasons this dataset likely can't be released publicly,
the method to generate it from materials should be possible to release so that
IF you have the source, you can get the annotated audio.

~~~
kd5bjo
Audiobooks could be another resource, but they’ll only give you a narrator’s
intonation. On the other hand, at some point it may become easier to train
people to speak in a certain way to computers instead of training computers to
understand the unclear mumblings of casual speech.

~~~
est31
> Audiobooks could be another resource

Yes, that's in fact the source of the commonly cited LibriSpeech dataset. It
used publicly available recordings from the LibriVox project and has done some
cleaning steps as well as alignment of the transcript with the audio to arrive
at 1000 hours of cleaned up audio.

The M-AILABS dataset uses LibriVox as well, among other sources, to arrive at
1000 hours of cleaned up data in various european languages.

Overall there's still a large untapped potential in LibriVox data:
[https://gist.github.com/est31/1e195c55fab8f95a72393db1519da1...](https://gist.github.com/est31/1e195c55fab8f95a72393db1519da107)

But you'll have to give up properties like gender balancing in your dataset
that then has to be counteracted during the learning process, e.g. by having
gender-custom learning rates.

~~~
lunixbochs
Facebook’s librilight taps this potential. They have a script for easily
fetching an entire language from librivox, and I have a pipeline on top of
that for book fetching and transcript alignment that I’ve successfully trained
from.

I think the next step here needs to be auto-finding librivox source books that
aren’t in e.g. Gutenberg. I only auto-found books for 75% of English. Only
found books for 20% of German. I don’t actually know where to look beyond
Gutenberg for books that don’t have easily linked text. Common crawl? Librivox
forums?

(Or you can take Facebook’s semi-supervised approach and generate a transcript
from thin air instead, but imo that will work better for English than
languages that don’t have good baseline models yet)

------
jancsika
What is the hacker's "good enough" tiny lib to take even a small set of single
words from my own voice as input through my laptop mic?

~~~
smcameron
Pocketsphinx[1]. I used it, and pico2wave along with some old-school
interactive-fiction style code to create a "computer" for my open source space
game[2] so I can drive the spaceship around by talking to it[3][4].

[1][https://cmusphinx.github.io/wiki/tutorialpocketsphinx/](https://cmusphinx.github.io/wiki/tutorialpocketsphinx/)
[2][https://spacenerdsinspace.com](https://spacenerdsinspace.com)
[3][https://www.youtube.com/watch?v=tfcme7maygw](https://www.youtube.com/watch?v=tfcme7maygw)
[4][https://scaryreasoner.wordpress.com/2016/05/14/speech-
recogn...](https://scaryreasoner.wordpress.com/2016/05/14/speech-recognition-
and-natural-language-processing-in-space-nerds-in-space/)

~~~
jancsika
Ooh, thanks for the links. I'll check these out.

------
z3t4
When this becomes a solved problem, government agencies will have a field day.
So it has to be delayed until most of the world has free speech.

~~~
mkl
That sounds absurd. Government agencies are not holding back an entire
industry of researchers.

And surely the countries without free speech would be among the most
interested in computer-based eavesdropping.

~~~
z3t4
We live in a world where some people have immense power. We, as in engineers
and scientists are the enablers.

------
snakers41
Hi, my name is Alexander, I one of authors of both Gradient pieces, Open STT
and silero.ai.

We are planning on adding 2-3 new languages soon (English, German, maybe
Spanish), so if you would like us to fully open source everything - please
support us here -
[https://opencollective.com/open_stt](https://opencollective.com/open_stt)

A lot of feedback here, very nice!

As with the Russian Speech community for some reason people are willing to
share their feedback on third-party forums, but do not bother to write to the
author directly.

If you have something to say - please email me - aveysov@gmail.com or telegram
me (@snakers4 or @snakers41) if you would a faster reply.

Also it seemed to me that some of the readers missed that there were actually
2 parts of the article - [https://thegradient.pub/towards-an-imagenet-moment-
for-speec...](https://thegradient.pub/towards-an-imagenet-moment-for-speech-
to-text/).

With this all addressed, to answer some common criticisms / points raised
below:

> Kaldi cool, NNs not cool

Though it is true that production solutions CAN be built with Kaldi (and I
have even spoken with people who built them, as well as vosk author), we
purposely omitted Kaldi because we believe that it is a technological dead-
end. Also it depends a lot on antiquated technology, is very difficult for
new-comers etc etc

Another problem is that ... it requires a lot of specific knowledge to add
languages. Whereas our approach is just plug-and-play - just add more data!

For example - our best model in production is just 300-500 lines of code in
PyTorch.

With so much resources poured in tools like PyTorch, it just makes sense to
use tools that are simple / robust / offer a lot of future-proofing and cool
features etc etc Also you can ofc switch DL frameworks if you wish =)

Also you can do CV, NLP, etc with PyTorch unlike Kaldi.

As someone noted, does not really matter what you are using, if it suits your
compute. Fully e2e approaches still are GAFA scale only, and I just realized
why Google even bothers with them - please see my post here
[https://t.me/snakers4/2445](https://t.me/snakers4/2445)

> ML APIs changing w/o backwards compatibility

TF does that, and it is a joke. Paid marketing says that TF is cool, but real
practicioners (that I know personally) use PyTorch. It is a holywar, but I
believe TF just has a lot of captive audience ...

PyTorch core API on the other hand has been mostly stable since 0.4 (now it is
on 1.4)! (!!!) No one speaks about this for some reason.

> Deep Speech is a bad starting point

Yes and no. Vanilla huge LSTMs are really GAFA scale only. But out networks is
at least an order of magnitude faster. Please see this -
[https://thegradient.pub/towards-an-imagenet-moment-for-
speec...](https://thegradient.pub/towards-an-imagenet-moment-for-speech-to-
text/)

> cool open source datasets

Check out Open STT! We have not been able to OSS everything, because Russian
corporations would use it w/o even crediting us. But you can support us and
change it. Also other languages.

caito.de is really cool, albeit very small common voice - is very call, but on
the smaller side libri-light - you have to align it yourself, so no difference
really - easier to make it from scratch

Also I do not understand why people use flac, when opus exists?

~~~
woodson
Hi Alexander, I appreciate you posting here; I didn't feel too strongly about
this to sign up for an account to comment directly on TheGradient.

> Kaldi cool, NNs not cool

It seems you're misrepresenting things. Kaldi models are NNs, they are trained
using a sequence-level objective function (LF-MMI, not entirely unlike CTC),
they can also be trained completely end2end (without using HMM-GMM systems to
generate alignments for CE regularization and obtaining lattices). The Kaldi
authors are currently working on moving NN training to PyTorch.

The fact is that, if you need a complete easy-to-use off-the-shelf solution
("Just give me the transcription of that audio file!"), Kaldi is not it. You
need to train models and you need to use other projects, such as vosk, kaldi-
gstreamer-server, etc. or build your own solution. People have done so and use
it in production successfully. Use the right tool for the job, a poor workman
blames his tools, yadda yadda.

Perhaps it's just that speech recognition IS difficult, there is no one-size-
fits-all-use-cases solution, and when people can't figure out Kaldi, they seem
to think they need to throw out everything and pray that CTC will fix it
somehow. Get sufficient data in your domain and things might work better.
Don't use a couple of audiobooks for training, expecting that it will
perfectly transcribe Indian accented English callcenter conversations. I
really appreciate your effort on releasing OpenSTT, in particular that you
provide data in different conditions and domains (phone calls, lectures,
broadcast radio, etc.). This is useful no matter what particular tool (Kaldi,
DeepSpeech, w2l, RNN-T, transformers) people are using.

