Yup. Copying my post from reddit: Seems like the title should really be criticis...

Yup. Copying my post from reddit:

Seems like the title should really be criticisms of STT groups in the industry, work from academic groups is ignored.

The original article was kind of silly anyways, it was all engineering work, very little research. Engineering work is important, but you're not going to get an image net moment in ASR by twiddling with hyperparameters or reducing training time. Also a bizarre decision to pick deepspeech as a framework, of course your models will take forever to train and won't even be that good in the end. It's end-to-end and hasn't been SOTA for a while.

> It is hard to include the original Deep Speech paper in this list, mainly because of how many different things they tried, popularized, and pioneered.

Had to laugh at this. What exactly did they "popularize and pioneer" other than the practice of training and reporting results on very large private datasets? [This, among others, was a much more important end-to-end paper I think.](https://arxiv.org/pdf/1303.5778.pdf)

There are some good points about using private datasets and overfitting on a read-speech dataset, and using very large models etc. but I really wish they would have used their data to train a kaldi system. I can guess why they didn't, they have no background in speech and found it too hard to use. Still disappointing.

Anyways, I would argue the reason that a "imagenet moment" hasn't arrived for ASR is that vision is universal, but speech splits into different languages, making it much harder to build a single model that everyone else can use as a seed model. I believe multilingual models are the future.