
Wav2vec: Unsupervised Pre-Training for Speech Recognition - gk1
https://arxiv.org/abs/1904.05862
======
harisankarh
The authors propose an unsupervised encoder for ASR. The encoder is trained
using an interesting upstream task of predicting whether an audio portion or
clip succeeds the current one or not. The authors report superior overall
accuracy results even surpassing the massively trained Deepspeech 2 model in
certain datasets. The authors perform insightful characterization and ablation
studies and report results.

The approach seems to provide significant accuracy boost when the supervised
training set available is small, e.g., less than 10 hours. The relative
improvement is modest over baseline supervised model trained on 10s of hours
of transcribed audio. The trends indicate that the improvement is probably
minimal when 100s of hours of supervised training data is available.

The authors report improvements over Deepspeech on certain datasets.
Deepspeech uses a 5-gram language model. The proposed model has significantly
lower performance (albeit on a smaller supervised training set) when it also
uses an n-gram-based language model. Improvements over Deepspeech are shown
when convolutional language models are used. Hence, it is possible that the
improvements over Deepspeech are contributed mainly by the use of
convolutional language models. Comparing with Deepspeech+conv language model
will provide a better apple-to-apple comparison of the proposed unsupervised
pre-trained acoustic model.

The gains also seem to have diminishing returns as the number of hours of
unsupervised training data increases (improvement is marginal even with 10x
increase of unsupervised training data).

