Edit: found a link that works https://github.com/facebookresearch/wav2letter
I think I remember a similar thing happening with previous wav2letter releases.
I would love for a simple tutorial on just using a pretrained model but that feels unlikely to ever happen
The framework should be generalizable, but the models they are making available are only for English. Actually adapting this for any other language would be a huge amount of additional work.
A small amount of additional work, and a huge amount of money to pay for dataset collection and compute for training.
That, and (at least for English) the results are the most accurate I've ever seen.
And does anyone know when Mozilla will release the updated Common Voice dataset from https://voice.mozilla.org ?
I haven't used wav2letter, but I can run DeepSpeech on my (low powered) laptop with faster than real-time transcription with just the CPU.
All modern models get to ~human level when tested on individual phrases or sentences.
None yet get to human level when the source is many paragraphs long, because the human benefits from context and 'getting used to the accent', which ML has so far not achieved.
Interestingly, the baselines are all systems that model grapheme forms instead of acoustic (phonemes) directly.
Sure, the average HN reader will tell you that they don't see the point, etc. But Amazon and Google have sold hundreds of millions of those little voice devices.
There's ongoing speculation around whether they're currently doing this, but it could be an ongoing area of research. I'd imagine efficient on-device transcription would help in this regard.
The volume of real users that want to sign up to my site, and are blind, and don't have a Google account, and clear their cookies frequently enough to get an extra recaptcha challenge, and can't just call support to make an account for them, is probably zero.
Yet the number of spammers that come in that way must be in the millions by now.
I haven't kept up with developments over the last two years--anyone have a sense of whether this is close to being a reality?
(I've taken a bunch of Stanford's graduate AI courses on NLP and speech recognition; I can read documentation and deploy/configure models but don't have much appetite for getting into the weeds.)
Found the paper on it:
If so have you compared with using Mechanical Turk?
For reference, our system runs at 0.1 RTF on iPhone 10 using Accelerate framework under FP16 precision. INT8 should be better but haven't benchmarked yet!
word2vec learns good feature representations for words by looking at a context window and adjusting weights for the ngram that fits within that window, simultaneously doing so in the opposite direction for randomly sampled ngrams.
wav2vec does something similar, learning good feature representations of audio by distinguishing a "real" sample frame from a randomly sampled frame. I can't remember whether wav2vec operates on the raw waveform, the Fourier transform or MFCCs, but the underlying principle is the same.
These learned feature vectors have been shown to dramatically improve downstream tasks.