
Online speech recognition with wav2letter anywhere - moneil971
https://ai.facebook.com/blog/online-speech-recognition-with-wav2letteranywhere/
======
tingletech
All I see is """Sorry, this content isn't available right now The link you
followed may have expired, or the page may only be visible to an audience
you're not in. Go back to the previous page · Go to News Feed · Visit our Help
Center"""

Edit: found a link that works
[https://github.com/facebookresearch/wav2letter](https://github.com/facebookresearch/wav2letter)

~~~
vineelkpratap
Here is the direct link - [https://ai.facebook.com/blog/online-speech-
recognition-with-...](https://ai.facebook.com/blog/online-speech-recognition-
with-wav2letteranywhere)

~~~
rgj
Same error (from NL)

~~~
vineelkpratap
Fixed now !

------
dvduval
So by open sourced I assume this means there are absolutely no Facebook
dependencies where the voice is passing through a Facebook server? Sorry, have
to ask, as my trust level is low. Otherwise, awesome!

~~~
notduncansmith
The repository
([https://github.com/facebookresearch/wav2letter](https://github.com/facebookresearch/wav2letter))
claims to come with pre-trained models for automated speech recognition.

~~~
throwawayhhakdl
> Trained models: ...

I think I remember a similar thing happening with previous wav2letter
releases.

I would love for a simple tutorial on just using a pretrained model but that
feels unlikely to ever happen

~~~
snippyhollow
The models are here:
[https://github.com/facebookresearch/wav2letter/tree/master/r...](https://github.com/facebookresearch/wav2letter/tree/master/recipes/models/sota/2019)
[https://github.com/facebookresearch/wav2letter/tree/master/r...](https://github.com/facebookresearch/wav2letter/tree/master/recipes/models/streaming_convnets/librispeech)

------
gliese1337
Online speech recognition _for English_.

The framework should be generalizable, but the models they are making
available are only for English. Actually adapting this for any other language
would be a huge amount of additional work.

~~~
londons_explore
> would be a huge amount of additional work.

A small amount of additional work, and a huge amount of money to pay for
dataset collection and compute for training.

~~~
gliese1337
The dataset collection is going to be work for _someone_.

------
Jnr
How does this compare to Mozilla's DeepSpeech?

And does anyone know when Mozilla will release the updated Common Voice
dataset from [https://voice.mozilla.org](https://voice.mozilla.org) ?

~~~
zachruss92
I've been following DeepSpeech for a while. They have a WER IN THE 7% range
and wave2letters SOTA model is at around 5%.

I haven't used wav2letter, but I can run DeepSpeech on my (low powered) laptop
with faster than real-time transcription with just the CPU.

~~~
londons_explore
Word error rate depends heavily on dataset.

All modern models get to ~human level when tested on individual phrases or
sentences.

None yet get to human level when the source is many paragraphs long, because
the human benefits from context and 'getting used to the accent', which ML has
so far not achieved.

------
jwineinger
I'd love a tutorial that shows a normal guy like me how to use this tool with
the pre-trianed models to transcribe my audio files. Not finding anything of
that kind included there.

~~~
vineelkpratap
Check out the tutorial here -
[https://github.com/facebookresearch/wav2letter/wiki/Inferenc...](https://github.com/facebookresearch/wav2letter/wiki/Inference-
Run-Examples)

------
gok
The preprint: [https://research.fb.com/wp-
content/uploads/2020/01/Scaling-u...](https://research.fb.com/wp-
content/uploads/2020/01/Scaling-up-online-speech-recognition-using-
ConvNets.pdf)

Interestingly, the baselines are all systems that model grapheme forms instead
of acoustic (phonemes) directly.

------
faitswulff
Speaking as a Facebook user, I'm a bit confused - where do they use speech
recognition? Or is this just purely research oriented?

~~~
melling
Voice user interfaces are becoming more common. Ignoring this technology is a
bad idea. FB has VR devices and their portal device:

[https://www.theverge.com/2019/9/18/20870866/facebook-
portal-...](https://www.theverge.com/2019/9/18/20870866/facebook-portal-mini-
new-price-release-date-whatsapp-support)

Sure, the average HN reader will tell you that they don't see the point, etc.
But Amazon and Google have sold hundreds of millions of those little voice
devices.

------
isoos
I'd be really interested in the accuracy of this tool to solve Google audio
captchas. I'm assuming the price of solving captchas will go further down.

~~~
londons_explore
I wish recaptcha would let me disable audio captchas - I'm pretty sure all the
spammers solve them that way.

The volume of real users that want to sign up to my site, and are blind, and
don't have a Google account, and clear their cookies frequently enough to get
an extra recaptcha challenge, and can't just call support to make an account
for them, is probably zero.

Yet the number of spammers that come in that way must be in the millions by
now.

~~~
Semaphor
I’m thankful they don’t. Recaptcha already makes me usually close the tab
instead, without audio captcha it would make me close it 100% of the time. I’m
not spending 20 minutes hunting for dumb images.

------
jonathanleroux
If I may insert a relevant plug: we (MERL) just put out a paper last week with
SOTA 7.0 % WER on LibriSpeech test-other (vs wav2letter@anywhere's 7.5%) with
590 ms theoretical latency using joint CTC-Transformer with parallel time-
delayed LSTM and triggered attention. Check it out:
[https://arxiv.org/abs/2001.02674](https://arxiv.org/abs/2001.02674)

~~~
jonathanleroux
Correction: no PTDLSTM here but time-restricted self-attention (duh). PTDLSTM
was our previous encoder setup published at ASRU.

------
cproctor
I'm about to start as a professor in CS education, and am hoping we're getting
close to the point where I can easily transcribe interviews and high-quality
dialogue audio using open-sourced models running on machines in my lab. I'm
tired of paying $1/minute for human transcription that's not great anyway, and
would love to undertake research that would require processing a lot more
audio than is affordable on those terms.

I haven't kept up with developments over the last two years--anyone have a
sense of whether this is close to being a reality?

(I've taken a bunch of Stanford's graduate AI courses on NLP and speech
recognition; I can read documentation and deploy/configure models but don't
have much appetite for getting into the weeds.)

~~~
kick
Earlier this year the Media Lab did an absolutely _ginormous_ automated
transcription project. Off the top of my head, it was ~2.8 billion words.
13.1% error rate (vs. ~7% error rate for Google's proprietary solution).

Found the paper on it:

[https://arxiv.org/pdf/1907.07073.pdf](https://arxiv.org/pdf/1907.07073.pdf)

~~~
woodson
Sadly they don’t (well, can’t) release the audio+transcripts as dataset, as
they clearly don’t own the rights.

~~~
kick
They did release it as a dataset. I have a copy of it. It's massive. I'd
recommend reading the paper, it has a link to a place where you can download
it, and aside from that, it's also fascinating.

~~~
woodson
Including the audio? I downloaded the transcripts from their bucket, but
couldn’t find any information on how to obtain the corresponding recordings.
At Interspeech 2019, the author basically told me that they couldn’t share it.

------
ColanR
So what's the efficiency of this model? Can I use it instead of pocketsphinx
on a raspberry pi?

~~~
sp332
According to [https://research.fb.com/wp-
content/uploads/2020/01/Scaling-u...](https://research.fb.com/wp-
content/uploads/2020/01/Scaling-up-online-speech-recognition-using-
ConvNets.pdf) the benchmarks were run on "Intel Skylake CPUs with 18 physical
cores and 64GB of RAM."

~~~
yorwba
The real-time factor for a single audio stream looks like 0.1 (based on
eyeballing the graph), so it should be possible to achieve acceptable speeds
even with a slower CPU (maybe not a Pi). The memory requirements for the
intermediate results are likely to be substantial, though. They say they have
"carefully optimized memory use", but don't give any figures.

~~~
sp332
They benchmarked several configurations, but I couldn't match up which
configuration of models produced which results. I was trying to figure out if
you drop the CPU power so throughput just handles one person talking at a
normal pace, whether latency would necessarily get crazy high.

------
rexreed
Given that this uses a beam search decoder to find the most likely word
pattern, is it possible small perturbations in audio could cause it to
improperly decode certain word strings? Sort of like the audio equivalent of
adversarial attacks, but on ASR?

------
yellow_lead
The name must be a nod to Word2Vec[1]. A cool naming scheme IMO.

[1]
[https://en.m.wikipedia.org/wiki/Word2vec](https://en.m.wikipedia.org/wiki/Word2vec)

~~~
nmfisher
Facebook Research actually have another toolkit called wav2vec that's based on
the same principle as word2vec (self-supervised discriminative pretraining).

word2vec learns good feature representations for words by looking at a context
window and adjusting weights for the ngram that fits within that window,
simultaneously doing so in the opposite direction for randomly sampled ngrams.

wav2vec does something similar, learning good feature representations of audio
by distinguishing a "real" sample frame from a randomly sampled frame. I can't
remember whether wav2vec operates on the raw waveform, the Fourier transform
or MFCCs, but the underlying principle is the same.

These learned feature vectors have been shown to dramatically improve
downstream tasks.

------
starpilot
Do the pretrained models work decently on landline phone quality recordings? I
can see massive value for this if it can transcribe corporate call center
audio.

~~~
woodson
They can’t, because they are trained on more-or-less high-quality recordings
of people reading books out loud. Phone conversations are very different, not
just in audio quality but in the way people speak.

------
z3t4
For any project like this, please post exactly the sound configuration used
for the model. eg. the rate (Hz), channels, and format.

------
amluto
I wonder if this would be a good engine to plug in to rhasspy.

------
phkahler
Which OSS license?

~~~
trakout
BSD

