Hacker News new | past | comments | ask | show | jobs | submit login
Online speech recognition with wav2letter anywhere (facebook.com)
234 points by moneil971 on Jan 13, 2020 | hide | past | favorite | 63 comments

All I see is """Sorry, this content isn't available right now The link you followed may have expired, or the page may only be visible to an audience you're not in. Go back to the previous page · Go to News Feed · Visit our Help Center"""

Edit: found a link that works https://github.com/facebookresearch/wav2letter

Same error (from NL)

Fixed now !

So by open sourced I assume this means there are absolutely no Facebook dependencies where the voice is passing through a Facebook server? Sorry, have to ask, as my trust level is low. Otherwise, awesome!

The repository (https://github.com/facebookresearch/wav2letter) claims to come with pre-trained models for automated speech recognition.

that's cool! I wonder how it works against podcasts

> Trained models: ...

I think I remember a similar thing happening with previous wav2letter releases.

I would love for a simple tutorial on just using a pretrained model but that feels unlikely to ever happen

Yes, nothing goes through Facebook servers. The model will be run locally on the machine.

Not only does it mean that, it means that you can look at the source code yourself rather than asking the question on HN

Online speech recognition for English.

The framework should be generalizable, but the models they are making available are only for English. Actually adapting this for any other language would be a huge amount of additional work.

> would be a huge amount of additional work.

A small amount of additional work, and a huge amount of money to pay for dataset collection and compute for training.

The dataset collection is going to be work for someone.

This is one of the reasons why we're using Azure Cognitive Services. https://docs.microsoft.com/en-us/azure/cognitive-services/sp...

That, and (at least for English) the results are the most accurate I've ever seen.

How does this compare to Mozilla's DeepSpeech?

And does anyone know when Mozilla will release the updated Common Voice dataset from https://voice.mozilla.org ?

I've been following DeepSpeech for a while. They have a WER IN THE 7% range and wave2letters SOTA model is at around 5%.

I haven't used wav2letter, but I can run DeepSpeech on my (low powered) laptop with faster than real-time transcription with just the CPU.

Word error rate depends heavily on dataset.

All modern models get to ~human level when tested on individual phrases or sentences.

None yet get to human level when the source is many paragraphs long, because the human benefits from context and 'getting used to the accent', which ML has so far not achieved.

It is misleading to use librispeech WER as a general guide to real world WER. Don't do that.

wav2letter outperforms, in large part because it seems to make better use of more training data. Facebook’s original paper shows that wav2letter outperforms when trained with their very large internal dataset, and they included a reproduction on a smaller open source dataset with worse overall accuracy but again wav2letter outperforming.

I'd love a tutorial that shows a normal guy like me how to use this tool with the pre-trianed models to transcribe my audio files. Not finding anything of that kind included there.

The preprint: https://research.fb.com/wp-content/uploads/2020/01/Scaling-u...

Interestingly, the baselines are all systems that model grapheme forms instead of acoustic (phonemes) directly.

Speaking as a Facebook user, I'm a bit confused - where do they use speech recognition? Or is this just purely research oriented?

Voice user interfaces are becoming more common. Ignoring this technology is a bad idea. FB has VR devices and their portal device:


Sure, the average HN reader will tell you that they don't see the point, etc. But Amazon and Google have sold hundreds of millions of those little voice devices.

Subtitles for videos is my guess.

Didn't they "get caught" for having their apps listening all the time and everyone was wondering how they kept getting ads for things they only ever talked about?

They could use microphone audio for ad targeting.

There's ongoing speculation around whether they're currently doing this, but it could be an ongoing area of research. I'd imagine efficient on-device transcription would help in this regard.



better ads targeting based on user-generated audio/video?

I'd be really interested in the accuracy of this tool to solve Google audio captchas. I'm assuming the price of solving captchas will go further down.

I wish recaptcha would let me disable audio captchas - I'm pretty sure all the spammers solve them that way.

The volume of real users that want to sign up to my site, and are blind, and don't have a Google account, and clear their cookies frequently enough to get an extra recaptcha challenge, and can't just call support to make an account for them, is probably zero.

Yet the number of spammers that come in that way must be in the millions by now.

I’m thankful they don’t. Recaptcha already makes me usually close the tab instead, without audio captcha it would make me close it 100% of the time. I’m not spending 20 minutes hunting for dumb images.

I'm not a spammer but I always use the audio option. Typing one word is way less onerous than clicking through multiple screens hunting for level crossings, fire hydrants, etc

If I may insert a relevant plug: we (MERL) just put out a paper last week with SOTA 7.0 % WER on LibriSpeech test-other (vs wav2letter@anywhere's 7.5%) with 590 ms theoretical latency using joint CTC-Transformer with parallel time-delayed LSTM and triggered attention. Check it out: https://arxiv.org/abs/2001.02674

Correction: no PTDLSTM here but time-restricted self-attention (duh). PTDLSTM was our previous encoder setup published at ASRU.

I'm about to start as a professor in CS education, and am hoping we're getting close to the point where I can easily transcribe interviews and high-quality dialogue audio using open-sourced models running on machines in my lab. I'm tired of paying $1/minute for human transcription that's not great anyway, and would love to undertake research that would require processing a lot more audio than is affordable on those terms.

I haven't kept up with developments over the last two years--anyone have a sense of whether this is close to being a reality?

(I've taken a bunch of Stanford's graduate AI courses on NLP and speech recognition; I can read documentation and deploy/configure models but don't have much appetite for getting into the weeds.)

Earlier this year the Media Lab did an absolutely ginormous automated transcription project. Off the top of my head, it was ~2.8 billion words. 13.1% error rate (vs. ~7% error rate for Google's proprietary solution).

Found the paper on it:


Sadly they don’t (well, can’t) release the audio+transcripts as dataset, as they clearly don’t own the rights.

They did release it as a dataset. I have a copy of it. It's massive. I'd recommend reading the paper, it has a link to a place where you can download it, and aside from that, it's also fascinating.

Including the audio? I downloaded the transcripts from their bucket, but couldn’t find any information on how to obtain the corresponding recordings. At Interspeech 2019, the author basically told me that they couldn’t share it.

If you have access to the corresponding audio recordings, would you be able to share them?

Just curious: $1/min sounds like quite a bit of money, are you paying for some professional to do this?

If so have you compared with using Mechanical Turk?

That’s minute of recorded audio, and that’s a pretty standard transcription rate. Using anything less than a professional service will show in the quality of the output, and even many services don’t produce high quality transcripts, especially those that use temp (often undergraduate/graduate student) labor.

Minutes of recorded audio make a lot of sense, thanks!

So what's the efficiency of this model? Can I use it instead of pocketsphinx on a raspberry pi?

According to https://research.fb.com/wp-content/uploads/2020/01/Scaling-u... the benchmarks were run on "Intel Skylake CPUs with 18 physical cores and 64GB of RAM."

Yes, in the paper we discuss our benchmarks on Intel CPUs. But, as we mention in the final section, we also made the system work efficiently on IOS and Android, but we haven't open-sourced them in this release. This will be in our future work.

For reference, our system runs at 0.1 RTF on iPhone 10 using Accelerate framework under FP16 precision. INT8 should be better but haven't benchmarked yet!

The real-time factor for a single audio stream looks like 0.1 (based on eyeballing the graph), so it should be possible to achieve acceptable speeds even with a slower CPU (maybe not a Pi). The memory requirements for the intermediate results are likely to be substantial, though. They say they have "carefully optimized memory use", but don't give any figures.

They benchmarked several configurations, but I couldn't match up which configuration of models produced which results. I was trying to figure out if you drop the CPU power so throughput just handles one person talking at a normal pace, whether latency would necessarily get crazy high.

So no, lol

They say they're coming out with Android and iOS versions soon, so maybe take a look after that point to see how they've tweaked the models and if the error rates are a lot higher.

FWIW, we have already working versions in Android, iOS but didn't have time to open-source it with the current release. This is certainly in our future work.

Given that this uses a beam search decoder to find the most likely word pattern, is it possible small perturbations in audio could cause it to improperly decode certain word strings? Sort of like the audio equivalent of adversarial attacks, but on ASR?

The name must be a nod to Word2Vec[1]. A cool naming scheme IMO.

[1] https://en.m.wikipedia.org/wiki/Word2vec

Facebook Research actually have another toolkit called wav2vec that's based on the same principle as word2vec (self-supervised discriminative pretraining).

word2vec learns good feature representations for words by looking at a context window and adjusting weights for the ngram that fits within that window, simultaneously doing so in the opposite direction for randomly sampled ngrams.

wav2vec does something similar, learning good feature representations of audio by distinguishing a "real" sample frame from a randomly sampled frame. I can't remember whether wav2vec operates on the raw waveform, the Fourier transform or MFCCs, but the underlying principle is the same.

These learned feature vectors have been shown to dramatically improve downstream tasks.

___2___ has been a common naming scheme for conversion utilities for a long time.

Do the pretrained models work decently on landline phone quality recordings? I can see massive value for this if it can transcribe corporate call center audio.

They can’t, because they are trained on more-or-less high-quality recordings of people reading books out loud. Phone conversations are very different, not just in audio quality but in the way people speak.

For any project like this, please post exactly the sound configuration used for the model. eg. the rate (Hz), channels, and format.

I wonder if this would be a good engine to plug in to rhasspy.

Which OSS license?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact