
Facebook open-sources a speech-recognition system and a machine learning library - runesoerensen
https://code.fb.com/ai-research/wav2letter/
======
mark_l_watson
Very nice. In my work, as much as I love working with RNNs, convolutional
models are faster to train and use. Years ago, FB blew away text modeling
speed with fasttext and it is good see these projects made publicly available
also.

As much as I sometimes criticize FB and Google over privacy issues, they also
do a lot of good by realeasing open source systems. Most of my work involves
using TensorFlow and Keras, and a little over two years ago I replaced a
convolutional text classification model with fasttext, with good results.

~~~
stochastic_monk
I haven’t done any significant projects using neural networks for sequence
modeling or analysis, but when I do, I plan to start with a temporal
convolutional network, based on [0]. They argue that RNNs being standard is
likely an artefact of the history of the field, while they get superior
performance from TCNs.

[0] [https://github.com/locuslab/TCN](https://github.com/locuslab/TCN)

~~~
yablak
The truth is that the best architecture depends on the problem, and you get a
lot out of "graduate student gradient descent", i.e., hyperparameter search
and fiddling with the architecture fine grained details. You'll want to
experiment with rnn, tcn, and transformer (t2t) models to find the best one
for your problem.

Disclaimer, I wrote the TF RNN api.

~~~
alexcnwy
Totally agree - it's hard to say which sequence architecture works best for
your problem (are there long term dependencies, do they exist at multiple
levels of scale, etc) and dataset size.

Convolutions are cheap to compute and can be more efficient on smaller data
but it's also possible that a CNN outperforms when you have a small dataset
and an RNN when you have a larger dataset.

------
jimmychangas
People are quick to criticize Facebook here on HN, but this release is
awesome. I believe open source speech recognition is still lacking, and any
contribution is very welcome. CMU Sphinx and Kaldi are great, but it feels
like the most recent advances in the field are still hidden behind paid
services.

~~~
pbalau
You would be amazed about how little comments does an article about FB doing
"this tech thing" attracts, vs a generic FB is bad. There are terribly few
people that can comment on a tech subject, but everyone and their dog has an
opinion about how FB destroys humanity

~~~
mempko
People confuse Facebook the org with Facebook the workers. Don't confuse the
great work it's workers do with the what management decides.

~~~
chrisbrandow
That logic doesn’t generalize too well.

------
ToFab123
So how good is that speech recognition system in, let's say, to listen in on a
phone conversation and use that information in, let's say, the Facebook news
feed. You know the precise thing many, me including, suspect Facebook is
actively doing. And if you say they don't do any things like that, why do they
need a speech recognition system in the first given that Facebook is a text
based system.

~~~
varenc
They need speech recognition for their Alexa/Google Home like hardware
product: [https://portal.facebook.com](https://portal.facebook.com)

(Notice how much they emphasize privacy and security in their marketing...)

~~~
akhilcacharya
For the record, I do think it's telling that they currently use Amazons's
Alexa Voice Service for this (or at least most of it after the hotword
recognition) instead of building on Project M.

~~~
varenc
Alexa is optional and Facebook has their own voice recognition for Portal
features, You can use the wake word “Hey Portal” for a limited set of portal
commands mostly around calling and messaging, or “Alexa” for all the Alexa
capabilities. When using Alexa functionality Facebook isn’t suppose to
“listen” at all. (Though not sure if these queries still flow through their
servers or not?)

[https://portal.facebook.com/help/2149102838698668/](https://portal.facebook.com/help/2149102838698668/)

------
m0zg
Where are the pre-trained models? It's worthless without them. Edit: NM,
someone hunted down the AWS links

    
    
        wget https://s3.amazonaws.com/wav2letter/models/librispeech-glu-highdropout.bin
    
        wget https://s3.amazonaws.com/wav2letter/models/librispeech-glu-highdropout-cpu.bin
    
        wget https://s3.amazonaws.com/wav2letter/models/librispeech-glu-lowdropout.bin

------
maddyboo
Link to the GitHub repo, for the lazy:
[https://github.com/facebookresearch/wav2letter/](https://github.com/facebookresearch/wav2letter/)

~~~
motivated_gear
As someone behind a corporate firewall on a Friday, thank you

------
Yoric
Does anyone know ho this compares with Mozilla's DeepSpeech?
[https://github.com/mozilla/DeepSpeech](https://github.com/mozilla/DeepSpeech)

~~~
yorwba
From [https://arxiv.org/abs/1812.06864](https://arxiv.org/abs/1812.06864) "On
Librispeech, we report state-of-the-art performance among end-to-end models,
including Deep Speech 2 trained with 12 times more acoustic data and
significantly more linguistic data."

Specifically they claim word error rates that are 1 to 2 percentage points
lower, 3.44% on "clean" and 11.24% on "other".

~~~
StudentStuff
It'll be interesting to see if anyone can reproduce their results, thus far
its been troublesome:
[https://github.com/facebookresearch/wav2letter/issues/88](https://github.com/facebookresearch/wav2letter/issues/88)

------
dlojudice
I wonder how all these open-source releases happening this month [1][2][3] are
related to improve team morale internally...

[1]
[https://github.com/facebookresearch/DeepFocus](https://github.com/facebookresearch/DeepFocus)

[2]
[https://github.com/facebookresearch/nevergrad](https://github.com/facebookresearch/nevergrad)

[3]
[https://github.com/facebookresearch/pytext](https://github.com/facebookresearch/pytext)

~~~
rock_hard
Probably just a end of year push to get things out

~~~
benbayard
Or they're in a production code freeze so engineers and spend more time
pushing these final projects to finally get them released.

~~~
xpaulbettsx
It's Perf Review time, gotta get that Impact without affecting production

------
ZainRiz
Did they release a model to go with this (so that the average dev can actually
use this in their app) or is this just a tool for researchers?

~~~
nh2
[https://github.com/facebookresearch/wav2letter/issues/88](https://github.com/facebookresearch/wav2letter/issues/88)
mentions a "pre-trained model named librispeech-glu-highdropout.bin", but I
couldn't find it anywhere.

[https://github.com/facebookresearch/wav2letter/issues/93](https://github.com/facebookresearch/wav2letter/issues/93)
also mentions a pre-trained model but without any reference which one or where
to find it.

Googling "librispeech-glu-highdropout.bin" still shows the text "luajit
~/wav2letter/test.lua ~/librispeech-glu-highdropout.bin -progress -show -test
dev-clean -save -datadir ~/librispeech-proc/ -dictdir ~/librispeech-proc/
-gfsai ..." for
[https://github.com/facebookresearch/wav2letter/blob/master/R...](https://github.com/facebookresearch/wav2letter/blob/master/README.md),
but clicking it, it's gone.

But the Google Cache still has the result, including 3 pre-trained models:

    
    
        wget https://s3.amazonaws.com/wav2letter/models/librispeech-glu-highdropout.bin
        wget https://s3.amazonaws.com/wav2letter/models/librispeech-glu-highdropout-cpu.bin
        wget https://s3.amazonaws.com/wav2letter/models/librispeech-glu-lowdropout.bin
    

The cache also includes a much more detailed README on how to use the
software.

~~~
taf2
Thank you! I found that some of the forks also have the more detailed README
files for example:
[https://github.com/19ai/wav2letter](https://github.com/19ai/wav2letter)

~~~
nh2
Thanks.

It would be great if anybody could build it all and try if the out-of-the-box
experience with the pretrained model is good.

I've tried Mozilla's DeepSpeech a few times but so far it didn't recognise
"this is a test" reliably without mistake out of the box from a good
microphone.

------
ocdtrekkie
Always excited to see more speech recognition releases. Way too many solutions
are "just point your microphone's feed at our cloud service", and a lot of
those that aren't have somewhat lagged behind.

------
sfilargi
Interesting, why does FB need a speech recognition system?

~~~
karmasimida
Why not?

Facebook as research arm dedicated to play Go/Starcraft also, what do you
think their reason is for doing that?

They can use a speech recognition system to transcribe videos, just like what
Youtube is doing to improve ad targeting and recommendation. Why is this hard
to understand?

~~~
nurettin
that would be a bit creepy and opportunistic if fb did that. Yt videos are by
and large uploaded for profit. Fb videos are uploaded for storing birthdays
and family vacations.

~~~
dqpb
Could be used for indexing videos for search

------
freyir
Any idea why they developed Flashlight for this, instead of using PyTorch?

------
ngngngng
> developing in modern C++ is not much slower than in a scripting language.

Is this accurate? I haven't written C++ since freshmen year of college and it
was very cumbersome then.

~~~
AlexCoventry
It's improved a lot in terms of expressiveness, but it's still plagued by
memory errors and obscure error messages.

Who said that, anyway? I don't see it in the text linked by the OP link.

~~~
ngngngng
Sorry I should have mentioned, it's part of the research paper announcing the
release. It's on the right column of the first page.

[https://arxiv.org/pdf/1812.07625.pdf](https://arxiv.org/pdf/1812.07625.pdf)

~~~
AlexCoventry
Thanks.

------
nshm
I like the way they politely skipped Kaldi WER on test-clean (4.31) in results
table. Their best WER (4.91) will not look so attractive.

~~~
woodson
Even more so if they compared it to Kaldi TDNN-LSTM with RNNLM lattice
rescoring (test-clean 3.22%, apparently): [https://github.com/kaldi-
asr/kaldi/blob/master/egs/librispee...](https://github.com/kaldi-
asr/kaldi/blob/master/egs/librispeech/s5/local/chain/tuning/run_tdnn_lstm_1b.sh)

------
IshKebab
This is excellent. Modern free speech recognition software is hard to come by.
Everything except Kaldi has laughable error rates, and Kaldi is a huge pain to
set up.

Will be interesting to see what people can do with this and the available data
sets.

~~~
faitswulff
> available data sets

Reminder that Mozilla's Common Voice project accepts voice donations!
[https://voice.mozilla.org/](https://voice.mozilla.org/)

~~~
darkpuma
Given the value of open source training data (or scarcity of it), has anybody
attempted to use LibriVox for training?

[https://en.wikipedia.org/wiki/LibriVox](https://en.wikipedia.org/wiki/LibriVox)

[https://librivox.org/](https://librivox.org/)

The recordings are public domain audio books of public domain books, so the
licensing should be fine. The audio isn't annotated, but given the value
involved I think it would be worth attempting to use forced alignment to
annotate the recordings with their public domain source texts. Forced
alignment using the sort of speech recognizer you're trying to train in the
first place may be a bit "chicken and the egg", but from some experiments I've
run myself existing open source speech recognizers can do it reasonably well.
Humans could manually tune up the alignment to improve the quality if
necessary.

As for motivating people to actually do that mundane work... well these are
audio books so maybe the work isn't so mundane after all! The LibriVox
recording of Tom Sawyer (read by John Greenman: [https://librivox.org/tom-
sawyer-by-mark-twain/](https://librivox.org/tom-sawyer-by-mark-twain/)) is
pretty great and has been listened to by millions of people. If somebody
created a "read along" web app that showed you the text of the book from
Project Gutenberg getting highlighted as the audiobook from LibriVox was
played, users who have an interest in reading/hearing the book could have
their attention held by Mark Twain and with the right UI provide fine tuning
for the forced alignment at the same time.

~~~
lingz
Librivox is commonly used as a training corpus however it's main weakness is
that reading speech differs quite a lot from conversational speech.

~~~
darkpuma
I see. I hadn't considered that might be a weakness because to be honest the
creation of forced alignments between LibriVox and the source texts was my
objective (it's a feature that exists on some Kindle's when pairing ebooks
with audible audio books. I believe the feature makes literature more
accessible, a noble enough cause. Although I can understand why more effort is
being spent on recognizing conversational speech.)

------
honkycat
I like to take recordings of my thoughts on my cellphone similar to Dale
Cooper. Unfortunately, I do not have a Diane on the other end to translate my
thoughts to text, I have to do that myself.

I've been looking into things like mozilla/deepspeech and other open source
libraries for automatically converting my messages to text. I'll have to take
a look at this project as well!

~~~
halfjew22
Hey me too! A while ago I was looking st trying to figure out how to hack
something like this together myself when I came across what is now one of my
top thre apps: Otter.

Sounds like a shill and I don’t really care. I’m a premium member with 6,000
minutes of transcript time per month (and sometimes I’ve used almost all of
it) and I couldn’t be happier.

You can export everything, support and head of product are kind and
responsive, and you can click in the transcription anywhere and it will play
the audio at that point.

Exactly what I need.

My main complain is that it’s geared towards corporate environments for
conferences, meetings, etc, and so the grouping isn’t exactly what I like, but
I was my text editor to keep the links more to my liking.

Being able to search by word hundreds of hours of my thoughts is a
fantastically empowering experience and I hope you find the same.

Let me know what you think! Shoot me an email if you want to chat about it
ever. If you can't tell I'm a pretty big fan.

------
buboard
How good is this library? Is this good and fast enough to see wider adoption
in embedded devices or phones? Would be awesome to be able to voice-enable
apps without the need for a cloud provider. How would this compare with a C++
pytorch-based approach?

~~~
woodson
It has dependencies on CUDA and/or Intel MKL, so not really suitable as-is for
embedded/phones.

------
SpaceManNabs
From what I understand, this is less of a machine learning advancement and
more of an engineering advancement? Trying to see if any of the bleeding edge
stuff has been implemented. Still waiting for SeLUs to be standard!

------
EGreg
Is it better than openEars? I want something that will work onboard without
sending to a server.

------
wpdev_63
It would be interesting to see this benchmarked against mozilla's deepspeech.

------
suyash
Is there a demo of a working example app using this library?

------
g_delgado14
Management: "Let's keep open sourcing all of the things so that the informed
community will overlook our transgressions!"

------
vbuwivbiu
Is it ethical to use their code ?

~~~
darkpuma
I think you should consider it ethical to do ethical things using a tool
created by unethical people. Consider for instance Fritz Haber, the so called
"father of chemical warfare" who contributed to the development of the
Haber–Bosch process for artificial nitrogen fixation, that facilitated the
mass production of explosives by Germany during the Great War, but also
provides as much as 50% of the nitrogen in the body of an average human today
due to it's role in the production of fertilizer.

Whether it's ethical to contribute back to the project knowing that the
unethical creator might derive unethical utility from your contributions, is
perhaps slightly more complicated. However the same could be said of any open
source project, you could create something new wholly from scratch and if you
release it publicly, somebody else could use it for something unethical.

I commend your consideration of ethical concerns, which I think is lacking in
the tech industry today. But in this particular case I don't believe there is
too much cause for concern.

~~~
adjkant
I think your points generally stand, but I think Facebook open source raises
some specific issues at least worth consideration:

If your use of it contributes to its popularity, perhaps becoming the standard
of X area, does that give Facebook the company more power and possibly enable
other unethical actions?

I think it's probably not as much of a worry given the narrowness of the area,
but I do think this is something to consider when it comes to React for
example.

