
Initial Release of Mozilla’s Open Source Speech Recognition Model and Voice Data - Vinnl
https://blog.mozilla.org/blog/2017/11/29/announcing-the-initial-release-of-mozillas-open-source-speech-recognition-model-and-voice-dataset/
======
muxator
I am very grateful for this release from Mozilla, and more generally for the
broad vision of their effort.

As time passes, the quest for openness and freedom in software moves higher up
in the stack. Thanks to the latest ~30 years of effort, we basically came to a
point in which we have free OSes, basic infrastructure, building tools, end-
user applications.

In the last ~10 years we changed paradigm: autonomous desktop computing
progressively transitioned to mobile, with a lot of functionality offloaded to
"the cloud".

What I feel is needed, going forward, is working towards building a viable
free replacement for these distributed services. DeepSpeech is a step in the
right direction.

Edit: just speaking about SW, here. HW is worth a different topic, and
probably poses even more challenges.

~~~
sitkack
This period is only a phase while CNNs offer the best bang. As algorithms
improve, the amount of data needed to train a model will drop 1000x making the
big5's data hoards not worth nearly as much. In five years, the power will
shift away from petabyte sized datasets.

~~~
charlysl
My understanding is that data needed (aka data complexity) is independent of
the algorithm. It depends only on model complexity (roughly # of free
parameters in model). AFAIK this is a fundamental principle of ML that has
been proven mathematically and which is inescapable.

[https://en.m.wikipedia.org/wiki/VC_dimension](https://en.m.wikipedia.org/wiki/VC_dimension)

~~~
ekidd
> _AFAIK this is a fundamental principle of ML that has been proven
> mathematically and which is inescapable._

Well, the human brain manages to master several complex tasks using much
smaller data sets than current machine learning algorithms. Natural language
acquisition, for example, seems to require fewer than 10 million spoken words
per year, and even academically successful 12-year-olds might be reading 1 to
4 million words per year. These are not exactly tiny data sets, but they don't
require Google's scale to recreate, either.

Sure, the human genome probably "knows" what set of models to try when
building the brain, which gives it an advantage. But I can't think of any
reason why machine learning couldn't ultimately try similar techniques with
similarly-sized data sets, and get competitive results.

~~~
charlysl
You are right.

I didn't phrase it correctly. What is algorithm independent is the theoretical
performance bound for a given data complexity.

Sure, not all algorithms are the same, but the best performance that can
theoretically be achieved with a given iid sample of a given size is algorithm
independent.

So yes, a better algo can get same performance with less data, but limit stays
the same.

------
StavrosK
This is very timely, as today I was thinking of caving in and getting an Echo
Dot so I can control my smart home devices by voice.

I would love an open-hardware microphone array that I could use with a Pi or
something similar, to write my own Alexa. Not only would I love this, I would
store all my commands and send them to Mozilla to help with their speech
recognition models.

I don't want to be the guy who wishes someone else would do all the work and
then give it away for free, so I'll do what I can to help (which is probably
limited to writing a bunch of code and documentation on how people can set
this up more easily). Congratulations and thanks to Mozilla for this.

~~~
bronco21016
Not affiliated in any way but have you seen the MATRIX Creator or MATRIX
Voice? Both contain microphone arrays that can interface with a Raspberry Pi.

I’d love to see one of these meshed together with this new Mozilla voice
project for an open source Echo or Google Home. The only missing piece it
seems at this point is NLP and all of the glue that converts commands to API
calls.

~~~
StavrosK
Oh I hadn't, this is fantastic (and it integrates with an ESP32), thanks! I
wonder if it includes software to do some live DSP to reduce noise... The fact
that it can just connect to the Raspberry Pi's GPIOs and provide great sound
is ideal, though. I'm very glad someone has made this, I wish I had known
about it before so I could back it.

------
gok
This is super cool, but I'd be cautious about the usefulness of this data set.

Both this data set and LibriSpeech are read speech, where the speaker was
prompted with a transcription and asked to say it out loud. In practice it's
very rare that you're trying to transcribe speech that's already been
transcribed. Speech patterns for computer-directed speech (e.g. for voice
activated user interfaces) or human-to-human speech (e.g. for meeting
transcription) are quite different.

~~~
punchingwater
Yup, this is an excellent point. We have, and will continue to explore ways to
allow Common Voice users to speak more organically (for instance by answering
a question, or responding free-form to some other sort of prompt). The problem
with this approach is that it requires an extra step, transcription, which at
the scale we are trying to achieve is pretty costly in either money or time
(ie. tedium for our users). Eventually we hope that speech engines can take
care of the transcription part, but for now we need people.

That said, we will definitely be exploring ways to build in organic speech and
perhaps transcriptions to the Common Voice app. This will solve another
problem for us too, which is getting public domain material for people to
read. Doing this obviously requires a much more complex user experience, and
we have more work to figure out how to make something that people will want to
use and contribute to. Stay tuned for that :)

On the flip side, we hope that these datasets, models, and the tools (ie.
DeepSpeech) can get more people (researchers, start-ups, hobbyist) over the
hump of building an MVP of something useful in voice. Once you have people
using your products, collecting useful in-context voice data becomes much
easier.

On that note, another approach we are working on is partnering with
universities and socially-aware startups like MyCroft, SNIPS, and Mythic.
Imagine if voice products in market allowed their users to _opt-in_ to
contributing their utterances to an open resource similar to Common Voice. Of
course, sharing your voice publicly is not for everyone, or every product
scenario. But it does work for some. And if we pool our resources, our hope is
to indeed commoditize speech-to-text so that we can focus on more interesting
challenges like building voice experiences people want to use. (For instance,
could voice somehow be a "progressive enhancement" to the web?).

~~~
visarga
> could voice somehow be a "progressive enhancement" to the web?

I have created my own TamperMonkey plugin that adds TTS to web pages. It finds
text, makes it clickable, and when a user clicks a word, it starts reading
from there, highlighting text as it reads it, skipping menus and chrome. I
find this helps me better focus on reading. Unfortunately I can only stand one
single voice and it's been stagnating for years (Alex from Mac OS). Can't wait
to hear the WaveNet voice Google has been threatening to give us.

~~~
beagle3
Is it available for the rest of us perchance?

------
bobajeff
Awesome, we're so close to having a speech-to-text system I can trust.

I really wish Mozilla would release a keyboard app for Android. It would
instantly be the single most trusted keyboard available.

~~~
wslh
Regarding Android keyboards, it is horrific that Google keyboard sends all
your key presses, except passwords, to them.

~~~
yjftsjthsd-h
... it does what? Even if you disable the "Share Snippets" option?

~~~
wslh
There is no official statement of Google about this but there is a generic
Android warning saying "This method can collect all of the text that you enter
except passwords including personal data and credit card numbers.". I wonder
why Google doesn't explicitly say what they are doing. I don't want to
distribute FUD but this is a critical component.

~~~
beojan
That's because any keyboard could, theoretically, be a keylogger.

------
hardwaresofton
For those who don't know that open source speech recognition that does not
depend on AI already exists:

[http://cmusphinx.sourceforge.net/](http://cmusphinx.sourceforge.net/)

[http://julius.osdn.jp/en_index.php](http://julius.osdn.jp/en_index.php)

Maybe with this data set released eventually all that additional data will be
used to improve those tools as well

~~~
punchingwater
Don't forget Kaldi!

[https://github.com/kaldi-asr/kaldi](https://github.com/kaldi-asr/kaldi)

~~~
skykooler
The problem with Kaldi is that it's virtually impossible to get a dictation
model working with Kaldi unless you have a doctorate in speech recognition.
There is no "I know basic programming, but little about speech recognition"
documentation for Kaldi.

~~~
yjftsjthsd-h
Between learning curve and dependency hell, I've never managed good results
with Kaldi, Simon, or Sphinx. It's unfortunate; hopefully we'll get an easy to
use option soon.

~~~
hardwaresofton
When was the last time you tried sphinx? The library has changed a LOT. Their
guides, new website and other resources basically walk you from zero knowledge
to working demo.

~~~
yjftsjthsd-h
On, I will have to try again:) Thanks for the tip; I always thought Sphinx
_should_ be ideal, it was just too much work to get it working.

------
richdougherty
If you want to help them out you can visit
[https://voice.mozilla.org/](https://voice.mozilla.org/) and record some
sentences.

~~~
melling
They have an iOS app too: [https://itunes.apple.com/us/app/project-common-
voice-by-mozi...](https://itunes.apple.com/us/app/project-common-voice-by-
mozilla/id1240588326?mt=8)

It’s a great idea to crowd source this. Wonder if this project can turn voice
recognition into a solved problem.

I just set a daily reminder so I can do 10 minutes a day.

~~~
punchingwater
Thank you so much!

I also want to emphasize the importance of listening (validating) as well as
recording. Validation is an big part of the puzzle for building machine
learning viable data.

~~~
RubenSandwich
Can I suggest is encourging user's to get recordings from their children as
well, as most speech recognition libraries are pretty poor with children's
voices. (IMO Alexa Voice Service is by far the best with children voices.)

~~~
yjftsjthsd-h
Is that okay legally? Maybe parental permission is enough

------
ocdtrekkie
Really happy to see someone major looking at open source speech recognition.
Due to the lack of a self-hosted or on-device solution, my
assistant/automation software is basically designed for speech, but not
currently doing recognition because I haven't found a non-cloud option that
does what I need it to.

------
jononor
Wowow. Time to build an open microphone array for impeoved speech pickup. That
can be connected to RPi for voice control that respects privacy.

~~~
mediocrejoker
Sort of like this one?

[https://www.matrix.one/products/voice](https://www.matrix.one/products/voice)

~~~
jononor
That is a very nice board! The inclusion of a FPGA and ESP32 makes it very
capable, and 65USD is a good price for such a package. And I found beamforming
code for the microphone array (running on host computer) at
[https://github.com/matrix-io/matrix-creator-
hal/blob/master/...](https://github.com/matrix-io/matrix-creator-
hal/blob/master/cpp/driver/microphone_array.cpp)

------
kljuka
How does Mozilla's 6.5% error rate on LibriSpeech’s test-clean dataset compare
to Google's, Apple's, Amazon's and other's voice recognition? I couldn't
easily find any chart of comparisons.

~~~
TD-Linux
In the Baidu Deep Speech 2 paper, the Baidu implementation is able to get
5.33%, and a human 5.83%.
[https://arxiv.org/pdf/1512.02595v1.pdf](https://arxiv.org/pdf/1512.02595v1.pdf)

------
asabjorn
Does this relate or help at all with speaker identification? The Microsoft
speech API also provides speaker recognition which is useful for many
applications: [https://azure.microsoft.com/en-us/services/cognitive-
service...](https://azure.microsoft.com/en-us/services/cognitive-
services/speaker-recognition/)

~~~
woodson
No, this is for speech recognition only.

------
Dowwie
the world’s second largest publicly available voice dataset, which was
contributed to by nearly 20,000 people globally

Well done, people!

------
tardygrad
This is tangential, but I wonder if something like this could be (mis)used to
break captcha - by feeding in the disabled-friendly audio captcha and passing
the results back to the captcha server.

As voice recognition becomes more sophisticated I think captchas are going to
have to evolve to kjeep up as well.

~~~
candiodari
Captchas are utterly beat, and more so, it's not the technology or difficulty
: they are a lost cause. Pretty much any problem you might present in a
captcha, machine learning performs better than humans.

So today, failure to captcha is actually an indication that the other end is
human.

------
tmzt
How does this compare with Snips.co which can do offline speech recognition on
a Rapberry Pi 3?

Coyld this be used to train a model/engine that can be used that way?

~~~
reubenmorais
The model we released today is not yet optimized for smaller devices like
that, but our plan is to make it usable on targets like the RPi3.

~~~
RubenSandwich
Are you releasing any prebuilt models, I searched but couldn't find any, so
people can go and play with your work without training?

Edit: NM found it under releases:
[https://github.com/mozilla/DeepSpeech/releases](https://github.com/mozilla/DeepSpeech/releases).

~~~
knocte
OMG, C++??? I thought Mozilla folks would give Rust a spin.

~~~
TD-Linux
It is Tensorflow based which has a C++ API. Though, it looks like they provide
Rust bindings for the Deep Speech library itself.

------
monsieurgaufre
I don't see why Mozilla would do this kind of things except to spread
ressources. I know that it is an anecdote, but I don't know anyone who uses
any kind of speech to text in part because they all suck if you don't speak
english and even then..

~~~
khedoros1
Personal assistant devices are selling like hotcakes, and there are tons of
options for creating smart lighting and such. Maybe I hang around tinkerers
too much, but I think it would be nice to have a method of speech control that
doesn't rely on someone's remote speech recognition API.

~~~
monsieurgaufre
As i stated above, it might be an anecdata or a cultural bias as I live in
Montreal.

While i do believe that's interesting, I don't understand why that would be
Mozilla's job.

~~~
khedoros1
Why not? They build free software and promote open standards; speech
recognition is a prominent area that doesn't have any really good open
solutions right now.

------
pmoriarty
Does this foreshadow a day when Firefox starts spying on what I'm talking
about?

~~~
jimktrains2
Wouldn't this be the opposite? By bundling the speech model so that it's on
the client device instead of their server, your speech need not leave your
device.

