
Project Common Voice - mhr_online
https://voice.mozilla.org/
======
albertzeyer
The terminology is a bit confusing. They are saying that they want to build
voice recognition but it seems like they actually might want to build a speech
recognition engine. Speech recognition is about recognizing the speech, the
spoken words. Voice recognition is about recognizing the speakers voice, i.e.
identifying the speaker. Also, maybe they also want to build a text-to-speech
(TTS) system but I'm not sure.

No matter what, the collected data might be useful for all of that, maybe
except of voice recognition actually, because I guess the data will be
collected anonymously?

Note that there are some other existing big open speech corpora such as
LibriSpeech ([http://www.openslr.org/12/](http://www.openslr.org/12/)) which
could already be used right now to build a quite good speech recognition
system.

~~~
punchingwater
I can tell from your comment (and it's responses) that the language on our
homepage is a bit confusing, so thank you for the feedback.

To answer you question: Common Voice is about building a collection of
labelled voice data (ie. sentence clips w/ transcripts) that can be used to,
for instance, train speech-to-text algorithms. Part of the goals of this
project though is to figure out how this data can best help people build voice
technology. So it's pretty open ended at this point.

Mozilla does have an open source speech-to-text engine [1] we are developing,
and we hope one day to use the Common Voice data to train this engine.
DeepSpeech and Common Voice are related, but separate projects, if that makes
sense.

As for LibriSpeech, the DeepSpeech team at Mozilla does use this data for
training. However, the language is pretty antiquated, and we only get about 1K
hours of data, whereas you need about 10K hours to get to a decent accuracy
(WER of 10% and below). Common Voice is about adding to public corpora like
LibraSpeech, not replacing them.

1.)
[https://github.com/mozilla/DeepSpeech](https://github.com/mozilla/DeepSpeech)

~~~
albertzeyer
Very interesting. I was not aware that there is Mozilla DeepSpeech (which
implements the model from the same called paper DeepSpeech by Baidu, in
TensorFlow). Note that the issue with DeepSpeech (the CTC model from the Baidu
paper) is that it really needs a lot of training data to perform well (that is
a generic property of CTC). If you use more conventional models (hybrid NN/HMM
models), you can get very decent word-error-rate performance with only a few
hundred hours of data. The advantage of DeepSpeech of course is that it is
simpler and you don't need a lexicon (mapping words to its pronunciations,
i.e. sequence of phonemes).

I would also not use voice technology as the generic term for speech
recognition, text-to-speech, and whatever else you want to do with this data.
Rather, speech technology is the common term to cover all of this
([https://en.wikipedia.org/wiki/Speech_technology](https://en.wikipedia.org/wiki/Speech_technology)).

~~~
punchingwater
Noted. Again thanks for the feedback :)

------
apeddle
This looks great! I use voice control to program on occasion due to an rsi
injury. The standard stack for this is a mess due to closed source systems
that aren't designed for voice programmers. A good open solution could really
save me from a lot of headaches.

~~~
oulipo
You can take a look at what we build at [https://snips.ai](https://snips.ai),
we will open-source the platform later this year

~~~
Jayakumark
Cool. Just curious on What is the voice engine behind snips ? and who is the
provider of training data ?. Also do you have plans for supporting additional
languages or can it be trained on when you open source it ?

------
cooper12
If they're planning to make a voice recognition system, why are they using
example statements that are clearly taken from novels? [0] That's not how real
people talk. They use a lot more slang, a lot more stopping and starting,
filler words, etc. Instead you have people saying things like "irresolute",
"rumbling", and other complex words. It would be useful for training a novel
dictation system, but it's not how people would speak to their browser for
example.

[0]: An example sentence is "a thin circle of bright metal showed between the
top and the bottom of the body of the cylinder", which is from H. G. Wells'
_War of the Worlds_.

~~~
jpalomaki
Maybe there's not yet good open datasets available for this kind of material?

This gives Amazon, Apple and Google a nice advantage since they are able to
collect huge sample sets of actual voice commands used by people and to some
extent also correlate them with the actual action taken by the person.

How could we collect such dataset? It's a bit chicken-egg problem. I don't
want to talk to some open source system unless it has fairly good chance of
understanding me. Should we try to half manually (through crowd sourcing) come
up with potential requests like "Check news from CNN.com", "Order me quattro
stagioni" which could be then fed to platform like Common Voice?

Or should we work on higher level. Come up with task descriptions ("You want
to order taxi to get to airport for your morning flight at 7am") and then let
people record how they would actually request this from computer with voice.
This might more accurately capture the language we actually use when speaking.
Through some simple automation you could generate variations of the requests
and at least partly the same base material could be used for different
languages (task given in English, ask person to make the request in Finnish).

~~~
saurik
If you want people carefully reading books, it is pretty easy to get a hold of
that kind of data in the form of audio books and the work of Recording for the
Blind and Dyslexic. Sure, it isn't chunked into sentences, but since you have
all of the source text you could do a quite reasonable job automating the
slicing, throw out places you aren't sure, and still have a near infinite
amount of great data. (Note that it isn't like these sentences are perfect
anyway, hence the filtering process with volunteers: while I was judging some
audio files one of the issues was "person turned off microphone a little too
soon".)

~~~
icc97
Perhaps that's one of the points of using text from books. You can compare how
people are speaking compared to someone who was specifically tasked with
reading the book out loud for the audio book.

------
glandium
Sadly, in Demographic Data, only native english accents can be selected.

~~~
a3_nm
I have reported this and it looks like they intend to fix this
[https://github.com/mozilla/voice-
web/issues/242](https://github.com/mozilla/voice-web/issues/242)

~~~
breakingcups
If I read the issue right they don't intend to fix it at all.

------
pebers
Is the data going to be freely available as well? It's a little unclear
whether they intend to make it separately available or not.

~~~
hardwaresofton
It looks like the database will be open sourced later this year:
[https://voice.mozilla.org/faq](https://voice.mozilla.org/faq)

I'm wondering if the format will be easily translatable to the kinds of models
that software like CMUSphinx and Julius use

[https://cmusphinx.github.io/](https://cmusphinx.github.io/)

[http://julius.osdn.jp/en_index.php](http://julius.osdn.jp/en_index.php)

~~~
lucb1e
It's weird to me that they publish the project about having an open dataset of
voice data, with only a promise the open it up later.

~~~
zie
It's Mozilla, they likely just don't have the infrastructure/code in place
yet. They definitely get my trust for believing they actually will. They have
been very good in the past about keeping things open. They make mistakes
sometimes, but their goals are all about being open.

~~~
punchingwater
thanks for the vote of confidence! yes we will absolutely open this data up,
and it's just a matter of collecting enough data to be useful, and then
building the UI. we have a goal of achieving this by the end of 2017, so stay
tuned!

------
therealunreal
Any plans for languages other than English?

~~~
a3_nm
This seems planned: [https://github.com/mozilla/voice-
web/issues/213](https://github.com/mozilla/voice-web/issues/213)

------
jldugger
And... 503'd. I didn't catch what the intended use case was before it died,
but I'm guessing computer generated voice?

Most of the computer generated stuff I've seen uses trained actors. Which
neatly avoids the problem of trying to reconcile a myriad of accents and
dialects, which was immediately apparent from the first two samples I tried.

edit: back up, seems to be about voice recognition, which this could help with
no problem.

~~~
popinman322
Actually, based on the site content I think they're using it to create an
archive of speech data to train speech recognition systems.

~~~
mbebenita
That is correct. The DeepSpeech project,
[https://github.com/mozilla/DeepSpeech](https://github.com/mozilla/DeepSpeech)
will use this data to train and validate open source / freely available speech
to text models. The training data, along with the trained models will be made
available for free to all users and researchers alike.

------
eatbitseveryday
It would be useful to collect data from non-native speakers of a language.
More and more such individuals are appearing in all countries, and devices
that accept spoken words should not break because of someone's level of
command of a spoken language. For example, a Swiss speaking German
(Hochdeutsch), or more clearly, a Brit speaking French, etc. Some children who
grow up in multi-lingual families also intermix words from multiple languages
into their sentences. We can still understand them.

~~~
punchingwater
This is a bug with our website [1]. We actually are trying to collect non-
native speakers (as well as native). We are looking into clarifying this on
the site.

1.) [https://github.com/mozilla/voice-
web/issues/242](https://github.com/mozilla/voice-web/issues/242)

------
giancarlostoro
I wonder if implementing a new type of Recaptcha with these type of projects
in mind would make sense. The data wouldn't be going to some data center in
Google land, but instead to some open end that anyone should be able to get
their hands on. Also a free and open source recaptcha alternative would be
nice. Trick is keeping it complex enough that bots cannot just reuse the
existing public data set. Maybe withhold on making some of the data public for
a few years till deemed 'retired'.

------
ZoomZoomZoom
I hope this data will be used purely for voice recognition purposes and not
for voice generation, or we'll be stuck with robots talking with this horrible
gurgling and clicking accent due to poor recording conditions of most
participants!

~~~
sirlantis
According to their FAQ they actually want those poor conditions to be present
in the corpus.

> We want the audio quality to reflect the audio quality a speech-to-text
> engine will see in the wild. Thus, we want variety. This teaches the speech-
> to-text engine to handle various situations—background talking, car noise,
> fan noise—without errors.

[https://voice.mozilla.org/faq](https://voice.mozilla.org/faq)

------
tangue
Cool project, really aligned with the mission of Mozilla, and with a pleasant
UX. And if you're a non-english speaker like me validating sentences is a nice
way of improving your comprehension.

~~~
TomasSedovic
Yep! Though us non-native speakers should really be recording, too. So we're
not left behind in voice recognition.

~~~
punchingwater
Exactly! Part of the goals of Common Voice is to make voice recognition work
better for non-north american men (which is where the vast majority of the
training data comes from).

If you are a non-native speaker, we need your voice!

~~~
tangue
Okay but what should strangers submit for country and accent in the form ? The
only options are anglophones countries.

~~~
punchingwater
Good question. Sounds like we should add an "Other" to that drop down, and
make it clear that we are looking for _all_ accents?

~~~
tangue
Yes - imho "non native english speaker" would be we the field where I would
look first- edit typed on phone ...

------
sexydefinesher
Should i have any privacy concerns with contributing? I dont want just anyone
to have the data to recreate my voice digitally.

------
olegkikin
Man, most people have horrible microphones.

~~~
mbebenita
Indeed, and a big problem for STT models, which is why we're trying to collect
this type of data.

------
timwaagh
this is an important development. voice control has good potential. would be
cool if they used it as an alternative way to control firefox and/or servo?

------
ibotty
Any idea why the duplicate detection did not work for this link:
[https://news.ycombinator.com/item?id=14786881](https://news.ycombinator.com/item?id=14786881)

Anyhow: these should be merged (even though there is no discussion on the
other submission)

~~~
pvinis
I think that's why. When the previous thread is not very active, dup's are
allowed. Not sure though.

~~~
tomhoward
Yes. The system is designed to allow multiple chances for good content to get
exposure.

[https://hn.algolia.com/?query=dang%20deliberately%20porous&s...](https://hn.algolia.com/?query=dang%20deliberately%20porous&sort=byPopularity&prefix&page=0&dateRange=all&type=comment)

------
kgdinesh
Can't believe it's down already.

