
Mozilla Common Voice Dataset: More data, more languages - dabinat
https://discourse.mozilla.org/t/common-voice-dataset-release-mid-year-2020/62938
======
echelon
Data in ML is critical, and this release from Mozilla is absolute gold for
voice research.

This dataset and will help the many independent deep learning practitioners
such as myself that aren't working at FAANG and have only had access to
datasets such as LJS [1] or self-constructed datasets that have been cobbled
together and manually transcribed.

Despite the limited materials available, there's already some truly amazing
stuff being created. We've seen a lot of visually creative work being produced
in the past few years, but the artistic community is only getting started with
voice and sound.

[https://www.youtube.com/watch?v=3qR8I5zlMHs](https://www.youtube.com/watch?v=3qR8I5zlMHs)

[https://www.youtube.com/watch?v=L69gMxdvpUM](https://www.youtube.com/watch?v=L69gMxdvpUM)

Another really cool thing popping up are TTS systems trained from non-English
speakers reading English corpuses. I've heard Angela Merkel reciting
copypastas, and it's quite amazing.

I've personally been dabbling in TTS as one of my "pandemic side projects" and
found it to be quite fun and rewarding:

[https://trumped.com](https://trumped.com)

[https://vo.codes](https://vo.codes)

Besides TTS, one of the areas I think this data set will really help with is
the domain of Voice Conversion (VC). It'll be awesome to join Discord or
TeamSpeak and talk in the voice of Gollum or Rick Sanchez. The VC field needs
more data to perfect non-aligned training (where source and target speakers
aren't reciting the same training text that is temporally aligned), and this
will be extremely helpful.

I think the future possibilities for ML techniques in art and media are nearly
limitless. It's truly an exciting frontier to watch rapidly evolve and to
participate in.

[1] [https://keithito.com/LJ-Speech-Dataset/](https://keithito.com/LJ-Speech-
Dataset/)

~~~
indogooner
Curious to know why don't researchers use Audiobooks/Videos and transcript
when data is not available? Is it because these do not capture different
dialects/accents?

~~~
petargyurov
I suppose there might some copyright issues with such content (?)

~~~
mjepronk
Well there is Librivox...

~~~
est31
Indeed, that's what one of the famous established datasets, LibriSpeech, bases
on.

------
lunixbochs
This is great! I’m always excited to see new common voice releases.

As someone actively using the data I wish I could more easily see (and
download lists for?) the older releases as there have been 3-4 dataset updates
for English now. If we don’t have access to versioned datasets, there’s no way
to reproduce old whitepapers or models that use common voice. And at this
point I don’t remember the statistics (hours, accent/gender breakdown) for
each release. It would be neat to see that over time on the website.

I’m glad they’re working on single word recognition! This is something I’ve
put significant effort into. It’s the biggest gap I’ve found in the existing
public datasets - listening to someone read an audiobook or recite a sentence
doesn’t seem to prepare the model very well for recognizing single words in
isolation.

My model and training process have adapted for that, though I’m still not sure
of the best way to balance training of that sort of thing. I have maybe 5
examples of each English word in isolation but 5000 examples of each number
(Speech Commands), and it seems like the model will prefer e.g. “eight” over
“ace”, I guess due to training balance.

Maybe I should be randomly sampling 50/5000 of the imbalanced words each epoch
so the model still has a chance to learn from them without overtraining?

~~~
scribu
What if you first trained a classifier that told you if the uttereance is a
single word vs. multiple words? Then, based on that prediction, you would use
one of two separate models.

The technique you're thinking of is called oversampling and there are many
other general techniques for dealing with imbalanced datasets, as it's a very
common situation.

~~~
lunixbochs
Thanks, the oversampling mention gives me a good reference to start.

The model itself has generalized pretty well to handle both single and multi
word utterances I think, without a separate classifier, but I'm definitely not
going to rule out multi-model recognition in the long run.

My main issues with single words right now are:

\- The model sometimes plays favorites with numbers (ace vs eight)

\- Collecting enough word-granularity training data for words-that-are-not-
numbers (I've done a decent job of this so far, but it's a slow and painful
process. I've considered building a frontend to turn sentence datasets into
word datasets with careful alignment)

~~~
nmstoker
For that last point, forced alignment tools may be useful.

An issue to watch for though is elision: a word in a sentence can often be
said differently to the individual words, eg saying "last" and "time"
separately one typically includes the final t in last and yet said together,
commonly it's more like "las time".

~~~
lunixbochs
Yeah, I'm familiar with forced alignment. This is slightly nicer than the
generic forced alignment, because my model has trained on the alignment of all
of my training data already. My character based models already have pretty
good guesses for the word alignment.

I think I'd be very cautious about it and use a model with a different
architecture than the aligner to validate extracted words, and probably play
with training on the data a bit to see if the resulting model makes sense or
not. I do have examples of most english words to compare extracted words.

~~~
nmstoker
Sounds like a great approach

------
jointpdf
Does this dataset include people with voice or speech disorders (or other
disabilities)? I don’t see any mention of it in this announcement or the
forums, though I haven’t looked thoroughly (yet).

Examples: dysphonias of various kinds, dysarthria (e.g. from ALS / cerebral
palsy), vocal fold atrophy, stuttering, people with laryngectomies / voice
prosthesis, and many more.

Altogether, this represents millions of people for whom current speech
recognition systems do not work well. This is an especially tragic situation,
since people with disabilities depend more heavily on assistive technologies
like ASR. Data/ML bias is rightfully a hot topic lately, so I feel that the
voices of people w/ disabilities need to be amplified as well (npi).

~~~
sagz
There's g.co/euphonia for those projects

~~~
daanzu
Google is certainly doing some great work with this, both Project Euphonia and
other research [0]. However, as far as I know, the Euphonia dataset is closed
and only usable by Google. A Common Voice disordered speech dataset would
(presumably) be open to all, allowing independent projects and research. (I
would love to have access to such a dataset.)

[0] [https://ai.googleblog.com/2019/08/project-euphonias-
personal...](https://ai.googleblog.com/2019/08/project-euphonias-personalized-
speech.html)

------
intopieces
I would love to work for Mozilla on this effort full time. I have experience
in voice data collection / annotation / processing at 2 FAANG companies.
Anyone have an in? Thinking of reaching out to the person on who wrote this
post directly.

~~~
nmstoker
The people on those related projects seem like a great bunch of people to work
with.

I don't have "an in" but it's probably worth having a look over the Common
Voice and Deep Speech forums on Discourse to see who the main people are. They
also hang out in their Matrix Chat groups, so might be able to get in touch
that way. Links are below.

[https://discourse.mozilla.org/c/deep-
speech/247](https://discourse.mozilla.org/c/deep-speech/247)

[https://discourse.mozilla.org/c/voice/239](https://discourse.mozilla.org/c/voice/239)
[https://chat.mozilla.org/](https://chat.mozilla.org/)

------
Polylactic_acid
Why on earth are they using mp3 for the dataset? Its absolutely ancient and
probably the worst choice possible. Opus is widely used for voice because it
gets flawless results at minuscule bitrates. And don't tell me its because
users find mp3 simpler because if you are doing machine learning I expect you
know how to use an audio file.

~~~
lunixbochs
Probably because they're uploading (and playing back) from a webpage and Web
Audio is weird and inconsistent, so sticking to a builtin codec is probably
more reliable. As someone who trains on their data, it seems usable anyway.
Training on 1000 hours of Common Voice makes my model better in very clear
ways.

[https://caniuse.com/#search=mp3](https://caniuse.com/#search=mp3)

[https://caniuse.com/#search=opus](https://caniuse.com/#search=opus)

I got flac working for speech.talonvoice.com with an asm codec so they could
do whatever in theory, but I do get some audio artifacts sometimes.

~~~
est31
Yeah especially compatibility with Apple browsers was very important for them.
I'd added functionality to normalize audio for verification but they removed
it multiple times because it didn't work on Safari for various reasons.

I ended up building an extension for Firefox that normalizes the audio on the
website if installed: [https://github.com/est31/vmo-audio-
normalizer](https://github.com/est31/vmo-audio-normalizer)
[https://addons.mozilla.org/de/firefox/addon/vmo-audio-
normal...](https://addons.mozilla.org/de/firefox/addon/vmo-audio-normalizer/)

In general I don't think normalization should happen at the backend. It's
useful for training data to have multiple loudness levels, so that the network
can understand them all.

------
dang
If curious see also

2019
[https://news.ycombinator.com/item?id=19270646](https://news.ycombinator.com/item?id=19270646)

------
tumetab1
To contribute
[https://voice.mozilla.org/en/speak](https://voice.mozilla.org/en/speak)

~~~
pjfin123
They make it really easy to contribute! You don't need to make an account (you
can though) and you read/review short sentences. I just added 75 recordings
and it only took ~30 minutes. Also if you speak other languages you can
contribute in them. It would really be great if there was a comprehensive
public voice dataset for people to do all sorts of interesting things with.

------
j45
This is really encouraging to see. So nice to see languages that have more
speakers than the most commonly translated languages.

------
stergro
The complete project is very exciting, I hope that this is really a game
changer, that enables private persons and startups to create new neural
networks without a big investment for the data collection.

I worked for the Esperanto dataset of common voice in the last year, and we
now have collected over 80 hours in Esperanto. I hope that in a year or two
we'll have collected enough data to create the first usable neural network for
a constructed language and maybe the first voice assistant in Esperanto too. I
will train a first experimental model with this release soon.

------
user764743
This is interesting. As someone who has always tons of interview data to
transcribe for academic research, what TTS systems should I be looking into to
help me save some time? Is Deep Speech adapted for this use?

------
villgax
Nice, now we need the CTC based models to run offline on low-powered devices &
then pretty much all speech-to-text APIs are done for.

~~~
lunixbochs
I've been working on this. I think I can reliably hit the quality ballpark of
STT APIs at the acoustic model level, but not at the language model level
(word probabilities) in a low-powered-way yet.

Also, non-English models are _way_ behind still.

~~~
villgax
The way google & now Apple manage to run these on device is pretty neat.
Google has a blog post for the same too.

The recently updated Mozilla Voice dataset still lacks non-EN languages sadly.

