
Mozilla releases the largest to-date public domain transcribed voice dataset - Vinnl
https://blog.mozilla.org/blog/2019/02/28/sharing-our-common-voices-mozilla-releases-the-largest-to-date-public-domain-transcribed-voice-dataset/
======
sgc
I love this.

I do a lot of dictation on mobile devices for work, with middling, and perhaps
more importantly, frustrating results (needless to say we are working on
programming our way out of that hole). It is an area ripe for open source
progress given the failure of larger companies with large proprietary data
sets to make _basic common sense_ decisions in their transcription algorithms,
together with no way to provide impactful feedback.

If anybody is interested, there is definitely a market for a more robust
dictation library that can be integrated into apps and works offline. It just
needs to be professional - e.g. allow for preferences including the ability to
indicate a _strong_ preference for standard language and grammar over all
slang, not forcing Title Case for anything resembling a brand name, having a
training mode for words and phrases of the user's choosing, blacklisting of
certain word or phrase results which are false positives, and proper learning
from user corrections during use so the tedium of correcting the same phrase
100 times disappears.

~~~
99052882514569
There's also a huge need for better transcription software in radiology.
Existing ones are expensive and are just not good enough to be an actual time-
saver.

(A radiologist friend describes switching from a human medical
transcriptionist to one of these "AI" software thingies as a cost-cutting
measure by his hospital, or more accurately as a cost-offload measure.
Hospital offloads salary, radiologist spends more time correcting stupid
transcription mistakes for no extra pay).

~~~
MattSayar
I thought it would be faster to correct a transcript than type it up fresh. I
can type pretty damn quickly, but still slower than the average person talks.
I would rather follow along to an AI transcript and correct a few words than
type it all new.

~~~
jcims
Problem is that transcription errors never have a misspelling and usually
sound ok in yourself nene voice. You have to really pay attention to find them
(reading the text backwards a sentence at a time sometimes helps with this)

~~~
derefr
Imagine a professional “assisted stenography” software package. Rather than
automatic full transcription + proof-reading after the fact, it would
transcribe “live” but with any words where its model gave a low-confidence
output highlighted with the expectation that the human stenographer, listening
to the audio at the same time the machine is, will type in the correct
transcription (or just hit tab to accept, like in autocomplete.)

~~~
jcims
I would love it if I could always see the top three possible interpretations
of what I just said to my phone. It's interesting watching the speech
recognition think about what's being said and flip words around once it has
enough context.

------
abakker
I have a (probably dumb) question: why don't we just use audiobooks for this?
There are thousands and thousands of hours where the transcripts were written,
and then read aloud. Some of them are now public domain. I'm sure some
validation would need to be done, but it seems like there would be an endless
validation set there. Am I missing something obvious?

~~~
punchingwater
Audiobooks are definitely possible for ASR training. Indeed the largest open
ASR training dataset before Common Voice was LibriSpeech
([http://www.openslr.org/12/](http://www.openslr.org/12/)). Also note, the
first release of Mozilla's DeepSpeech models were trained and tested with
LibriSpeech: [https://hacks.mozilla.org/2017/11/a-journey-to-10-word-
error...](https://hacks.mozilla.org/2017/11/a-journey-to-10-word-error-rate/)

But as others have mentioned, there are several problems with audiobooks as an
ASR training dataset. First, the language used in literature is often very
different from how people actually speak, especially if that language comes
from very old texts (which many public domain books are indeed quite old).

Then there is the sound profile, which includes background noise, quality of
microphone, speakers distance to device, etc. For recorded audio books, the
speaker is often using a somewhat sophisticated setup to make the audio
quality as clean as possible. This type of setup is obviously unusual when
people want to speak to their devices.

Third, the tone and cadence of read speech is different than that of
spontaneous speech (the Common Voice dataset also has this problem, but they
are coming up with ideas on how to prompt for spontaneous speech too).

But the goal of Common Voice was never to replace LibreSpeech or other open
datasets (like TED talks) as training sets, but rather to compliment them. You
mention transfer learning. That is indeed possible. But it's also possible to
simply put several datasets together and train on all of them from scratch.
That is what Mozilla's DeepSpeech team has been doing since the beginning (you
can read the above hacks blog post from Reuben Morais for more context there).

~~~
olejorgenb
> Then there is the sound profile, which includes background noise, quality of
> microphone, speakers distance to device, etc. For recorded audio books, the
> speaker is often using a somewhat sophisticated setup to make the audio
> quality as clean as possible. This type of setup is obviously unusual when
> people want to speak to their devices.

It shouldn't be that hard to degrade the quality synthetically? And with a
clean source you can synthesize different types of noise/distortions.

~~~
vidarh
I can't speak for voice data, as I've not worked with voice, but I did my MSc
on various approaches for reducing error rates for OCR. I used a mix of
synthetically degraded data ranging from applying different kinds of noise to
physically degrading printed pages (crumpling, rubbing sand on them, water
damage), and while it gave interesting comparative results between OCR
engines, the types of errors I got never closely matched the types of errors I
got from finding genuine degraded old books. I've seen that in other areas
too.

My takeaway from that was that while synthetic degradation of inputs can be
useful, and while it is "easy", the hard part is making it match real
degradation closely enough to be representative. It's often _really hard_ to
replicate natural noise closely enough for it be sufficient to use those kind
of methods.

Doesn't mean it's not worth trying, but I'd say that unless voice is very
different it's the type of thing that's mostly worth doing if you can't get
your hands on anything better.

~~~
jfoutz
You might say, if you can identify and simulate all cases of real life
degradation, your problem is basically solved, just reverse the simulation on
your inputs.

I’m not saying ocr isn’t hard. I’m saying normalizing all those characters
basically is the problem.

~~~
dbdjfjrjvebd
This isn't quite true if e.g. there are degenerate cases.

------
16bytes
It's great to see innovation in the space of open data.

There have recently been a number of assertions that better quality ML data
will outperform better ML algorithms, and this has certainly been true in my
experience as well, especially in domains like speech recognition.

There's going to be a long road to catch up to the big players, however. Even
15 years ago there were companies who were doing 1M minutes of labled voice
data _per year_.

The data gap between established players and newcomers to the market will
continue to grow unless we invest in efforts like this.

~~~
novaRom
Original Deep Speech 2 paper released few years ago mentioned a hundred
thousand or two hundred thousand hours, but the amount of data has increased
since then significantly.

Still quite a lot of languages have very tiny datasets of transcribed data.

------
Wowfunhappy
I was decided to spend 10-ish minutes validating voices, because why not.

I did not hear any lines that were flat-out spoken incorrectly, at least as
far as I could tell. However, I did come across a ton of _really_ poor
samples, to the point of being somewhat difficult to understand. Things like:

• _Really_ strong accents

• Horrible, muffled microphones

• Background noise

• Super quiet

• A couple "robotic" samples I legitimately think were generated via text-to-
speech software

All of these types of samples—save the last—constitute possible real-world
scenarios. But, do they make for good training data? I very little about
machine learning, but it makes logical sense to me that you'd want to teach
the computer with "clean" data—something with a high signal-to-noise ratio
which is as close to the "average" of the real world as possible. Is this
completely wrong?

Separately, they ought to provide this type of instruction on what to do with
borderline samples. If I legitimately can't tell for sure whether a word was
spoken correctly, what should I do?

~~~
opportune
Yes, you do want the bad voice samples precisely because they correspond to
actual input that a model might receive. A dataset with only clear samples
would likely have a stronger "signal" overall meaning it might be easier for
an academic testing a model to get higher test accuracy training on that data.
The bad data makes the model more robust to non-ideal input types

~~~
XMPPwocky
In fact, it's not uncommon to take a dataset and deliberately, randomly
distort it- for images, things like scaling, rotating, cropping, blurring,
altering color balance and gamma, flipping, adding random noise...

The idea is to make the model resistant to that "bad" input and effectively
enlarge your dataset for free- if you have a picture of a cat, you can
automatically also get loads of pictures that you know should still be
classified as a cat- rotated 15 degrees clockwise, noisy (like in low-light
conditions), the tip of its tail is out of frame, the camera's automatic white
balance screws up...

Also, the robotic samples may be real human voices mangled by LPC- Think a
lossy VoIP call.

------
Theodores
In reading the comments here I see that some people do use voice with their
hand rectangles for other purposes than speaking to someone. I never do even
though there is some Google Assistant icon staring at me. If I accidentally
start recording my voice by fumbling buttons then I instinctively try to stop
it so I can continue pecking into the keyboard.

Is it a generational thing and will future generations find typing into a
search box as anachronistic as I find using a land line with a rotary dial?

I am also British and therefore not as loud as some people in the English
speaking world. Talking to my phone on the train would make me cringe. Clearly
people like me will die off soon enough, however, is adoption a problem for
these voice technology things? How do people get into changing habits from
pecking at a keyboard to the evidently easier voice driven way of doing
things? Is there one use case, e.g. in the car, where the habit of speaking to
a gadget is learned?

~~~
MattSayar
When I was one-handed for a while after surgery, I would use my phone's
speech-to-text function often to send emails. Far faster than one-handed
typing.

------
dabinat
I’ve been contributing to Common Voice for several months now. If anyone else
is thinking of making contributions, it’s worth mentioning that there are a
lot more speakers than validators and English currently has a 1 year
validation backlog, so new validators are more useful right now than new
speakers.

~~~
gdfasfklshg4
Interesting! I started doing some validation a while ago then stopped because
I figured it would be the other way around and I wasn't prepared to speak. I
will start validating again!

------
elektor
My voice is in this dataset!

For the past few months, I've been on the iOS CommonVoice app reading
sentences out loud. It's great fun, I'd recommend it.

------
a3_nm
I'm puzzled by the "You agree to not attempt to determine the identity of
speakers in the Common Voice dataset".

On some level it's a good idea to want to request this, but as the dataset is
public-domain, isn't it going to get mirrored and retrieved by people who
won't have to agree to anything? ...

~~~
lucb1e
It has no legal meaning whatsoever since, indeed, it's public domain. Or
_maybe_ it can affect the downloader, but certainly not anyone who got the
data from the downloader (without such promise). I think the point is to
remind people that they aren't cool with it and that because you _can_ (CC0)
it doesn't mean you _should_.

------
vkaku
Great. Is there an open voice assistant project yet? Or a privacy focussed
one?

~~~
intopieces
Yes! Here’s a recent HN discussion about itD
[https://news.ycombinator.com/item?id=19152561](https://news.ycombinator.com/item?id=19152561)

------
microcolonel
Do they have the audio in a different format than MP3? MP3 is really garbage
and introduces weird biases for voice.

~~~
microcolonel
As an update, in looking at the code it seems they hardcode the choice of MP3
for storage in the server.

Also very weird that they serve the tarball of MP3s gzipped, which seems
mostly pointless to me, as it amounts to a reduction of maybe ~4% for a
tremendous amount of time spent uncompressing the tarball (which itself has a
bunch of useless macOS-specific headers, three apiece on each file).

Many of the MP3 files are literally just empty (zero size) or partially
written (corrupt), it seems. I wonder if that issue comes from their choice to
package the tarballs on macOS, or some underlying issue on the server side.

~~~
nshm
I think its on purpose to not give too much advantage to downloaders, there
was a discussion long time ago, but they still continue with mp3. There are
also other not so nice points: [https://github.com/kaldi-
asr/kaldi/issues/2141](https://github.com/kaldi-asr/kaldi/issues/2141)

------
stunt
Sadly, this is an area that big companies are not contributing to keep their
competitive advantage.

Really glad to see Mozilla is trying to change that.

------
echelon
This is amazing! I've been using LJS trained models and then cross-training
them to target speakers, but this looks like it may produce even higher
quality results.

I've previously implemented concatenative TTS using unit selection [1]. The
quality is spotty, so I'm throwing it out and going with the ML approach,
which produces higher fidelity voices even in my own experiments.

My next steps are taking an end-to-end synthesis model and porting it to run
cheaply on the CPU.

Thanks so much for making this data available, Mozilla! You're helping
democratize this technology for individual engineers and researchers that
don't have Google's resources.

[1] [http://trumped.com](http://trumped.com)

------
The_Amp_Walrus
I'm interested in using this for text to speech, rather than speech to text.
Is wavenet still the state of the art for training on a dataset like this?

~~~
1ris
The biggest feature of this data set is that it includes lots of different
accents and non-native speakers. This should improve voice recognition in
these areas, but I'm not sure if it is that usful for voice generation.

------
robax
It would be super rad if someone plugged this dataset into CMU Sphinx. Google
is the only decent ASR and a competitive open source alternative would be
awesome. A big thank you to Mozilla for this dataset.

