
Common Voice: A Massively-Multilingual Speech Corpus - ArtWomb
https://arxiv.org/abs/1912.06670v1
======
est31
> M-AILABS data contains 9 language varieties with a modified BSD 3-Clause
> License, however there is no community-driven aspect.

There is certainly an _aspect_ , as M-AILABS sources their data mainly from
the librivox project, which is community-driven.

To give ballpark numbers on what it would have cost if you had had to pay
people for providing the data instead of getting it for free:

It's low skilled labour so you'll likely find people to do it slightly above
minimum wage. Let's take Germany as I'm most familiar with its rules and
because that's where Common Voice is headquartered. Minimum wage here will be
€9.35 / hr starting on Jan 1st. Let's say you pay them €11. There are various
Arbeitgeberanteile which you have to pay as well. Let's say your per-employee
expense would amount to €15/hour. Let's assume you can verify and record at
70% efficiency and you use two people to verify. Then you need to expend 4.29
employee-hours per final result hour.

You couldn't just get German at this rate: Berlin is one of the towns with the
largest language diversities in Germany.

This would give you a price of €64.35 per result hour. You'd have to pay €64k
for 1000 hours of validated training data, and 128k for the 2 thousand hour
figure of currently achieved data.

These €128k are probably on the same order of magnitude that Mozilla pays for
the project (employee time to design, build, and run it), and if the project
scales it will look even better. From a business POV, going open source was
thus a great idea.

To put the 2000 hours into comparison, the deepspeech 2 paper [1] used 10k
hour datasets (per language, while the 2k hours are distributed amongst
multiple languages) [1]. Record holder is probably Amazon with 1 million hours
(although it's unlabeled) [2].

It's possible though that future breakthroughs will remove the need for tons
of training data. So even if the restricted amount of training data can't
create practical models in niche languages for now, it might very well be able
to in the future.

[1]:
[https://arxiv.org/pdf/1512.02595.pdf](https://arxiv.org/pdf/1512.02595.pdf)

[1]:
[https://arxiv.org/pdf/1904.01624.pdf](https://arxiv.org/pdf/1904.01624.pdf)

~~~
melling
I’m not sure why we’d bother to figure the cost? People have been donating
their time to open source for decades, without compensation.

I’ve donated some time to the Mozilla Common Voice project:
[https://voice.mozilla.org/en](https://voice.mozilla.org/en)

If a lot of people donate a little time, we’ll come much closer to making
voice recognition a solved problem.

Google has a great project where they’re trying to improve voice recognition
for people with disabilities:

[https://blog.google/outreach-
initiatives/accessibility/how-t...](https://blog.google/outreach-
initiatives/accessibility/how-tim-shaw-regained-his-voice/)

~~~
est31
It was meant to demonstrate that open sourcing a project can help a company
with its realization instead of doing nothing or holding it back.

------
peterjussi
The actual dataset from the paper:
[https://voice.mozilla.org/en/datasets](https://voice.mozilla.org/en/datasets)

------
zerop
I have trained with common voice data and it certainly does very good..however
a decent speech recognition has still a long way to go... These models work
good in controlled environment with a mic...but real world use cases are
having noise, dynamic environment, varied pitches etc... on top of it everyone
expects your voice recognition to be like Google or Alexa... I am still
looking for a decent deep learning based solution that can work in a real
environment..

~~~
Jnr
They actually ask to use whatever microphone in any environment you are in.

I usually record my clips on the phone in car when sitting in the traffic. It
is not even close to studio quality.

Currently my contribution is small (about 1500 clips and 1000 checks) but I
still participate when I get some time.

------
lunixbochs
This dataset has been a huge boon for my English acoustic models that
recognize many accents at once.

------
option
This is a great dataset. We already pre-trained some models on it
[https://nvidia.github.io/NeMo/asr/quartznet.html](https://nvidia.github.io/NeMo/asr/quartznet.html)

~~~
zerop
How is the performance?

