
Common Voice – Mozilla's initiative to help teach machines how real people speak - mjlee
https://voice.mozilla.org/
======
gok
The goal of making a large, publicly available training corpus for ASR is
incredibly admirable, but this approach is problematic. People speak entirely
differently when reading from a script. Models trained on read speech (like
LibriSpeech) generally don't perform well on spontaneous speech test sets
(like Switchboard). Transcribing speech that was read from a script isn't a
particularly interesting problem.

This effort would be more interesting if it could collect speech data in a
more specific domain, like web search queries.

~~~
cco
Why do you view this as admirable? I view it similarly to creating a large
cache of explosives and munitions and then giving them away for free. Sure a
lot of people might have some fun with them on the weekend but the largest
impact of this technology will be used against people, not for them.

~~~
staktrace
With your line of reasoning, would you prefer a world where all the munitions
were held by megavillains, or a world where both megavillains and regular
people had access to the munitions?

~~~
minikites
The latter strategy doesn't seem to work out that well for the USA because it
results in regular people using the munitions on other regular people.

~~~
zenography
To defend themselves, hundreds of thousands to millions of times per year.

[https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3194685](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3194685)

~~~
williamdclt
> indicate that defensive uses of guns by crime victims are far more common
> than offensive uses by criminals

That's literally the worst case scenario. Victims escalating the violence.

~~~
zenography
Are you saying that people defending themselves is worse than killing?

------
pornel
I've tried to contribute to new languages, but they all have "join us" button
that subscribes to a generic Mozilla mailing list, and the Voice team never
sent any instructions how to contribute! No wonder these are stuck unfinished
for years.

~~~
Krasnol
Sometimes it's beyond me how tech savy people are struggling with the easiest
things online.

The page itself has only very few links. Randomly clicking through it logged
in would have a high chance hitting one of the links that would either lead
you directly to one of the links where you can CONTRIBUTE:
[https://voice.mozilla.org/de/speak](https://voice.mozilla.org/de/speak), your
dashboard where you can also find links to the CONTRIBUTE part or just the
huge microphone symbol.

~~~
pornel
I'm talking about new languages, which haven't been launched yet, because they
don't have enough text to record yet.

The actual method to contribute to them is finding Mozilla's internal
translation tool, proposing translations, then finding a Mozillian to approve
them. Then there's an obscure webapp for sentence submission and voting, but
it doesn't give any feedback whether sentences you've submitted were accepted
or went to /dev/null.

(you see people struggling with easiest things, because you assume people
struggle with easiest things, rather than that you've misunderstood the
problem.)

~~~
ascii_only
There is feedback in sentence collector. If your sentence too long it will
tell you straight away. If your sentence is rejected by people it will appear
in "rejected sentences" tab. If your sentences are approved you can see it in
GitHub repo.

~~~
pornel
What GitHub? Please realize that none of this is obvious to an outsider who
doesn't know how your pipeline is built. None of this is explained on the
Voice website. These tools aren't even linked to on the official website.

I've submitted 200 sentences, and none of the progress counters shown to me
increased by 200, so I assumed they were lost, and gave up.

~~~
ascii_only
In mozilla/voice-web/server/data/ but if at your profile tab in sentence-
collector there are 0 sentences added than that means that your sentences
hadn't been added even to sentence-collector for approval.

------
punnerud
I love the project and have participated. What I feel lack is cases/words that
kids are able to pronounce (with help of parents saying what they have to
repeat). My 3-year old son love to use speech translations and searching for
videos, but it frustrates both me and him that he pronounce it in a way that
"all" humans would understand but the voice-to-text get his voice wrong in 3/4
of the cases.

~~~
bluGill
Collecting data from children is nearly impossible for legal reasons. I mostly
agree with the reasons, but the side effect is nobody has good data on
children. Thus children are forever doomed to a bad experience, like the time
my son asked "Hey Mycroft, how do you spell Kansas" "C-A-N-V-A-S" which he
knew was wrong.

~~~
lunixbochs
Maybe we could specifically target voice actors who convincingly voice
children in animated stuff

Edit:

\- There are already paid child actors, you could somewhat-manually collect
their speech from e.g. movies and TV shows to have _something_

\- Even if there's some copyright issue with distributing their audio
directly, it's not clear (uncertain but dubious?) that a model trained on
their audio would have any copyright concerns as long as it can't be used to
reproduce the original audio

\- What is Mozilla going to do if <1% of their dataset is already children who
didn't put an age in? Is that a COPPA violation? There's even the defense of
"an adult can sound like a child and we also don't know who this child is so
how is it personal information" (I have no idea the usefulness of any of that)

------
akie
It's going to be tricky.

I did some of the "listen" exercises to validate how people pronounced some
sentences, and I got a few people who spoke with very strong (Indian,
Nigerian, UK, ...) accents. How would you take these things into account? Just
average all of them and hope for the best? Not sure how to approach that. I
don't think it's very straight-forward. Interesting problem though.

However, you can't do anything if you don't even have the data, so props to
Mozilla for starting this.

~~~
yorwba
In the worst case, you might have to treat different accents as different, but
related languages. The current trend for low-resource languages seems to be
about using one giant model for all languages to make use of shared features
(e.g. for translation [1]), so adding even more languages might not be that
expensive in terms of training data required.

[1] [https://ai.googleblog.com/2019/10/exploring-massively-
multil...](https://ai.googleblog.com/2019/10/exploring-massively-
multilingual.html)

~~~
taneq
Isn't this literally what locales are for? Instead of "English" you have "En-
US", "En-UK", "En-Ind" etc.

~~~
jobigoud
I think locales are for dialects, where you can have different terms used for
the same concept. Here you can have someone speak the en-NZ dialect, but with
a French accent.

Also, we would need en-FR, en-ES, en-IT, etc. All languages as spoken from all
other native languages. And obviously the strength of the accent varies.

~~~
TomMarius
Yeah, also a Moravian (ancient nation united with the Czechs more than 1000
years ago) person will speak English differently than a Czech (as in nation,
not state) person even though we all speak the Czech language.

~~~
joshuaissac
A solution could be to tag both the dialect and the accent with language
codes. Native speakers of Moravian Czech will probably have similar accents
when they speak New Zealand English. Using Glottolog IDs as tags for example,
this might be represented as { dialect:"newz1240", accent:"czec1259" }. If the
program can already recognise the New Zealand English dialect and the Moravian
Czech accent, it might then leverage both of those to recognise the speech of
a Moravian person speaking New Zealand English.

------
squarefoot
I've validated some speech and like others I sometimes found strong foreign
accents from non native English speakers (just like I am) so I tried to be
neutral at least with the sentences I could understand, which luckily were the
vast majority. If however I may offer a suggestion to improve the service,
some information should be given about how to produce good quality recordings
before the user starts contributing, what mic to use and how, sound levels,
equalizing, background noise etc. Some of the recordings were truly awful
quality wise and probably would generate false negatives (ok, an AI should
learn to sort those out, but maybe later). Also, some recordings although
correct were stuttering badly probably due to network congestion on the
contributors side; it could come handy a way to also tag those sentences as
"correct but stuttering" so that in the future the AI could also learn how a
formally well recited text would sound if coming from a problematic
connection. Tagging (scoring maybe) could also be useful for sentences where
just about everything is correct save for a single word or part of it. For
example, one sentence was "The party was a Sikh-centered political party in
the Indian state of Punjab." but the woman said "in _this_ Indian state". I
didn't mark it because either way would have been not entirely accurate.

Nice initiative though.

------
est31
If you are listening to the recordings and it's annoying to you that they have
different loudness levels, you can try out my add-on that normalizes the
levels: [https://addons.mozilla.org/de/firefox/addon/vmo-audio-
normal...](https://addons.mozilla.org/de/firefox/addon/vmo-audio-normalizer)

You can also see a demo here (requires git clone):
[https://github.com/est31/js-audio-normalizer](https://github.com/est31/js-
audio-normalizer)

------
SamBam
It is unclear to me what we're supposed to be validating the the "Listen"
section.

1\. That the words say what is written?

2\. That the words are clear and easy to understand?

3\. That the speech is fluent and natural and easy to listen to?

I would guess 1 and maybe 2, and not 3, but it's only a guess because I don't
see it written down anywhere.

Many of the clips are very stilted, spoken slowly and unnaturally.(Also,
sometimes the text doesn't make sense. "On a normal Hajj, it would be around
to walk." But I'm guessing this doesn't matter.)

~~~
dabinat
This may help: [https://discourse.mozilla.org/t/discussion-of-new-
guidelines...](https://discourse.mozilla.org/t/discussion-of-new-guidelines-
for-recording-validation/36465)

~~~
SamBam
That does help. And it should be linked clearly.

For those that don't want to read, the only real rule is: All the words are
read exactly as written, and no other words are heard.

Clarity, stiltedness, background noise, pronunciation (so long as it's
acceptable English), accent, etc don't matter.

------
dang
Related from 2018:
[https://news.ycombinator.com/item?id=17436958](https://news.ycombinator.com/item?id=17436958)

~~~
plibither8
And 2017:
[https://news.ycombinator.com/item?id=14794654](https://news.ycombinator.com/item?id=14794654)

------
intopieces
I work in data collection for speech recognition systems and would love to
work on this project full time. I wish they had openings.

------
terrycody
But how this project can benefit us all?

