
Mozilla Overhauls Speech-To-Text Contribution Interface - blendergeek
https://voice.mozilla.org/speak
======
ipsum2
This is a nicely designed interface. Well done, Mozilla. Validating sentences
is quite fun, listening to different accents from around the world. Try it out
if you haven't already:
[https://voice.mozilla.org/en/listen](https://voice.mozilla.org/en/listen)

It's awesome that the dataset is offered with a CC-0 license:
[https://voice.mozilla.org/en/data](https://voice.mozilla.org/en/data), does
anyone know if it includes the answers from the survey? I have a limited
bandwidth internet, so I haven't checked it out yet. In particular, I'm
wondering if there's user data on whether they picked yes or no, to implement
"troll detection" \- people who click one option all the time.

Trolling of crowdsourced data isn't unheard of. (NSFW language:
[https://www.reddit.com/r/pics/comments/cygfx/4chan_is_using_...](https://www.reddit.com/r/pics/comments/cygfx/4chan_is_using_googles_captcha_technology_to_get/))

~~~
archgoon
> It's awesome that the dataset is offered with a CC-0 license:
> [https://voice.mozilla.org/en/data](https://voice.mozilla.org/en/data), does
> anyone know if it includes the answers from the survey?

I'm downloading it now, I'll have an answer in a half hour. Does anyone know
if there is a torrent for it?

~~~
archgoon
Well, download took longer than expected :).

Anyhow, here's a sample from the csv file:

    
    
      filename,text,up_votes,down_votes,age,gender,accent,duration
      cv-valid-test/sample-001224.mp3,but i felt miserable watching him wither away like a shriveled dandelion,1,0,thirties,male,england,
    

Not sure how some of these are being populated, but yeah; there's several
additional folders including invalid mp3, a splintered train set (not sure how
it was selected) and a test set folder.

Here's the README.txt. Looks cool! Have happy hacky fun! :)

[https://gist.github.com/cwgreene/f7f4df4ddcd9da017b9f4694b3f...](https://gist.github.com/cwgreene/f7f4df4ddcd9da017b9f4694b3f46d51)

Interestingly; many of the 'invalid' mp3's are actually (mostly) correct.
Listening to them is interesting to guess as to why they were downvoted.

~~~
punchingwater
We also keep the README in the repo: [https://github.com/mozilla/voice-
web/blob/master/docs/corpus...](https://github.com/mozilla/voice-
web/blob/master/docs/corpus_readme.txt)

~~~
archgoon
Thanks! Couldn't find the source of the Readme in the zipfile. Can you talk
about what the update process for this file is? How often is it updated? Is
there a way to just download the new files? Is there a tarball script for this
in the repo somewhere?

I see that you have instructions for s3, are the files actually backed in s3?
Is it possible to download them with s3 (possibly using requester pays)?

~~~
punchingwater
We have no plans to allow users to download the "raw" data from s3 (ie. before
we perform the train/dev/test split). But we want to eventually build some
tools to automate this. See here for some background:

[https://discourse.mozilla.org/t/the-mozilla-guarantee-
publis...](https://discourse.mozilla.org/t/the-mozilla-guarantee-publishing-
multi-language-voice-data/29649)

------
makmanalp
This is wonderful and addictive. One thing that comes to mind is that the UI
allows for very little metadata - for example in some cases the audio has a
slight mispronunciation even though the intended word was clear - wouldn't it
be helpful to mark "difficult" cases like this? In other cases the volume is
just super low or there is background noise.

The other thing is that it's very cool to see the "you helped us reach out x%
goal" thing but it locks up all the previous / next shortcuts which means I
have to switch back to the mouse after 5 entries.

~~~
mgkimsal
> for example in some cases the audio has a slight mispronunciation even
> though the intended word was clear

had similar issue/concern. ideally if enough people mark something as correct,
the variations and slight differences will get merged together. it did still
bother me a bit, as being able to add a bit more extra data would probably be
helpful. but... maybe they can add some geo-ip data - respondents from various
areas would probably mark more stuff 'correct' from their own region. ???

Being able to mark something 'close', or rate it (1-5, maybe) would help. Just
heard an indian accent reading "It's such an unfair world, innit?" The words
are... correct, but 'innit' is somewhat idiomatic (especially spelled out that
way - seems more UK-oriented text). The pronunciation was "correct" but
"awkward".

Also... (too lazy to check right now) - if I create an account, can I see the
'yes/no' ratings of my own submissions?

~~~
punchingwater
> Also... (too lazy to check right now) - if I create an account, can I see
> the 'yes/no' ratings of my own submissions?

Not yet, but this is something in the works. You can explore our new
experience with the evergreen link: [http://bit.ly/cv-desktop-
ux](http://bit.ly/cv-desktop-ux)

------
redfast00
Am using Brave on Android (basically chrome with afblocker). I accidentally
mislabeled some (probably) correct samples because the waves move when you
click the play button, even if the audio hasn't loaded yet and I thought it
was just a blank recording.

~~~
punchingwater
Would you mind filing an issue? [https://github.com/mozilla/voice-
web/issues](https://github.com/mozilla/voice-web/issues)

------
childintime
On my 6th try, just to get an idea of what to expect, the sentence: "Birds
feed their offspring with spiders, worms, slugs and bugs", was presented to
me. Upon clicking play I heard "Fuck fuck fuck, shit shit shit, fuck fuck shit
shit fuck".

No I shit you not.

Who was it? Miss, present thyself.

------
microcolonel
My biggest issue with this project is that more than half of the contributions
are from people who have failed to record correctly, or who are not fluent in
English.

~~~
wiml
That's what the second pass is for, right, to screen out actually
unintelligible or misrecorded entries.

The English (in)fluency is more of a feature, though, than a bug. The goal
isn't to produce a speech-to-text system that can recognize a perfectly miked
BBC announcer. It's to be able to recognize a wide variety of people speaking
fairly naturally in imperfect conditions, using whatever accent they use for
casual speech.

~~~
JackCh
> _" The goal isn't to produce a speech-to-text system that can recognize a
> perfectly miked BBC announcer."_

Wait what? The headline is about text-to-speech aka speech synthesis, not
speech recognition (speech-to-text.) Are they trying to do both? It seems to
me that you'd train both using different sorts of datasets. If you wanted TTS
to be intelligible to the most number of people, training to to speak like a
'perfectly miked BBC announcer' is probably exactly what you'd want to do.

Train it to _recognize_ many regional accents, but train it to _speak_ with
the most prevalent and universally understood accent you can find. So either
BBC English or Californian/Hollywood English.

Although traditionally TTS engines have shipped with numerous voices, such
that you can select either a British or an America accent for the English
voice. It may be worthwhile to have other English accents too, maybe one for
India (125 million speakers.) But if you trained a TTS engine to have a
computer amalgamation of all possible English accents I really doubt the
result will be considered high quality by anybody.

~~~
wiml
The headline was inaccurate (now fixed) — the Mozilla Voice project is about
speech recognition aka STT not TTS.

It would be kinda interesting to have a TTS system learn from a neighboring
STT system so that it gradually adopts your accent, though. I'm not sure if
that would be more _usable_ but it would be an interesting experience.

------
polygot
Really nice app. It would be a nice feature to add volume normalization as
some microphones/speakers are very soft, and I can't hear what they are
saying, while others are much too loud.

~~~
punchingwater
There has been some discussion around this, but no real movement yet:
[https://github.com/mozilla/voice-
web/issues/336](https://github.com/mozilla/voice-web/issues/336)

------
userbinator
From the description:

 _Common Voice is a project to help make voice recognition open to everyone.
Now you can donate your voice to help us build an open-source voice database
that anyone can use to make innovative apps for devices and the web._

I'll be the first to note that here's another piece of personally identifying
information you just "donated"...

~~~
a_imho
I won't contribute because speech interfaces are imo terrible and against the
proliferation of Echo and Duplex like services, but the data collected is
listed here

[https://voice.mozilla.org/en/privacy](https://voice.mozilla.org/en/privacy)

~~~
supuun
speech interfaces are developed by large corporations and there is nothing you
can do about that. With Common Voice you are helping to create open-source
alternative, which is a positive thing, I guess

------
microcolonel
People seem to have fun with things like this, so I can see this drastically
increasing the contribution volume.

------
_r_o_y_
Even though is Mozilla I don't like the fact that they force you to receive
emails if you one to add another language.

Edit: My bad, is only for the unavailable languages.

~~~
punchingwater
Just to note, we will never _require_ your email address to contribute. There
will always be an anonymous contribution workflow.

But adding new languages to Common Voice is a bit complicated at the moment,
and we haven't built a way to do this through the website yet. So for now, we
are doing this through a very manual process, and we plan to use email
addresses to communicate.

------
amelius
Did they consider to use MTurk for this?

~~~
nmstoker
If you think about it, this is a Mechanical Turk approach, just done for free
with volunteers not with the financial impetus of MTurk. That's why they've
given particular attention to the second pass over the data.

If you want to read more about it, the GitHub repo and in particular the
issues cover a lot of the obvious questions like this. They're here:
[https://github.com/mozilla/voice-
web/issues](https://github.com/mozilla/voice-web/issues)

