This dataset and will help the many independent deep learning practitioners such as myself that aren't working at FAANG and have only had access to datasets such as LJS  or self-constructed datasets that have been cobbled together and manually transcribed.
Despite the limited materials available, there's already some truly amazing stuff being created. We've seen a lot of visually creative work being produced in the past few years, but the artistic community is only getting started with voice and sound.
Another really cool thing popping up are TTS systems trained from non-English speakers reading English corpuses. I've heard Angela Merkel reciting copypastas, and it's quite amazing.
I've personally been dabbling in TTS as one of my "pandemic side projects" and found it to be quite fun and rewarding:
Besides TTS, one of the areas I think this data set will really help with is the domain of Voice Conversion (VC). It'll be awesome to join Discord or TeamSpeak and talk in the voice of Gollum or Rick Sanchez. The VC field needs more data to perfect non-aligned training (where source and target speakers aren't reciting the same training text that is temporally aligned), and this will be extremely helpful.
I think the future possibilities for ML techniques in art and media are nearly limitless. It's truly an exciting frontier to watch rapidly evolve and to participate in.
As someone actively using the data I wish I could more easily see (and download lists for?) the older releases as there have been 3-4 dataset updates for English now. If we don’t have access to versioned datasets, there’s no way to reproduce old whitepapers or models that use common voice. And at this point I don’t remember the statistics (hours, accent/gender breakdown) for each release. It would be neat to see that over time on the website.
I’m glad they’re working on single word recognition! This is something I’ve put significant effort into. It’s the biggest gap I’ve found in the existing public datasets - listening to someone read an audiobook or recite a sentence doesn’t seem to prepare the model very well for recognizing single words in isolation.
My model and training process have adapted for that, though I’m still not sure of the best way to balance training of that sort of thing. I have maybe 5 examples of each English word in isolation but 5000 examples of each number (Speech Commands), and it seems like the model will prefer e.g. “eight” over “ace”, I guess due to training balance.
Maybe I should be randomly sampling 50/5000 of the imbalanced words each epoch so the model still has a chance to learn from them without overtraining?
The technique you're thinking of is called oversampling and there are many other general techniques for dealing with imbalanced datasets, as it's a very common situation.
The model itself has generalized pretty well to handle both single and multi word utterances I think, without a separate classifier, but I'm definitely not going to rule out multi-model recognition in the long run.
My main issues with single words right now are:
- The model sometimes plays favorites with numbers (ace vs eight)
- Collecting enough word-granularity training data for words-that-are-not-numbers (I've done a decent job of this so far, but it's a slow and painful process. I've considered building a frontend to turn sentence datasets into word datasets with careful alignment)
An issue to watch for though is elision: a word in a sentence can often be said differently to the individual words, eg saying "last" and "time" separately one typically includes the final t in last and yet said together, commonly it's more like "las time".
I think I'd be very cautious about it and use a model with a different architecture than the aligner to validate extracted words, and probably play with training on the data a bit to see if the resulting model makes sense or not. I do have examples of most english words to compare extracted words.
Examples: dysphonias of various kinds, dysarthria (e.g. from ALS / cerebral palsy), vocal fold atrophy, stuttering, people with laryngectomies / voice prosthesis, and many more.
Altogether, this represents millions of people for whom current speech recognition systems do not work well. This is an especially tragic situation, since people with disabilities depend more heavily on assistive technologies like ASR. Data/ML bias is rightfully a hot topic lately, so I feel that the voices of people w/ disabilities need to be amplified as well (npi).
However, this means that the gains to be had from personalized training are greater for disordered speech than for "average" speech. I develop kaldi-active-grammar , which specializes the Kaldi speech recognition engine for real-time command & control with many complex grammars. I am also working on making it easier to train personalized speech models, and to fine tune generic models with training for an individual. I have posted basic numbers on some small experiments . Such personalized training can be time consuming (depending on how far one wants to take it), but as my parent comment says, disabled people may need to rely more on ASR, which means they have that much more to gain by investing the time for training.
Nevertheless, a Common Voice disordered speech dataset would be quite helpful, both for research, and for pre-training models that can still be personalized with further training. It is good to see (in my sibling comment) that it is being discussed.
The Common Voice FAQs say the right words about the mission of the project:
>”As voice technologies proliferate beyond niche applications, we believe they must serve all users equally.”
>”We want the Common Voice dataset to reflect the audio quality a speech-to-text engine will hear in the wild, so we’re looking for variety. In addition to a diverse community of speakers, a dataset with varying audio quality will teach the speech-to-text engine to handle various real-world situations, from background talking to car noise. As long as your voice clip is intelligible, it should be good enough for the dataset.”
However, your data validation criteria both implicitly and explicitly exclude entire classes of people from the dataset, and allow for the validators to impose an arbitrary standard of purity regarding what constitutes “correct” speech. In so doing, you are influencing who is and isn’t understood by systems built upon this data. Examples from the docs (https://discourse.mozilla.org/t/discussion-of-new-guidelines...)
>”You need to check very carefully that what has been recorded is exactly what has been written - reject if there are even minor errors.”
As currently stated, this criteria leads to the categorical exclusion of people for whom speaking without “even minor errors” is not possible (ex: lalling and other phonological disorders, where certain phonemes can’t be formed), based on the validators’ subjective perception of data cleanliness.
>”Most recordings are of people talking in their natural voice. You can accept the occasional non-standard recording that is shouted, whispered, or obviously delivered in a ‘dramatic’ voice. Please reject sung recordings and those using a computer-synthesized voice.”
Please watch this example of a person you are defining out of your dataset: (https://m.youtube.com/watch?v=5HgD0PXq0E4)
Look at this kid’s face light up and tell me that’s not his new natural voice. An electrolarynx is not a computer-synthesized voice (you manipulate the muscles in your neck to generate vibrations—like an external set of vocal cords). Although it would almost definitely be mistaken for one, and summarily sent to the “clip graveyard” (https://voice.mozilla.org/en/about).
>”I tend to click ‘no’ and move on for extreme mispronounced words. I’m of the opinion that soon enough, another speaker from their nationality will submit a correct recording.”
Again, the use of the word “correct” here is problematic. Rejecting borderline cases and waiting for “cleaner” samples is a severe trap to fall into, regardless of the domain.
>”I do the same as you. Accept if it’s an elongation; reject if the reader takes two attempts to start the word.”
Again, this almost categorically excludes people with a stutter and other types of speech disorders.
@dabinat gets its right with this comment:
>”There are uses for CV and DeepSpeech beyond someone directly dictating to their computer. In my opinion, CV’s voice archive should contain as many different ways to say something as possible.”
>”You may well be right. I’d be interested to hear what the programmers’ expectations are.”
>”I will ping @kdavis and @josh_meyer for feedback on the ML expectations (in terms of what’s good/bad for deepspeech).”
Yikes. So the data is being selected to improve performance benchmarks of the speech recognition model, and not to better reflect the nuances and variety of speech in the real world (as was the stated goal of Common Voice). It’s very easily the case that cherry picking data to improve test benchmarks will decrease generalizability of the model in other applications. Narrowing the range of human speech to make the problem easier (as in simpler to build a model that functions well for most people) is antithetical to your stated mission. We can’t keep measuring AI progress in parameters and petabytes. It has to be about the people it helps.
>”I agree that we don’t want to scare off new contributors off by presenting the guidelines up-front as an off-putting wall of text that they have to read.”
Limiting the amount of documentation/training available to data annotators in an effort not to scare them is a surefire way to end up with inconsistently labeled data.
Although I find the above examples to be dismaying, I do not mean to ascribe any ill intent to your team or the volunteers. I understand the complexities at play here. But the outright dismissal of certain types of voices as out-of-scope or not “correct” is causing real harm to real people, because ASR systems simply do not work well for people with various disabilities. I could find no direct mention or acknowledgement of the existence of speech disorders anywhere* on the website or forum.
I believe there needs to be a more deliberate effort to construct a more representative dataset in order to meet your stated mission (which I am willing to volunteer my time towards). Just some initial ideas:
- Augment the dataset by folding in samples from external datasets (e.g. https://github.com/talhanai/speech-nlp-datasets). I’m not sure on the approach, but if movie scripts can be adapted, presumably so can other voice datasets.
- Retain samples with speech errors like mispronunciations and stutters (perhaps with a flag indicating the error). In fact, why not retain all samples, flagging those that are unintelligible? At least keep it available, for data provenance purposes (so it is known what was excluded and can be reversed).
- Establish a relationship with speech-language pathologists to collect or validate samples (eg: universities or the VA, who have many complex/polytrauma voice patients). Sessions with SLPs often involve having patients read sentences aloud, so it’s a familiar task. This is probably the best way to collect data from people with voice disorders, so volunteer annotators aren’t responsible for analyzing a complex subset.
- Use inter-annotator agreement measures to characterize uncertainty about sample accuracy, rather than binary accept/reject criteria.
- Collect/solicit more samples from people >70yrs old, since they are currently underrepresented in your data. Is there anyone over the age of 80 in your dataset at all?
- Improve your documentation and standards to be more explicitly transparent about the ways in which it does not currently represent everyone, and plans for bridging these gaps.
But as I mentioned, this has been discussed, including the ability for users to add flags to their profiles to indicate disordered speech.
IMO it might be better to include disordered speech in a separate dataset with separate validation requirements, which would require new features on the site. But the new “target segments” feature is a step towards achieving such a thing.
>”Common Voice is part of Mozilla’s initiative to make voice recognition technologies better and more accessible for everyone.”
If people with speech/voice disorders are not represented in the dataset, then this is not inclusive of everyone.
I worked for the Esperanto dataset of common voice in the last year, and we now have collected over 80 hours in Esperanto. I hope that in a year or two we'll have collected enough data to create the first usable neural network for a constructed language and maybe the first voice assistant in Esperanto too. I will train a first experimental model with this release soon.
I don't have "an in" but it's probably worth having a look over the Common Voice and Deep Speech forums on Discourse to see who the main people are. They also hang out in their Matrix Chat groups, so might be able to get in touch that way. Links are below.
I got flac working for speech.talonvoice.com with an asm codec so they could do whatever in theory, but I do get some audio artifacts sometimes.
I ended up building an extension for Firefox that normalizes the audio on the website if installed: https://github.com/est31/vmo-audio-normalizer https://addons.mozilla.org/de/firefox/addon/vmo-audio-normal...
In general I don't think normalization should happen at the backend. It's useful for training data to have multiple loudness levels, so that the network can understand them all.
Also, non-English models are _way_ behind still.
The recently updated Mozilla Voice dataset still lacks non-EN languages sadly.