I do a lot of dictation on mobile devices for work, with middling, and perhaps more importantly, frustrating results (needless to say we are working on programming our way out of that hole). It is an area ripe for open source progress given the failure of larger companies with large proprietary data sets to make basic common sense decisions in their transcription algorithms, together with no way to provide impactful feedback.
If anybody is interested, there is definitely a market for a more robust dictation library that can be integrated into apps and works offline. It just needs to be professional - e.g. allow for preferences including the ability to indicate a strong preference for standard language and grammar over all slang, not forcing Title Case for anything resembling a brand name, having a training mode for words and phrases of the user's choosing, blacklisting of certain word or phrase results which are false positives, and proper learning from user corrections during use so the tedium of correcting the same phrase 100 times disappears.
(A radiologist friend describes switching from a human medical transcriptionist to one of these "AI" software thingies as a cost-cutting measure by his hospital, or more accurately as a cost-offload measure. Hospital offloads salary, radiologist spends more time correcting stupid transcription mistakes for no extra pay).
The in the case described above the radiologist does none of the typing and none of the correction.
I don't know, maybe confusing 'canine' with 'benign'
But as others have mentioned, there are several problems with audiobooks as an ASR training dataset. First, the language used in literature is often very different from how people actually speak, especially if that language comes from very old texts (which many public domain books are indeed quite old).
Then there is the sound profile, which includes background noise, quality of microphone, speakers distance to device, etc. For recorded audio books, the speaker is often using a somewhat sophisticated setup to make the audio quality as clean as possible. This type of setup is obviously unusual when people want to speak to their devices.
Third, the tone and cadence of read speech is different than that of spontaneous speech (the Common Voice dataset also has this problem, but they are coming up with ideas on how to prompt for spontaneous speech too).
But the goal of Common Voice was never to replace LibreSpeech or other open datasets (like TED talks) as training sets, but rather to compliment them. You mention transfer learning. That is indeed possible. But it's also possible to simply put several datasets together and train on all of them from scratch. That is what Mozilla's DeepSpeech team has been doing since the beginning (you can read the above hacks blog post from Reuben Morais for more context there).
It shouldn't be that hard to degrade the quality synthetically? And with a clean source you can synthesize different types of noise/distortions.
My takeaway from that was that while synthetic degradation of inputs can be useful, and while it is "easy", the hard part is making it match real degradation closely enough to be representative. It's often really hard to replicate natural noise closely enough for it be sufficient to use those kind of methods.
Doesn't mean it's not worth trying, but I'd say that unless voice is very different it's the type of thing that's mostly worth doing if you can't get your hands on anything better.
I’m not saying ocr isn’t hard. I’m saying normalizing all those characters basically is the problem.
You'd have to know what the most common types of noise are, how they interact with the signal, etc. This method of collecting data can provide useful info on what that noise actually is.
I don't think most people speak to their phone the same way they normally speak.
For example, I always speak slowly, with perfect pronunciation and intonation when talking to Siri.
The problem with the 'problem' you're describing is the scope of speech recognition is being defined too narrowly.
If all you care about is creating an open source Alexa/Siri knockoff, then yes you need to recognize conversational speech and much else. But what if you do want to recognize scripted rehearsed speech? What if you want a speech recognizer that can auto-transcribe movies, news broadcasts, or in fact audio books? Wouldn't it be nice if all audiobooks came with aligned text? That's an experience you can get right now with kindle/audible, but as far as I'm aware no FOSS ebook reading software supports it. If I have a public domain reading of Tom Sawyer from LibreVox and a text copy of Tom Sawyer from Project Gutenberg, how many hoops do I currently have to jump through to get the speech highlighted on my screen as the audiobook plays?
Recognizing all forms of speech should be the goal, not just one narrow albeit trendy sliver of speech.
Regarding conversational speech, I get that. Books are definitely not conversational.
I guess the next question though, would be: is the objective to build a model that understands all words, or conversational speech? <novice> It seems like transfer learning on a model trained on audiobooks and then conversations would still be a good path, right? </novice>
In any case, for read speech in particular there are several corpora out there already, including the moderately large LibriSpeech corpus (1000hr). The state-of-the-art accuracy on read speech is also very good -- for example, domain-specific dictation systems have been commercially viable for quite some time. So while it's true that Audiobooks are a large untapped source, I think that there are other large-scale and richer options like YouTube or movies (i.e. videos with speech for which subtitles are available) that would be more useful to make progress towards good speech recognition systems.
The subtitles often don't match what is spoken exactly.
1. we have a lot of training data for the voices of white men reading stuff.
2. We have good models that already exist for removing background noise.
3. We might be able to build good models that could identify accents, gender, age variation.
4. we have good models for style transfer that work in the audio domain.
Could we take an audiobook read by a white guy, and use a style transfer model to give him a german accent, and then use the german accented version as training data back into the speech recognition model? Could you use a reverse style transfer model to turn accented audio into non-accented audio (i.e. normalize it all to the place where we have the most training data) Could we use a combination of style transfer models to vastly expand the training data set, and then train the conversational systems?
Or, are the style transfer models not good enough? Or do we not have training data for style transfers to turn the voices of white men into the voices of white men with german accents?
I don't want to trivialize, but I'm genuinely curious how professionals are actually trying to solve this now?
Understanding all words is not the problem. I don't know if it's universal, but frequently, a speech-to-text model is actually two models: A voice model (mapping raw audio to phonemes) and a language model (which models what the language looks like, i.e. what sentences are likely and which words exist). So if you want the STT system to understand novels, include novels in the training data for the language model. You can then combine it with a voice model suitable for conversational speech/the user's accent/background noise.
One of the advantages of CommonVoice is that breadth of voices. By having more, and different types of voices, you can build a more robust system, which works for more people.
For instance, my wife is a nonnative English speaker, and had trouble for years using Siri. Alexa can’t understand small children. And of course, elevators can’t understand Scottish accents. https://m.youtube.com/watch?v=NMS2VnDveP8
Not so much the case on LibriVox. Accents, age and levels of voice professionalism vary greatly.
Also (I am not a lawyer but) just because the source material is out of copyright doesn't mean the audiobook is out of copyright.
I’m also sure you could add subtitles for hard of hearing as a very noisy data set.
There have recently been a number of assertions that better quality ML data will outperform better ML algorithms, and this has certainly been true in my experience as well, especially in domains like speech recognition.
There's going to be a long road to catch up to the big players, however. Even 15 years ago there were companies who were doing 1M minutes of labled voice data per year.
The data gap between established players and newcomers to the market will continue to grow unless we invest in efforts like this.
Still quite a lot of languages have very tiny datasets of transcribed data.
I did not hear any lines that were flat-out spoken incorrectly, at least as far as I could tell. However, I did come across a ton of really poor samples, to the point of being somewhat difficult to understand. Things like:
• Really strong accents
• Horrible, muffled microphones
• Background noise
• Super quiet
• A couple "robotic" samples I legitimately think were generated via text-to-speech software
All of these types of samples—save the last—constitute possible real-world scenarios. But, do they make for good training data? I very little about machine learning, but it makes logical sense to me that you'd want to teach the computer with "clean" data—something with a high signal-to-noise ratio which is as close to the "average" of the real world as possible. Is this completely wrong?
Separately, they ought to provide this type of instruction on what to do with borderline samples. If I legitimately can't tell for sure whether a word was spoken correctly, what should I do?
The idea is to make the model resistant to that "bad" input and effectively enlarge your dataset for free- if you have a picture of a cat, you can automatically also get loads of pictures that you know should still be classified as a cat- rotated 15 degrees clockwise, noisy (like in low-light conditions), the tip of its tail is out of frame, the camera's automatic white balance screws up...
Also, the robotic samples may be real human voices mangled by LPC- Think a lossy VoIP call.
Is it a generational thing and will future generations find typing into a search box as anachronistic as I find using a land line with a rotary dial?
I am also British and therefore not as loud as some people in the English speaking world. Talking to my phone on the train would make me cringe. Clearly people like me will die off soon enough, however, is adoption a problem for these voice technology things? How do people get into changing habits from pecking at a keyboard to the evidently easier voice driven way of doing things? Is there one use case, e.g. in the car, where the habit of speaking to a gadget is learned?
For the past few months, I've been on the iOS CommonVoice app reading sentences out loud. It's great fun, I'd recommend it.
On some level it's a good idea to want to request this, but as the dataset is public-domain, isn't it going to get mirrored and retrieved by people who won't have to agree to anything? ...
Also very weird that they serve the tarball of MP3s gzipped, which seems mostly pointless to me, as it amounts to a reduction of maybe ~4% for a tremendous amount of time spent uncompressing the tarball (which itself has a bunch of useless macOS-specific headers, three apiece on each file).
Many of the MP3 files are literally just empty (zero size) or partially written (corrupt), it seems. I wonder if that issue comes from their choice to package the tarballs on macOS, or some underlying issue on the server side.
Really glad to see Mozilla is trying to change that.
I've previously implemented concatenative TTS using unit selection . The quality is spotty, so I'm throwing it out and going with the ML approach, which produces higher fidelity voices even in my own experiments.
My next steps are taking an end-to-end synthesis model and porting it to run cheaply on the CPU.
Thanks so much for making this data available, Mozilla! You're helping democratize this technology for individual engineers and researchers that don't have Google's resources.