That's what the second pass is for, right, to screen out actually unintelligible...

JackCh · on July 1, 2018

> "The goal isn't to produce a speech-to-text system that can recognize a perfectly miked BBC announcer."

Wait what? The headline is about text-to-speech aka speech synthesis, not speech recognition (speech-to-text.) Are they trying to do both? It seems to me that you'd train both using different sorts of datasets. If you wanted TTS to be intelligible to the most number of people, training to to speak like a 'perfectly miked BBC announcer' is probably exactly what you'd want to do.

Train it to recognize many regional accents, but train it to speak with the most prevalent and universally understood accent you can find. So either BBC English or Californian/Hollywood English.

Although traditionally TTS engines have shipped with numerous voices, such that you can select either a British or an America accent for the English voice. It may be worthwhile to have other English accents too, maybe one for India (125 million speakers.) But if you trained a TTS engine to have a computer amalgamation of all possible English accents I really doubt the result will be considered high quality by anybody.

wiml · on July 1, 2018

The headline was inaccurate (now fixed) — the Mozilla Voice project is about speech recognition aka STT not TTS.

It would be kinda interesting to have a TTS system learn from a neighboring STT system so that it gradually adopts your accent, though. I'm not sure if that would be more usable but it would be an interesting experience.

nmstoker · on July 1, 2018

You were quite right that in the way Mozilla are using this, it's suited to speech to text, but running it locally is actually a reasonable way to record a single speaker set of data for use in TTS training.

That's exactly what I've been doing recently, and using it with https://github.com/r9y9/deepvoice3_pytorch/blob/master/READM... is providing reasonably good results - it definitely has my intonation (if somewhat crossed with a Dalek!!)

blendergeek · on July 1, 2018

Oops. My bad. I accidentally got that wrong. Mozilla's goal here is speech recognition.

microcolonel · on July 1, 2018

> But if you trained a TTS engine to have a computer amalgamation of all possible English accents I really doubt the result will be considered high quality by anybody.

For what it's worth, they do ask you to create a profile after your fifth sample, and that profile includes an "accent" section.