Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That's what the second pass is for, right, to screen out actually unintelligible or misrecorded entries.

The English (in)fluency is more of a feature, though, than a bug. The goal isn't to produce a speech-to-text system that can recognize a perfectly miked BBC announcer. It's to be able to recognize a wide variety of people speaking fairly naturally in imperfect conditions, using whatever accent they use for casual speech.



> "The goal isn't to produce a speech-to-text system that can recognize a perfectly miked BBC announcer."

Wait what? The headline is about text-to-speech aka speech synthesis, not speech recognition (speech-to-text.) Are they trying to do both? It seems to me that you'd train both using different sorts of datasets. If you wanted TTS to be intelligible to the most number of people, training to to speak like a 'perfectly miked BBC announcer' is probably exactly what you'd want to do.

Train it to recognize many regional accents, but train it to speak with the most prevalent and universally understood accent you can find. So either BBC English or Californian/Hollywood English.

Although traditionally TTS engines have shipped with numerous voices, such that you can select either a British or an America accent for the English voice. It may be worthwhile to have other English accents too, maybe one for India (125 million speakers.) But if you trained a TTS engine to have a computer amalgamation of all possible English accents I really doubt the result will be considered high quality by anybody.


The headline was inaccurate (now fixed) — the Mozilla Voice project is about speech recognition aka STT not TTS.

It would be kinda interesting to have a TTS system learn from a neighboring STT system so that it gradually adopts your accent, though. I'm not sure if that would be more usable but it would be an interesting experience.


You were quite right that in the way Mozilla are using this, it's suited to speech to text, but running it locally is actually a reasonable way to record a single speaker set of data for use in TTS training.

That's exactly what I've been doing recently, and using it with https://github.com/r9y9/deepvoice3_pytorch/blob/master/READM... is providing reasonably good results - it definitely has my intonation (if somewhat crossed with a Dalek!!)


Oops. My bad. I accidentally got that wrong. Mozilla's goal here is speech recognition.


> But if you trained a TTS engine to have a computer amalgamation of all possible English accents I really doubt the result will be considered high quality by anybody.

For what it's worth, they do ask you to create a profile after your fifth sample, and that profile includes an "accent" section.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: