> Writing a vocaloid for japanese is a high school programming class project. Writing a TTS for english takes thousands of Google developers years.
It didn't take "thousands of Google developers years" to teach a computer English spelling rules. Indeed, even a programming class assignment could do that: you can cover most cases by just looking the pronunciations up in a dictionary. (The number of quirks and inconsistencies makes English spelling quite hard for humans to memorize, but computers are rather good at lookup tables.)
Even if you consider the actual Vocaloid software, which was developed by a team at a large corporation, there are two factors differentiating it from English TTS that make the latter much harder:
1. Japanese has much simpler phonetics than English, with a smaller set of phonemes and (somewhat oversimplifying) only using open syllables. So it's easier to consume and produce, for both computers and humans, but at the cost of being a less efficient encoding: Japanese tends to require a lot more syllables than English to express the same concept, and there are a lot of homophones.
2. Vocaloid sounds robotic. It's gotten a bit less so over time, but it still doesn't come close to passing as human. If you're okay with robotic, English TTS software has existed for a long time, starting many decades before Google was founded.
The hard part, the part that requires neural networks and massive computational power and Google and still has yet to be perfected, is making it sound human.
By the way, although vocaloid software would be given phonetic input, normal Japanese writing uses kanji (i.e. Chinese characters), most of which have multiple unrelated possible pronunciations. Determining which pronunciation applies to each character in a given piece of text is nontrivial, and sometimes even depends on context or meaning.
It didn't take "thousands of Google developers years" to teach a computer English spelling rules. Indeed, even a programming class assignment could do that: you can cover most cases by just looking the pronunciations up in a dictionary. (The number of quirks and inconsistencies makes English spelling quite hard for humans to memorize, but computers are rather good at lookup tables.)
Even if you consider the actual Vocaloid software, which was developed by a team at a large corporation, there are two factors differentiating it from English TTS that make the latter much harder:
1. Japanese has much simpler phonetics than English, with a smaller set of phonemes and (somewhat oversimplifying) only using open syllables. So it's easier to consume and produce, for both computers and humans, but at the cost of being a less efficient encoding: Japanese tends to require a lot more syllables than English to express the same concept, and there are a lot of homophones.
2. Vocaloid sounds robotic. It's gotten a bit less so over time, but it still doesn't come close to passing as human. If you're okay with robotic, English TTS software has existed for a long time, starting many decades before Google was founded. The hard part, the part that requires neural networks and massive computational power and Google and still has yet to be perfected, is making it sound human.
By the way, although vocaloid software would be given phonetic input, normal Japanese writing uses kanji (i.e. Chinese characters), most of which have multiple unrelated possible pronunciations. Determining which pronunciation applies to each character in a given piece of text is nontrivial, and sometimes even depends on context or meaning.