* Licensing (MIT)
* Quality (judge for yourself: https://rhasspy.github.io/larynx/)
* Speed (faster than real-time on amd64/aarch64)
* Voices/language support (9 languages, 50 voices)
I'm working now on integrating Larynx with speech-dispatcher/Orca. The next version of Larynx will also support a subset of SSML :)
EDIT: despite those errors I can create output.wav. However, interactive mode crashes with "No such file or directory: 'play'".
For speech-dispatcher, I'd start a Larynx HTTP server and use curl to get audio. I have an undocumented --daemon flag that does something like this.
I was going to comment that you didn't have any en_gb listed, but it seems there's a bunch under en_us :)
Some rather good brit'ish accents in there me old mate!
The cloud based systems from Google, Microsoft, Amazon and IBM are much better than anything else, and within them, the neural network based systems, which appear to be a sort of different product category, are far and away the best of all. The neural voices are approaching natural voice intonation and have an almost believable ability to read text.
The ones that sounded most natural were IBM Watson and Googles neural voices.
Amazon Polly appeared to be the furthest behind of all the cloud systems…. a really average sounding product.
Of the local TTS systems, the one built into MacOS sounds the best… but they were all very average at best. All the linux ones frankly sounded like garbage relative to the state of the art.
Things might have advanced with the cloud systems over the past couple of years but I didn’t get the impression the cloud companies were putting much effort into research and development.
They're not all easy to setup however
Because you need data, trained models, etc.
Because data scientists aren't typically product people or software engineers with UX in mind.
Because ML packages are brittle and tied to specific hardware configurations.
Because the ML world is evolving rapidly. It's quick, dirty, and messy.
View these as stepping stones for research and product development.
(I created https://vo.codes using a lot of these, fwiw, in an attempt to make it easy.)
They would have to release a great many different ones or alternatively bundle all libraries with it.
I once saw a comparison with LibreOffice that showed that the the package Debian itself provided was 20% of the size of the package LibreOffice provided targeting Debian, — which would not receive the same benefits of security bugfixes to libraries, but of course also not the same problems that often arise on Debian when they arrogantly patch libraries they barely understand and create their own unique security problems.
You have eSpeak, which is GPL V3, so including it in your own software is a problem. RH Voice can be compiled without GPL code, but its language support is pretty limited. There's also SAM, which is incredibly easy to port and incredibly light on resources, but its licensing status is unknown, it's English only and it just sounds bad, even to somebody used to robotic synths.
If you're developing for a popular platform, it probably has something built-in, but if you're developing for embedded, you need to pay thousands of dollars to Cerrence (formerly Nuance) to even get started.
Can't you just include it like a separate module and provide any improvements to it specifically upstream?
If the device you're developing for uses some proprietary firmware, a custom module might not even be an option.
As an aside, because of Vocalizer's use in automotive, it will probably be the only high-ish quality speech engine that won't become fully cloud-based. VFO's claims about the continued use of Vocalizer in JAWS seem to confirm that.
Regarding Eloquence itself, its status is not really known. I would be extremely surprised if it was owned by Microsoft, though. There's a hypothesis that nobody really knows who actually owns it, there were multiple companies that assisted in its development, including IBM. The product was so unimportant to Nuance these days that they might not even have considered it when doing the spinoff, leaving its ownership uncertain. If this hypothesis is untrue, though, I'd strongly suspect that Cerence is the owner, not Microsoft.
Stuffing an existing model into a .deb is of course fairly easy.
Work like, say https://arxiv.org/abs/1806.04558 [paper]
Edit: Sample of ETI-Eloquence at my preferred speed: https://mwcampbell.us/audio/eloquence-sample-2021-09-25.mp3 (yes, it mispronounces "espeak")
Edit 2: To elaborate on what I mean by "mostly dead": In 2009 I was tasked with adding support for ETI-Eloquence to a Windows screen reader I developed. At that time, Nuance was still selling Eloquence to companies like the one I worked for back then. When I got the SDK, the timestamps on the files, particularly the main DLLs, were from 2002. As far as I know, an updated SDK for Windows was never released. I'm thankful for Windows's legendary emphasis on backward compatibility, particularly compared to Apple platforms and even Android.
Finally, a sample of espeak-ng (in the NVDA screen reader) at my preferred speed: https://mwcampbell.us/audio/espeak-ng-sample-2021-09-25.mp3 I use the default British pronunciation even though I'm American, because the American pronunciation is noticeably off.
This is exactly the speech synthesizer I use daily. I've gotten so used to it over the years that switching away from it is painful.
On Apple platforms, though, using it is not an option. So I use Karen. Used to use Alex, but Karen appears to be slightly more responsive and tries to do less human stuff when reading. Responsiveness is a very important factor, actually. Probably more so than people might realize. Eloquence and ESpeak react pretty much instantly whereas other voices might take 100 MS or so. This is a very big deal for me. Just like how one would like instant visual feedback on their screen, it's the same for me with speech. The less latency, the better.
My problem with ESpeak is that it sounds very rough and metallic whereas Eloquence has a much warmer sound to it. I pitch mine down slightly to get an even warmer sound. Being pleasant on the ears is super important if you listen to the thing many, many hours a day.
Edit: or on the online demo; select "HMM-based method (HTS 2011) - Combilex" > "SLT (English American female)".
There are of course great benefits to something simple to use. I remember cross-compiling flite to run on a custom android/windows/linux project to generate voice lines intended for a in-game robot companion (nothing came of it though) based on SDL. It probably would not be nearly as feasible to do the same for some dependency-heavy machine learning library.
Now, I haven't done any research to find better examples of projects. I was just surprised how identical the article describes the options, to what was available 12 years ago.
Either we're just not there yet technologically (hard to believe), or there isn't a will to make good speech synthesis available to commoners.
I tried festival and it too complicated and my version was too to run the better voices model.
Instead I've used this repo to use upgraded flite: https://github.com/kastnerkyle/hmm_tts_build/
I have mapped keyboard shortcuts Win+1 for normal speed, Win+2 for faster and Win+3 for really fast reading speed. I can use it while reading, to enhance my focus. Neat.
For local, Mozilla TTS was best from a quality standpoint but the GPU inference support was a bit dicey and (possibly) not really supported at all.
For more complex and bespoke applications the Nvidia (I know, I know) NeMO toolkit  is very powerful but requires more effort than most to get up and running. However, it provides the ability to do very interesting things with additional training and all things speech.
In the Nvidia world there's also their Riva  (formerly Jarvis) solution that works with Triton  to build out an architecture for extremely performant and high-scale speech applications with things like model management, revision control, deployment, etc.
You can hear it in this video: https://www.youtube.com/watch?v=tfcme7maygw&t=131s
The italian voice sounds great
Which is simply not the case.
Artificial speech is to human speech what typography is to handwriting.
For example espeak is by far my number one choice for reading anything, because the voice models it uses can be sped up to 1k wpm and still be understandable. This is basically a superpower when skimming boring documentation of any type. Throw in basic tesseract OCR and in a 45 minute sitting I can go through 30k words of any document that can be displayed on a computer screen.
It's not that I'm stuck with a terrible robotic voice, it's that I don't want anything "better" in the same way that I don't see much value going past the command line for most tools when you can use ncurses.
So probably most people here researched this topic not for accessibility reasons but for "commercial" stuff like creating some kind of service where chat bots could speak to you or transcribe articles for some regular people(without eye problems) to listen to them.
More natural speech patterns would be useful in those venues.
What a lot of people don't realize is that Festival is intended for creating new TTS voices based on your own voice. The fact that it generates TTS is an artifact of it's main function. I've never messed with that functionality myself but I always wonder if someone could train a synthetic voice to sound better with a larger sample set. The Nitech voices are definitely better so it's certainly possible to encourage Festival do a better job.
Warning: HTS_fopen: Cannot open hts/htsvoice.
aplay: main:666: bad speed value 0
aplay: main:666: bad speed value 0
;; Debian-specific: Use aplay to play audio
(Parameter.set 'Audio_Command "aplay -q -c 1 -t raw -f s16 -r $SR $FILE")
(Parameter.set 'Audio_Method 'Audio_Command)
I did however go through the motion of installing the nitech voices before reading that they only work with older versions of festival. Doh!