
A highly efficient, real-time text-to-speech system deployed on CPUs - moneil971
https://ai.facebook.com/blog/a-highly-efficient-real-time-text-to-speech-system-deployed-on-cpus/
======
thelazydogsback
Personally, I find I dislike any "emotion" added to TTS -- I find Alexa's emo
markup, a la:

[https://developer.amazon.com/en-US/blogs/alexa/alexa-
skills-...](https://developer.amazon.com/en-US/blogs/alexa/alexa-skills-
kit/2019/11/new-alexa-emotions-and-speaking-styles)

to be disturbing and without much added value. (Such as used with games like
Jeopardy.)

If used, the application of these tags needs to be both meticulous in its
proper context, somewhat non-deterministically applied, and with randomized
prosody. Repeated usage of the same overstated emotive content is annoying and
unnatural (worse than a "flat" presentation) and only serves to underscore the
underlying inflexible conversational content.

~~~
teilo
I agree. All I care about is that the pronunciation is contextually correct,
and smooth. Accents are fine, and necessary. But I don't want a non-human
simulating human emotions. Now if they aren't simulated, that's another
story...

~~~
mwcampbell
+1. And I know I'm in the minority here, but I prefer it if computers don't
use actual human voices at all. I like my computer to sound like a computer.
(I wouldn't go so far as to add distracting retro affectations though.)

I'm visually impaired, and I often use a screen reader when browsing the web.
Here's my favorite voice to use with a screen reader:

[https://mwcampbell.us/tmp/hn-
comment-20200515-1.mp3](https://mwcampbell.us/tmp/hn-comment-20200515-1.mp3)

~~~
specialist
That's amazing. Thank you for sharing.

I could not comprehend that. I tried a few times. IIRC, people can learn to
"speed hear".

Make me think of the Star Trek TNG Bynars meets the FEDEX speed talker.

[https://en.wikipedia.org/wiki/11001001](https://en.wikipedia.org/wiki/11001001)

[https://www.youtube.com/watch?v=v1o2wg5wqko](https://www.youtube.com/watch?v=v1o2wg5wqko)

~~~
mwcampbell
I only run my TTS moderately fast. I have blind friends who run it faster. I
have some vision, so I only use a screen reader part of the time, and not for
programming.

Also, I think speech synthesizers can do a more consistent job of enunciating
at high speed than humans can. I didn't quite understand everything the FedEx
speed talker was saying.

------
ekelsen
Exciting to see our research making broad impact across the industry!
[https://arxiv.org/abs/1802.08435](https://arxiv.org/abs/1802.08435)

~~~
ajtulloch
Absolutely, it's super impressive work (as is your later work with Marat :) ).

------
jandrese
Speech Synthesis has always baffled me. You could run a reasonable (albeit
strangely accented) version on 16Mhz Macs without major CPU impact. The code
including sound data was less than a megabyte.

In order to achieve modest improvements in dictation we're throwing entire GPU
arrays at the problem. What happened in the middle? Was there really no room
for improvement until we went full AI?

~~~
Someone
IIRC, it was _with_ major CPU impact. A 8 MHz machine couldn’t do much else
while talking.

Also, the original MacinTalk sounded a lot better if you fed it phonemes
instead of text. It didn’t know how to pronounce that many different words,
and wasn’t really good at making the right choice when the pronunciation of a
word depends on its meaning.

For example, if you gave it the text “Read me”, it always pronounced “Read” in
the past tense. That always seemed the wrong bet to me, and I would think the
developers had heard that, too, but apparently, fixing it was not that simple.

I also think it didn’t know to pronounce “Dr” as “Drive” or “Doctor”,
depending on context, or “St” as “Saint” or “Street”, to mention a few
examples, and probably was abysmal when you asked it to speak a phone book,
with its zillions of rare names (back in the eighties, that’s an area where
AT&T’s speech synthesizers excelled, I’ve been told)

And that’s just the text-to-phoneme conversion. The arts of picking the right
intonation and speed of talking are in a whole different ball park; they
require a kind of sentiment analysis.

~~~
wkearney99
hee hee, the best mispronunciation from Macintalk had to be chihuahua.

chee-hoo-ah-hoo-ah

~~~
thelazydogsback
The TI/99 TTS was the bomb! :)

------
blickentwapft
It’s a pity that all the best text to speech and speech to text systems are
cloud based with heavy vendor lock in.

------
Avi-D-coder
Any chance of a open source implementation of this?

I could really use a better tts for Linux.

~~~
brutt
No, it cannot be open sourced. It literally has no source to open.

~~~
jandrese
Huh? It appears to be written in PyTorch according to the article?

The training data could also be considered source.

And I agree that this is of limited use if I have to access it by uploading
and downloading everything from Facebook servers. Not only do I have privacy
implications, but there's the need for a solid fast low latency internet
connection that I can't guarantee.

~~~
brutt
AI is not written, AI is trained using dataset, PyTorch, and lot of computer
time (and manpower).

Dataset is not a big problem (if you can speak, you can create your own).
PyTorch is already open.

~~~
qchris
Depending on the architecture, though, it's possible to export the trained
model into a stand-alone file that can be imported by somebody else's program,
de-coupling the network's training data from model it produces.

This is done pretty frequently in areas like computer vision and speech
recognition, with the pre-trained weights for YOLO and Mozilla Deepspeech[0]
being available for download. I'm not sure if the word "open-source" totally
applies here, since as you pointed out, apart from downloading the dataset
source might be tought, but OP's question might be answered by having the
resulting models made publicly available with the source code of the networks
they used to train and deploy it?

[0]
[https://github.com/mozilla/DeepSpeech/releases/v0.6.0](https://github.com/mozilla/DeepSpeech/releases/v0.6.0)

------
ge96
Impressive but also still sounds "robotic" like AWS Polly. I wonder if they'll
fuse that tech where you can sample someone's voice from a paragraph and build
something. Then you could hire a voice actor(ress) and maybe license their
voice? I don't know how that would work.

~~~
shakna
Personally, I quite like Polly's voices. But Polly already offers custom
voices, such as trained with a particular person's voice. [0]

[0]
[https://aws.amazon.com/polly/features/#Brand_Voice](https://aws.amazon.com/polly/features/#Brand_Voice)

~~~
ge96
Which one do you use? I built something 3 years ago been using Kendra, not
sure if Joanna is new that one sounds much better.

------
birdyrooster
How long until computers can brainstorm all sorts of exciting new voices for
characters removing the need for pesky contracts and royalties paid?

------
godelski
That video at the end really is deep in the uncanny valley.

~~~
microtherion
Meh. The synthesis quality is not terrible, but calling it "state of the art",
quality-wise, is a bit of a stretch.

------
Causality1
The weaknesses of TTS twig different people in different ways. For example,
Microsoft Zira and the older Google TTS voice rank near the top for me, while
I find every single one of the modern Google voices so horrible as to provoke
instant anger when I hear them.

------
bergstromm466
Yeah, awesome! This proprietary transcription algorithm must make it a hell of
a lot easier for NSA databases. If this is deployed and used by FB so they
send the finished and full transcripts of calls and other voice traffic [1]
instead of the original audio to be transcribed later, it will all be more
efficienct! // sarcasm

[1] [https://theintercept.com/2015/05/05/nsa-speech-
recognition-s...](https://theintercept.com/2015/05/05/nsa-speech-recognition-
snowden-searchable-text/)

