
Creating a Computer Voice That People Like - dnetesn
http://www.nytimes.com/2016/02/15/technology/creating-a-computer-voice-that-people-like.html
======
Mithaldu
Amusingly the biggest problem here is english itself, and the way it is
amazingly irregular in all its pronounciations, and how a few letters can
change an entire word (constable, unstable).

The article shortly mentions the technology that uses voice sample databases,
without mentioning its primary example: Vocaloids.

With vocaloid technology digital speech synthesis works amazingly well and is
used to create entire songs ( here's a video of a voice bank donor singing
together with her vocaloid:
[https://youtu.be/JfW0glLj2pE?t=74](https://youtu.be/JfW0glLj2pE?t=74) ;
here's an example of casual TTS for chat that sounds silly, but is incredibly
clear to understand:
[https://youtu.be/R8OoadkjO8Q?t=20](https://youtu.be/R8OoadkjO8Q?t=20) ).
However this only works well for japanese (voice bank size: ~100 samples),
since it is astonishingly regular (i.e. spelling maps almost 1:1 to
pronounciation), while in english there is almost no regularity whatsoever, so
to replicate the same, voice banks need to be 1000 samples and more big.

~~~
delecti
There are still a limited number of sounds in English. Including a dictionary
that had phonetic pronunciations seems like it should be sufficient to avoid
the constable/unstable problem.

~~~
Mithaldu
It's not just pronounciation of specific words though, it's also combinations
of words, and abbrevations like "would've". English basically consists
entirely of special cases, while in japanese you can straight-up just match
the written letters to the sound and get it right in most cases. In fact, i
can tell you the exceptions in this post right quick: Sometimes the u sound is
swallowed when it would be more comfortable to say the word without it. "Wo"
is almost always pronounced as "o". "Ha" is very often pronounced as "wa" due
to a specific grammatic rule. Now try listing all the exceptions to english
pronounciation. :)

~~~
schoen
You can deal with the irregularity of spelling by using a pronouncing
dictionary, which has the correct pronunciation of every word in a lexicon.
For example, CMU has the CMU Pronouncing Dictionary:

[http://www.speech.cs.cmu.edu/cgi-
bin/cmudict](http://www.speech.cs.cmu.edu/cgi-bin/cmudict)

Spelling irregularities are fully taken account of. But a problem is
heteronyms, where the same spelling is pronounced in two different ways
depending on the meaning, like English initial stress derivations

[https://en.wikipedia.org/wiki/Initial-stress-
derived_noun](https://en.wikipedia.org/wiki/Initial-stress-derived_noun)

A pronouncing dictionary will record each of the options but might not provide
a way to determine which one is correct in a particular context.

Combinations of words, as you mentioned, do pose problems for speech synthesis
because of suprasegmentals, where for natural speech you might need to adjust
features of one word because of the presence of others, for a variety of
linguistic reasons.

[https://en.wikipedia.org/wiki/Prosody_%28linguistics%29](https://en.wikipedia.org/wiki/Prosody_%28linguistics%29)

Japanese has suprasegmental issues too, so you can't expect to make realistic
speech synthesis easy just by synthesizing a language with a regular
orthography and a simple sound structure. Also, imagine trying to create text-
to-speech for Japanese texts written in kanji rather than kana; you'd have
irregularity problems in some ways even more challenging than English
orthography presents!

[https://en.wiktionary.org/wiki/%E7%94%9F#Kanji](https://en.wiktionary.org/wiki/%E7%94%9F#Kanji)

~~~
majewsky
A pronouncing dictionary does not help you when people invent new words or
import new words from a different language, in which case readers familiar
with the subject at hand might identify the pronounciation from their
familiarity with the source language or subject.

Unless, of course, you have an AI attached to your TTS that can understand the
context in a similar way.

~~~
delecti
Depending on the context, you could potentially just include the pronunciation
of all the new words in the context a video game set in a fictional setting.

------
Razengan
I've always found it odd that the video game industry hasn't put as much focus
on improving speech synthesis as they do on 3D graphics. They would be the
ones to benefit the most from more natural text-to-speech.

Seems to me that back when games used to be in text, you got to have more
elaborate storylines because the developers and writers themselves could just
add and improve the dialogue at any time. See Planescape: Torment [1],
regarded to have one of the most complex stories in a game, with enough
dialogue to rival a novel, but most of it is in text. Same for Star Control 2
[2] which was the inspiration behind Mass Effect.

Now, when everyone wants to be Hollywood and voice-acting is the norm, and
_expected_ by players for a game to be considered high-quality, stories are
locked in to whatever was recorded in the studio at the time. This is also
probably the reason for the discrepancy between the dialogue choices you see
onscreen, versus what your character actually says (such as in Mass Effect and
Fallout 4 and many others, where you just to get choose between something like
"Joke" or "Threaten" and it sometimes spawns a very unexpected chain of
conversation.)

Better speech synthesis would not only empower solo developers to write
engrossing stories and complex dialogues from the comfort of their bedrooms,
like they used to, it'd also open up better player customization in RPGs and
online games, when whatever you type would be spoken to other players in your
preferred voice.

[1]:
[https://en.wikipedia.org/wiki/Planescape:_Torment](https://en.wikipedia.org/wiki/Planescape:_Torment)

[2]:
[https://en.wikipedia.org/wiki/Star_Control_II](https://en.wikipedia.org/wiki/Star_Control_II)

~~~
sliverstorm
It would be convenient, certainly, the same way a robot kitchen would be
convenient. But it turns out human chefs aren't all that expensive compared to
an equally competent robot chef, at least for the foreseeable future.

The video game industry already has lifelike speech available to it, for a
cost. Lifelike 3D graphics on the other hand don't exist at _any_ cost.

~~~
Razengan
But with human voice actors, you can't modify the dialogue in your game, or
change/improve the part of the story that's tied to that dialogue, without
scheduling new recording sessions.

That can get expensive and actors aren't always going to be available (which
is another problem when you're trying to make sure you get the same voices for
DLC & sequels etc.)

~~~
sliverstorm
Yes, I understand why it would be _nice_ to have arbitrary lifelike
synthesizable voice. But hiring a voice actor currently gets you 90% of the
way "there" (wherever "there" is) at a tiny, tiny, minuscule fraction of the
cost of personally developing the entire future of speech synthesis.

------
bane
Having grown up with various levels of automated speech, today's speech synth
is _really_ good. At times it takes a few seconds of listening to find a
"tell" that lets you know it's not a person or a pre-recording of a person.

Here's some examples of what I grew up with for comparison:

-[https://www.youtube.com/watch?v=0ccKPSVQcFk](https://www.youtube.com/watch?v=0ccKPSVQcFk)

-[https://www.youtube.com/watch?v=rR0Ofu0M53g](https://www.youtube.com/watch?v=rR0Ofu0M53g)

-[https://youtu.be/4S3G3veop2w?t=102](https://youtu.be/4S3G3veop2w?t=102)

-[https://www.youtube.com/watch?v=ginShVeGpGY](https://www.youtube.com/watch?v=ginShVeGpGY)

and then suddenly in the last few years you get Siri.

With a few more rules, I wouldn't even mind if modern voices read entire books
to me. They need to make more pauses, take breaths for example, or they can be
exhausting to listen to...but I have the sense that we're pretty close.

~~~
teddyh
Of the examples you gave, the first one is a real human voice run though a
vocoder, and the last one is a real human voice sampled at a really crappy
sample rate and played back using hardware never intended for digital sampled
audio.

~~~
bane
Speech synthesis is a spectrum of technologies. In the early days there really
wasn't much difference between recording somebody and playing back, and using
chopped up phoneme recordings or words and reordering to produce speech.

It's kind of like those robot voices used by telephone companies and answering
machines for years, they're just a prerecording of somebody saying some words
and numbers and the machine connects them back in the appropriate sequence.

At finer granularity you simply record somebody saying phonemes and
reconstruct them back into a voice that can say pretty much anything. Alter
pitch and you can have it ask questions or sing. This technique has been
around in some way for decades.

Here's an excellent interview with the guy who did much of the foundational
work in the field. [http://ataripodcast.libsyn.com/antic-
interview-101-forrest-m...](http://ataripodcast.libsyn.com/antic-
interview-101-forrest-mozer-pioneer-in-digitized-speech)

------
melloclello
I always liked the original, Klatt-based synthetic voices that actually
sounded like vocoders. I would think we could avoid the uncanny valley by
creating synthetic voices that _deliberately_ sound like synthetic voices.

------
bitwize
I still find Watson's voice a little unsettling. It's almost like the speaker
is smiling to me all the time, like the black-suited man who recruits Deadpool
in _Deadpool_.

~~~
Zikes
As an aside, if anyone is wondering why that guy looked so familiar it might
be because he played the always-smiling alien Teb in Galaxy Quest.

Edit: Funny enough, the "uncanny valley" cadence of the Thermians' English
reminded me quite a bit of speech synthesis. I now wonder if that's where they
got their inspiration.

------
peter303
I "heard" Stephen Hawking remains with his stilted American-accented computer
voice instead of using a modern computer voice because people recognize it as
him. That old voice is hard for me to clearly understand, as I was listening
to his recent salutary message on gravitation waves.

------
optimuspaul
"Andy Aaron, an IBM Research researcher, said mispronunciation was the
“biggest problem” in preparing Watson for “Jeopardy!”"

Ironically I'd say that mispronunciation is probably a trait that would make
it seem more human.

------
peter303
The irony is that photorealistic imagery with three orders of magnitude
bandwidth than voice has been essentially solved already. If it was solved,
they would have replaced expensive voice actors in CGI movies.

~~~
Shorel
One of the reasons the photorealistic imagery with those badly synchronized
lips works is because the voice makes it authentic.

Remove the voice and we fall into the uncanny valley.

------
transfire
[https://en.wikipedia.org/wiki/Majel_Barrett](https://en.wikipedia.org/wiki/Majel_Barrett)

Done

~~~
scardine
In order to spare some of you a click: this lady was the talent behind the
voice interfaces in Star Trek.

Majel Barrett-Roddenberry (first name pronounced /ˈmeɪdʒəl/; born Majel Leigh
Hudec;[1] February 23, 1932 – December 18, 2008) was an American actress and
producer. She is best known for her role as Nurse Christine Chapel in the
original Star Trek series, Lwaxana Troi on Star Trek: The Next Generation and
Star Trek: Deep Space Nine, and for being the voice of most onboard computer
interfaces throughout the series. She was also the wife of Star Trek creator
Gene Roddenberry.

------
leklund
The Talking Moose circa 1987 using the Macintalk voice of 'Fred'\-- what more
could we need?

