
Deep Learning for Siri’s Voice - subset
https://machinelearning.apple.com/2017/08/06/siri-voices.html
======
StavrosK
The iOS 11 Siri sounds like it's a real person talking, it's amazing. Does
anyone know if there's an open-source TTS library available with such quality
(or if anyone is working on one, from this paper)?

I would love to have my home speakers announce things in this voice.

~~~
knolan
She sounds younger to me, but very natural sounding.

Will be interesting to see how Siri on Home Pod works out.

~~~
ghaff
Yes. And the way it answers some questions, it also seems to be going for a
more casual, enthusiastic sort of vibe. It wasn't clear to me to what degree
the changes apply outside of the female American voice.

------
beisner
A research paper published by Apple? About Siri?! Unheard of! Last time I was
at an NLP conference wth Apple employees they wouldn't say anything about how
Siri speech worked, despite being very inquisitive about everyone else's
publications. Good to see some change.

~~~
edwhitesell
It's probably safe to assume a lot of that was due to some/most of Siri being
licensed from Nuance initially. I mean, who wants to talk about a new product,
which most people think is brand new and entirely innovative, just to say "Oh
yeah, we paid someone else to work with us to create it."

Not that there's anything wrong with that and it certainly seems like Apple
has been investing in-house pretty heavily in recent years for Siri
improvement.

~~~
blackkettle
I think it has more to do with the fact that they are finally starting to
allow their researchers to publish. They were platinum sponsors at INTERSPEECH
2017 this week, and actually published a paper there. I'm pretty sure that was
the first time _ever_ despite their recruiters showing up every year.

~~~
gok
3 papers at Interspeech 2017, actually :)

------
jchw
My favorite part is that the runtime runs on device. I moved back to Android,
but persistently one thing Apple does that I like is they don't move things to
the internet as often as Google does. On Android, you get degraded TTS if the
internet is shoddy.

~~~
zionic
It's two different philosophies. With Apple it's about providing sufficient
value such that the consumer will pay a premium for the product. With Google
it's about providing the minimum viable value such that the user will provide
as much of their data as possible.

~~~
eridius
Apple also cares strongly about privacy, so there's a lot of stuff they refuse
to do in the cloud.

Google also cares about privacy, but only in reverse. They don't want you to
have any ;)

------
quiteawhile
I couldn't read the paper yet, and also I know very little about this, but
listening to the audio samples it seems that one of the most notable changes
was the intonation in changing phrases. Did anyone else catch something like
that? I'm not sure I'm doing a good job at explaining. If you listen to all
iOS11 samples it'll stand out.

Anyway, it's the only way I can still identify this as a fake voice. The
intonation always follows the same cadence (not sure if that's the word?). We
really shouldn't have overused the word awesome before this kind of thing came
along.

There's also a kind of dread too, tbh, this kind of seamless TTS has the
potential to change a lot of things. First of all criminals are going to love
this, youtube pranksters too. Eventually this will shake up the voice acting
industry in a possibly not healthy way for the voice actors, while at the same
time allowing projects with a shorter budget to have incredible voice work
(also dubbing).

What I think is really important, tho, is that as we move away from the
uncanny valley we change our relationships with those voices, our brains don't
have the capacity to listen to a voice this real and not imagine it as a
person, even for adults.

Ironically at this moment I'm using an old threadless sweatshirt that says
"this was supposed to be the future" but nowadays I can honestly say we're
getting there.

~~~
lawkwok
Regarding voice acting, I think there is something to be said about human
expression/ad-lib. Sure, you could generate a natural-sounding voice computer
voice, but in the context of arts we’re still a ways to go before a computer
can go off script and add just the perfect amount of intonation on a certain
word that turns a phrase into an iconic quote.

Similarly, we don’t see CGI motion capture replacing Andy Serkis any time
soon.

~~~
Eridrus
I think this is less likely to hit major films or TV shows, but it will hit
the audiobook and video game markets pretty hard.

I'm pretty excited about the video game side.

------
coldcode
The difference between the Siri voices from iOS 9-11 is startling. I can still
here some issues especially at the ends of phrases, but it's extremely good.

~~~
pault
11 sounds almost as good as the wavenet demo. Considering it runs in real-time
that's very impressive.

------
default-kramer
This just made me realize that every time you see a strong AI in fiction, it
still has a computer-sounding voice. If we ever develop strong AI, we will
probably already have perfectly natural speech synthesis. And if not, the AI
could develop it for us.

But I suppose an AI might choose to use a computer-sounding voice to remind us
that it is a computer. Kind of like those inaccurate sound effects in movies -
they have become so common that it seems more wrong to omit them. (TV Tropes
calls this "The Coconut Effect".)

~~~
banderman
I recommend watching the scifi film "Her", it has a different take on this.

~~~
MBCook
That was a great movie.

There is always the chance that as we get better at this stuff we'll start to
find it creepy that it's so realistic (either due to the uncanny valley or
because we _crossed_ the valley) and we'll start to prefer devices that act
robotic even though we know we could make the indistinguishable.

I'm trying to think of another example. I know I've heard a good one with
Roombas but I can't remember it.

Basically we may try to avoid a Bladerunner situation where we're not sure
when we are or aren't talking to a real person and prefer the 'computery'
voices.

~~~
digi_owl
Well there is always Data, and how he was made to be less human after the
researchers found Lore to be unsettling.

~~~
MBCook
Excellent example. I'd forgotten about Lore.

------
sib
The prosody and and continuity of the speech is dramatically improved. This is
hard to do and very impressive (especially given that it is being done on-
device).

Personally, I'm less pleased with the actual new voice itself, although that
is more a subjective judgment. After listening to many hundreds of voice
talent auditions for Alexa, it's hard to step back from that level of
pickiness.

~~~
ghaff
As I indicated in another comment, the visual that the voice (together with
other tweaks in some of Siri's responses) suggests to me is a perky twenty-
something.

I actually tend to generally prefer some of the female British accents in
several current TTS systems. (Amy is probably my favorite Polly voice.)
Perhaps as an American, the robotic-ness doesn't seem quite as obvious or
grating.

~~~
dkonofalski
I also prefer the female British accents but that's exactly what excites me
and is so awesome about this. These aren't just samples that are being
stitched together anymore. This "learning" that is being done can be applied
later to any of the voices in Apple's catalog. Once they get the data of the
synthesis out there, they'll more than likely update all the languages and
intonations to match. I would imagine that the biggest hurdle with this is
that different languages and accents have different nuances. As with most
things, they're just starting with English and then will move everything over
to all the other options, including the British accents. I don't think we're
too far off from a future where you'll be able to pick the age, gender, and
voice of your assistant in the same way that characters selection is done in
most modern video games.

------
ucaetano
Kinda sad to see that the names of the authors are omitted, although you can
infer some of them from the quote:

> For more details on the new Siri text-to-speech system, see our published
> paper “Siri On-Device Deep Learning-Guided Unit Selection Text-to-Speech
> System”

 _[9] T. Capes, P. Coles, A. Conkie, L. Golipour, A. Hadjitarkhani, Q. Hu, N.
Huddleston, M. Hunt, J. Li, M. Neeracher, K. Prahallad, T. Raitio, R.
Rasipuram, G. Townsend, B. Williamson, D. Winarsky, Z. Wu, H. Zhang. Siri On-
Device Deep Learning-Guided Unit Selection Text-to-Speech System, Interspeech,
2017._

Why not just add the names by default?

~~~
justinjlynn
Because then it wouldn't be an Apple™ iNovation™.

~~~
justinjlynn
Let's be honest, these people's names aren't displayed prominently for
precisely the same reason as early Atari game developers names weren't.

~~~
DonaldPShimoda
I'm not familiar with this example. Could you elaborate on the Atari thing,
please?

~~~
MBCook
Atari wanted people to want Atari games, not Frank Jones games. They
considered their programmers replaceable cogs and refused to give them credit.

That's why the Easter egg in Adventure with the programmer's name exists. It
was the only way to get his name out there.

What happened was the developers didn't like this and left to start their own
company, Activision, which made some of the best remembered games on the 2600.

Apple already 'compromised' by letting their researchers publish _at all_.
Maybe names will be allowed in the future but it's kind of surprising we're
even getting this.

~~~
DonaldPShimoda
Oh I see! I didn't know about that. Thanks for the explanation!

I totally agree that this was an unprecedented move by Apple, considering
their past stance on such things. I'm hopeful for the future, though! They
seem to have realized (at least a little bit) that community cooperation is
valuable.

~~~
MBCook
The reports from a few months ago word that they basically had to do this
because no one was willing to work for them if they weren't able to publish
because their career basically stalled.

------
pault
It might seem silly, but I'm looking forward to the first AI talk therapist.
Most of the benefit of therapy is the talking, so it's not as crazy as it
sounds.

~~~
CharlesW
> _It might seem silly, but I 'm looking forward to the first AI talk
> therapist. Most of the benefit of therapy is the talking, so it's not as
> crazy as it sounds._

Not crazy at all. At least some therapies provide benefits even with simple
non-AI processes: "A meta-analyses of 15 studies, published in this month’s
volume of Administration and Policy in Mental Health and Mental Health
Services Research, found no significant difference in the treatment outcomes
for patients who saw a therapist and those who followed a self-help book or
online program."[0]

[0] [https://qz.com/1057345/researchers-say-you-might-as-well-
be-...](https://qz.com/1057345/researchers-say-you-might-as-well-be-your-own-
therapist/)

~~~
briandear
The one grey area though is that when the patient actually does the self-help
program. Adherence is a big problem. Meaning a “self-helper” needs to be more
disciplined because they don’t have the same pressure/accountability they
might have with an actual therapist.

------
andreyk
Good blog post and audio samples notwithstanding, annoying that they don't put
the paper on Arxiv. As they themselves point to in the blog post, the learning
architecture was introduced in 2014's "Deep mixture density networks for
acoustic modeling in statistical parametric speech synthesis" so it's not
clear how much of this is just good engineering vs novel research.

~~~
dkonofalski
The paper was more than likely embargoed until the talk they gave about it was
over. They're introducing some new things that they probably didn't want to
release details on before they publicly made a statement.

------
speakingmachine
The obvious question would be a head-to-head qualitative comparison vs.
WaveNet. It seems that they have advanced siri vs. siri prior, but does this
work advance the field?

~~~
dharma1
in terms of being feasible to actually use in production? Yes. It runs
realtime locally on a mobile device at 48khz 16bit. WaveNet doesn't run
realtime even on a desktop GPU at 16khz 8bit.

The WaveNet method of predicting the output sample by sample yields great
results but at a very high computational cost

------
chiph
There's no question the diction of iOS 11 is much improved. But I liked the
voice & timbre of the old speaker better - it sounds more authoritative.

~~~
TazeTSchnitzel
Yes, it's a shame they didn't hire her to do the iOS 11 voice.

------
BadassFractal
Now if only it didn't feel like when I'm asking Siri to do a task it has a
very small pool of pre-set options I get to choose from. It still feels rather
restricted, but I'm excited they're really investing into it.

------
remir
The new voice sounds a lot like Google's current TTS voice.

------
sangd
I don't like the higher pitch/sharp tone from iOS 11. I like a warmer and
deeper tone in iOS 10. I feel like having a more mature/experience assistant.

------
EGreg
It's also interesting how they made the pitch higher for the new voice, like
Google has had all along.

------
satyajeet23
This is amazing, and also how beautifully it is written and presented!

------
seldomrandom
Siri's voice update and not allowing apps to use location always were two of
my favorites in iOS 11!

