
Google Cloud Text-To-Speech Powered by DeepMind WaveNet Technology - pseudobry
https://cloudplatform.googleblog.com/2018/03/introducing-Cloud-Text-to-Speech-powered-by-Deepmind-WaveNet-technology.html
======
coryfklein
Google also announced today their Tacotron engine which features new prosody
modeling speech generation. It allows them to generate speech that mimics
personal intonation, accents, and rhythm, effectively mimicking an individuals
"expression" in their speech.

HN discussion here:
[https://news.ycombinator.com/item?id=16691197](https://news.ycombinator.com/item?id=16691197)

~~~
vosper
I find it amusing that here we have all the corporatey buzzwords - "DeepMind
WaveNet Technology" \- but the other thing is called Tacotron.

~~~
gcb0
if they named themselves DeepMind WaveNet Blockchain, they could have acquired
Googlebet* by now.

* that's how i call google/alphabet when it doesn't matter which side of the tax-avoiding entity i am referring to.

~~~
VVyattPrentice
goophabet?

~~~
rmorey
Alphoogle™

~~~
alttab
Bless you

------
slaymaker1907
For any Google devs lurking out there, it doesn't seem to work at all in
Firefox on Windows. It looks like it has something to do with custom web
components with the following message:

ReferenceError: customElements is not defined

Also apparently some assertion errors with webcomponents (minified so line
numbers not useful).

~~~
matthewmacleod
Doesn't work in Safari. Doesn't work in Firefox. Doesn't work in Edge.

I don't want to be unreasonable, but Google used to at least generally support
the idea of the open web. There are a bunch of different UAs out there; while
I accept it's more challenging to support some than others, it doesn't seem
unreasonable to expect a product launch from a large-scale web company should
at the very least give us an error message.

The web is deteriorating right in front of us, and a big contributor to that
is Google's continued failure to realise that the web isn't all-Chrome, all
the time. The attitude displayed here—a minor thing when compared to the
overall problem—has strengthened my resolve to avoid Chrome at all costs.

~~~
halfteatree
> I don't want to be unreasonable, but Google used to at least generally
> support the idea of the open web.

This is a _Machine Learning_ product, I don't think anyone at Google, at least
as part of this team, is trying to "get you" or destroy the open web or
something. This isn't even a case of Google using something non-standard --
WebComponents is part of the standard, you can even see it in Mozilla's MDN
[0]. Firefox, Safari, Edge, et al simply haven't implemented it yet (or landed
in stable). Is that somehow also Chrome's fault?

Filing a bug report is good, but ranting on HN about how this is a sign of
Google trying to steal the open Internet is at best unnecessary, and
absolutely unreasonable.

Coincidently, I'm working on an application that uses complicated SVG with CSS
animations, and I've spent a ton of time optimizing it. I've never tested it
outside of Chrome before today. To my surprise, while everything works fast in
Chrome, in Safari it's bearable, but in FF it's simply too laggy to use. Now,
I probably won't ever get to fix the performance issues in FF and Safari,
simply because I don't have the time. Am I also out there trying to destroy
the open web? Maybe I'm just bad, not evil.

[0]: [https://developer.mozilla.org/en-
US/docs/Web/Web_Components](https://developer.mozilla.org/en-
US/docs/Web/Web_Components)

~~~
matthewmacleod
No, I don’t agree with you.

This is a basic audio playback widget on the web. The web is nominally an open
platform. Whichever team built this decided to build it in a way that it would
only support Google’s own web browser. It’s not unreasonable to expect a
massive web company to build cross-browser support for their user-facing
demos, _especially_ in cases where there is _obviously no reason_ that it
needs to be incompatible.

Like I said, I appreciate that building cross-browser is not always possible.
The difference between you and Google is that you aren’t one of the largest
companies in the world, and you don’t publish your own browser.

------
ollin
Does anyone have a GitHub project for epub -> mp3 using this service yet (for
automatic audiobook generation)? May make it myself if I have time but curious
if anyone already has set it up.

 __EDIT __: this is almost exactly their sample application
([https://github.com/GoogleCloudPlatform/python-docs-
samples/t...](https://github.com/GoogleCloudPlatform/python-docs-
samples/tree/master/texttospeech/cloud-client)). Was able to get it working
with epubs using pypandoc within the hour. Now just need to make it upload to
Overcast...

 __EDIT 2 __: Can now convert epubs directly to mp3s on Overcast. Yay!

~~~
technics256
Is your code on github?

~~~
ollin
the code was hacked together during a lecture so it's not very clean or
robust, but here's the gist if you're trying to build something similar:
[https://gist.github.com/madebyollin/508930c86fa12e1a70e32d91...](https://gist.github.com/madebyollin/508930c86fa12e1a70e32d91411485a8)

(overcast uploading not shown–that's a separate script using mechanize)

------
qeternity
The average English word is 4.5 characters and the average English speaker
speaks 110-150 words per minute. This means that at $16/1m characters, we can
generate speech at a cost between $28.57-39/hr. Per Google's post, WaveNet now
costs 50ms of TPU time per 1s of speech generated, meaning, at 100%
utilization, a TPU can generate somewhere between $571.40-780/hr. Google's
TPUs can be deployed (by third parties) at $6.50/hr. That's some sweet sweet
margin.

~~~
teraflop
I think your math is wrong by a factor of 60. Under your assumptions, one hour
of speech is equivalent to 30-40 thousand characters, costing between
$0.48-$0.65. That translates to revenue of $9.50-$12.96/hour per TPU.

~~~
bufferoverflow
I double checked, you are correct:

    
    
        16 * (4.5 * 110 * 60) / 1M = $0.475/hr
    
        16 * (4.5 * 150 * 60) / 1M = $0.648/hr
    

If you multiply by the number of 50ms in one second (20), you do get $9.5 -
$12.96

~~~
qeternity
This is correct. Appols!

------
StavrosK
Here's a simple Python script that will fetch some sample audio using the
request on the demo page and save it in a file:

[https://www.pastery.net/nujfhw/](https://www.pastery.net/nujfhw/)

I have no idea what the rate limits are, so please don't abuse it, I wrote it
because the demo didn't work in Firefox and I wanted to play around with it
more extensively.

~~~
mxuribe
Thanks, pretty neat script; works nicely!

------
Jakob
Having an English text but setting the language to another one like German or
French is hilarious.

You get e.g. ze Dscherman aczent or de frensch onehe.

~~~
jfno67
I decided to do the opposite so French text with the wavenet English voice,
pretty funny too.

------
tambourine_man
The US English synthesized version is truly remarkable. Borderline scarily
good.

The fact that the preview only seems to work on Chrome (and silently breaks
everywhere else) is not cool, thought.

------
PostOnce
Am I wrong in thinking that the cost of generating (realistic-sounding,
learned-model) speech on commodity hardware will be near-zero soon, largely
negating the value of a SaaS?

I've been waiting a long time for decent sounding open source TTS software for
narrating books to me, and now with deep learning it's either here or very
near here, and the hardware is going to keep getting more performant at the
same price. I guess that will be very appealing to businesses relying on TTS
(e.g. call centers and phone robots and mobile apps with TTS, etc)

~~~
ariwilson
What open source TTS is as good as Google Cloud?

~~~
PostOnce
check the "kate" samples:
[https://github.com/Kyubyong/speaker_adapted_tts](https://github.com/Kyubyong/speaker_adapted_tts)

This is with 1 minute of audio and 10 minutes of training, which is crazy to
me. Maybe it's not "as good", but it's very good, and free, and it will get
better, faster, and cheaper quickly?

~~~
ghaff
Partially off-topic, but one of the things I find as a native English-speaking
American is that British female accents (probably more specifically accents
that are close to a "BBC accent") sound better to me. That's definitely true
with Polly. I don't know if it's because flaws aren't quite as obvious to me
or just that I like the accent better in general so I'm more willing to
overlook them.

------
neom
As someone who struggles greatly with the written words, I'm so thankful to
see this. For the last year or so I've poked around every few months to see if
they'd opened this up more generally. I'd be more than happy to pay $30-60/mth
(more if it had Spritz) for the ability to have high quality, high speech
speed, text to speech for my emails, documents and news articles I'd like to
consume.

------
benjismith
Interesting! I'd love to see a thorough comparison with the Amazon Polly
service...

[https://aws.amazon.com/polly/](https://aws.amazon.com/polly/)

Polly is priced at $4 per million characters and the Google WaveNet voices are
$16 (compared with the Google non-WaveNet voices, which are also $4).

After listening to a few samples from each service, the voice quality and
prosody modeling seem roughly on par between Polly and WaveNet, or at least
the differences I heard didn't seem to justify a 4x price multiplier.

But I'd love to hear an informed opinion from someone with more expertise...

~~~
jakozaur
A lot of voice generation is cost-center (call center that are outsourced to
cheapest location) with short sentences. I doubt industry would pay 4x price
multiplier for that use-case.

So in fact WaveNet competes more with voiceover and new use-cases such as
voice assistants. Still I don't hear that much difference there today, but
maybe WaveNet will improve in the future to human level sooner than the other
models.

------
joelthelion
I for one welcome our new wavenet telemarketing overlords...

~~~
dakna
I set the text to

"I'm sorry Dave. I'm afraid I can't do that."

to be prepared for whats coming ...

------
WheelsAtLarge
It's very good. The voices reminds me of speech from real life people with
accents. It's good enough for voice overs where previously real-life voices
would be too expensive. I would say that it's better than Amazon's Polly when
it's used to read long passages of text.

~~~
ghaff
I don't know. They're good but they still sound robotic. For me, they work for
applications where I sort of expect/accept that I'll get computer-generated
speech anyway. But I wouldn't use them as a general substitute for a human
speaking, even someone like me who doesn't exactly have a radio voice.

~~~
remir
It's getting more and more human sounding. Take a look at this research (also
from Google): [https://research.googleblog.com/2018/03/expressive-speech-
sy...](https://research.googleblog.com/2018/03/expressive-speech-synthesis-
with.html)

~~~
ghaff
It absolutely is. But I'm looking at it from a perspective of whether I could
put a daily or weekly podcast out there using one of these TTS services and I
come out with a resounding no (today).

------
aviv
I have not seen any mention on licensing and whether you can cache and replay
voice responses. Amazon Polly specifically allows caching.

------
ryeguy_24
Based on the pricing of $16 per 1 million characters (roughly equal to a
400-500 page book), doesn't this severely threaten the voiceover market place?
I just priced the cost of a human voiceover on VoiceBunny.com for a 400-page
book and I got an average turnaround time of 90 days / $15K cost vs WaveNet's
$16 cost and only 30 mins of computational time. That sounds like an
interesting disruptor to me.

~~~
brad0
It does if people are willing to listen to the voice for 10 hours+.

I could listen to this voice for a while, but the voice needs more emotion in
it before it could be actually useful for long text.

~~~
dhon_
Tacotron (also by Google) looks promising in this area
[https://google.github.io/tacotron/publications/global_style_...](https://google.github.io/tacotron/publications/global_style_tokens/index.html)

------
bufferoverflow
I wish they had some beautiful voices, not some of the most generic-sounding
men and women.

~~~
jonknee
They're working on it! Check out the samples here:

[https://research.googleblog.com/2018/03/expressive-speech-
sy...](https://research.googleblog.com/2018/03/expressive-speech-synthesis-
with.html)

The last set is specifically interesting for your wish.

~~~
gene1974
Meryl Streep should be worried right about now! (I heard that last set of
renders... whoa!)

------
remir
Imagine teaching these voices to sing. Something like DeepMind WaveNet Song
Generator.

You upload your music to the cloud, set some parameters (genre, tempo,
emotion, etc) and a bunch of lyrics and the thing will spit out awesome vocals
for you.

~~~
severine
That's a billion dollar idea, I wonder who'll do it, maybe you!

------
ImJasonH
Quick, someone remake Translation Party using Speech-to-Text-to-Speech-to-
Text-to-Speech-ad-infinitum

[https://cloud.google.com/text-to-
speech/docs/quickstart](https://cloud.google.com/text-to-
speech/docs/quickstart) [https://cloud.google.com/speech/docs/sync-
recognize](https://cloud.google.com/speech/docs/sync-recognize)

~~~
ImJasonH
Never say I never gave you anything, HN:
[https://gist.github.com/ImJasonH/78c22b36944b8ec189456e67e63...](https://gist.github.com/ImJasonH/78c22b36944b8ec189456e67e63bfaa4)

------
tmalsburg2
This is great, but there remain very difficult problems to be solved. The
prosody generated by this is fairly generic and not informed by a true
understanding of the text. Consider this sentence:

I have plans to leave.

If you stress the word "plans", the sentence means that the speaker is not
necessarily intending to actually leave. However, when the stress is on
"leave", the speaker definitely intends to leave. A human reader can easily
infer the correct meaning from context but text-to-speech systems can't
because they don't have any systematic understanding of the things being
talked about and the social pragmatics of the discourse. As long as these
issues aren't solved, text-to-speech systems will make mistakes. These
mistakes will be easy to spot in some cases but can also have catastrophic
consequences in other cases: "I have plans to bomb North Korea."

~~~
londons_explore
Google has solved that here:
[https://news.ycombinator.com/item?id=16692559](https://news.ycombinator.com/item?id=16692559)

~~~
tmalsburg2
This is really cool, thanks for the link, but it solves a different problem.

------
kokimame
I'm using Amazon Polly for a few of months to make videos for language
learners. And I realize English voices powered by WaveNet slightly better than
those of Amazon but the default Japanese sounds way too worse. Anyway, their
pricing and platform are almost same with Amazon, so I definitely need to add
another interface for this TTS into my app. You can listen to Amazon Polly
voices with the video I made:
[https://www.youtube.com/watch?v=ysMp0k4oR5c](https://www.youtube.com/watch?v=ysMp0k4oR5c)

------
lysp
I picked 3 random paragraphs from a random article on a local online news
site.

The voices did sound quite natural and "news-readery", however the one issue I
did find is adding a pause between words.

With the example phrase: "He bought himself a boat and then took it to his
house". You often expect a small pause after the word "boat".

I was able to manually fix it by adding some commas and full stops, however
the AI was not able to pick up those pauses naturally.

It sounded like someone was rushing through the speech instead of stopping
occasionally to "take a breath".

------
coryfklein
The demo is available at [https://cloud.google.com/text-to-
speech/](https://cloud.google.com/text-to-speech/)

Requires Chrome.

------
verelo
Is it just me, or would a demo really make this posting much more interesting?

Edit: There is one, on the actual Google Cloud Text-To-Speech page, so a few
clicks in and you'll get one.

~~~
mintplant
Doesn't show up on Firefox or Edge. Only a blank space where I assume the demo
should be. Console suggests some sort of Polymer/WebComponents error.

------
StavrosK
Is there any API for generating speech that sounds like Google Now's
assistant? The quality of that is much, much better than this new service.

~~~
panarky
Yes, I use this:

[https://github.com/pndurette/gTTS](https://github.com/pndurette/gTTS)

Very simple and has the Google Assistant voice.

~~~
StavrosK
Hmm, that seems much worse than Google Assistant as well. I think my mistake
was that I had selected "Basic" instead of "WaveNet" for the voices (because
it's only available for US English). WaveNet is much better.

------
tristanj
Are there any voice samples?

~~~
joefourier
You can try it yourself here, just make sure to select English (United States)
and Voicetype: Wavenet, as the other languages are not yet using the Wavenet
system: [https://cloud.google.com/text-to-
speech/](https://cloud.google.com/text-to-speech/)

~~~
alonmower
It is fun to mismatch voices/languages to hear some hilariously stereotypical
accents

------
daoudc
I had an idea this morning for a personalised "podcast" that could read out
e.g. the weather in your area, any new and important emails, the headlines and
first paragraph of top stories from your favourite sources and notifications
from social media.

I think this is the missing thing that was needed to make this viable.

~~~
dragonwriter
> I had an idea this morning for a personalised "podcast" that could read out
> e.g. the weather in your area, any new and important emails, the headlines
> and first paragraph of top stories from your favourite sources and
> notifications from social media.

Google Assistant already has all the pieces of that (maybe not all the social
media connections one might want, I haven't looked much at that), and the
ability to string them together.

