
Amazon Polly – Lifelike Text-To-Speech - stevewilhelm
https://aws.amazon.com/polly/
======
probably_wrong
Previous discussion:
[https://news.ycombinator.com/item?id=13072944](https://news.ycombinator.com/item?id=13072944).

For a service claiming to be "lifelike", Joey has totally phrased that
question as a statement.

~~~
jicks
I agree, it doesn't sound "lifelike" at all to me. Compared to WaveNet[1] it's
day and night.

[1] [https://deepmind.com/blog/wavenet-generative-model-raw-
audio...](https://deepmind.com/blog/wavenet-generative-model-raw-audio/)

~~~
j2kun
The "babbling" samples here are really fun. Would make for a great "Sims"
language.

~~~
justusthane
I actually found that pretty unsettling to listen to, and verging on creepy. I
felt like I was listening to a computer trying to pretend to be human.

------
TazeTSchnitzel
Could we drop “Lifelike” from the title? It's marketing puffery (and untrue,
it sounded very robotic to me to the point I was convinced I'd heard the wrong
audio file). The actual story is that it's Amazon's text-to-speech-as-a-
service offering, not that it's particularly innovative in terms of sounding
good.

~~~
splike
I agree that the voice sounds pretty bad, but I think the 'lifelike' part
comes from the intelligence part. I.E WA -> Washington, and 75F -> 75
Fahrenheit

~~~
mwcampbell
Text-to-speech systems have been expanding abbreviations for decades, and it
always backfires in some cases. For example, DECtalk would expand "Sun" to
"Sunday" regardless of context, with the humorous result that blind people
reading tech news in the 90s would often hear about Sunday Microsystems.

Edit: Yes, that example was unfair, because it's from the 90s (and actually,
DECtalk was largely unchanged since the 80s, until it was ruined in the late
90s). Here's a somewhat more recent one: With the ETI-Eloquence engine, which
was last updated in 2002, the string "2 Marketing" (which one might find, say,
in a Wikipedia table of contents) is expanded to "March twond keting".

~~~
splike
Hence the need for intelligence

~~~
revelation
Which it fails miserably at. They probably picked the one example that worked.

Someone linked another example here:

[https://soundcloud.com/zack-bloom/amazons-new-text-to-
speech...](https://soundcloud.com/zack-bloom/amazons-new-text-to-speech-api)

And it pronounces 1970s (the decade) as one-thousand-nine-hundred-seven-ths.

~~~
mwcampbell
One blind guy I know opined in the late 90s that it's better to let the brain
do the interpreting. But I understand that expansion of abbreviations makes
TTS more attractive for mainstream applications.

------
derefr
One thing I always still notice about these "lifelike" speech models is that
they still have random pitch variation that wouldn't be present from a real
speaker—sort of a "warbling", similar to audio heavily compressed by some
cellular realtime audio codecs.

I assume this is due to random variation in the pitches the speakers in the
training data used to say given phonemes. Would this, then, imply that speech
models for tonal languages like Mandarin don't share this problem, since
people are speaking with prescribed pitches more often?

~~~
hacker_9
Additionally when we speak we don't just 'replay' words; we add emotions and
so on. Reminds me of this video [1] analysing how the actor Anthony Hopkins
converts his lines to speech for his scenes.

[1]
[https://www.youtube.com/watch?v=4kSGkGKwp9U](https://www.youtube.com/watch?v=4kSGkGKwp9U)

~~~
angstrom
The spoken word has so many nuances. Cadence, inflexion, tone, and then the
facial performances are never captured by pure text to speech. You'd almost
have to be able to set boundaries for seriousness, playfulness, innuendo,
indifference, sarcasm etc because none of that is conveyed in the literal text
without additional descriptions or contextual analysis of the conversation.

~~~
derefr
It'd be interesting to take the Joint Many-Tasks approach here, training some
of an RNN's layers on text-to-speech, and then another set of layers on
sentiment analysis of speech-audio, where the error of the sentiment layer can
backpropagate into the TTS layer.

------
benjaminjosephw
They say they doesn't charge for royalties for using generated audio. This
could be huge for accessibility and alternative forms of media consumption for
articles and online media. Generate once, distribute to all.

------
drchiu
The speech that's generated still sounds relatively artificial.

This is in comparison with Adobe's tech where it samples a real human voice
and allows speech to be inserted ([http://arstechnica.com/information-
technology/2016/11/adobe-...](http://arstechnica.com/information-
technology/2016/11/adobe-voco-photoshop-for-audio-speech-editing/)).

------
mgamache
There's a lot of negative reaction to Polly. Besides WaveNet (which is not
available as an API-- and really slow), what are better alternatives? Or is
the reaction just to Amazon's marketing speak using "lifelike" and "deep
learning"? I find most of the voices to be okay, but Joanna is pretty good and
better then anything I've heard besides WaveNet.

Also, with Amazon's FPGA investment how long before it implements WaveNet
using FPGA's to reduce the speech generation time to something that is
realtime?

~~~
mtgx
Why would Google use FPGAs over its own (faster) TPUs? Chances are Google
either doesn't have enough TPUs (it may be waiting on the next generation to
go massive scale with them), or Wavenet is still too computationally intensive
even for TPUs.

~~~
mgamache
I am sure Google is looking at WaveNet & TPUs, I was just thinking _Amazon_
would look at using its new massive FPGAs to implement WaveNet. I have no idea
if the FPGAs would be fast enough.

------
hacker_9
English speakers sound brain dead to be honest. This sounds no different from
pre-NN attempts, only difference this time is it's behind an Amazon paywall.

------
zackbloom
Here's a sound clip of one of the voices I pulled, if you'd like to hear it:
[https://soundcloud.com/zack-bloom/amazons-new-text-to-
speech...](https://soundcloud.com/zack-bloom/amazons-new-text-to-speech-api)

~~~
martythemaniak
The demo files in the article sound pretty decent, but this sample is pretty
bad. It sounds pretty robotic by itself (lots of artifacts, odd pitches, no
pauses), but saying "during the one thousand nine hundred and seventies" is
the real kicker.

------
tucaz
I'm a native Portuguese speaker and I can say that this is far from how people
speak Portuguese. It sounds like pretty much other TTS solutions.

However, Spanish and English are awesome and really sound like a person. This
is great!

Edit: After playing a little bit with it, it does sound better than anything I
have seen before even in Portuguese. I believe that the weirdness is just for
the standard/demo phrase. When I tried some custom phrases (curse words,
mainly) it does sound very natural.

~~~
skzo
Also the sample text in Portuguese has a bad construction

------
twoquestions
Given that this supports Speech Synthesis Markup Language [0], I wonder what
the impact will be on positions that talk for a living. While this is unlikely
to replace voice actors for games/cartoons, I wonder if public service
announcers and such will be replaced by this robot.

[https://www.w3.org/TR/speech-synthesis/](https://www.w3.org/TR/speech-
synthesis/)

------
jeena
I really like text2speech because I really like the computer to read articles
to me while I'm doing something else, but I won't use any internet service for
that just to get a little bit more quality in comperison to Apples "Alex" or
MeryTTS [http://mary.dfki.de/](http://mary.dfki.de/)

~~~
visarga
Me too. I usually play Reddit and HN on Alex voice. It's the best instantly
accessible voice on Mac, even though it is quite old.

------
sagivo
#offtopic a bit but i wonder if anyone knows a good api for the other-way
around -> speech to text

~~~
ckluis
Shoot - I'd love an API you pass in txt in one language and get multiple
languages back in txt and speech.

~~~
shiftpgdn
Amazon Lex and Google's GCE offer this.

------
chillingeffect
The thing these voices always miss is context. We need about 700 variations of
every textual sentence that all depend on the mood or goal of the conversation
and modulate pitch, volume, speed, word choice, etc. appropriately.

This experiment does cover the "generic computer voice" role quite well, but
customers don't want that on e.g. their website. I would want a "I'm a
professional designer" voice. So there will always be markets for more
appropriate voices [*]. But this one does sound very nice at what it does.

The same will happen with robots, btw. Once we get a generic humanoid robot
that can do everything, we will still seek aesthetic variations and employ
remote body actors.

------
aantix
If you want to play around with it..

[https://console.aws.amazon.com/polly/home/SynthesizeSpeech](https://console.aws.amazon.com/polly/home/SynthesizeSpeech)

------
kalleboo
Those samples are absolutely terrible.

The idea of applying machine learning to add natural emphasis to speech is a
good one, but seeing how language machine learning systems like Google
Translate fail on very basic contextual cues, I don't have much hope...
Especially when their demo samples are on par with systems from a decade ago.

I'll stick to Bruce.

------
polskibus
Congrats to the team behind the release in Gdańsk, Poland!

~~~
pawelwentpawel
That's great to hear! I'm from Gdańsk too :) Out of curiosity - any more info
about the team behind Amazon Polly?

~~~
jeffehobbs
Probably the Ivona team, no? [https://www.ivona.com/](https://www.ivona.com/)

~~~
jakozaur
Yeah IVONA. Sounds similar and after acquisition Amazon was looking for
engineers to build cloud offering :-).

~~~
wichert
Hah, that confirms my guess after noticing the voice names for Ivona and Polly
are exactly the same :)

------
pkfrank
How is directly linking a .mp3 the best way to preview their technology? You'd
think there would be a much more elegant solution to showcase Polly on their
site.

------
pantulis
A little bit on the underwhelming side, specially given the "lifelike"
moniker. But it seems pretty useful otherwise. What are the competing
services?

------
soheil
In case anyone would like to hear what it sounds like:
[http://vocaroo.com/i/s1mM3ExvMuwC](http://vocaroo.com/i/s1mM3ExvMuwC)

To play around with it:
[https://console.aws.amazon.com/polly/home/SynthesizeSpeech](https://console.aws.amazon.com/polly/home/SynthesizeSpeech)

------
avita1
Interesting to hear Danish and Icelandic. Does the language affect how
difficult it is to get realistic sounds?

------
ausjke
I assume now you can use LEX + Polly to make your own Echo.

Google, IBM and Microsoft all have invested way more in AI already and are
actually way ahead of Amazon technically. However Amazon now becomes the first
one that makes these technologies really approachable for ordinary developers.
Way to go Amazon!

------
winter_blue
Hasn't Google had a Text-to-Speech service (for e.g. the speaker button in the
Google Translate page) that's been around for aeons?

I remember it sounding pretty life-like. Don't know if they expose an API for
it, and sell it though. (It would be great if they did.)

------
bckmn
For anyone interested in using something like this to listen to articles/text,
you should check out [https://narro.co](https://narro.co)

------
xbryanx
I'm curious what some of the industrial or business use cases are for a
service like this. I'm not saying there aren't any, I just wonder who this
scratches and itch for.

------
shams93
I'm more excited about the html5 tts API than these services when it runs in
the browser the cost is $0. The costs if these 3rd party APIs add up and add
latency to your app.

------
aerialcombat
I'm sorry, but lifelike? Not yet. More like better robot.

------
lottin
The female Spanish speaker has a strong Scandinavian accent.

------
dgudkov
In our team nobody speaks English as first language. Now we have to hire
actors for English voice narration in our demo videos. If Amazon Polly gets
better, we might be able to switch to synthesized voice narration and don't
deal with actors anymore.

PS.That would be one more job lost to AI.

------
xwowsersx
Does anyone know why there is a 1000-character limit?

~~~
niklasrde
It says

> The size of the input text can be up to 1500 billed characters (3000 total
> characters). SSML tags are not counted as billed characters. [1]

Not sure where you got 1000 from. As to the why - no. It doesn't seem like too
much. 2 minutes of speech maybe in there?

[1][http://docs.aws.amazon.com/polly/latest/dg/limits.html](http://docs.aws.amazon.com/polly/latest/dg/limits.html)

------
godmodus
Danish is good, the rest sound not so very lifelike.

~~~
kalleboo
The Danish doesn't sound very good to me, compared to Apple's voices

------
antirez
Old style TTS quality + Marketing blablaing. It's more costly since you pay
the blablaing.

