
Amazon Polly – Text to Speech in 47 Voices and 24 Languages - munchor
https://aws.amazon.com/blogs/aws/polly-text-to-speech-in-47-voices-and-24-languages/
======
jawns
> "I can't embed sound clips in this post"

You can read this as: We're pleased to announce our really cool TTS feature
that took a lot of engineering know-how and effort ... but you'll have to
click through, because we can't seem to get around the limitations of our CMS
to embed audio content in a blog post.

------
qsun
I was trying to figure out an affordable way to send "Read Later" articles in
voice to mobile device, either as podcast or other format, to keep myself
relevant while driving to/from office.

I realized this tool might not be cheap, since it may take the voice
actor/actress 2 hours per day to produce my content (2-hour driving commuting
per day for me). To get familiar local accent, it costs ~$36 in Australia, and
maybe slightly cheaper for US accent. The value it brings me can hardly
justify the cost.

Now, with Polly, things changed - it produces reasonable voice, and 2-hour
content would only cost ~$0.3. I decided to launch my service as soon as
Instapaper approves my API request.

At the same time, put your email here:
[http://readlater.launchrock.co/](http://readlater.launchrock.co/)

~~~
bckmn
I've built this here: [https://narro.co](https://narro.co) (I haven't built
into the Polly voices yet, but it uses Ivona currently, Polly's precursor.)

~~~
bckmn
We just released options for all the available Polly voices this morning!

------
cypher543
One thing I wish more services like this offered is non-speech sounds.
CereVoice, for example, lets you insert laughs, coughs, sighs, etc and it can
really enhance the output in some cases. Google's WaveNet also manages to
simulate the catching of one's breath during particularly long utterances,
although I realize it uses a completely different technique (neural net vs.
concatenative synthesis).

My biggest problem with CereVoice, though, has been its terrible web API. It
doesn't support streaming output, so it renders the audio to an Amazon S3
bucket and then returns a URL, which is pretty inconvenient (and slow). You
have to do the same for transcripts, too. So, if you want everything, you have
to make 3 separate HTTP requests and parse 2 XML documents for one round of
synthesis.

IBM Watson's TTS API gets it right, imo. Its streaming mode returns audio
frames and transcripts over a WebSocket connection.

------
olegkikin
After that WaveNet speech synthesis demo, none of these sound even remotely
good.

[https://deepmind.com/blog/wavenet-generative-model-raw-
audio...](https://deepmind.com/blog/wavenet-generative-model-raw-audio/)

~~~
Cerium
Different optimizations. WaveNet (depending on hardware and configuration)
takes hours to tens of hours per second of audio, but sounds really good.
Polly can create audio faster than you can listen to it.

------
primitivesuave
Wish we had seen this a few days ago before dropping funds on a human to
record for us. I played with several different voices and ran it on a text
corpus that we gave to the human, and in some cases I would say this even
sounds better.

Computer generated voices feel most robotic when their intonation of a word is
abnormal or their pauses between words make the sentence feel choppy. The
intonation and natural pauses between words is very good for all of the main
voices.

The Japanese voice Mizuki was the most comical addition, since I can't think
of a real situation where she would ever actually be used. Mizuki speaks
Engarish (the Japanese version of English) beautifully, but any Japanese
person who can understand Engarish will also understand English. Also, Mizuki
doesn't add the correct vowel ending to all words, e.g. she correctly says
"cheezu" for "cheese", but says "steku" instead of "steki" for "steak".

~~~
aninhumer
>any Japanese person who can understand Engarish will also understand English.

My impression is that it's actually quite common for Japanese people to
understand Japanese accented English more easily than a native English
speaker.

>she correctly says "cheezu" for "cheese", but says "steku" instead of "steki"
for "steak".

"Suteeku" sounds like what a Japanese person who knows English well but has a
strong accent might say. "Suteeki" is more of a corrupted loanword.

~~~
primitivesuave
> My impression is that it's actually quite common for Japanese people to
> understand Japanese accented English more easily than a native English
> speaker.

This is actually very true, made me think back to many situations in Japan
where adding a strong Japanese accent to my English words made it
comprehensible to the listener (just as adding a Japanese accent to the
generated speech makes it nearly incomprehensible to a non-Japanese speaker).

------
tmountain
Audio examples are available here:
[https://aws.amazon.com/polly/](https://aws.amazon.com/polly/)

~~~
Mizza
They don't sound great to me. I was hoping to build a service around this, but
hearing the quality there.. meh. Cheaper to use `say`.

It'd be tolerable to hear a voice-interface with this, but it'd be maddening
to try to listen to a book this way.

~~~
beagle3
Check the license on 'say' before you use it commercially. There are
restrictions.

~~~
alblue
Do you have a reference to the restrictions of what is prodced by "say"? Just
because a tool may be GPL does not mean that its product need be (c.f. gcc)

~~~
beagle3
From Page 2 of
[http://images.apple.com/legal/sla/docs/macOS1012.pdf](http://images.apple.com/legal/sla/docs/macOS1012.pdf)
:

> F. Voices. Subject to the terms and conditions of this License, you may use
> the system voices included in the Apple Software (“System Voices”) (i) while
> running the Apple Software and (ii) to create your own original content and
> projects for your personal, non-commercial use. No other use of the System
> Voices is permitted by this License, including but not limited to the use,
> reproduction, display, performance, recording, publishing or redistribution
> of any of the System Voices in a profit, non-profit, public sharing or
> commercial context.

Which truly sucks, because they don't even give you the option to pay for such
use.

------
caio1982
I tried a few random sentences and some articles paragraphs with both Vitória
and Ricardo (Brazilian Portuguese voices) and Ricardo did pretty well. I was
impressed, really. Vitória on the other hand was not much better (as in
"fluent", with rhythm and right intonation) than other available female voices
out there for pt_BR.

EDIT: oh, I had no idea they have used Ivona

------
devoply
Did they use ivona for this? Probably. Ivona Amy is awesome1

~~~
jakozaur
Almost sure about that they used ivona. Even name suggest the location of
their office Polly -> Poland.

~~~
typicalrunt
That'd be quite witty of them. Polly would normally map to being a parrot,
repeating whatever was told to it.

------
archagon
This sounds better than OSX text-to-speech for audiobook purposes, but the
1500 character limit per API call is annoying. Instead of sending the ebook
text in full, I have to split by paragraph and (occasionally) sentence and
then stitch everything back together with manually inserted pauses, making the
audio a bit uneven.

------
anotheryou
I wonder why they pulled the Ivona text2speech android app. I'm still happily
using it. It's quite comfortable to listen to articles in pocket while on the
train (and to keep listening while changing trains).

edit: ah it was just the german version that is not available any more,
english one seems to be still the store:
[https://play.google.com/store/apps/details?id=com.ivona.tts](https://play.google.com/store/apps/details?id=com.ivona.tts)

~~~
nitrogen
In the US I get "Sorry! This content is not available in your country yet."
Interesting.

------
webwanderings
Does anyone know if NPR recently used this (or something similar) recently, on
a story? I recall listening to a story in my car (hardly paying attention) and
the person narrating the story, sounded unusual. I thought it is probably
computer generated but I couldn't tell for sure. I guess the general public
better get ready to tell the difference?

------
cyberferret
Just me, or do the foreign voices sound much more realistic than the English
speaking ones do? It could be the fact that I am not a native speaker of
Icelandic, French etc. so perhaps to a native speaker, it may still seem
robotic and sterile, but to me, the inflections and cadence sounds much more
natural in the non English synthetic speech.

~~~
ghaff
The French sounds a bit flat to me but I could easily mistake it for a
somewhat bored human reading in mostly a monotone. But I certainly don't know
French well enough to be sensitive to unnatural pauses or odd inflections.

It really is quite good--even if I really wouldn't want to read an entire book
read this way or would mistake it for a human. It definitely gets me thinking
about ways to use this service.

------
Hansi
Surprised that Icelandic is being offered. As a native Icelandic speaker; to
me it sounds about as good as can be expected with such a service. Isochrony
(had to search the dictionary for that one) is a bit off but expected based on
the context of phrases / words used to create the samples.

------
jpdlla
I wonder if they support speech marks like Ivona, for allowing synchronization
of text with audio, useful for text highlighting.

------
cypher543
The link to the Polly service console is broken. Polly doesn't even show up in
the list of services on my AWS account.

~~~
deanclatworthy
Same. I'm sure this will roll out to us all in the coming hours.

------
mistermann
Hope someone can post an audio clip for those of us lacking an account,
curious to see how this sounds!

~~~
nieksand
Scroll down a bit here and you'll ten different example voices:
[https://aws.amazon.com/polly/](https://aws.amazon.com/polly/)

English male sounds surprisingly bad. English female is better.

~~~
cypher543
That's pretty common among TTS services and engines. Most people want a female
voice for their personal assistant/ebook reader/whatever, so the male voices
don't seem to get as much tuning. As someone who happens to prefer a male
voice on my apps, it's a little frustrating.

~~~
ghaff
I don't know how much is tuning and how much is personal preference but I
plugged in some different texts and I'd probably default to Amy (English,
British).

------
canthonytucci
I need the inverse of this is anyone selling that?

~~~
asamarin
I don't know about Amazon, but Google does have Cloud Speech API; seems to be
what you're looking for:

[https://cloud.google.com/speech/](https://cloud.google.com/speech/)

------
mrfusion
Does aws offer a speech recognition service too?

~~~
lljk_kennedy
[https://aws.amazon.com/lex/](https://aws.amazon.com/lex/)

------
aecing
seems like a great opportunity for a tool to aide in learning pronunciation of
foreign languages

------
rjbwork
There's already a well known .NET library called Polly that's under the
purview of the .NET Foundation. See
[http://www.thepollyproject.org/](http://www.thepollyproject.org/)

This could get a bit confusing for some folks.

