
Ask HN: Are any startups working on text-to-speech? - bossx
It seems like TTS technology hasn&#x27;t evolved much over the years and I was wondering if any startups are working on making it sound more realistic?
======
ivan_ah
Have you tried the TTS in Mac OS X? You can run it on the command line using:

    
    
        say 'this is a test'
    

I for one think it's very good quality (at least for the Alex voice)

~~~
flippant
I have the crontab set up to `say 'ay ay. ay ay. smoke weed. every day.` every
day at 4:20.

I'm not even a stoner and folks seem to get a kick out of it whenever they're
pair programming with me.

------
compumike
What do you see as the business use case for more realistic text-to-speech?

We use TTS extensively within the Pantelligent iOS and Android apps, and it's
something our users requested and get a lot of value out of. It seems like the
existing solutions are already good enough / dramatically above the threshold
of usefulness for an interactive real-time-guidance application like ours, and
just keep getting better from the OS side.

~~~
jrcii
It seems that IVR phone systems could benefit. Maybe the state-of-the-art just
hasn't had the chance to propagate, but it always strikes me as inelegant when
I hear, "Your account has a balance of... ONE, HUN-DRED, DOLLARS... AND...
NINETY, FIVE, CENTS. To pay this bill,"

~~~
joegreen
I've just tried playing "Your account has a balance of 100 dollars and 95
cents" on [https://www.ivona.com/us/](https://www.ivona.com/us/) and it seems
quite natural to me (but I'm not a native english speaker so I may not be
sensitive).

~~~
ctomaybe
Just spent more time than I would like to admit making those voices say all
matter of inappropriate things and giggling to myself.

~~~
praccu
Did you try the chipmunk voice?

------
lutusp
> It seems like TTS technology hasn't evolved much over the years and I was
> wondering if any startups are working on making it sound more realistic?

It's not true that TTS hasn't improved (see below). Many people are working on
this, both in academia and in private enterprise. It's an obvious and
potentially valuable part of the human-computer interface.

This is not to suggest that it's easy -- the mathematics and vocal tract
modeling problems are formidable. The only reason there are reasonable TTS
resources now is because of the rapid increase in computer power -- power
that's needed to support this feature.

Here's a site chosen at random that offers a high-quality TTS example:

[https://www.ivona.com/](https://www.ivona.com/)

It's pretty good based on prevailing standards, and it's the outcome of a lot
of work.

To find the companies working on this, just Google for "high-quality tts".

------
insoluble
I've thought about getting into this area myself, but I was too afraid there
was not enough market for it. This was several years ago. As far as CPU power,
today's average PC is easily 5x what's necessary for perfect speech. The real
question is the algorithms being used. Perhaps the fear is that the algorithm
would be pirated. I mean just look at the history of digital audio (or video)
and encoding, with things like Xvid and Ogg. Basically every time a good
algorithm even starts to gain traction, an open "alternative" is made
available practically overnight. This is not to say I don't like open
algorithms. In fact, I believe that any standard algorithm should be open.
This fact, however, is enough to deter research in this field. Perhaps a Web
service that converted text to speech would be one option, but it would have
limited applicability.

Edit: Perhaps a Kickstarter or related would be a good idea since this type of
feature would be useful by so many people. Nearly everyone has functioning
ears. (no offence to those who don't)

------
praccu
Ivona, acquired by Amazon, is a hugely notable example. They're great folks,
and do great work. Alexa'a voice was done by them, but an early customer of
theirs was the Polish public transit system.

A lot of really great work is happening in academia; I'm not going to name
names because I'd forget someone deserving.

(Shameless plug: we [0] do speech and language consulting including custom
TTS.)

[0] cobaltspeech.com

~~~
bossx
Please name names in Academia I would really like to explore the research and
advancements being made in TTS

~~~
praccu
aah. Okay. Disclaimer, [b] this isn't my area [/b] and I'm going to miss some
folks. I apologize to all of the important and interesting work that I'm
missing.

Also, I'm really linking to research groups, so take these names as starting
points and look at their students and other professors working with them.

First off, the Blizzard challenge is a major hub of activity. [4] Festival is
an important piece of software [5]. Interspeech is an important conference [6]
(take a look at the speech synthesis track and the organizers for that).

Alan Black [0] @ CMU is kinda a giant.

Keiichi Tokuda [3] @ Nagoya also a giant.

Simon King [1] @ Edinburgh, just had a paper linked to on HN a few days ago
and does important work.

Mark Gales [2] @ Cambridge does work here, too.

[0] [https://www.cs.cmu.edu/~awb/](https://www.cs.cmu.edu/~awb/) [1]
[http://www.cstr.ed.ac.uk/ssi/people/simonk.html](http://www.cstr.ed.ac.uk/ssi/people/simonk.html)
[2] [http://mi.eng.cam.ac.uk/~mjfg/](http://mi.eng.cam.ac.uk/~mjfg/) [3]
[http://www.sp.nitech.ac.jp/~tokuda/](http://www.sp.nitech.ac.jp/~tokuda/) [4]
[http://festvox.org/blizzard/](http://festvox.org/blizzard/) [5]
[http://www.cstr.ed.ac.uk/projects/festival/](http://www.cstr.ed.ac.uk/projects/festival/)
[6] [http://interspeech2015.org/wp-
content/uploads/direct/INTERSP...](http://interspeech2015.org/wp-
content/uploads/direct/INTERSPEECH_2015_AbstractBook.pdf)

------
lorenzorhoades
A ML based app to read articles to me in the morning on my way to work may
have some commercial success. The ones currently have a hard time reading
through an entire article without sounding like an robot, or reading a
headline as if it was part of the previous sentence.

------
tkjef
I made this app called Ultimate Alerts that allows for all email and text
messages to be read over TTS when your car goes over 10mph for over 1 minute.

It switches back to normal settings when you go below 10mph for over 1 minute.
Helpful with switching everything to TTS when you're driving automatically.

Lots of other functionality as well. Check it out:
[https://play.google.com/store/apps/details?id=com.org.imsono...](https://play.google.com/store/apps/details?id=com.org.imsono.emailnew&hl=en)

------
jacquesm
[https://news.ycombinator.com/item?id=10925826](https://news.ycombinator.com/item?id=10925826)

May be of interest to you.

------
tomasien
FWIW I think TTS is a bad interface. That said, IBM Watson is getting pretttty
good. Check it out, they're willing to work with startups too.

~~~
joefarish
Do you mean the current execution is bad or that it is just a bad concept? I
can think of plenty of times where TTS is the perfect interface for the
problem at hand. For example, a Sat Nav giving turn by turn directions.

~~~
tomasien
That is true - all voice related interfaces are best exemplified with travel.
STT for example is best when you're alone and can't use your hands aka
driving. TTS is best for the same set of situations, so listening to articles
on the train or getting directions read back.

In general, TTS is a better interface than STT, both are (in my very, very
humble opinion) bad interfaces.

------
pshapiro99
I've also noticed this stagnation. The quality of the spoken voice TTS sound
depends upon two things, I've heard -- processor speed and memory (RAM).
Processor speed has increased dramatically in the past few years. I wish
someone would design TTS that only works on the fastest processors. There
seems to be too much lowest-common-denominator going on in this field.

~~~
jerf
I was under the impression the problem the field currently has is that
improving the diction rapidly becomes AI-complete. Voice synthesis is either a
solved problem, or one that could be solved with just a bit more effort, but
the real problem is what do you _feed_ that perfect voice model?

Read your comment or my comment aloud to yourself. Now feed it to your choice
of TTS engine. The problem isn't the words, the problem is the lack of
comprehension.

------
yrezgui
Have a look to Watson Text to Speech API:
[http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercl...](http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/text-
to-speech.html)

------
AndrewMBliss
This is not really a "text to speech" but an voice search engine. It is called
Mobvoi and is invested by Google.
[http://chumenwenwen.com/global.html](http://chumenwenwen.com/global.html)

------
andrewbarba
I've messed around with Api.ai
([https://docs.api.ai/docs/tts](https://docs.api.ai/docs/tts)) and it is quite
impressive. Full-featured, amazing pricing (free), and goes way beyond just
TTS.

------
lazyjones
OSX's "say" is amazing (judging from their english and german voices). Do you
believe there's enough room for improvement to build a business case on it,
even though some companies have worked on the problem for decades?

------
infocollector
Does anyone know of a good one that will without an internet connection on
ubuntu 14.04?

------
Animats
Try Vocaloid.[1] It's overkill for plain text to speech, but quite good.
Better for Japanese than for English.

[1] [http://www.vocaloid.com/en/](http://www.vocaloid.com/en/)

------
bckmn
I'm building a web-to-speech(-to-podcast) app that's gaining traction.
Integrating new voices and integrations all the time.
[https://narro.co](https://narro.co)

------
elchudi2
[http://www.mivoq.it/en/](http://www.mivoq.it/en/) is trying to advance TTS
technology

------
jerelunruh
I recently ran onto this for the browser:
[http://responsivevoice.org/](http://responsivevoice.org/)

------
hobonumber1
Try Houndify from SoundHound! (www.houndify.com).

~~~
KevinBongart
This website could use some more demos!

------
justincormack
Apple bought Cambridge UK startup VocalIQ that does this (in part, they also
did recognition work) a year or so ago.

------
npalli
Here are three vendors that provide a good TTS apis. Have you evaluated the
performance. What did you find lacking?

1\. Nuance

2\. AT&T

3\. IBM Watson

~~~
frik
Basically there is Nuance, which is the leader. All others either use older
open source projects underneath, legacy software or just license from Nuance.
If you can name another one, please do.

~~~
calmhead
Ivona, [https://www.ivona.com/](https://www.ivona.com/) the Amazon company
responsible for Echo's Alexa voice also has an excellent TTS system.

------
voiceclonr
Shameless plug. Give www.voiceclonr.com a try (something I built a while back)

------
mdasen
TTS has definitely evolved over the years. If you compare Google Maps voice to
Stephen Hawking, it's night and day.

However, I can definitely understand how TTS technology looks stagnant. Part
of this is that going from nothing to something reasonable happened
exceedingly quickly. Early TTS research was supported by the US government
which saw that early systems were comprehensible (if not wonderful sounding)
and declared victory. Funding went to other problems in computational
linguistics (like speech recognition, information extraction, etc.) and so did
a lot of the workforce.

Modern systems usually involve many hours of speech from a single person and
use variable length units to form more natural speech. Many systems still
sound pasted together because that's how enterprise technology goes. How many
banks have online banking that seems like it's from the 90s? You can't compare
what systems can do to what some call center has installed as its technology.
Someone has linked to Inova. Google and Nuance have good systems as well, but
there's a balance between resources and perfect speech.

In terms of some of the issues. . . When you're going through a finite amount
of recorded speech, you have to choose something that fits. It isn't going to
be perfect in many cases. You have to deal with things like F0 declination.
You have to deal with how long phonemes are going to be. You have to deal with
breaks in utterances.

And the fact is that we can understand Hawking's 1980s TTS.

If you want to start thinking about the problem more, try inputting these two
statements into Inova "Do you really want to see all of it? Do you want to see
all of it? I want to see all of it." Notice how it tries to rise around
"really" in the first sentence. It's trying to match how we would speak -
rising for "really" in the first sentence and rising for the question-ending
in both questions. But it kinda misses in both cases. Still, in some ways it's
amazing that it recognizes "really" as something that should go up. It
recognizes that questions go up at the end. It recognizes how the non-question
goes down as the sentence progresses. And it finds things within its data set
to fit to how it thinks the sentence is going to go. But it doesn't have
perfect language understanding so it doesn't know exactly how things would be
said - a lot of sounding natural isn't making the phonemes more accurately,
but the intonation and attitude of the speech. It also has to find something
that fits. Lots of smart things are done, but it's pulling from a limited
amount of recorded speech - speech that has been sliced in many useful ways,
but still limited.

TTS has definitely evolved and I think that Google, Nuance, and others are
definitely pushing it forward. You're going to interact with a lot of legacy
systems that feel like they're still in the Hawking era. But most ATMs I use
don't even have touch screens (opting for buttons on the side of the screen)
and even fancy ATMs like Wells Fargo don't feel like an iPad. You don't want
to compare to systems that are so far away from modern, commercially available
systems.

There is definitely work being done on it and it's definitely become much
better. But to an extent, it isn't something that a lot of companies are going
to work on. How big is the market for TTS? Before you say, "it's useful in
loads of things," think about the market for maps. A lot of it is Google or
Apple Maps. Loads of apps integrate mapping, but don't want to map the world
or run their own infrastructure for serving it. Some use OpenStreetMaps, but
they're really just serving generated tiles rather than re-mapping the world.
If you were to create a TTS startup, what would your business model be? Pay us
money to TTS your text rather than getting it for free from Google's Android
TTS? The issue is that TTS is more a feature than a product. Companies like
Amazon might bring a company like Inova in-house. Nuance sells their stuff to
companies like Apple. Google is large enough to build it. But you'd be
pitching something that doesn't directly solve something for customers and
trying to hire very smart (expensive) people who you hope will be able to come
up with something new that isn't an obvious "with more resources, better"
solution. Remember, TTS needs to be done in real-time, possibly on low-powered
mobile devices (don't eat their battery or storage) or over the network (don't
make our AWS spend go through the roof). If you're going to sell to an app
maker as a no-network TTS, how much bloat are you adding to the app?

It's just a hard market to be in given that the 1980s solution works, even if
it doesn't sound realistic and modern already-available systems are quite
reasonable.

~~~
mchahn
> If you compare Google Maps voice to Stephen Hawking,

Has anyone else wondered why Stephen hasn't upgraded his voice? Maybe it is
his signature of sorts.

~~~
razakel
>Has anyone else wondered why Stephen hasn't upgraded his voice? Maybe it is
his signature of sorts.

That's exactly what he's stated publicly - it's so widely recognised it's part
of his identity.

