
Ask HN: Need for a human-powered text-to-speech API? - leahcim
We are working with a network of US-based people good on the phone.<p>Is anyone interested in an API that would accept text as input and return a MP3 of someone reading the text within a couple of hours? We have a couple of US-based people who could do the job really well in a couple of minutes.<p>Command: POST &#x2F;tts { text: &quot;Hello John. Thanks for joining us today.&quot;, voice:&quot;female&quot;, web hook: &quot;..&#x2F;webhook&#x2F;response&quot; }<p>Webhook response (a few minutes later): POST &#x2F;webhook&#x2F;response { file: &quot;voice.mp3&quot;, cost: 0.07 }<p>Cost would be something like $1 per 100 words.
======
dragonwriter
> Cost would be something like $1 per 100 words.

A quick googling suggests that voice acting rates (pay to the voice actor
alone) tend to be in the range of $1/second for short, small-market bits
(short bits with larger markets tend to have higher use fees on top), so it
sounds like this service relies on getting people willing to work on-demand
for about 1/100 of market rates _with_ a much faster turnaround time than is
typical to have any room for profit

Sure, if you’ve got quality voice talent there's a huge demand for that. OTOH,
if you don't have quality voice talent, why would people pay for this instead
of today's commercially available machine TTS, which is _much_ lower latency
and _much_ cheaper (e.g., Google with their premium WaveNet voices at
$16/million characters, or something on the order of $1/8000 words.)

------
eindiran
I'd wager that a latency of a couple hours is unacceptable for almost all TTS
use cases.

Moreover, the current generation of TTS is pretty good and a lot of research
is being done to improve it. You'd have a very finite amount of time to build
your service and get users before the big players have got TTS that has caught
up and doesn't have an enormous latency/require paying human wages.

------
WestCoastJustin
Both Google and AWS have these APIs for pennies per minute. This market is
going to be absolutely commoditized in no time. You thinking of using one of
these APIs and slapping a front end on it?

~~~
leahcim
Absolutely agree, it's a super crowded space. Question is: are you happy with
existing TTS API? They sound so robotic.

~~~
WestCoastJustin
Honestly, unless you have some crazy tech there is no way you can complete
with Google and AWS in this space. The Google API does this in real-time too
(think what is backing Google Home, etc). The new deepmind wavenet tech is
getting way better at sounding natural [1]. I think your only option would be
to use these APIs, slap a front end on it, and try to undercut everyone in the
market (and quickly). But, it is a race to the bottom, and you likely have a
brief window to make some real money. Plus, this is typically a one time
purchase for most folks and not a subscription business. So, you'll constantly
be chasing customers.

I explored this idea, also the speech-to-text option, and when you run the
numbers you'll need thousands of hours per day just to keep the lights on.
Probably not worth it given you'll constantly be tracking new customers down.
One option might be to target news companies and try to make automated news
castings or something and try to get consulting fees + using your custom tech.
But, I suspect it would need to be the tech + some other offering to
differentiate you from everyone else that will be doing this.

Not trying to dissuade you. Just telling you what I think about it after
looking at this and building out a few prototypes.

[1] [https://cloud.google.com/text-to-
speech/docs/wavenet](https://cloud.google.com/text-to-speech/docs/wavenet)

~~~
leahcim
I wonder if some tech companies need more human audio samples to train their
ML?

