
Emotionally Expressive Text to Speech - interweb
https://www.sonantic.io/
======
crazygringo
This is fascinating.

But I'm very curious what the emotional "parameters" are? There are literally
at least a thousand different ways of saying "I love you" (serious-romantic,
throwaway to a buddy, reassuring a scared child, sarcastic, choking up, full
of gratitude, irritated, self-questioning, dismissive, etc. ad finitum).
Anyone who's worked as an actor and done script analysis knows there are 100's
of variables that go into a line reading. Just three words, by themselves, can
communicate roughly an entire paragraph's worth of meaning solely by the exact
way they're said -- which is one of the things that makes acting, and
directing actors, such a rewarding challenge.

Obviously it's far too complex to infer from text alone. So curious how the
team has simplified it? What are emotional dimensions that you can specify?
And how did they choose those dimensions over others? Are they geared towards
the kind of "everyday" expression in a normal conversation between friends, or
towards the more "dramatic" or "high comedy" of intense situations that much
of film and TV lean towards?

~~~
stubish
I've always imagined that this tech would need a markup language. Instead of a
script that an actor needs to interpret, the script writer (or an editor, or a
translator) would mark up the text.

~~~
jchb
There is Speech Synthesis Markup Language (SSML). Amazon Polly and Google
text-to-speech supports it, although the best neural-model based voices only
support a small subset.

~~~
crazygringo
Ah thank you, that's very interesting.

So that's not markup along "emotional" lines, but rather along "technical"
attributes such as speed, pitch, volume, pause between words, and so on.

Obviously coding those things in XML manually would be a nightmare. Now I find
myself wondering if 1) these technical parameters can be used to synthesize
speech that does sound like a reasonable approximation of emotion (or if
they're insufficient because changes in resonance and timbre are crucial too),
and 2) if there are tools that can translate, say, 100 different basic
emotional descriptions ("excitedly curious", "depressed but making effort to
show interest", etc.) into the appropriate technical parameters so it would be
usable.

Anyways, just a fascinating area of study.

------
sonantic
Hey HN - Zeena Qureshi (Co-Founder and CEO at Sonantic) here.

Thanks for your thoughts and feedback thus far! I'd be happy to answer
questions (within reason) about our latest cry demo / emotional TTS! Feel free
to fire away on this thread.

~~~
nmstoker
Saw your YouTube videos a few days ago and was very impressed.

Clearly you can't give away too much on your "secret sauce" but is there any
insight you could share on two questions:

1\. Do the individual voice talents need to express the emotion types you use
or can you layer it on after? (ie do they have to have recorded say "happy" to
get happy outputs or can that be added to neutral recordings retrospectively)

2\. What are the ball park audio amounts you need per voice? 10 hrs, 20 hrs or
more?

~~~
sonantic
Hey! Thanks so much. Yea, can't go into too much detail here, but I will say
that more def isn't always better when it comes to the size of datasets. :) We
aim for quality over quantity in order to achieve natural expressiveness from
our actor recording sessions.

~~~
AndrewUnmuted
> more def isn't always better when it comes to the size of datasets

This sentiment definitely gives you lots of credibility, only those who have
seriously endeavored in this space are able to acknowledge just how true this
is.

It's quite antithetical to how some ML folks like to think.

------
spaceprison
My daughter is dyslexic and would love to play things like stardew valley,
pokemon or even animal crossing but being text only makes them such a slog for
her.

The same goes for sub titles, she'd be perfectly fine with a robot voice for
the actors if they sounded real enough like this.

Game changer.

~~~
sonantic
Thank you for your comment. One of the reasons we founded Sonantic was to
improve accessibility so we are right there with you! We plan to do this by
reducing the barriers (both financial and logistical) of voiced content for
everyone from indie developers to big AAA studios. We've already begun to see
progress on this through partnership with initial customers during our beta.

~~~
mrec
Is your TTS process necessarily ahead of time, or can it be done at runtime
with all the flexibility (templating, generative text etc) that brings?

~~~
sonantic
That's the holy grail right there, isn't it? :) We're definitely working
towards runtime but still some work to be done there to account for additional
complexities and balance trade-offs re: speed, quality, accuracy etc. of the
rendered output.

------
ArneVogel
This site has to have one of the worst cookie choice decision popups:
[https://imgur.com/a/YLsGadP](https://imgur.com/a/YLsGadP)

~~~
aantix
GDPR has reverted the web to Geocities. It’s popup hell.

~~~
martimarkov
It’s not GDPR. It’s all the tracking website owners put there.

The fault is with the owners of the sites.

------
vessenes
Hi Zeena, I love this! I just filled out your form.

I was just mucking around with Nvidia's latest, called flowtron, and I know
from that experience there's a significant amount of work between getting a
tech demo out and launching a usable product, whether API-based, or with some
visual workflow like your video shows.

One thing I think worth considering on the commercialization front is whether
or not the core offering is the workflow niceties around your engine, the
engine-as-API, or both. I'm just a random person on the internet, so take
these thoughts with a large grain of salt, but thinking about it, it seems to
me that prioritizing integration with say unity, unreal engine, video
compositing tools, blog posting tools are all interesting and viable market
paths. The underlying networks are going to keep improving for some time, so
you're really trying to buy some long term customers.

Some stuff that's obvious, but I can't resist:

I could off the top of my head imagine using this for massively reducing the
cost to develop games, for script writers pulling comps together, for myself
to create audio versions of my own writing, for better IOT applications inside
the home... I'd really love to be able to play with this.

There still isn't a truly non-annoying virtual assistant voice; when the first
tacotron paper came out, I was hopeful I would see more prosody embedded in
assistants by now, but the longer we live with siri and google, the more
sensitive I think we are to their shortcomings. I have a preference for
passive / ambient communication and updates, so I would place a really high
value on something that could politely interrupt or say hello with
information.

At any rate, congratulations, this is cool. :)

------
diminish
Impressive next step for text-to-speect. Wish there was some simple real
demos. I also work on the same thing using DL- and hope to open source the
"emotional part" of it.

We soon can create emotionally expressive youtube videos with synthetic
actors..

~~~
sonantic
Thanks for your comments and nice to hear you're also working on TTS! We have
a few more samples (without background music) further down on our homepage and
plan to add a full dedicated subpage in time!

------
yc-kraln
I have a comment and a question:

The comment: I noticed that your demo video also had "emotional" video layered
on top of the dialogue. This could be considered manipulative; perhaps
consider sharing a naked version so we could attempt to interpret the emotion
based solely on the text to speech engine.

The question: You mention you met at EF. I was wondering if, beyond bringing
you together, you found EF to be worth the cost of admission?

~~~
jasonlingx
> The comment: I noticed that your demo video also had "emotional" video
> layered on top of the dialogue. This could be considered manipulative;
> perhaps consider sharing a naked version so we could attempt to interpret
> the emotion based solely on the text to speech engine.

Close your eyes

~~~
fossuser
The music is still there and has an obvious effect.

I thought the demo was impressive, but these things do seem like an effort to
distract from (or more accurately bolster the effect of) the core technology.

Though maybe the right call since this is less a strict technical demo and
more a way to drive interest/marketing.

The 'high levels of expressivity' comment was more of a flag to me, it's a
meaningless phrase alone but it's suggested as an obvious answer. It feels
like a mysterious answer [0].

I recognize though this is a marketing video, the core tech demo is cool, and
I'm probably being unfairly critical. Flags like that make me more skeptical
than I would otherwise be by default.

[0]:
[https://www.lesswrong.com/s/5uZQHpecjn7955faL/p/6i3zToomS86o...](https://www.lesswrong.com/s/5uZQHpecjn7955faL/p/6i3zToomS86oj9bS6)

------
microtherion
The prosody sounds nice. But two of the longer samples have a lot of vocal
fry, and the third sounds like the voice has a stuffy nose and/or a slight
lisp. I wonder whether those mannerisms were chosen to camouflage artifacts
inherent in their current implementation.

~~~
rowanG077
Or the mannerisms where chosen to express that the system can produce real
voices and not just perfect ones.

~~~
sonantic
Yep. Each of our TTS models is based on a real actor's voice that has its own
nuanced characteristics. Some voices are naturally rougher / croaky while
others smooth. As in real life, our differences are what make us unique. Some
voices will work better for certain character profiles / scenes - it's up to
the user to decide.

------
schoolornot
Between this and Lyrebird there seem to be a high number of cutting edge TTS
solutions being worked on in the private sector. Does anyone know why there
haven't been much advancement with the FOSS libraries?

~~~
vianneychevalie
I’m convinced that the barrier to entry in this field in terms of technologic
and financial investment is too high for FOSS projects to compete with the
commercial solutions

We don’t see FOSS pharmaceutical research for instance, I believe for the same
reason. The amount of coordination needed and the impossibility to separate
TTS projects into sub-parts could also factors.

~~~
sgk284
[https://voice.mozilla.org/en](https://voice.mozilla.org/en)

“Common Voice is Mozilla's initiative to help teach machines how real people
speak.”

~~~
nmfisher
The problem with Common Voice is the same resource problem plaguing open
source efforts in general - for TTS data, it's not just the size that matters,
it's also the quality.

For something like Sonantic, you need clean recordings from professional
actors in proper recording environments (not to mention the in-house expertise
to then filter these down to curate the training/test datasets). That costs
money. A million people with laptop microphones will just never get there.

------
jariel
Recommending editing the video down to 43-60 seconds.

It would be nice to try with actual text inputs right on the page, that this
doesn't exist is tiny flag.

A great choice to work with voice actors, because there isn't any 'pure' TTY
that's good enough in the most general sense, having the actual voice actor as
a working basis will help.

Perhaps for small game houses, they can just use something off the shelf, big
houses can use a customized voice, and then not worry if they have to make
tweaks or changes, they don't have to do a whole production.

~~~
sonantic
Thanks for your feedback! We felt that this storyline / length was best in
order to showcase the two different actor's artificial voices and build up to
the actual cry.

As you've mentioned, we do work with real actors to create our TTS and take
misuse of their (artificial) voices very seriously. Because they sound so
lifelike, we've made the decision not to allow public access/personal use at
this time.

Lastly, your assessment is spot on regarding standard vs custom voices. Lots
of interest for both!

~~~
zoomablemind
TTS=text-to-speech, so it's quite reasonable to showcase that chain instead of
an edited video.

Not diminishing the quality of your product, just pointing out an obvious
expectation of the audience that it's presented to. Perhaps, there could be a
way to test-drive it directly, with limited choices or combinations of the
input text.

~~~
sonantic
Fair point. We will definitely consider adding something like this to the
site. Thanks for the suggestion!

------
microcolonel
Very cool demo, but the quality of the vocoding is not state of the art, and
it's audibly artificial, which is probably why you covered it up with the
obnoxiously loud music.

Next time be honest about what you have when presenting it; every human with
functioning ears is attuned to the sound of speech. This sort of technology
would be amazing for narrative video games even with the less than perfect
vocoding.

------
amelius
Sounds nice but difficult to judge with the background music.

~~~
hobofan
I get the criticism (and "you need to find out for yourself" at ~0:44 sounds
somewhat robotic), but given that it's aiming at the entertainment industry,
where you will have background music most of the time, it also seems like a
fair choice of representing real-world usage (where background music might
always hide the imperfections a bit).

------
voiper1
Is there any pay-to-use or open source voice for Hebrew?

Amazon's Polly English voice, Matthew is pretty nice. But they don't have
Hebrew. Also Google doesn't have Hebrew. Bing has some attribution requirement
that I haven't fully investigated.

~~~
sarabande
Here is one, עלמה רידר
[https://www.almareader.com/](https://www.almareader.com/) (not nearly as
emotionally expressive as this one though).

------
DenisM
This is very impressive.

I wonder if attaching this to a modern-day Elisa will improve the Turing test
scores? Emotional load can reduce the requirement for semantic coherence.

~~~
samcodes
“Emotional load can reduce the requirement for semantic coherence.” - great
insight

------
tomByrer
@sonantic Seems you don't do real-time yet?

If so, have plans for a Web Speech API plugin? I'm about to release a reader
demo based around it. [https://developer.mozilla.org/en-
US/docs/Web/API/Web_Speech_...](https://developer.mozilla.org/en-
US/docs/Web/API/Web_Speech_API)

------
aasasd
As a non-native-speaker, I understood exactly four words from the monologue in
the vid. Which _might_ be on par for some movies, often having actors whisper
and breathy-voice through the whole thing (ahem House of Cards cough).
However, for actual TTS like webpages and audiobooks, the ‘Dina’ voice works
much better.

------
wishinghand
Hey Zeena- will there be options to make the voices more unreal? The use case
I imagine is for a character with a damaged vocoder or a broken speaker. Other
glitchy affectations could be useful too.

------
diskmuncher
History has shown us that technological advancement of this kind will be
adopted first by ...

Obvious application: H-anime. Reduced parameters for the "emotion" as well.

------
hyperpallium
the video
[https://youtube.com/watch?v=zwYiDraKtSA](https://youtube.com/watch?v=zwYiDraKtSA)

------
sarabande
If this could generate well-done audiobooks instantly from a text, that would
be fantastic. All e-books could have an audiobook version overnight.

~~~
woah
It's kind of odd that they are not pitching this as the primary use case.
Seems much more plug and play than game voicing

~~~
mthoms
It needs a human to annotate the text with the desired emotion.

Ideally, it would be able to infer the emotion from the text itself, but I
think that level of sophistication is a long way off.

Edit: Actually, this might be a perfect candidate for some sort of
crowdsourcing. Imagine Wikipedia pages containing hidden annotations for the
proper text-to-speech "tone/cadence/whatever" of each sentence or paragraph.

~~~
bnj
This is a really cool idea— amazing to think about something like goodreads
taken to the next level where people are sharing their emotion markups for
texts. Imagine how you could try different mappings to see which ones you
liked...

------
Animats
Can't hear the voices over the music.

~~~
catblast
I hear them if I concentrate to isolate the voices out of the music. And you
can pick up on quite noticeable flaws, especially in cadence and intonation.
What somebody described as vocal fry sounds more like synthesis artifacts. The
bare samples further on the page also highlight issues in cadence.

This is obviously an early demo, but this isn't yet to the level you could
narrate an audio book - those little problems will quickly become noticeable.

------
moron4hire
Any plans to support languages other than English? This would be huge in the
foreign language instruction field.

~~~
sonantic
Yes! Supporting additional languages (and dialects) is definitely in our
roadmap.

------
blattimwind
I could see this being used for RPG games to fix the choice deficiency that
has been caused by going for fully voiced dialogue. Also, making Hitler read
copypastas even more convincingly.

~~~
ghaff
As good as even not-top-shelf voice actor talent is a really high bar. I keep
my eye on this space because there are a number of things I do where having
even just decent "radio voice" TTS would be useful (and better than I can do
myself). But nothing is really there today. In some respects, it's better than
I can do myself but certainly not consistently.

~~~
blattimwind
The bar really isn't "has to be good out of the box", if it requires some
tweaking on a line to line basis that would probably be ok and still much,
much cheaper and much quicker to iterate on than voice actors for these high
volumes of speech. In a lot of these games the existing voice acting is often
consistently poor (literally everything Bethesda ever released comes to mind);
certainly quite a few notches below the average AAA voice acting (which is
occasionally bad, but on average good).

~~~
ghaff
Fair enough. I'm not really much of a gamer.

One of the other challenges with using outside voice talent is that it can be
inconvenient/expensive when you need to add/change something. I've been
involved with podcasts using an external host and one of the negatives with
that process is that if you discover a minor mistake/glitch in the narration
late in the process you can't easily fix it.

------
dequalant
This is amazing! I was looking something like this to come up for a long time.
Finally someone did it!

------
terrycody
Applied the form. Really cool.

I want the know the price and when can we use it in production.

------
cemregr
Is there an actual demo?

------
dejongh
Borh Cool and creepy!

~~~
sonantic
O_o

haha thanks, we'll take that as a compliment!

------
maxdo
Wow sounds very real

