Hacker News new | past | comments | ask | show | jobs | submit login
NaturalSpeech: End-to-end text to speech synthesis with human-level quality (speechresearch.github.io)
377 points by phsilva on May 18, 2022 | hide | past | favorite | 149 comments



Wowsers. This is a step change in quality compared to SOTA. I suspect that without evaluating samples as a correlated group, distinguishing between the generated samples and those recorded from a human will be little better than a coin toss.

And even when evaluating these samples as a group, I may be imagining the distinctions I am drawing from a relatively small selection that might be cherry-picked. Nevertheless:

The generated samples are more consistent as a group, and more even in quality, with few instances of emphasis that seem (however slightly) out of place.

The recorded human samples vary more between samples (by which I mean the sample as a whole may be emphasized with a bit of extra stress or a small raising or lowering of tone compared to the other samples), and within the sample there is a bit more emphasis on a word or two or slight variance in the length of pauses, mostly appropriate in context (as in, it is similar to what I, a non-professional[0], would have emphasized if I were being asked to record these).

In general for a non-dramatic voiceover you want to maintain consistency between passages (especially if they may be heard out of order) without completely flattening the in-passage variation, but tastes vary.

Conclusion: For many types of voice work, these generated samples are comparable in quality or slightly superior to recordings of an average professional. For semi-dramatic contexts (eg. audiobooks) the generated samples are firmly in the "more than good enough" zone, more or less comparable to a typical narrator who doesn't "act" as part of their reading.

[0] Decades ago in Los Angeles I tried my hand at voiceover and voice acting work, but gave up when it quickly became clear that being even slightly prone to stuffy noses, tonsillitis and sore throats was going to pose a major obstacle to being considered reliable unless I was willing to regularly use decongestants, expectorants, and the like.


> distinguishing between the generated samples and those recorded from a human will be little better than a coin toss.

The synthesis still doesn't know where to place emphasis. You may be unable to distinguish between POOR human voice work and TTS, but not GOOD human reading.

Not a big improvement over some existing TTS. Try e.g. Michelle at: https://cloudpolly.berkine.space/


You can hear that the human readers place emphasis based upon an understanding of the meaning of the text that they're reading, and also based upon an understanding of the humans at the receiving end. It seeps through that they're human. The AI generated samples are good, but they're bland in comparison. The human emphasized words are typically not emphasized in the AI generated samples.


I hear more stress by the human reader, but it isn't always in the appropriate spot, IMO.


That might be true. But I think a good reader, such as a news anchor or a voice actor, will know where to put the emphasis and the pauses, in order to help the listener along. It's value-adding. I think most people who do it professionally will have this capacity.


I'd be nice to be able to tag specific words for emphasis in a sentence, where the tagging process would be made via semantic NLU tasks and the voice alteration by the TTS model


That'd be interesting because it'd split the problem into "parse and highlight what should be emphasised" and "do the TTS".


I think there's alrrady research for "TTS after NLG" that does this, since a NLG system can export meta-info about emphasis, in addition to the text (at least in case of non-end2end NLG systems).

Whether that makes a big difference in practice, I don't know.


> The synthesis still doesn't know where to place emphasis.

True. And yet, these samples aren't in a monotone! That's an enormous improvement.

> You may be unable to distinguish between POOR human voice work and TTS, but not GOOD human reading.

I think these are indistinguishable from AVERAGE human voice work. Keep in mind that the POOR voice work you may have in mind is probably still being done by someone who is, at least nominally, a paid professional.

*> Try e.g. Michelle at: https://cloudpolly.berkine.space/

I'm not able to select a voice other than Oscar (which is definitely worse than this).


Yeah, my first impression of the samples is "wow, they've found a human narrator who knows how to speak like a robot!"


It's interesting that TTS is getting better and better while consumer access to it is more and more restricted. A decade ago there were a half dozen totally separate TTS engines I could install on my phone and my Kindle came with its own that worked on any book.


This reminds me of voice dictation history. I remember using dragon and other upstarts, on a 486dx notebook. I could discuss with the laptop, ask instructions, receive answers, and control verbally all options, actions, and programs. At the same time, do dictation in different fields of study - training required in geology words or engineering words for example - which was remarkably accurate. So you could turn on your notebook; start running programs, verbally type (dictate) a report , save, format, print, it email it - then turn on a music player or watch a movie. All without touching a mouse or keyboard. This was in a busy room, with a party by the way. Everyone said, just wait until computer speeds are faster! But all the software was bought out by Apple, Microsoft. It didn't get improved on for over 20-25 years, and still not really f functioning. (Except siri Google etc.) the


In case it's of interest, when I last explored this topic in terms of the Free/Open Source ecosystem I was very impressed with how well VOSK-API performed: https://github.com/alphacep/vosk-api

Here's another project that builds on top of VOSK to provide a tighter integration with Linux: https://github.com/ideasman42/nerd-dictation


I agree so much, that I've started learning ML to make a decent opensource many-languages TTS working on smartphones.

But really, the situation is pretty good, with a lot of code and dataset available as opensource. Notably, if you're not constrained to smartphones and the like, you can run on your computer quite a number of modern models, see for instance https://github.com/coqui-ai/TTS/ (which itself contains many different models).

The work that needs to be done is """just""" to turn those models into something suitable for smartphones (which will most likely include re-training), and to plug them back into Android's TTS API.


If you've not already encountered them I'd definitely encourage you to check out these Free/Open Source projects too:

* Larynx: https://github.com/rhasspy/larynx/

* OpenTTS: https://github.com/synesthesiam/opentts

* Likely Mimic3 in the near future: https://mycroft.ai/blog/mimic-3-preview/

Larynx in particular has a focus on "faster than real-time" while OpenTTS is an attempt to package & provide common REST API to all Free/Open Source Text To Speech systems so the FLOSS ecosystem can build on previous work supported by short-lived business interests, rather than start from scratch every time.

AIUI the developer of the first two projects now works for Mycroft AI & is involved in the development of Mimic3 which seems very promising given how much of an impact on quality his solo work has had in just the past couple of years or so.


To the Kindle thing, I wonder if the rise of audiobooks as a product killed the latter off to remove competition.


It was a rights issue. The Authors' Guild argued that TTS required audio rights. Here's an article from 2009: https://www.theguardian.com/technology/blog/2009/mar/01/auth...


Just one more example of copyright stifling progress, innovation, and accessibility. The Authors Guild doesn't screw over the public by exploiting our insane copyright laws as often as the MPA or RIAA, but they've had their moments.

https://www.eff.org/deeplinks/2005/09/authors-guild-sues-goo...

https://www.cnn.com/2011/09/13/tech/web/authors-guild-book-d...

https://popula.com/2022/01/22/what-kind-of-writer-accuses-li...


Or publisher could license the audio right from the content creators, so they get compensated for the increase in profits that the tech/publisher is making?


If it's audiobook, then yes, it's sort of a performance and you could argue they need different licence. But with tts you basically create the audio with your own means on your device. But you still bought the book, so there is you profit. If I read a book to my child, would you want me to buy audio license?


I know it’s a rhetorical example, but given how stingy the copyright regime is: yes, yes they would. You maybe could manage a 20% off credit for the child’s listening rights.


Whats the difference between Audiobook and realtime text to audio translation, isn't the outcome/experience the same?

The difference between you reading and the software is one is 'mechanical', whether you want to constrain someone doing that in the rights you grant is debatable.

Copyright enables you (gives the creator the 'freedom' to choose) to make such choices, its what people choose to limit is the problem, as they tend to be very stingy.


> Whats the difference between Audiobook and realtime text to audio translation, isn't the outcome/experience the same?

The "outcome" (text is read aloud) is the same if you read it aloud to yourself. Really the difference is that when a publisher releases an audiobook they hire someone (sometimes multiple someones) often an actor or the original author to sit in a recording studio and recite. They pay for things like studio time, sound engineering, editing, the narrators time, sometimes music or foley, etc. It's very much a different product.

If you have a book and you recite it, or if you pay someone to come into your home and read it to you, or you get a bunch of software and have your computer read it for you, that's your right and at your own expense in terms of time, money, and effort. Some kind of text to speech software is expected on pretty much every device. Including such features in devices or using those features (especially the accessibility features) of your own devices isn't copyright infringement, shouldn't open you up to demands for payments from publishers, and is in no way comparable to a professionally produced audiobook. Maybe one day the tech will advance to where a program can gather the context needed to speak with and convey the correct emotion for each line and will be capable of delivering a solid performance, but right now we're lucky if more than 2/3 of the words are even pronounced correctly and the inflection isn't bizarre enough to make you question what was being said or distract you from the material.


As a thought experiment, how about if some new fancy feature/AI could take your book/text and generate a movie from it. The technology/publishing platform only licensed the right to publish the text/book, but they now have a way that you can watch a movie (in the past they had to get actors, director, sometimes multiple, pay people, studios etc)...

As a follow up it would be cool, TTS did use different voices for Narrator and characters in a book ... if someone patents that your welcome!


I'd guess that even a movie, made by AI using nothing but a book you paid for, would still be okay for personal use. Again, we already have the right to hire an entire theater troop to come to our homes and perform whatever we want. As long as you weren't making your AI movie commercially available you'd probably be alright, that kind of thing might be transformative enough to be covered under fair use, although I'm guessing somebody would still object to you posting it online for free.

Honestly, if AI ever gets good enough at crafting films from literary source material that the AI movies has any chance of competing with a hollywood production the entertainment industry is screwed. I'm positive that by then whatever crazy stuff that AI is putting out will be everywhere and playing with it would be way more fun than a movie theater ticket.

Different voices would be cool, but who is speaking which line can sometimes be ambiguous even for human readers. I'd be happy with just one voice that didn't sound like a robot or like a human voice spliced together from multiple sources.


> convey the correct emotion for each line and will be capable of delivering a solid performance

Yeah ... so you might be getting into performance rights then ;P


One of the problems is if TTS becomes comparable to an actor's performance. It's interesting that we are also now moving in the direction where you can copyright your voice.


yeah, actors are going to have to secure all kinds of "likeness rights" or something to prevent all kinds of media being made with them, but without them. There have already been cases like this: https://www.theverge.com/2013/6/24/4458368/ellen-page-says-t...

I still expect it'll mean a lot of very amusing outbound voice mail messages.


Oh noooo. There is a huge difference in quality. Voice actors are severely underrated. They can make or break an audiobook.

We recently listened to How To Train Your Dragon instead of reading it to our kids ourselves specifically because it was narrated by David Tennant.


Next you will need audio rights if you want to read a bedtime story to your kids...


Oh absolutely, the same way Amazon removed physical buttons so they could use them as leverage for their overpriced Oasis reader. My twelve year old Kindle 3 has TTS, page turn buttons, and a headphone jack, and it cost a hundred bucks.


Fun fact: If you try to use Siri to read the text of a website to you ("speak screen"), it refuses to do so if you are connected to a car's bluetooth.

One of the vast litany of non-AI related Siri flaws that make you wonder how a $2t company can achieve such tragic levels of incompetence.


That sounds intentional, like trying to prevent "public performances".


I wonder if Google has the same limitation. I use that functionality quite often but I do it over the aux input.


Your average mobile processor doesn't have anywhere near enough processing power to run a state of the art text to speech network in real-time. Most text to speech on mobile hardware are stream from the cloud.


I had a lot of success using FastSpeech2 + MB MelGAN via TensorFlowTTS: https://github.com/TensorSpeech/TensorFlowTTS. There are demos for iOS and Android which will allow you to run pretty convincing, modern TTS models with only a few hundred milliseconds of processing latency.


Dr. Sbaitso ran on a modest 386. Mobile device processors generally eclipse that and could definitely generate better quality TTS.


Not only is state of the art TTS much more demanding (and much much higher quality) than Dr. Sbaitso[0], but so are the not-quite-so-good TTS engines in both Android and iOS.

That said, having only skimmed the paper I didn’t notice a discussion of the compute requirements for usage (just training), but it did say it was a 28.7 million parameter model, so I recon this could be used in real-time on a phone.

[0] judging by the videos of Dr. Sbaitso on YouTube, it was only one step up from the intro to Impossible Mission on the Commodore 64.


Ok, I get it, state of the art TTS uses AI techniques and so eats processing power, buuuuut seeing that much older efforts which ran on devices like old PCs, the Amiga, the original Macintosh, the Kindle etc. used much less CPU for speech that you could (mostly) understand without problems, it may be worth exploring if it's possible to write a better "dumb" (i.e. non-AI) speech synthesizer?


Better than the ones those systems already have? I assume they’ve already got some AI, because without AI, “minute” and “minute” get pronounced the same way because there’s no contextual clue to which instance is the unit of time and which is a fancy way of describing something as very small.


I'm still hoping that a human being can tell which of the four possible ways to pronounce the name of the English post-punk band, "The The".

https://en.wikipedia.org/wiki/The_The

https://www.youtube.com/watch?v=orIy18qIaCU


I have a soft spot for the Yorkshire pronunciation: https://www.youtube.com/watch?v=lzymb0YJp7E&t=160s


The parent didn't mention real-time as a requirement. Offline rendering would well suffice.


28.7 million parameter is nothing for inference


Often you can prune parameters as well. You might be able to cut that down by a factor of 10 without any noticeable loss in accuracy.


Well, that's no surprise, because governments/corporations (in fact, The Govporation these days) strive to keep the status quo of being way beyond the capabilities of the people, hence the monopoly on weapon-grade tools - and I seriously think that AI-produced speech that is indistinguishable from the human one is a weapon-grade tool.


Your answer is in deep conspiracy territory. The reality is in my opinion more banal: money.

Microsoft and particularly Apple used to make cutting edge (at the time) TTS available as a marketing gimmick, but then did not develop this further because TTS was only relevant for screen readers and people with impaired vision weren't really their primary customers. Then TTS made advances and companies started selling high quality voices at high price points. Now companies want to make as much money with it as they can. Improving "free" TTS for ordinary customers is not really a top priority of Microsoft and Apple. Moreover, since network speeds have increased, any improved end-consumer TTS will send the text to a server and the audio back, so that a company can collect and analyze all the texts and make money with this spy data. That's how Google's free TTS server works.


> Now companies want to make as much money with it as they can.

That's exactly what I meant when I wrote "...strive to keep the status quo of being way beyond the capabilities of the people".

Monopolizing the means of maximizing the profits to maximize the power - now that's some deep conspiracy territory, indeed.


> Your answer is in deep conspiracy territory. The reality is in my opinion more banal: money.

Honestly, are those really different? How is it not in their interest for money?


Can't expect any honesty here, my friend. Vested interest and latent stockholm syndrom prevails.


[flagged]


“Govporation” is exactly the sort of agenda driven torturous linguistic manipulation that the phrase “call a spade a spade” mocks.


Parse it as "seemingly independent governments of the world and international big business blended and interlocked together to the point of being indistinguishable" to ease the torture.


We have a word for that, although it's meaning is a bit broader. Society.


No, that excludes labor, i.e. the majority of people. Land, labor, capitol and state as per classical economics (Ricardo, Smith, etc.)


We're democracies, we vote.

Oh also it's land, labour, capital and entrepreneurship in classical economics.


> We're democracies, we vote.

You vote what you're told to vote by the media that the state and capital dominate entirely (Think about it: how would enough people even hear about an alternative idea or candidate that they don't deem in their interest?) If you're American you get to choose between two millionaires (or sometimes billionaires) with virtually identical economic politics chosen in advanced for you who can safely ignore you once they get elected. "Democracy."

> Oh also it's land, labour, capital and entrepreneurship in classical economics.

First, no you're wrong, there are only three factors of production in classical economics which you would know if you read any of those authors or at least bothered to check Wikipedia[1]. Second, the state is not a factor of production, so you got the wrong list anyway. "Entrepreneurship" which isn't an institution, was made up by noeclassical economics to justify profit in the 30's, over 250 years later. It wasn't even popular until much later.

1. https://en.wikipedia.org/wiki/Factors_of_production#A_fourth...?


We as in Zamyatin's "We"?


You are free to assign whatever sequence of phonemes you deem appropriate to the phenomenon, but the latter remains what it is regardless.


See also the recently published Tortoise TTS, which IMO sounds even better: https://github.com/neonbjb/tortoise-tts


> If I, a tinkerer with a BS in computer science with a ~$15k computer can build this, then any motivated corporation or state can as well.

Huh.


WOW! I'm flabbergasted! Checkout `Compared to Tacotron2 (with the LJSpeech voice)` or `Prompt Engineering` section!


just had a quick play with this ! great !!


It sounds better and yet sounds like an overcompressed mp3


Nice pitch envelopes. But it's a bit uncanny that natural human pitch envelopes encode and express what you understand and intend to convey about the meaning of the words you're saying, and what you want to emphasize about each individual word, emotionally. Like how you'll say a word you don't really mean sarcastically. It can figure out it's a question because the sentence ends in a question mark, and it raises the pitch at the end, but it can't figure out what the meaning or point of the question is, and which words to emphasize and stress to convey that meaning. (Not a criticism of this excellent work, just pointing out how hard a problem it is!)

For example, compare "rebuke and abash": in the NaturalSpeech, one goes down like she's sure and the other goes up like she's questioning, where in the recording, they are both more balanced and emphasized as equally important words in the sentence. And the pause after insolent in "insolent and daring" sounds uneven compared to the recording, which emphasizes the pair of words more equally and tightly.

Jiminy Glick interviews (and does an impression of) Jerry Seinfeld:

https://www.youtube.com/watch?v=AE2utktZ92Y


I had always thought a necessary step along the way to natural speech synthesis would be adding markup to the text. But I guess the use case being chased is reading factual information, where you don't use sarcasm or other emotional color. Trying to convert text to emotive speech requires markup IMO. Even the best actors will read their lines the wrong way and need correction by the director.


> Even the best actors will read their lines the wrong way and need correction by the director.

And sometimes even the director doesn't notice. Even when they also wrote the screenplay! My favourite example is from The Matrix, @1:01:40, where Cypher says:

> The image translators work FOR the construct program, but there's way too much information to decode the Matrix

Stress on the "for", which makes no sense; he sounds like he's revealing an employee/employer relationship. The stress should be on the start of "CONstruct", since he's saying the tech works for that but not for the other. Subtle, but it changes the whole sense of the line.


Wow, I've seen that movie dozens of times and I've always thought that was the stupidest line of meaningless technobabble because of how it was delivered. It made me question whether I knew what was meant by the phrase "construct program". It's clearly the name for the training/utility simulation used by the crew, but this line always had me questioning whether it also referred to the "front end" of the Matrix itself or something. Now it makes sense!

Goddamn you Cypher, indeed.


Dumbledore said calmly


That's a good point. There are various xml markup formats for synthesizing text to speech that let you tag words for emphasis and pitch, but there's not granular or expressive enough to mark up individual syllables of words, and it's not useful for singing, for example.

It would get really messy of you had to put tags around individual letters, and letters don't even map directly to syllables, so you'd need to mark up a phonetic transcription. At that point you might as well use a binary file format, not xml.

But for that kind of stuff (singing), there are great tools like Vocaloid (which is a HUGE thing in Japan):

https://www.vocaloid.com/en/

https://en.wikipedia.org/wiki/Vocaloid

VOCALOID5 - Walkthrough

https://www.youtube.com/watch?v=UAtVGHl1AFM

(Check out the "Cool / Cute" slider at 8:22!)

Here's a much simpler and cruder tool I made years ago (when xml was all the rage) for editing and "rotoscoping" speech pitch envelopes, called the "Phoneloper" -- not quite as polished and refined as Vocaloid, but it was sure fun to make and play with:

Phoneloper Demo

The Phoneloper is a toy/tool for creating and editing expressive speech "Phonelopes" that Don Hopkins developed for Will Wright's Stupid Fun Club in 2003, using Python + Tkiter + CMU's "Flite" open source speech synthesizer. It modified Flite so it could export and import the diphone/pitch/timing as xml "Phonelopes", so you could synthesize a sentence to get an initial stream of diphones and a pitch envelope. Then you could edit them by selecting and dragging them around, add and delete control points from the pitch and amplitude tracks to inflect the speech, stretch the diphones to change their duration, etc. It was not "fully automatic", but you could load an audio file and draw its spectrogram in the background of the pitch track, so you stretch the diphones and "rotoscope" the pitch track to match it.

https://www.youtube.com/watch?v=qy5cqV8ypIs


DECTalk (famously, the TTS used by Stephen Hawking) lets you mark up individual syllables, and it even had a singing feature circa 1985.[1] It is indeed messy though.

This is the example code to make it sing Happy Birthday:

>DECtalk Software Sample Singing Program

[:phoneme on]

[hxae<300,10>piy<300,10> brr<600,12>th<100>dey<600,10> tuw<600,15> yu<1200,14>_<120>] [hxae<300,10>piy<300,10> brr<600,12>th<100>dey<600,10> tuw<600,17> yu<1200,15>_<120>] [hxae<300,10>piy<300,10> brr<600,22>th<100>dey<600,19>dih<600,15>rdeh<600,14&g;ktao<600,12>k_<120>_<120>] [hxae<300,20>piy<300,20> brr<600,19>th<100>dey<600,15> tuw<600,17> yu<1200,15>]

---

[1] https://vt100.net/manx/details/1,1230 (92mb pdf so I won't link directly to it, but info on singing feature appears on p. 116 for the curious)


Here's a historic DECTalk Duet song from Peter Langston (which is actually quite lovely):

Eedie & Eddie (And The Reggaebots) - Some Velvet Morning (Peter Langston)

https://www.youtube.com/watch?v=1l0Ko1GUiSo

Peter S. Langston - "Some Velvet Morning" (By Lee Hazelwood) - Performed By Eedie & Eddie And The Reggaebots

http://www.wfmu.org/365/2003/169.shtml

Eedie & Eddie On The Wire

http://www.langston.com/SVM.html

Peter Langston's Home Page:

http://www.langston.com/

His 1986 Usenix "2332" paper:

http://www.langston.com/Papers/2332.pdf

How to use Eddie and Eedie to make free third party long distance phone calls (it's OK, Bellcore had as much free long distance phone service as they wanted to give away for free):

https://news.ycombinator.com/item?id=22308781

>My mom refused to get touch-tone service, in the hopes of preventing me from becoming a phone phreak. But I had my touch-tone-enabled friends touch-tone me MCI codes and phone numbers I wanted to call over the phone, and recorded them on a cassette tape recorder, which I could then play back, with the cassette player's mic and speaker cable wired directly into the phone speaker and mic.

>Finally there was one long distance service that used speech recognition to dial numbers! It would repeat groups of 3 or 4 digits you spoke, and ask you to verify they were correct with yes or no. If you said no, it would speak each digit back and ask you to verify it: Was the first number 7? ...

>The most satisfying way I ever made a free phone call was at the expense of Bell Communications Research (who were up to their ears swimming in as much free phone service as they possibly could give away, so it didn't hurt anyone -- and it was actually with their explicitly spoken consent), and was due to in-band signaling of billing authorization:

When you called (201) 644-2332, it would answer, say "Hello," pause long enough to let the operator ask "Will you accept a collect call from Richard Nixon?", then it would say "Yes operator, I will accept the charges." And that worked just fine for third party calls too!

>Peter Langston (working at Bellcore) created and wrote a classic 1985 Usenix paper about "Eedie & Eddie", whose phone number still rings a bell (in my head at least, since I called it so often): [...]

>(201) 644-2332 or Eedie & Eddie on the Wire: An Experiment in Music Generation. Peter S Langston. Bell communications Research, Morristown, New Jersey.

>ABSTRACT: At Bell Communications Research a set of programs running on loosely coupled Unix systems equipped with unusual peripherals forms a setting in which ideas about music may be "aired". This paper describes the hardware and software components of a short automated music concert that is available through the public switched telephone network. Three methods of algorithmic music generation are described.


> and it's not useful for singing, for example.

There's this: http://zeehio.github.io/festival/doc/Singing-Synthesis.html It won't sound as good as Vocaloid, but it uses XML with singing synthesis.


FWIW the "Speech Synthesis Markup Language" (SSML) standard does have sections for "3.2 Prosody and Style" https://www.w3.org/TR/speech-synthesis11/#S3.2 & "3.1.10 phoneme Element" https://www.w3.org/TR/speech-synthesis11/#edef_phoneme in it which can enable quite granular control.

Of course, whether a particular speech synthesis system supports such features is another thing.

It also has a `pitch_contour` attribute: https://www.w3.org/TR/speech-synthesis11/#pitch_contour

Coincidentally enough I pretty much only know any of this because a few years ago I created a GUI for a client which enabled an assistive technology researcher to "draw" in the pitch contour required for a word/phrase (not unlike the project demonstrated in your video :) ) from which the SSML was then generated.


Do you know greats vocaloids for songs? the best I know being the soul crushing https://youtu.be/W2TE0DjdNqI


Given that a human is going to be listening to each sample, I wonder if a style-transfer style algorithm could be used to map the intent of a sentence to a simulated voice.

I tend to view most of these things through the perspective of what would help mod-maker's for video games: VA is the one thing you basically can't do yourself, but you also tend to have a decent data set to pull from (and I suspect various open source voice sample sets would become pretty popular).


> I wonder if a style-transfer style algorithm could be used to map the intent of a sentence to a simulated voice.

There's definitely research/proprietary software that can enable a person speaking in desired manner to have their voice control the expression of the generated speech.

Here's a related issue on a Open Source text to speech project which I only learned of today: https://github.com/neonbjb/tortoise-tts/issues/34#issue-1229...

> I tend to view most of these things through the perspective of what would help mod-maker's for video games

Yeah, I think there's some really cool potential for indie creatives to have access to (even lower quality) voice simulation--for use in everything from the initial writing process (I find it quite interesting how engaging it is to hear one's words if that's going to be the final form--and even synthesis artifacts can prompt an emotion or thought to develop); to placeholder audio; and, even final audio in some cases.

> (and I suspect various open source voice sample sets would become pretty popular).

That's definitely a powerful enabler for Free/Open Source speech systems. There's a list of current data sets for speech at the "Open Speech and Language Resources" site: https://openslr.org/resources.php

Encouraging people to provide their voice for Public Domain/Open Source use does come with some ethical aspects that I think people need to be made aware of so they can make informed decisions about it.

Given your interest in this topic you might be interested in this (rough) tool I finally released last week: https://rancidbacon.itch.io/dialogue-tool-for-larynx-text-to...


Yes, this stuff will be pretty neat for game mods, particularly ones like the Elder Scrolls IMO. Can't wait!


The W3C does have a standard for speech synthesis markup called "Speech Synthesis Markup Language" (SSML): https://www.w3.org/TR/speech-synthesis11/

Whether it's supported by a particular speech synthesis system and which set of features are supported varies.

The standard does include a section on "3.2 Prosody and Style" which covers this aspect to some degree: https://www.w3.org/TR/speech-synthesis11/#S3.2


https://15.ai/ uses emojis for emotional context and it works pretty good.


There's no way I'll find it but somewhere along the way there was a collection of samples in which one of these contemporary model-based speech synthesizers (possibly wavenet or tacotron) was forced to output data with no useful text (can't remember if it was just noise or literally zero input). The synthesizer just started creating weird breathy pops and purrs and gibberish utterances. Some of them sounded like panic breathing and it was one of the more jarring things I've heard in quite some time.

This isn't exactly it but it's very close - https://www.deepmind.com/blog/wavenet-a-generative-model-for... CTRL+F 'babbling'


Sounds a lot like Simlish!

Katy Perry - Last Friday Night in Simlish:

https://www.youtube.com/watch?v=sxyW6AJ-yIk

How the Language From the Sims Was Created:

https://www.youtube.com/watch?v=FGsbeTV76YI

Simlish Voice Video (Gerri Lawlor and Stephen Kearin, inventors of Simlish):

https://www.youtube.com/watch?v=Y_E6026i9tA

Steve and Gerri improvising together in English while playing The Sims:

https://donhopkins.com/home/catalog/sounds/Steve_And_Gerri.w...

Gerri Lawlor:

https://en.wikipedia.org/wiki/Gerri_Lawlor

At the University of Maryland VAX Lab in the 80's, we had a DECTalk attached to the VAX over a serial line that we'd play around with, but I think the protocol must have used two byte tokens, because some times it would get one byte out of sync and start going "BLLEEGH YAAUGH RAWGH BRAGHK SPROP BLOP BLOP GUKGUK BWAUGHK GYAAUGHT BLOBBLE SPLOP BLAP BLAP BEAUGH GUWK SPLAPPLE PLAP SPLORPLE BLAPPLE"! (*)

Just like it was channeling the Don Martin Sound Effects from random Mad Magazines.

https://www.madcoversite.com/dmd-alphabetical.html

(*) Bulemia Meeting Attendees Vomiting, MAD #266 1987, Page 45, On Thursday Evening on West 12th Street.


Most of those sounds hilariously close to Danish, probably because of the glottal stop :P


For german users, I can recommend to take a look at

https://www.thorsten-voice.de/

https://github.com/thorstenMueller/Thorsten-Voice

where someone contributed a huge set of his voice samples and a tutorial / script collection to build a pretty decent TTS model LOCALLY.

Quality-wise it is not as good as the samples in the article, but its free and pretty easy to follow for a tech enthusiast.


Thanks for the pointer. Good non-English TTS is hard to come by, and this one sounds truly amazing!


What's a good TTS cloud service that has anything even close to these voices. I looked at the Google and Amazon ones and was pretty disappointed.


Did you ... read ... what was written in the linked page?

> Authors

> ...

> Microsoft Research Asia & Microsoft Azure Speech

https://azure.microsoft.com/en-us/blog/announcing-new-voices...


Just because they release a research paper doesn't mean it's out in production or if this model will even be economical to run in production for them.


Thank you! Thought it was just a research paper and they weren't in production yet. Samples on that blog post sound really good


I played with demos of various cloud providers, a lot of it comes down to difference of opinion it seems as I've had people exclaim on here something was fantastic that sounded like a robot.

My opinion is Microsoft's Azure TTS is the best and IBM's Watson TTS was a pretty close second. I remember finding Google's disappointing as well.

https://azure.microsoft.com/en-us/services/cognitive-service...


Industry research lab claims human parity on end-to-end text-to-speech and releases a web page with five samples as proof? Microsoft, you're a little late to the party - Google has been using this playbook for 5 years!


It's clear that their dataset contains a lot of newscasts. I wouldn't call this "natural" speech. But it certainly has an application for replacing newscasters/announcers.


Reminds me of how good the choir instruments are these days. https://youtu.be/ulK3_o7OyEk?t=392


As far as I can tell the choir instrument in that video is just playing back samples, which are admittedly very high quality samples, but still nothing particularly groundbreaking.

I’d love to see something that could actually do choir synthesis using a method like the one in the article.


Oh, I thought it was the one where you type in words. Can't remember where to find the word-typing choir plug if that's not the one.

But here's I guess a more pure-synth version (depending on what that means, like is any real voice data OK at any point in the algo?), not too bad.

https://youtu.be/9vEA4iSjajg?t=517

Edit: Ah, here's a type & speak version...who knows what cheats there are, could be somebody under his desk for all I know:

https://www.youtube.com/watch?v=NNyQ7FWV2E8

Less drama, though different software:

https://www.youtube.com/watch?v=UAtVGHl1AFM&t=177s


reminds me of this CD from the mid `90's >> https://youtu.be/YNM-BJ9JDKU


That is just playing audio samples. For true singing TTS check this: https://news.ycombinator.com/item?id=31420072


thx for sharing


While every sample they provide is suspiciously similar to the human version,(indicating overtraining, either on the samples or on a single voice), where I would have expected a different if still human quality voice from a fully functional system, this tech is coming, and soon. And when it does, voice acting will no longer prevent videogames from having complex stories, and we will find out if the industry is still capable of making them. Looking forward to it :)


One of the samples even has a breath intake at the same point as the recording. Not sure how they do that (didn’t read the paper) but I first thought it was the recording and compared the two to find the breathing wasn’t quite as natural as an actual person with lungs.


I'm looking forward to it for makind audio books which have no audio or plain reading me selections on webpages at desired speed


I don't suppose anyone could recommend a good text-to-speech for Linux?

Command line is fine but it would be much better if it could trivially take clipboard content for input. The last time I looked I found stuff that wasn't that great and was pretty inconvenient.


I can recommend taking a look at:

* Larynx: https://github.com/rhasspy/larynx/

* OpenTTS: https://github.com/synesthesiam/opentts

* Likely Mimic3 in the near future: https://mycroft.ai/blog/mimic-3-preview/


Thanks!



Very subtle differences, can be heard, but I have my headphones on. For example, in the last example, "borne" and "commission" seem to have some kind of artificial noise inside the "b" and "c" sounds. The "th" in "clothing" sounds artificial too. Still, it's extremely amazing, and probably in 90% of settings, people won't be able to find a difference at all. It even does breaths: "scientific certainty <breath> that".


This is pretty impressive work, except for this one: "who had borne the Queen's commission, first as cornet, and then lieutenant, in the 10th Hussars"

Both the NaturalSpeech and the human said pretty much every word in that sentence completely incorrectly for the context of the words. It is the difference between "the car Seat" and "the car seat". "It's pronounced Ore-garh-no" to paraphrase the insufferable Hermione Granger.


also had he been in the 10th Hussars he'd have been "lef-tenant" not "loo-tenant", and it would have been the "hoo-zaaz"


One thing I've noticed is that I can hear human inhale before they continue speaking. Got curious if tts of the future should have this feature too.


Google had a presentation a while back with an AI voice adding "uhm" and breathing pauses in their responses.


That's a really valid point, like making robots blink.


Good quality overall, though it's difficult to tell from a small, hand picked set of examples (which appear to come from the training data, too — have the corresponding recordings been included in the voice build or held out?).

There is a rather obvious problem with the stress on "warehouses", and a more subtle problem with "warrants on them", where it's difficult to get the stress pattern just right.


I think they’re held out but all data comes from the same speaker and since they have like 10k samples times say 10 words per sample, practically every word will have been in the training data.


The Text-To-Speech service by https://vtts.xyz is the perfect choice for anyone who needs an instant human sounding voiceover for their commercial or non-commercial projects. Got a product to sell online? Why not transform your boring text into a natural sounding voiceover and impress your customers. What about adding a voiceover to your animation or instructional video? It will make it sound more professional and engaging! Our human sounding voices add inflections in the voice that make them sound natural, and our custom text editor makes it easy to get exactly what you want from both Male & Female voices included over 30 different tones, including: Serious, Joyful & normal


That is crazy. Any way I can start using this soon? I have a backlog of articles I’d love to listen to.


Do you convert articles and listen to them?


I do, but I don’t like many text to speech voices at the moment because they don’t always enunciate in a way that makes sense. So I’ve been looking forward to the day when I can use custom human voices to read me news articles.


I'm in the same boat. I have tried GCP and AWS, and they sound too robotic. Do you convert articles you wrote or random articles? The reason I'm asking is because I'm thinking about new project that works like Descript and that is much easy to use.


TortoiseTTS might be the closest https://github.com/neonbjb/tortoise-tts It's a few shot multi speaker model so you need just 3-4 little clips to train new voices.


Thank you very much. This is what I was looking for.


Microsoft/Nuance has been doing great in this area. I am very impressed with TTS on Windows. It makes proofing documents that much easier. I do think there is a need for some type of markup (akin to sheet music) for supervised learning.


You could tell the difference in that the AI pronounced "Hussars" correctly where as the human reader did not. Without adding in our human error, our AI-trained version will be the more educated one for certain going forward.


It'd be nice if we could input our own text because otherwise these things are subject to a lot of training corpus and other biases.

Sounds really good though.


This kind of stuff is going to be amazing for indie gamedevs. I want a model trained for "powerful narrator voice" and villain speeches.


I imagine that our concept of what a villain sounds like tends to be extremely personally biased but here's a couple of options [Advisory: Contains threatening language.]:

* http://www.sndup.net/p33q

* http://www.sndup.net/sppn

I created these samples in a relatively short time using the Free/Open Source (which I think is an important factor for indies) text-to-speech project Larynx & an narrative editor I finally released the other weekend:

* https://github.com/rhasspy/larynx/

* https://rancidbacon.itch.io/dialogue-tool-for-larynx-text-to...

Now, I would really like to link you directly to audio of the next two but considering it's currently in beta behind an (automated response) email address, I think that may not be appropriate, so, instead...

* Visit & get access to the beta here: https://mycroft.ai/blog/mimic-3-preview/

* Copy & paste this SSML into the form: https://pastebin.com/Bwd7LCbj

It's definitely a noticeable step up again in quality.

There's an alternate pair of voices if you move the "_" from one "name" attribute to the other in each "voice" element.

I intentionally didn't edit the text to remove some of the artifacts both to give a realistic impression of the current state & because sometimes they add interesting texture. :)

Note the beta voices are "low" quality.


> We train our proposed system on 8 NVIDIA V100 GPUs with 32G memory

Sounds like openly reproducing this result is within independent researchers’ reach.


Totally tangential comment:

You can click play on any/all of the samples simultaneously, resulting in a neat sonic effect vaguely reminiscent of Steve Reich's famous "Come out." [1]

[1] https://www.youtube.com/watch?v=g0WVh1D0N50 (skip to like 7 minutes in to get the idea)


I wish for the "naturalspeech versus recording" comparisons they'd used a different voice for the synthesized speech. Otherwise the fact that we may not be able to tell them apart by ear (in a blindfold test) doesn't tell us much about how good it is as a speech synth engine with that evidence alone.


As a TTS daily user, sometimes I'm even fine with espeak quality for system messages. But one thing concerns me more than beauty of the voice - the ability to process mixed language text and abbreviations. And I don't see these problems addressed in this project. (


Do you want to check how it works? you can test the operation of standard voices and advanced neural voices, at this url: https://vtts.xyz/home/tryme


I'm not sure what to make of this. The TTS output seems identical to the recording.

Why don't they use this tech to recreate some dead actor's speech, for example?


Unbelievable. This has traversed the uncanny valley and come out the other side.


This is definitely human-level quality. In fact, the synthesized versions pronounce some words better than human. Kudos to MSFT! I think they've been longest in the game too...

edit; is the Nuance acquisition compounding yet?


I actually think the TTS voice is better sounding than the human's voice.


To me the human voice samples have a bit of NPR in them, compared to the TTS.


Cadence still seems way off for the AI. Maybe it’s going word by word?


Very cool, but..

What's the end game here? because I cannot use it, I cannot buy it and this seems more than just a scientific paper.

So what's the objective here?


I think we have reached the stage of development of AI, I am no longer surprised/excited by this results by any means.


Is there any TTS engine which isn't based on English?

I'd love to be able to use an assistant device in Croatian in my lifetime.


While these Free/Open Source licensed TTS engines have a "relatively large" number of non-English voice options neither seems to include Croatian.

* https://rhasspy.github.io/larynx/

* https://mycroft.ai/blog/mimic-3-preview/ (forthcoming, currently in beta, actively looking for feedback on non-English language quality)

From a quick look here, there didn't seem to be any Croatian open source data options either:

* https://openslr.org/resources.php

I have a vague recollection that there was at least one (I think) Eastern European country that managed to get government funding in order to support creation of local language assistive device text to speech.

So, it doesn't seem like an impossible task but certainly a non-zero amount of work to collect & process appropriate audio data.

Hope you get your dream at some point in future. (As everyone deserves assistive devices in their own language.)


Yeah.

Most of Ex-Yugoslavian languages (The southwest Slavic group) are basically the same. Much along the lines of US vs UK English. Slovenian and Macedonian are also quite understandable.

Hopefully advances in AI will allow for a more general approach which will accelerate the development of these technologies.


I have the same with other languages. One thing that I find fascinating is that while all the western TTS and language understanding frameworks allow only one at the time. The Chinese ones happily do Chinese and English at the same time.

I recently tried out the open source android TTS engines and they don't seem all that great even though there's been years of development on them.

Can anyone that knows comment on what the complexity here is?


As a native Japanese and sort-of English speaker, I need to pause flush audio pipeline to switch between languages. Konnichiwa spoken in the middle of an English sentence and こんにちは in native Japanese, are separate expressions and have to be pronounced differently. From my experience there likely isn't a single unified language model inside a human head, and if so it's not a surprise to me that one cannot be effortlessly made for computers.


Wow, this is the first speech synthesis I've seen on here where I thought I was listening to a human at first.


Is this available in other languages yet?


[Take my money meme] I want this for articles and books now, please


Sounds nice. I'm interested in making business based on TTS


What the fuck is end-to-end text? My Bullshit-O-Meter is off the charts. End of what? I only know end-to-end encryption.


Nothing BS about it. In ML, end-to-end means feeding raw data (e.g. text) to the model and getting raw data (e.g. waveform audio) out.

This is on contrast to approaches that involve pre- and postprocessing (e.g. sending pronunciation tokens to the model, or models returning FFT packets or TTS parameters instead of raw waveforms).

It's a common and well understood technical term in this context.


Can someone please upload the results? On https://paperswithcode.com/sota/text-to-speech-synthesis-on-...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: