Hacker News new | comments | show | ask | jobs | submit login
Lyrebird – An API to copy the voice of anyone (lyrebird.ai)
1401 points by adbrebs 34 days ago | hide | past | web | 298 comments | favorite

Combined with Face2Face[1] live video impersonation, it is truly time to be very careful verifying videos or even live streams.


Without a doubt, our concept of personal identity will be completely unreliable within a few generations. Forget about privacy--we will soon have literally no way to verify who we're talking to.

Crypto would still work, and this tech isn't going to work face-to-face.

Neither will insulate you from a deception which you wish to perpetuate upon yourself, and identifying the latter is a trick that con artists specialize in.

Pelevin's novel 'Generation П' is a very interesting read on this kind of theme.

[0] https://en.wikipedia.org/wiki/Generation_%22%D0%9F%22

No different than now, right? Technically there is no way (practically, anyway) to identify someone you're talking to over the phone for instance.

If you have privacy, faces or sounds might not matter as much as content does - if you have common secrets, you have a way to identify a person.

To my knowledge, both of these particular projects are still a ways away from being used in any practical sense, let alone succeed at deceiving anyone.

You are right that we'll have to worry about this soon though. Likewise, verifying the identify of people we think we're talking to over video calls for example.

I would use this (the audio tech from the OP) for some edge cases in film production right now. It would also be easy to combine this with Twilio and a chatbot to scam people over the phone.

Woah, reminds me of Total Recall for some reason... looks like a special effect from the 80s when actual speaking occurs, but it's very close!

Okay, so on an ever so slightly related note, I've always wondered this ever since I saw that movie as a kid.

...Is it normal to feel bad for the Johnnycab "driver" when Arnie destroys it?

Woah that's kinda scary. What could we do to determine if a video is legitimate or not?

Mainly, practice critical thinking. Don't take anything at face value until it has been reconfirmed from many sources. At least that's what I do.

I love critical thinking as much as the next person, but I always find statements like this to be smug and self-congratulatory cliches. Of course you take things at face value, we all do. Every waking hour we're getting new information and having to make sense of it, while still living our lives. It's not practical for anyone to pretend that every interaction can be rigorously confirmed and independently verified, which means proliferation of convenient, effective mechanisms for lying and deception should be of real concern to all of us. No one is such a great critical thinker that they're immune, and it's particularly dangerous when our few reliable avenues of verifying identity and provenance are about to be cut off.

The problem isn't about the ones who already critically think.

This is going to be increasingly key. And even then, it will be very difficult.

Books like "Trust Me, I'm Lying" reveal the lengths at which deception can occur. Though this book discusses deception that starts at the textual level (e.g. blogs), it is inevitable that these tactics will be translated to the video level once the technology catches up.

Also, "at face value" - Ha! ;)

just wait till _the daily show_ and _last week tonight_ get a hold of this!

..then they'll finally be able to play audio of republicans contradicting themselves! :p

No need for that, they're already doing fine.

That's the joke

Awesome... Not sure if the voice thing can be done in realtime yet, but you're right... the combination of these two would be awesome

Holy shit this is crazy!


You were likely down-voted more because your first comment doesn't add anything substantive to the discussion rather than for the language you used. As the guidelines ask, please don't comment on being downvoted, as it makes for boring reading. And doing so in the manner you did is definitely uncalled for.


If you are referring to the word "shit," it is not forbidden here and is not likely the reason you were downvoted. I have a terrible potty mouth. I try to keep it PG-13ish online, but if I am tired or something, the way I actually talk tends to come out. My tendency to use the F word like other people use "very" does not appear to be in any way problematic per se.

I suggest you rethink your assessment of what is happening here.

Last week on BBC Radio 4 I heard of a woman who was losing her voice through disease (MND maybe?), a similar system was being anticipated and she was saving voice samples to seed it with.

She had been a singer and strongly identified her self with her voice, she wanted to be able to use a speech synthesis system that had her own voice pattern.

Apologies if this was already mentioned, but it seems to be a use others here hadn't considered.

Was this maybe CereProc? That's who helped Roger Ebert post thyroid cancer. He was a great candidate for the service as there were of course hours and hours of high quality audio recordings of his voice to use as a source.


I was just thinking that Stephen Hawking would perhaps be interested in using this to replace his current voice synthesizer (feeding in old interviews of him when he could talk). He has said that he has adopted the current voice since he has associated it with his own, but I wonder if he would prefer his old actual voice.

I think Hawking is now so firmly tied to that voice that he would probably never switch for public speaking engagements and the like.

I could see him doing such a switch for personal interactions.

Indeed, he doesn't like other voices and has always fallen back to using the same voice:

> "The voice I use is a very old hardware speech synthesizer made in 1986," he said. "I keep it because I have not heard a voice I like better and because I have identified with it."


I recall his biggest qualm with his current synthesized voice is that it did not come with a British accent. :-)

I'm not sure how sentimental he is but he does seem quite tied to that voice since there have been lots of advancements in voice synthesis since he originally got this and yet he's chosen to keep this one.

I wonder if he'd accept the same voice, but with a different accent?

Should be just as possible in the present or near future.

Actually fantastic to see there are legitimate uses for this tech beyond the obvious

This is actually quite inspiring!

If it seemed to you to be a use others hadn't considered, why would you apologize for mentioning it?

I hate it when threads are full of the same comment, I didn't diligently search to check it hadn't been mentioned already; ergo preemptive apology.

While all of these vec2speech type models are impressive, I get the feeling that most of the comments didn't listen to any of the samples. It's still distinctly robotic sounding, probably has quite a bit of garbage output that needs to be filtered manually (as many of these nets often have) and is a far cry from fooling a human.

I've been editing audio professionally for ~20 years now. This is short of perfect, but it's already good enough to be used in some kinds of editing emergencies - coincidentally exactly the ones where you're likely to have audio problems in the first place.

For example, in a movie with a busy action scene involving gunfire/helicopters/storms the production audio is usually useless because there were fans and other loud machinery on the film set to create visual illusions of powerful winds and so on. In this situation the audio is either not recorded or used only as a guide track to match timing on a high-quality track recorded later in the studio. Actors hate re-recording dialog by lip-syncing in front of a screen and producers hate paying for it. This solution is already good enough to use for incidental characters in scenes that are going to be noisy anyway. I give it about 5 years before it's good enough to use in ordinary dialog scenes - not for the intimate conversation between the two Famous Actors flirting over a romantic dinner, but fine for replacing the waiter or other background characters.

In the first clip, I'd say 80% of the soundbites were obviously robot-like, but one or two of the "Obama" quotes were startlingly clear - "The good news is, that they will offer the technology to anyone" - I can't hear anything wrong with that in the first clip at all. If they were all that quality I'd say we'd be easily fooled. As a proof of concept this is pretty big.

I can definitely hear issues with that phrase. It has quite robotic drop-offs.

Though coming soon: Neural networks to determine whether speech is NN-generated? :P

> Neural networks to determine whether speech is NN-generated?

I guess this would be an ideal use case for a generative adversarial network based approach.

This is likely part of how the speech-generating NNs are trained (ie. there's a generated-speech-detector and the network is trained to fool it, while it is also trained): https://arxiv.org/abs/1406.2661

And then the generators train their own generation NN against that :P

And you have invented Generative Adversarial Networks. They are the basis of all new ML findings, like pix2pix.

In a sort of Turing Test where I don't know who's a robot, or where I'm not even expecting a robot, it would probably be a bit harder.

The "Obama" material sounded quite good, but reverb can cover up a multitude of sins...


That means that recordings of speeches/performances from concert halls are potentially suspicious.

Know of any other instances of reverb covering things up? This is interesting!

> "potentially suspicious"

History is now doomed. Crackly recordings are obviously fakeable. Children will listen to JFK's "We shall not go to the moon" speech, proof that the moon landings are a liberal conspiracy and all that grainy footage is just CGI with a noise filter.

This idea hit me harder than I expected -- I am reminded of the scene in Interstellar, where the main character's daughter's teacher asserts that we never went to the moon. I don't recall whether she genuinely didn't believe it, or whether she felt it was better to lie to the kids to keep them motivated in the present, but apparently we are getting much closer to having gatekeepers of knowledge be able to actively subvert the artifacts of history in ways which are indistinguishable to the audience. Scary stuff.

> I don't recall whether she genuinely didn't believe it, or whether she felt it was better to lie to the kids to keep them motivated in the present

She stated it as a simple fact and seemed to believe it. There wasn't any indication that she might think it was a fact of what one might call "political convenience". It made the scene just that much chilling to me: https://www.youtube.com/watch?v=MpKUBHz6MB4

This is nothing new.. "History is written by the winners..." comes to mind...

Wasn't it the main character's teacher?

There have always been Gate Keepers to knowledge. Technology is in many ways blasted down these gates for many.

Yeah, that's a real issue. You can already fake audio very very easily. People don't realize how constructed movie soundtracks are, and I don't mean in terms of special effects but just people talking in every day situations.

I haven't worked extensively in voice adaptation, but I learned working in text-to-speech that adding a bit of reverb is quite effective at covering up artifacts.

Something similar seems to be going on in live vocals. If you lack confidence in your own voice, adding a bit of reverb can make it sound much better. Not sure what's going on — whether the reverb jams the critical listening facilities in one's brain or something like that.

> Something similar seems to be going on in live vocals. If you lack confidence in your own voice, adding a bit of reverb can make it sound much better.

a.k.a. why your singing sounds way better in the shower.

Wow. TIL x 2!

Now I understand why I liked adding a bit of reverb when listening to old MOD/IT/S3M audio files - it covered up the "digitalness" of the song structure a bit.

Thanks for the live vocals tidbit too, that's definitely something to file away.

I wonder how far you could push that in a presentational context (ie, when giving speeches), or whether "who left the speakers in 'dramatic cathedral' mode" would happen before "I dunno what they did to the audio but it sounds great". Maybe if the presentation area was fairly open/large it could work; the question is whether it would have a constructive effect.

In my (admittedly limited) experience mixing for both recording and live settings, I can say that the "sounds great" comes a long way before the "dramatic cathedral mode". If you can listen to it and hear reverb (unless you're going for that effect) you're doing it wrong. What you want is a bit of fullness, slightly softer edges at the end of words/sentences.

It's similar to the difference between 24fps cinema and 60fps home video. The video/clean signal retains more of the original information and is "more correct", but 24fps/touch of reverb adds a nuance that keeps things from getting too clinical. As to why we interpret clean signal == clinical == bad.... I can't really speculate.

Brains are incredible differential engines not evolved to handle current technology.

A standard quality video is just a projection, a high quality video stream on a 4k set running full 122hz is a weird window from wich we don't get stereo depth clues. The brain constantly has to rethink it's not real as we shift the head and the pov doesn't adjust.

I tried an Occulus a while ago and found it to be quite unrealistic, digital and "fake." (And I only had it on for about 30 seconds but my eyes felt a bit sore afterwards!)

Once LCD density allows for VR with 4K (or, if needed, 8K) per eye... yeah :) we'll firmly be in the virtual reality revolution.

Obviously we'll also need tracking and rendering that can keep up but display density is one of the trickier problems right now.


> If you can listen to it and hear reverb (unless you're going for that effect) you're doing it wrong.

I was thinking precisely that; I figured it'd need to be subtle and just-above-subliminal to have the most effect.

Completely agree about the 24fps-vs-60fps thing. I think this is a combination of both the fact that the lower framerate is less visual stimulation, and that I'm used to both the decreased visual stress and the overall more jittery aesthetic of 24fps.

Regarding >24fps, I think how it's used is critical.

I remember noticing https://imgur.com/gallery/2j98Y4e/comment/994755017/1 (yes, a random imgur gif - discovering that imgur doesn't have an FPS limit was nice though). I think this particular example pushes the aesthetics ever so slightly, but still looks pretty good.

I don't know where I found it but I remember watching a 48fps example clip of The Hobbit some time back. That looked really nice; I completely agree 48fps is a great target that still retains the almost-imperceptible jitter associated with 24fps playback.

To me the "nope"/sad end of the spectrum is motion smoothing. I happened to notice a TV running some or other animated movie with motion smoothing on while in an electronics store a few months ago... eughhh. It made an already artificial-enough video (I think it was Monsters Inc University) look eye-numbingly fake (particularly because the algorithm couldn't make its mind up about how much to smooth the video as it played, so some of it was jittery and some of it was butter-smooth). I honestly hope the idea doesn't catch on; it'll ruin kids and doom us to having to put up with utterly unrealistic games.

But I can see that's the direction we're headed in: 144Hz LCD panels are already a thing, and VR has a ton of backing behind it, so it makes a lot of sense that VR will go >200Hz over (if not within) the next 5 or so years.

The utterly annoying thing is that raising the framerates this high almost completely removes the render latency margins devs can currently play with. Rock-steady 60fps (with few drops below ~40fps) is hard enough but manageable on reasonable settings in most games nowadays (I think?), but when everyone seriously starts pining for 144fps+ at 4K, it's going to get a lot harder to keep the framerate consistent - now that we've hit ~4GHz, Moore's law won't allow breathing room for architectural overhead as has been the case for the past ~decade, and with current system designs (looking holistically at CPU, memory, GPU, system bus, game engine) we're already pushing everything pretty hard to get what we have.

So that problem will need to be solved before 144fps+ becomes a reality. A friend who has a 144Hz LCD says that going back to 60Hz just for desktop usage is really hard because the mouse is more responsive and everything just "feels" faster and more fluid. I'm not quite sure whether the games he plays keep up with 144fps though :P

On a separate note, I've never been able to make the current crop of 3D games "work" for my brain - everyone's pushing for more realism, more fluidity, etc etc, and it just drives things further and further into the uncanny valley for me, because realtime-rendered graphics still look terribly fake. Give me something glitchy and unrealistic in some way any day.

I think what it does is mask/smear the fine detail - the texture of the sound - but in a way that we are used to, so still sounds natural.

Adding a bit of autotune has made whole careers...

Also - lowering the bit-rate can coverup other defects (e.g. phone call).

The cadence was a bit off/unnatural, but I'm sure that is not too hard to fix. Phone-in TV/Radio/web shows are about to get very interesting.

Most of them are extremely mechanic, to the point where it's almost impossible to understand, but others are actually quite convincing.

I think it primarily needs to learn to respect punctuation, and to translate them to a breathing pause that matches the target voice ("President having speech"-style long pauses vs. "Politician having their ass handed to them by journalist on TV"-no-air-needed pauses).

Absolutely - listening through the multiple samples with different intonation from both Obama and Trump, some of the samples are much more realistic, while others come off as robotic.

Maybe it would be possible to train the system to prefer certain intonations in certain cases by rating the realism of the speech in context. It would be interesting to analyzes pauses around words grouped by word2vec! Or choosing a "style" of intonations based on punctuation, parameters like words/minute, etc.

I personally like this sample the most [0]. Note that these samples are not cherry-picked - having worked with very related algorithms [1] once it is trained well it pretty much "just works".

There is a lot of room for DSP/hacks/tricks to improve audio quality - just the same as in concatenative systems, but the point of this demo is to show what is possible with raw data + deep learning. Also note that this is (as far as I am aware) learned directly on real data such as youtube, or recordings + transcripts. That is quite a bit different than approaches which require commercial grade TTS databases, which are generally professional speakers with more than 10 hours of speech each, and cost a lot of money.

[0] https://soundcloud.com/user-535691776/special-guest-at-iclr

[1] http://josesotelo.com/speechsynthesis/

Very impressive I velt -- though there were robotic artifacts here and there (well every several seconds), much of it was fairly natural and reasonably lo-fi quality convincing. There was a HN post a few weeks ago from Google of their most TTS algorithm which was the best I've heard I think, quite quite human (whose techniques were published and could be integrated into Lyrebird).

you got a link anywhere?

I missed that one, and it's the best I heard so far. Bit of noise in the background, but the voice itself is very believable. I can imagine in year or two the radio and TV will start using this tech instead of actual speakers. Also, translation and synchronization of any content will be so much cheaper.

Try listening to the NOAA weather radio. They've been using very good TTS for years.

Wow, they are. Very surprised.

This is so good I'm wondering whether this is actually a massive (massive) sound bank. I think it might be a sound bank.

A random radio receiver site I found that doesn't require Flash, tuned to NOAA for Akron OH: http://tunein.com/radio/NOAA-Weather-Radio-1624-s88289/

The list I got the above link from (^F "noaa"): http://tunein.com/radio/Weather-c100001531/

I suspect the warbling I'm hearing is not due to TTS imperfections but 64kbps artifacting.

The one I listen to will occasionally mispronounce something in a way that a human never would, or say the name of a punctuation mark.

Huh. I see. Do you happen to know what station you use? I'd kind of like to hear this for myself (for the sole reason that I'd like to get an idea of what it sounds like, since that does definitely sound like a TTS).

KZZ40,162.45 MHz, Deerfield NH. Note that the stations have several different voices they use for different reports. Now that I think about it, I'm not sure which one I heard the mistakes on - it might have been one of the older ones.

BTW, I like the Tom voice more than the newer Paul. Paul is more realistic, but is also more soft-spoken and monotonic. Tom has more inflection and sounds more...forceful. I know it's just my imagination, but sometimes Tom sounds annoyed at bad weather :)

> KZZ40,162.45 MHz, Deerfield NH.

I see. I can't seem to find an online receiver for that frequency, although I did find that WZ2500 uses or seems to have used that frequency (for Wytheville VA).

I had a look at SDR.hu (a site I may or may not have just dug out of Google for the first time), but unfortunately the RTL-SDR receivers I can find seem to focus entirely on 0-30MHz. There are a couple ~400MHz receivers but nothing for ~160MHz.

(I may have fired up the receiver I found in NH and fiddled with it, puzzled, for 10 minutes before realizing the scale is in kHz, not MHz... yay)

> Note that the stations have several different voices they use for different reports.


> Now that I think about it, I'm not sure which one I heard the mistakes on - it might have been one of the older ones.

That's entirely possible. (But hopefully not. I kind of want to hear. :P)

> BTW, I like the Tom voice more than the newer Paul. Paul is more realistic, but is also more soft-spoken and monotonic. Tom has more inflection and sounds more...forceful. I know it's just my imagination, but sometimes Tom sounds annoyed at bad weather :)

I just learned about this service, I have to admit (I'm in Australia). It sounds really nice to be able to have a computer continuously read out the weather conditions to you as they change. And I can completely relate to the idea of preferring the voice that sounds unimpressed when the weather's bad :D

It really is a useful service. Many people have battery powered radios that include AM, FM and weather radio.

I'm not surprised that they re-use the frequencies. These are local weather stations, only intended to serve a radius of a hundred miles or so (at least here on the east coast). In addition to my local station I can receive the one in Boston, about 50 miles south of me.

Do we know what they're using? I found a few references to AT&T Natural Voices.

According to Wikipedia, "In 2002, the National Weather Service contracted with Siemens Information and Communication and SpeechWorks to introduce improved, more natural voices. The Voice Improvement Plan (VIP) was implemented, involving a separate computer processor linked into CRS that fed digitized sound files to the broadcast suite.... Additional upgrades in 2003 produced an improved male voice nicknamed "Tom", which could change intonation based on the urgency of a product"

Also, here are some audio clips http://www.nws.noaa.gov/nwr/info/newvoice.html

It doesn't sound too different from a voice coming over a walkie talkie or some kind of intercom.

The problem might be that high frequencies, especially overtones, aren't properly constructed, but I'm certain that can be improved.

The main problem is that the algorithms don't yet know what to stress in a sentence. The problem is semantic, and not so much about the sound of the voice itself.

You can synthesize someone's voice perfectly, but if it's stressing words incorrectly or not at all, it's not going to fool anyone.

Then again, that's probably easier to work around by having humans annotate the sentences to be read.

> Then again, that's probably easier to work around by having humans annotate the sentences to be read.

Or by starting with a recording of someone else reading the sentence. Then you get the research problem known as "voice conversion", which has been studied a fair amount, but mostly prior to the deep learning era - and mostly without the constraint of limited access to the target person's voice. (On the other hand, research often goes after 'hard' conversions like male-to-female, whereas if your goal is forgery, you can probably find someone with a similar voice to record the input.)

Anyway, here's an interesting thing from 2016, a contest to produce the best voice conversion algorithm, with 17 entrants:


pragmatic*, placing stress is less a problem of word meaning than it is of speaker adaptation for listener comprehension, emphasis, and prosodic tendencies.

Even then, I don't believe the issue is with stress. I believe that the voices sound robotic because they are using, and also admitting because it makes their results impressive in some sense, very few samples, "less than a minute" they claim. Triphones are usually what speech systems are trained on. The amount of triphones (3-phoneme-grams) to cover a language's phonemic inventory is huge (50 phonemes = 50! triphones, which could mean a few hours of audio, although many will not occur within the language given the phonotactics of the language).

Sure, it's not clean and crisp but if you ignore the distortions and the somewhat arbitrary intonation it's pretty close to the real thing. And I would argue that it's already good enough to fool grandma for a phone scam (but then again scammers currently get away with practically no resemblance at all).

Scammers have much more success phoning Granny and claiming that they are Windows Support or "The Internet" and simply asking for what they want. There are far easier ways to con than this, and for very sensitive scams where you need spear phishing of this order, then this will fall flat on its face. "Good enough" doesn't quite cut it when it comes to this.

And fwiw, I think the intonations are actually impressively learned and not random. Trump's odd yet distinct intonations are capture quite well in their samples.

I think the editing manually is something people with resources and motives will do anyways.

I can see something like this, if not already used, for propaganda.

Tin Foil hat is on

I imagine that even at the point where it had been improved to such an extent that a human could not tell the difference it would still be possible to train a simple ai to tell which ones were non-human nevermind the original speaker.

It might just be about the quality of the source. It seems they used public speeches. Obama sounds like he stands in a stadium. Trump is closer like a TV debate. Hillary sounds off to me, though.

Hillary sounded very off, but what is new? Joking aside, Obama didn't sound too bad, but the Hillary voice was bad enough I wouldn't have included it in a demo.

Depends if you want to fool a person or a computer.

I appreciate the ethics link up there in the menu. Not sure if I noticed it on any other AI startup (or for that matter, any startup). Given how complex the world is becoming due to ever increasing co-dependence with tech, I can see how such pages could become as important as 'pricing' or 'sign up' pages. (The privacy issues with Unroll.me, Uber and a thousand other such services will only accelerate this trend).

Good job, team Lyrebird. My feedback is that while the inclusion of ethics page is great, it could do with more content on your vision and what you will not let your tech be used for. I know others can develop similar tech, but it will be good to read about YOUR ethics.

[Edited for clarity]

I agree, it is reassuring to see that the team is thinking about ethical implications.

Judging by the samples from the homepage there are audible artifacts in the recordings resulting from synthesis. I doubt these would pass scrutiny if presented as evidence in court. In some ways forging a voice is like forging a signature, truth can be exposed with enough effort.

> but it will be good to read about YOUR ethics

Not just that, but ethical expectations on the users, backed up by legal policy, would seem important for this.

I love this. The business model is too good to be true.

1. Open source voice-copying software

2. At worst, create entire market of voice-fraudsters, at best, very few voice-fraudsters but very high and very real perception of fear of such

3. Become leading security experts in voice fraud detection

4. Sell software / time / services to intelligence agencies, governments, law enforcement, news networks

Ethically I'm a bit concerned with (2), but realistically the team is right --- this technology exists, it will certainly be used for good and for bad, and they're positioning themselves as the leading experts.

I'm interested to see which VCs and acquirers line up here. Applying a voice to any phrase seems useful for voice assistants (Amazon Alexa, Google Home) but I don't think that's the $B model.

You could charge 99 cents to have Siri talk in your favorite actor's voice.

Funny thing is, this is approximately where CIA was with similar technology in closer to 2000. They did some demos for politicians about how they can given anyone's fake their messages. That stuff is golden for propaganda means, and for confusing stuff like military chains of command. Today the CIA probably has worked out all the robotic artifacts already, and their output is really indistinguishable.

> Funny thing is, this is approximately where CIA was with similar technology in closer to 2000


Not OP, but here is one related source: http://www.washingtonpost.com/wp-srv/national/dotmil/arkin02...

I do not think the technology involved artificially generated voice though, but simply morphing someone's voice into sounding as the target voice.

NNets just recently got really good. You are correct though, politicians would love this.

I believe technology would make a Judge's life really hard.

Clearly the solution is to employ a GAN, so that we simultaneously get artificial voices that are indistinguishable to the human ear, as well as judges that are able to reliably distinguish them.


This is pretty cool (although, I have no idea what other technologies exist for this kind of thing), but it's definitely not convincing enough to a human listener. This sounds like it might be convincing enough for some programs like "Hey, Siri" but it's not gonna convince your mom. You can listen to the samples on the page linked here and you can immediately tell that Obama and Trump don't sound quite human.

Well, the question is, do they just need to throw more computational power / training at this algorithm or is that the peak of their implementation?

This is something Google has been working a lot on [1] and Baidu also recently posted about their results too [2]. We're definitely pretty close to passing the human detectable level.

[1] https://deepmind.com/blog/wavenet-generative-model-raw-audio...

[2] http://research.baidu.com/deep-voice-production-quality-text...

Google and Baidu have only demonstrated single speaker TTS. Lyrebird's the first to demonstrate being able to generate arbitrary voices. Since this came out of a research lab, I would guess that the quality would only improve if they are given more compute and data.

The Google one is able to generate arbitrary voices based on the recordings used for training. So much so that they made it generate piano music.

Maybe. However, most CGI (in big Hollywood productions with almost no budget limit) is still very detectable, so much so that the biggest productions try to do as much as possible with real props and real actors instead of CGI (cf. interviews about the last Fast and Furious movie).

The human mind seems to be better at this than most creators credit it for.

I believe what makes the voices robotic is due to the little amount of audio they need to generate a "usuable" voice from the system.

Speech models usually use triphones, which turns out to be a huge amount of audio. This is particularly impressive because of how little data they need.

Google used their own datasets, which are most likely massive.

It might become more convincing if audio engineers would edit the results to hide artifacts and make it sound more natural.

Devils advocate: the noise masks the distortion, which is the giveaway.

Text to speech is still pretty distinguishable as not-human, and that seems like an easier problem (only has to work for one specific voice, not an arbitrary voice). So just on the basis alone I wonder if this isn't still a ways out

Interesting thought: is it easier or more difficult to make a synthetic voice undistinguishable from a human one, compared to producing speech copying a real voice?

To me, at least, the voices I heard in Lyrebird's demo actually sounded more 'real' than Microsoft Sam for example.

Of course, the voices produced by Lyrebird sound more "real" than Microsoft Sam, since I would define realness as sounding humanlike. However, I would strongly prefer Microsoft Sam over something generated by this algorithm for general use because this algorithm produces voices that are still in an uncanny valley, because it is almost human but not human enough, whereas Microsoft Sam is obviously not human.

I would argue that it depends on the sample rate. Over the telephone there are several TTS voices that are very convincing because the audio quality is lower.


This is pretty basic at the moment and it's terrifying. Yeah, it has an MS Sam feel to it, but as the tech improves and we know it will, you could use a service like this to put words in someone's mouth. Think about how you could trip up a CEO or a Politician by playing some random clip that they never said. When that gets into the Zeitgeist judgments will be made in the court of public opinion devoid of facts or real evidence. You could destroy democracy or people's lives with technology like this

I actually have somewhat of an opposite opinion on this. As HN readers and being "in" the cutting edge front of tech, we know that things like this is possible (I first learned of this seeing Adobe demo it a while ago), but this is not mainstream knowledge yet.

The sooner we can get to a point where everybody knows stuff like this (voice impersonation) is possible, the sooner we can avoid real damages (of courts mis-judging with an impersonated voice recording as accepted evidence).

Yes, we lose an entire area of evidence that can be used in court (all voice recordings, possibly), but the tech was going to get here sooner or later and it was going to be a problem we'd have to deal with. I'd rather be at a place where everyone knows voice recordings are unreliable, than actually having harm done because of impersonated voices because people didn't think it was possible.

Avoiding court misjudgment is reasonably possible.

How we're going to fight against people believing whatever sound bytes from fake news they want to believe is a harder question...

I don't think the tech will improve fast. I've been watching speech synthesis since the 80s, and progress hasn't accelerated over that time.

Speech synthesis is one of those 90% problems - when you're 90% done, you find you only have 90% left to do.

This level of synthesis is relatively easy. Getting to the 'Can reliably pass for the real thing" level is going to take a huge amount of extra work.

It's not even about computational power - it's about the sophistication of the models, and their ability to parse words into phonemes correctly with some knowledge of social and linguistic context.

"Good enough for some applications" - like phone switchboard systems - is a simpler problem. Virtual impersonation is very much harder.

I was pretty impressed by fake Obama's voice. Obviously it doesn't stand up to close scrutiny, but I think if I heard it playing in the background, I could be fooled. And the biggest giveaway was occasional weird intonation rather than the timbre of his voice. All they have to do is make it to where you say a sentence, and it matches your intonation with the other person's voice.

I think you over estimate the complexity and required work to get to virtual impersonation. This will be a problem sooner than you think.

There are human impersonators already. I suppose it's not that easy to fake a visible, high-ranking person for long.

Individual impersonators are not the threat. It's the glut of impersonators that will present the real challenge. It would be very helpful to see a study done with these platforms as they mature to determine what percentage of the population is more easily fooled by these.

For example, as an individual with hearing problems, I may not be so easily able to determine a synthesized recording from an actual recording - for a short period of time. With longer recordings it may become more obvious.

Yes, but imagine a human impersonator who has infinite time to take requests from anyone and generate free recordings of any person with a substantial online audiovisual presence.

Bad joke of the day: Even Trump can't do it, and he really is President of the U.S.!

This is exciting! If you look at historic speeches (ie from American Rhetoric http://www.americanrhetoric.com/top100speechesall.html), there are large variations in average characteristics between various styles/contexts (on average, pitch/volume/speed are different for inspirational vs somber speeches, for example). But there are also really large differences in the variation - an inspirational speech may be marked by large swings from quiet, reflective pieces to booming, rousing calls-to-action while a somber speech has fewer swings in delivery.

For the examples given for various intonations from Obama/Trump, some intonations are much more natural than others. It would be interesting to decide how to parametrize a sentence for the intended intonation. (based on word2vec analysis of the words in the sentence, punctuation cues in the sentence, and perhaps a specified category of "emotional delivery").

It would be interesting at the sentence-level, but also at the macro speech-level to include the right "mix" of intonations for a specific context. On a related note, it would be interesting to study the patterns of intonations in successful vs unsuccessful outbound sales calls, for example, to learn how to best simulate a good human sales voice.

It's there any copyright protections for a person's voice? If not, David Attenborough and Morgan Freeman will be lead voice actors in my next game project

The only possible voice I can think of that I would be certain it could have this kind of protection right now is actually Majel Barret-Roddenberry. She apparently did a lot of voice recording specifically to let this kind of thing happen. What I couldn't fathom is what would happen if something like this was used to mimic someone else's voice that hadn't agreed to such a thing.


The NPR Planet Money podcast did one of their best episodes about this, called Frank Sinatra's Mug:


There's a transcript here:


When you mentioned Frank Sinatra it made me wonder what would this AI make of uploading singing instead of speaking?

There was an advert on TV in the UK and Ireland for an insurance company called MoreThan. They had a Morgan Freeman impersonator who ended the ad by saying "I'm Morethan Freeman". I seem to recall a legal kerfuffle but I can't find anything online about it.

One of the ads: https://www.youtube.com/watch?v=kpPzcAseU8E

If nothing else the source material you use would likely be recordings that others made, so you will not own the copyright on the source material and with how copyright works the synthesized recordings might be considered derivative works of the source material. IANAL, TINLA.

So if you are famous enough to have your voice seen as distinct, you are already screwed... interesting.

I don't think you can claim that your game was voiced by either of them, but I don't see how using this would be any more infringing than using a tuned synthesizer.

Their lawyers > your lawyers though

He probably won't be able to claim the game is "voiced by", but maybe he can get away with saying it features "voices of"?

Actually i wasn't planning on saying either, just having the calm voice of Morgan Freeman tell me i need shoot the zombies or whatever the game will be, and then David explaining with fascination how such a poor shot could have survived this far whenever you miss

It's not copyright, but a 'right of publicity.' It's an interesting area of law, although I don't feel like doing a write-up right now. Basically you can't use a celebrity's likeness to imply endorsement; otherwise you'd see a lot more advertising with cartoon versions of famous people. This technology won't affect courts very much as it's just another kind of likeness.

AFAIK not copyright per se in the traditional sense, but there is a "likeness right".

This. I would assume copying a voice pattern is treated no different from copying an appearance. If a 3D model is a very close approximation to a particular person's face and you use it in a game intentionally to benefit from that person's likeness, you can run into legal issues. And of course you really open yourself up for a lawsuit if you then attach the real person's name to it.

EDIT: See the NPR Planet Money podcast linked elsewhere in the comments. Apparently there are fairly specific laws to protect someone's likeness -- including their voice -- thanks to the entertainment industry and Frank Sinatra.

Look at Crispin Glover's lawsuit for Back to the Future II as well.

Tom Waits sued for this reason multiple times [1].

I guess he had a strong case because in all three cases he was (or so he claimed) approached to do a commercial and the advertisers went with a Waits-like surrogate after he declined.


I wouldn't try it without heavy modification of the voice. Since for example a Morgan Freeman earns money with his voice, you've effectively robbed him of income (even though there's virtually no chance that he'd have participated).

Tribute bands seem to get away with it.

I was thinking of creating a large corpus of my voice, as well as a few of my friend's voices, and licensing it under creative commons or the public domain. Perhaps we can get an early start on making this technology familiar and acceptable before lawyers start to shut us down.

It's personality rights. You cannot copyright a naturally existing thing like a face or a voice.

In the UK there's a common law thing called "passing off" that's used to protect unregistered IP from impersonation. It's already used to protect unauthorised abuse of voice actors IP.

I'm pretty sure "passing off" requires the seller to be fraudulently claiming the goods are the goods of someone else, that if you up-front say "the voices used are generated by computer algorithm and do not represent any real person" that a claim of passing off would be rendered moot.

Trademark/Copyright can't be disclaimed in this way but Passing Off requires active deception AIUI?

Well no.

You're right that being clear to avoid confusion is a good preventative measure, but you're wrong that intent to defraud is required. It's enough that the public is (or is likely to be) confused.

See Reckitt v Borden (1990) judgement referring to the three-part test for claims of passing off (my emphasis):

"Second, he must demonstrate a misrepresentation by the defendant to the public (whether or not intentional) leading or likely to lead the public to believe that goods or services offered by him are the goods or services of the plaintiff"

How is this not vulnerable to the "Mickey Mouse animated feature film but drawn by 6th graders"?

I'm not quite following your question?

It might be worth noting Mickey Mouse is a registered trademark and so doesn't need to use the weaker Passing Off law?

Copyright was designed to protect artistic expression. A song can be protected. A performance can be protected. A playing style cannot.


But also enabling the next gen of "Mom, I'm in Mexican jail. Quickly wire me $2,000 so I can get out." scams.

And Black mirror's "would you like to speak with your dead husband again?"

Forget dead husbands, with this tech, it will be hard to trust anything a politician said. Basically, once they master adding this to video, ANYTHING could be construed against anyone.

Want a video of a politician saying "Hitler was right" to cheering masses? Want a video about a president saying it's time to start Nuclear War One?

You can make that.

We already couldn't trust anything a politician said! But indeed, this is the moral equivalent of how Photoshop has nearly undermined photographic evidence. That leaves video, and it's starting to succumb. Synthetic video AI politician, well, Robert Heinlein wrote _The Moon is a Harsh Mistress_ in 1966.

> Photoshop has nearly undermined photographic evidence

No it hasn't, not at all.

> No it hasn't, not at all.

What do you mean? it most certainly has.

It is much harder to look at a photograph now and say for sure it's real. Before photoshop, there were ways of manipulating photos, but they were much harder and did not yield nearly as good results. You can photoshop people into photos they were not originally in, doing things they have never done. Even Trump is having his hands enlarged in photos.

Maybe you're speaking specifically about "evidence" in the legal sense. I can't speak to that. But there are countless examples where you can't (and shouldn't) believe what you see, because it's not real.

Certainly it raises the bar for skepticism, but I can think of relatively few cases where a photoshopped image has driven a news cycle/key event because someone famous/important believed it to be real.

Most doctored images/videos get snuffed out by the media very quickly if they grow viral.

So because you can think of relatively few cases where key figures were duped, it therefore must not be a problem? That hypothesis fails to account for the rest of humanity.

I have a friend who believes in aliens. They showed me a video that someone had put together of "footage". They were seriously showing it to me as evidence, to convince me that they are real, based on this video. I was shocked they were so serious - and having a hard time containing my laughter.

I later found a number of other videos debunking the videos and clips that video was based off of - but my friend still believes in aliens, based on those videos. And this is someone I regularly have "intelligent" conversations with.

This change, the ability to put words in someones mouth, is the next photoshop. And it WILL have consequences to it, good and bad. And we are not prepared for that, as a society - we haven't gotten over photoshop yet.

From the "Ethics" section of the Lyrebird site:

"Voice recordings are currently considered as strong pieces of evidence in our societies and in particular in jurisdictions of many countries. Our technology questions the validity of such evidence as it allows to easily manipulate audio recordings. This could potentially have dangerous consequences such as misleading diplomats, fraud and more generally any other problem caused by stealing the identity of someone else.

By releasing our technology publicly and making it available to anyone, we want to ensure that there will be no such risks. We hope that everyone will soon be aware that such technology exists and that copying the voice of someone else is possible. More generally, we want to raise attention about the lack of evidence that audio recordings may represent in the near future."

I'm glad the authors addressed this issue pretty forthrightly, but part of me wishes they'd written a bit more about exactly your point. Whether or not recorded speech will continue to be legally binding evidence, I think it's just as important to point out that many people are normally quite happy to take what they hear as solid evidence, especially when it aligns with their prejudices.


This is your first and only post, you have no submissions or favorites, and your account is 193 days old.

I'm very curious (and perplexed) as to why you have linked a video from elsewhere in this thread with no supporting context regarding its relevance other than "hf".

I can't wait to hear the new cassetteboy songs.

In the future: PGP signed speeches.

You joke, but I strongly suspect Authenticated Personal Trust is absolutely going to have become a thing.

And any genuine recording can be plausibly denied.

"Forget dead husbands, with this tech, it will be hard to trust anything a politician said."

Like we can trust anything a politician says/has said today anyway!

I meant it as you can't trust recording of what someone said. Basically any audio could be faked.

Which would be great for people like Trump. "Trust me, that audio is fake news. Fake!"

I was thinking the same when I saw an Adobe demo about a similar product. But then I realize that I have listen voice imitators mocking politicians very accurately. Maybe having this technology at hand will make everyone realise that we can't trust an audio recording.

The advantage of this method is that you can train on random speech datasets, and then using only a minute or two, find the "voice embedding" of a person, and generate anything with his/her voice. This voice embedding is much like the word2vec word vectors, and have similar arithmetic properties.

Responding to multiple sibling comments.

Wouldn't you need tons and tons of training data ? Obama, Clinton and Trump are public personalities, so it's easy to have many hours of recording of their voices.

A random relative; not so much.

Do you know the voices of your second or third degree relatives? Telephone scams are quite a well organized crime by now in Germany. If they can extend the target audience from elderly people to the general public by this means, they dont really care about some negative contacts.

http://www.aarp.org/money/scams-fraud/info-2016/how-to-beat-... https://de.wikipedia.org/wiki/Enkeltrick

If they really need a voice sample from a relative they could still impost, with voice or not, as insurance person, HR or police and acquire it by simple means if they have any idea about the social graph of their victim.

They claim to only need 1 minute of recording. In this age of all the kids sharing everything all the time that shouldn't be too hard to acquire.

Even better, targeted attacks against a person to collect their voice could involve contacting them for an opinion survey regarding a product, survey, or political opinion they value. Gleaning something like that from social media profiles is fairly easy.

"My voice is my passport. Verify me."

It claims that it only needs a minute of audio.

Or 1) planting evidence (no need for extended wire taps, 2) voice recognition software for access control, and 3) nearly any other fraud that will exploit the acceptance/authorization of verbal consent.

"But you just went out the door 10 minutes ago, and we live in France! How did it happen?! I'll wire it right away!"

First thing that came to my mind. We're assembling the pieces one at a time.

If this is anything like adobe's system then you need a fairly extensive library of the target speaking naturally (~20 hours), so to pull off a scam like that you'll want to try and get into alexa, siri, cortana or Google's library and find a heavy user.

Interesting that companies are trying to get us used to taking to our computers.

It states that it only needs about a minute of someones voice.

I wonder how long until we have an audio file of Paul Ryan saying something horrendous.

<lowhangingfruit>Just switch on CNN</lowhangingfruit>

Is this enough to beat voice recognition software?

If you thought fake news was bad before wait until these 'secret' recordings start getting released and reported on.

That was one of my first thoughts too [1]. I doubt it currently will do so as it does sound fairly robotic. It does sound closer however than other demos I've heard so i think we're probably on the way.

[1] https://youtu.be/-zVgWpVXb64?t=41

Throw some hard compression on it to get additional distortion, and I feel you could get away with claiming it was a recording done by a low-power bug.

Imagine when entire videos can be synthesized.

I think its good if a tech like this is publicly available. It will be used by comedians and satire outlets and over time raise awareness about possible fakes - pics/video or it did not happen... Well, no problem anymore ;-)

Can you imagine this for the next generation of cyber bullying? It could get super messy in high schools.

Alice broke up with Bob. Bob grabs the YouTube videos from Alice and makes video and voice profiles. Bob then posts a video of Alice saying how breaking up was her biggest mistake and how she misses <list of every sexual thing you can think of> because Bob does it all best.

That could end badly really easily.

Wow that's amazing. Combine the two and soon Hollywood stars will be redundant. Faceless session actors could just manipulate models of real people who've signed release forms with the vocal performed by similarly faceless vocal artists, or maybe even AI generated voices. Actors would lose their uniqueness and so end up being paid a pittance instead of being able to command the vast sums they can today. Another set of jobs soon to be made redundant by the rise of technology.

Have you seen the movie S1m0ne (2002)? I felt that it only scratched the surface of the topic (and the tech wasn't exactly at today's level), but it's otherwise pretty good.

I wouldn't be comfortable watching a movie scene if I knew I was looking at computer-generated faces and voices.

Are you comfortable with Auto-Tune in music, not the t-pain / etc exaggerated style... the nearly universal application of Auto-Tune to recording and live performance to ensure a "consistent product", and "save on expensive studio time"?

Because the market appears to have spoken on that one and it said "meh, I don't care" with an solid shrug of indifference.

By the same logic, one can see artificially produced vocal performance combined with artificial overlaying of photorealistic 3d reproduction as a way to cost effectively maximise the performer and crew expenses, and ensure the consistency of a performance. The results may even be better than what they could have done with the real performer in the case of some attractive actors who are not very good at the acting part of being an actor/modern celebrity.

That said, I'll definitely miss the days people had to actually be able to act, but then again I also miss the days people used to actually be able to play an instrument well and or sing well if they wanted to be a famous musician.

Japan, as usual, is ahead of the game here! [0]

[0] https://www.youtube.com/watch?v=pEaBqiLeCu0

Did you feel that way when seeing Grand Moff Tarkin on "Rogue One"?


Yes? I missed literally all his dialogue because he was so poorly animated I couldn't take my mind off it. So out of place and jarring.

Yep, fell right into the uncanny valley.

I'm interested to see if it can fool Android's trusted voice.

Very impressive, but like someone else said there are definite robotic/synthetic moments. I wonder how easy it would be to combine all of the other projects linked in these comments, is there a common interface that could easily combine all of them? Probably not because of commercial concerns...

Someone needs to make Trump sing. Someone else should prepare the playlist.

This model is quite cool, but also quite a bit different than what lyrebird.ai is doing. NPSS has a lot of extra information in the control inputs about pronunciation and timing (the part-of-phoneme timer feature) - this means that most of the "hard parts" (in my opinion) for naturalness are control inputs to NPSS/WaveNet style models, rather than variables the model must generate globally and consistently as in lyrebird. At generation time NPSS appears to generate each component autoregressively as well, but I am not clear on whether the demo samples do this or if they use "true" values for f0 at least - what forces the model to sing the exact same melody, if many melodies are possible given the underlying audio information?

Also note that NPSS has some amount of post-processing, at least reverb and perhaps other common musical mixing - we don't really know how these samples are generated, and I have a hard time decyphering exactly what inputs are required, and what are generated from the paper alone. However, I really, really, really like NPSS - I just don't think the comparison you are making is valid here.

These features (f0, duration, pronunciation) are some of the most difficult things to learn to model from datasets of speech and text directly, and I am not sure how they got the subset used (I think only f0 and pronunciation/phoneme) for this NPSS model. Giving creators fine-grained control of the performance (as in NPSS) is quite cool, and if these systems can get fast enough I think the possibilities are really exciting. The same things could likely be done with lyrebird as well - there is no real "tech reason" you couldn't add more conditional inputs, with finer grained information/control.

The key part in my mind is deciding what amount of complexity to show to a user, and what amount to try and capture inside the model - some people may want to control (for example) duration and f0 directly for a performance, while others may want to just upload clips to an API and get reasonable results back, with less ability to control each sample (they can still curate themselves for the "best" samples). Lyrebird.ai is handling the latter case, while the former case would require quite a bit more intervention from the average user, almost becoming like an instrument ala the original voder [0]. However, you could potentially have both approaches as a kind of beginner/advanced mode, but advanced mode needs a user interface, and probably near-realtime feedback.

I used to really strongly believe that the audio model was going to be the hard part of "neural" TTS (blame my background in DSP perhaps), but post-WaveNet the game has really changed a lot - conditional audio models are something we are starting to know how to do pretty well.

The text pipeline of most TTS systems is still the craziest part in my mind, check out a "normal" feature extraction of 416 hand-specified features [1]! These extractions can be upwards of 1k features per timestep/frame, and generally require a lot of linguistic knowledge to specify for new languages. It seems (given Alex Graves' demo [2], char2wav [3], tacotron[4]) that we are making progress on learning this information directly from text, which in my mind is a key breakthrough for TTS in languages besides English, where lots of work on English pronunciation has been done already and is generally available.

[0] https://www.youtube.com/watch?v=TsdOej_nC1M

[1] https://github.com/CSTR-Edinburgh/merlin/blob/master/misc/qu...

[2] https://www.youtube.com/watch?v=-yX1SYeDHbg&t=38m00s

[3] http://josesotelo.com/speechsynthesis/

[4] https://google.github.io/tacotron/

Hi Kyle, I was wondering if the lyrebird github implementation will be open sourced as currently I am hoping to work on improving the current implementation by incorporating prosody into speech synthesis, thanks!

Canned voices? Robotic intonation? No thanks.

Did you even listen to OP voices and compare them with for example the article Spanish generated voice?

Finally, I can have Morgan Freeman narrate my major life events.

Update: Reading changelogs before deployment never sounded better!

As a skilled vocal impersonator I read all my comments aloud in the voice of Morgan Freeman before posting them. Ordinary sentiments such as 'these pickles are quite tasty' are suddenly transformed into profound insights on the human condition.

I was wondering when CG Sir David Attenborough would get here and start narrating my day to day.

I imagine Sir David Attenborough would basically just cite the intro of "The Gods Must Be Crazy" movie, casually explaining how humans technological progress is futile in regards of happiness. That text seems to holds up to present day.

Actually, when the sad day comes and he passes away I'd be very comforted to hear his voice on new nature documentaries. I just can't watch them unless he's narrating.

I agree. This leads to an interesting question: can the estate of a deceased person sell or license their voice rights for new future performances? I suspect the law has some catching up to do.

This is not qualitatively different from existing situations, and will be only a minor legal wrinkle. Estates have been licensing the likeness of dead people for commercial purposes for a good while, now those likenesses are simply more sophisticated.

I'll tell you what will get complicated, copyright holders complaining that their product was used as input for the training algorithm and demanding a slice of any profits because they made the famous individual more famous by casting them.

Charles Schwab uses a voice phrase to authenticate you for access to your account, which is already pretty brittle, but I hope this makes them reconsider more urgently.

1. Is this company new?

2. Is this better then what Google or Baidu are doing?

3. I remember reading Adobe has something similar.

4. Why ( What happened ) that all of a sudden we have 4 company making voice breakthrough tech like these?

5. What Happen to Voice Acting? Places like Japan where they highly value voice actor. Is Voice even patentable?

1) Yes, it is spun off from research at MILA, University of Montreal.

2) Possibly. Google and Baidu have compute resources far beyond a university. This method might be better, and do well with more resources.

3) Adobe's method required far more input data. This apparently requires only 1 minute of your audio to start sounding like you.

4) Deep learning has revolutionized vision, and language processing for a while. It was just a matter of time before people started applying those methods on speech data, with similar surprising results.

5) It will be hard to capture human elements like "emotion" and tone via generative models. Maybe in the future the work will become sophisticated enough to be indistinguishable from human speech, but right now there are some telltale signs that it is artificially generated.

Specially to point 4, google pushed their wavenet paper a couple months ago. I wouldn't be surprised if some, if not all of these current break throughs are built on that foundation. This sort of application was the first thing that came to my mind after reading the paper.


There is an older paper [0] and demo from [1] Alex Graves that inspired a ton of work around handwriting, and then speech. Previous work from Jose Sotelo et. al. (including me) called char2wav [2] is a close neighbor to Graves' approach, though he (Graves) never published the approach for speech so we don't really know. Google's recent Tacotron paper [3] is also a relative to these approaches.

WaveNet certainly changed the game in many ways, but approaches to TTS using RNNs have different roots. WaveNet and friends (incl. DeepVoice and NPSS linked elsewhere in this thread) are largely focused on audio modeling, and generally use something closely related to the "classic" TTS pipeline for text in the frontend. The audio modeling results are stellar, and really blew me away personally - basically changing my perspective on what is possible in audio modeling overnight.

RNN models try to tackle the whole problem (text + audio modeling) at once, though currently (all?) RNN and attention style models need intermediate / high level hints or pretraining from things like vocoder representations or spectrograms, versus WaveNet's approach using the waveform directly. So they are complimentary in many ways, and I am sure we will see people trying to combine them soon - char2wav has this flavor by using SampleRNN, our lab's take on raw waveform generation though we are still working on the fully end-to-end from scratch training, the inference path is truly end-to-end. Though there are still many details to work out as far as output quality, it seems possible that this will be a productive approach (though I am quite biased).

We see similar directions in neural machine translation (NMT) moving from word level representations to word parts or characters directly - one of the big reasons deep learning has come so far, so fast is that a lot of techniques from other subfields can be utilized for new domains, and I think there is a lot more fertile ground for crossover in both directions.

Heiga Zen has a great overview talk about how speech synthesis, as a field, overlaps between different approaches and factorizations [4]. His work on parametric synthesis and TTS generally has laid the foundation for a lot of recent advances, and he was also a co-author on WaveNet!

[0] https://arxiv.org/abs/1308.0850

[1] https://www.youtube.com/watch?v=-yX1SYeDHbg&t=38m0s

[2] http://josesotelo.com/speechsynthesis/

[3] https://google.github.io/tacotron/

[4] https://www.youtube.com/watch?v=nsrSrYtKkT8

> 5. What Happen to Voice Acting? Places like Japan where they highly value voice actor. Is Voice even patentable?

I don't think this tech knows how to act. However, it could be used to increase the range of a voice actor.

I see a lot of people claiming that certain things will now be untrustworthy.

As if human voice imitators have not existed and could not be paid for prior to this. For $5 you can get Stewie Griffin [0] or Barack Obama [1] to say whatever you want them to say. Any audio-only messages of well known figures should already be considered "compromised" and untrustworthy. Even without the technology to impersonate them.

This should be more concerning for "normal people". It isn't that you can no longer trust an audio-only recording of Obama, but that you may not longer be certain an audio recording is from your best friend. (E: Once the technology improves a bit more of course.)

[0] https://www.fiverr.com/joe_stevens/talk-like-stewie-griffin-...

[1] https://www.fiverr.com/celebimpression/do-a-custom-barack-ob...

This is awesome. As someone exploring the fictional storytelling space, this seems like it'd have a lot of fun applications in that space as well.

How difficult is it to create/tune voices from parameters rather than training from an audio clip? I build software where people create fictional characters for writing, and having an author "create" voices for each character would be an amazing way to autogenerate audiobooks with their voices, or interact with those characters by voice, or just hear things written from their point of view in their voice for that extra immersion. Having an author upload voice clips of themselves mimicking what they think that character should sound like, but probably would keep traces of their original voice (and feel "fake" to them because they can recognize their own voice), no?

Can't wait to see how this pans out. Signed up for the beta and will definitely be pushing it to its limits when it's ready. :)

I wonder how dependent this is on language: can we make Trump speak Chinese using a one minute audio track of him speaking English?

I'd imagine you might get close, but it'd probably work best if you can get the person to use all the phonemes that you want to reproduce. That said depending on how good it is, it might slur other phonemes together to approximate it, which would probably work to give it the accent that the speaker would likely have.

It all sounded sort of slurry and muffled. Maybe if you're imitating a naturally slurred speaker it would be more effective.

It sounds like they're training a parametric speech synthesis platform on samples in order to learn the parameters. I wonder if there are are approaches at generating n-phones for concatenative models, or using a hybrid approach.

I built a toy concatenative Donald Trump speech system [1], but I don't have an ML background. I've been taking Andrew Ng's online course in addition to Udacity's deep learning program in an attempt to learn the basics. I'm hoping I can use my dataset to build something backed by ML that sounds better.

Is anyone in the Atlanta area interested in ML? I'd love to chat over coffee or join local ML interest groups.

[1] http://jungle.horse

I tried similar approaches long ago (~2 years now?) with something related to RNN-RBM and it showed some slight glimmer of promise, and still think there might be some clever ways to combine concatenative methods and deep learning to avoid a lot of the noise issues present in parametric models. Then again, maybe it just needs to train longer - it's always hard to tell. I liked jungle.horse, awesome stuff!

This is very exciting to me because it lets RPGs provide spoken dialog for everything (I'm waiting to see if they can do emotions at all convincingly). Even big budget games suffer from "you can call your character anything as long as it's 'Shepherd'" simply because you can't mention the character's name or any other use-content safely.

Through the tinny speaker of my mobile phone the Obama in the first sample is almost spot on. Some speed issues with Trump but really impressive.

I wonder how accurately this would reproduce dead musicians voices. I've had this idea for about 8 years called the Notorious BIG project. I have about 20 acapellas that I was originally going to manually chop into a song. Neural Nets can pretty much solve this now.

Can we get these speeches in audio form now?


As noted in other comments, all the samples still sound very robotic, so this is probably "just" a method to tune the parameters of an existing voice synthesizer to mimic a real persons voice as much as it allows.

That's exactly what it sounds like. The same old mediocre TTS with voices modified to mimic specific well-known voices.

It's impressive for what it is, but a lot of people here seem way too excited. This isn't any kind of breakthrough, and only the shortest hand-picked snippet would fool anyone.

The samples all sound a little like Rich Little and Stephen Hawking's love child doing impressions: they won't fool very many people.

But, you can certainly see where this is going and that's the worrisome part.

I'm sure it will improve dramatically over the years. This seems to be a problem with all digital voice software, it's never entirely human sounding. Pretty good starting point though.

Oh yea. The Troll embedded deep in my soul giggles in glee.

However, the day some shill tries to sell me travel insurance in departed nana's voice would be the day I start signing my voice convos' with a pgp key.

This site has a "demo" section featuring only Soundcloud clips. Uses to much the present tense "In a world first, Montreal-based startup Lyrebird today unveiled" and "Record 1 minute [...] and Lyrebird can [..]Use this key to generate anything" but has no actual product or beta version. Adobe had a much more impressive sneak peek of a similar product called VoCo: https://www.youtube.com/watch?v=I3l4XLZ59iw

Lyrebird has real tech capable of doing what they claim, whereas Adobe's demo was completely fake.


"Wife" sounds exactly the same in both places. All they did was copy the exact waveform from one point to another, like an automated cut and paste in Audacity. Nothing is being synthesized.


The word "Jordan" is not being synthesized. The speaker was recorded saying "Jordan" beforehand for this insertion demo and they're trying to play it off as though it was synthesized on the fly. That's incredibly dishonest. This is a scripted performance and Jordan is feigning surprise.


Again, the phrase "three times" here was prerecorded.

This was a phony demonstration of a nonexistent product. Reporters parroted the claims and none questioned what they witnessed. Adobe falsely took credit and received endless free publicity for a breakthrough they had no hand in by staging this fake demo right on the heels of the genuine interest generated by Google WaveNet. They were hoping they'd have a real product ready before anyone else.

If Adobe had a real product then they'd have proven it with a demo as alarming, undeniable, and straightforward as Lyrebird's. Instead they relied on aesthetics and flashy, polished, deceptive performance art with famous actors and de facto applause tracks to cover up the fact that they have nothing.

To be fair, that demo could have been staged whereas we can be pretty darn sure those aren't Trump's actual words.

I agree. My main issue with them is that those clips could have been painstakingly produced using some very far from shipping software with lots of manual tinkering while the copy on the site mostly reads like there's a product out. The part about being far from shipping could also be true about adobe's software, but I think their presented result (assuming it is real) sounded better, and they were more honest about the stage the product is in.

Relevant discussion from 17 hours ago: https://news.ycombinator.com/item?id=14177589

We need a new markup language for intonation and emotion.

Voice Actors out of business! :D

Excellent work. This will find widespread application in the film/tv/music industry and beyond (and we're not that far away from being able to do the same thing for video). Unfortunately it will also be widely abused, but given the near-inevitability of such technological development I'm already reconciled to that :-/

Curious choice to name a company & product with a name that sounds like "Liar Bird" when spoken. To me, that looks like they're fully embracing the concept that this can be used for nefarious purposes. If one of their goals is to bring attention that this technology exists and can be misused, the name reinforces that.

I guess they've named it that because the lyrebird is an amazing impersonator. The end of this BBC clip blew my mind the first time I saw it. https://youtu.be/VjE0Kdfos4Y

But you may have a point, and the ethics section makes it clear that they are indeed very aware of that this may be misused.

I don't see why audio manipulation would be any more nefarious than photo manipulation or video manipulation.

Plus, it doesn't even matter. You can write an article with fake quotes and people will believe it without even caring if there is an accompanying sound byte or not.


Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact