It has celebrities like Sir David Attenborough and Arnold Schwarzenegger, a bunch of the presidents, and also some engineers: PG, Sam Altman, Peter Thiel, Mark Zuckerberg
I'm not far away from a working "real time" [1] voice conversion (VC) system. This turns a source voice into a target voice. The most difficult part is getting it to generalize to new, unheard speakers. I haven't recorded my progress recently, but here are some old rudimentary results that make my voice sound slightly like Trump [2]. If you know what my voice sounds like and you kind of squint at it a little, the results are pretty neat. I'll try to publish newer stuff soon, and that all sounds much better.
I was just about to submit all of this to HN (on "new").
Edit: well, my post [3] didn't make it (it fell to the second page of new). But I'll be happy to answer questions here.
[1] It has about ~1500ms of lag, but I think it can be improved.
We'll re-up that thread (see https://news.ycombinator.com/item?id=11662380 for how this works generally). I'm going to move this comment there as well because it includes more background info than you posted there.
I don't think that there are any. Simply writing a program that outputs sounds that resemble someone's voice can't be illegal, right? You can use it in illegal ways, sure, but I don't think there are any laws concerning software like this.
There have been a couple cases or so in the US where singers have sued over their voices being imitated in commercials. As far as I know none of their copyright claims succeeded, but some still won on other grounds, such as the common law notion of appropriation of identity, or California's laws of publicity rights and federal laws on false endorsement.
As I said, they lost on copyright claims but these cases involved their natural voice. If the voice was a made up voice, such as a character an animated character, I wouldn't immediately dismiss the idea that it might be copyrightable.
At least in the US, you can’t copyright your voice. Publicity rights is a can of worms depending on the jurisdiction.
Tangentially related, if you have your voice print as a security mechanism at a financial institution (Vanguard), you should ask them to turn that off.
It's extremely hard to get women's voices to sound right in TTS, at least back when I was working on it. It was a striking difference, and I'm not sure why.
I've been wondering about the possibility of using this sort of tech (or the API offerings from Azure or GCP) to provide voice overs in video games.
By that I mean for smaller budget Indie development, it would be certainly interesting to either be able to generate voice audio from transcripts in order to add voices to background NPCs and so on (or even the possibility of doing it at run time to produce much more dynamic worlds).
I guess the biggest blocker is the difficulty in conveying emotion with what is currently available as well as the difficulty in getting pronunciation correct (especially with nouns).
There are half a dozen startups in this space that provide the tech. They use embedded style tokens or sliders to change the emotion, pitch, timbre, etc. I don't have links off hand, but they're not too difficult to find.
These companies tend to focus on off-the-shelf turnkey solutions, so they'll have a suite of a few voice actors to choose from for different character archetypes.
Yep, that's what we're doing at https://replicastudios.com - though we have strict ethical guidelines to not clone a person's voice without their permission.
Can this service be used with preexisting recordings? I have family that are losing the ability to speak and I’d love to be able to give them that back somehow.
I'm not a lawyer, but I think we're entering into a legal gray area. There are the existing frameworks of copyright, parody, free speech, slander, libel, etc. that are all somewhat tangential to this.
I believe (I'm not certain) that celebrity voice impersonation is legal as long as it is not used to sell or endorse a product.
Most models are trained on the original speaker's voice, but maybe only a little bit. Models might incorporate learning from many speakers. We might even be able to boil down a speaker representation to a small vector encoding in the future. It'll be interesting if we can capture the representation of a person with just a few numbers.
I don't think the legislature should be overly protective against machine learning. It seems obvious to me that neural networks will play a huge role in creating entirely virtual musicians and influencers. We're already seeing this start to happen. r9y9 on github has published some models that rival Vocaloid in lyrical ability.
At the same time, we don't want these techniques used to commit fraud, slander, or have them be used to falsely accuse someone of committing some act. These are things we might need new legal protections for.
But I don't know what I'm talking about. I'm not a lawyer.
I asked not because I expected an answer, but because I figured you'd have an insightful opinion.
It's essentially the performance of a composition vs the composition question again: at what point am I mimicking someone to the extent they have a valid claim on a portion of my work?
I expect it'll enter the courts a few milliseconds after someone clones a dead actor (without their estate's permission) for a new performance.
There's always been an inherent tension in the US distinction between a law of nature and a creative work though. It seems a bit silly for me to claim patent / trademark on a vector that encodes my likeness.
I suspect there's some plausible deniability built-in that might allow for such matters to be legal.
For example, lots of people sound like Arnold Schwarzenegger. So if you trained a model with tall, deep-voiced Austrian man, you could probably get something that people will immediately associate with Arnold without actually being his voice, or someone emulating him. Because much of what Americans associate with his voice is really a regional accent which is relatively uncommon in the US.
There may be a little bit more difficulty getting away with with someone like Gilbert Gottfried, whose voice is much more unique. But I do think you could get away with creating a voice that people think sounds just like him, but doesn't hold up in a side-by-side comparison.
What I think will happen is celebrities like Morgan Freeman will use their voice to train models like this, then gift these to their estates for use in the future.
> So if you trained a model with tall, deep-voiced Austrian man, you could probably get something that people will immediately associate with Arnold without actually being his voice, or someone emulating him. //
I think "passing off", an unregistered element of trademark laws, may be pertinent here. If the public think that there's an association and you're knowingly trading on that, even if the public are wrong, then you can be 'passing off' your output as someone else's goods/services/[vocal renditions].
It's likely you'd have to be very careful about use of copyright material for training the voice (eg extracting metrics that describe the voice). Fair Use might apply in USA though (even commercially).
>"Most models are trained on the original speaker's voice, but maybe only a little bit."
Really cool that you got this to work. I used to work on TTS (a few years ago, now), and we trained on celebrity voices, but used full audiobooks. https://github.com/Kyubyong/tacotron
It's not settled caselaw so anyone basing a business off this should expect to spend a lot of money defending it in courts
I saw once a company that offered to be the sole purveyor of a celebrity's synthesized voice. I haven't been able to find them again, but that seems like a much safer way to monetize this.
I'm not sure that making it easier to profit off of the likeness of others is a positive side. If it's legal for indie studios to do, it's legal for 20th Century Fox, Universal, and so forth.
It reduces cost to produce a specific good. I would think it would be measured similar to other technology advancements that do the same. Good for society, bad for the craftsman that were made obsolete, overall net positive.
This is purposefully not counting in the effect of being able to fake people and the damage that does to society, but I think that was implied by the previous poster specifying looking for the more positive side of the technology.
It’s not so obvious that it’s a net positive. Music for instance is way easier to get ahold of now, but we also don’t get to enjoy sophisticated long-form music as much, owing to how there’s no money in making it.
Check out Modulate.ai! We make real-time, emotive voice skins aimed at gaming voice chat. Audio watermarking is also built-in to prevent fraud. Currently in a closed alpha stage but if you're part of a game studio and have interest please reach out!
Oh jeeze. I had to. Switch it to Bill Gates and pop this in:
> I'm going to steal your soul. One injection at a time. Slowly, over the course of the next decade, the entire essence of your being will be demolished until your body is nothing but a vessel for my command.
I am worried about the potential abuse of this service, are there any existing services that can help to identify audio deep fakes like this one is for making them?
Such a system will always suffer from false positives and false negatives.
On a more positive note, when deepfakes become a problem, we will see the emergence of a culture where unsigned authoritative content is not paid any attention.
> we will see the emergence of a culture where unsigned authoritative content is not paid any attention
If current events are any indication, that culture will only emerge 30 years after the tech becomes widely usable, and in the interim will lead to absolute chaos in the form of weaponized disinformation.
Lots of bad things happen, and they are only surfaced because the person in question didn't notice the surreptitious recording. When deep fakes becomes a problem, it will give these people plausible deniability and they can just reject it as "fake news."
I'm reminded of a comment by someone that studied the old Eugenics movement. Mentioned finding that one of the major books on the subject the author had carefully edited photo's of 'degenerate people' to look even more crude an unintelligent.
Also photographer friend of mine said; a great photographer doesn't need to Photoshop anything to lie to you.
> Also photographer friend of mine said; a great photographer doesn't need to Photoshop anything to lie to you.
I couldn't agree more if they used the word lie in the more general sense as a synonym of deception. The availability of fakes may not become a problem because the most effective deception doesn't involve telling untruths.
To give an example, suppose Russia Today and Fox News report on the same event. There's a set of facts. RT picks a subset and reports it from their point of view. Fox picks another subset and presents their view. The resulting articles may give readers vastly different interpretation of the event and no untruths had to be involved.
I wonder if at some point we need to create legal requirements that all deepfakes have some kind of human-invisible (or visible?) fingerprint for identification, restriction on frequency range, etc. We have crypto export restrictions, why not put handcuffs on this as well which has the potential for probably larger scale harm?
Actually there might be a precedent with the Carrie Fisher from Star Wars. I believe they did use some form of virtualization after she passed..I don't know what the outcome was legally, but it definitely is in this realm.
She would have signed contracts that permitted this or her estate could have let it happen after she died. It is likely the former, I don't think this situation is applicable at all. Even The Crow was finished with face replacements in the 90s.
That’s a big start. If you look at Reddit, you’ll notice all these deepfake porns are using scripts/apps that people have packaged together. Those and any commercial variants can abide by such restrictions which will help document the fakeness once this goes mainstream. Only the super technical would be able to get around it, and if tutorials etc come out then you have legal grounds to go after it to minimize.
Don’t let perfect be the enemy of good. This has potential to literally cause spilled blood, fraud, etc. Better to have it for some than for zero.
I don't agree with the solution unless we as a society stop putting trust on any digital media but at that point, this is not necessary. Many governments would love to use this tech so they have incentive to stop others from using it and still let people believe in digital evidence by putting half assed solution like the one you proposed.
The cat is out of the bag. Digital media should not be trusted blindly.
I don’t appreciate the condescension. It’s half assed in that I put a rough direction out there for conversation for a huge problem that will inevitably cause social problems. Your reply isn’t in line with HN guidelines and certainly doesn’t make me want to participate in conversation with you about important topics.
I am really sorry. I could have worded it better. I didn't meant to be condescending to you. My reaction was more to the government.
Governments have a history of putting up solutions that work in their favor by selective enforcement and securing power to self. Encryption debate is one such example.
Thank you for saying that. If you have other solutions that you think could work it would be good to get them out there. Disinformation is a giant problem already just with social media alone. Now add in the ability to create false yet convincing AV evidence and the potential for large social harm explodes. It hasn’t become widespread just yet, so now is the time.
After some experiments with playing text to people around me, we decided that a huge factor in the perceptive quality of the voice comes from knowing who you are listening to before you get to listen, with the best perceived quality when the listener actually gets to see the picture of the voice's owner. Was it a deliberate choice to add those photographs for this reason?
Extremely fun little tool! Anywhere I can read about the techniques involve?
An interesting quirk: some words seem to get dropped entirely? for example the word "cleverer" and any word with a hyphen.
Really fun to start with a quote from one person and switch between voices to hear others recite the same line. Alan Rickman doing lines from Aladdin as Iago is pretty funny
You’ve clearly spent a good deal of your time creating this. Bravo. What steps can I take to find more time to dig in projects of my own? Assuming this is not what you make a living out of.
Hey! What do you think about running this on a mobile device? What are the current limitations? Since WaveNet we are able to get really good TTS but it's always online. Why aren't there any good TTS engines yet?
E.g. for Pocket or similar services
Guess the server is overloaded now. All I'm seeing are errors.
Tip: It's a cool idea to put some ready made samples under the photos. A lot of people like myself only want to hear some demos and pre saved mp3 samples are more than sufficient for that sort of thing. It will also help reduce your server loads.
I was wondering, wouldn't it be possible to classify the voice of every celebrity based on moods so one could make the voices less monotonic? So one could then add text metadata for the text-to-speech conversion, e.g. "[Angry] I have a dream, [Calm] but it has a patent so you can't copy it! (laughter) [Calm-fade-to-angry] In reality insomnia took it from me!"
I'm not aware of any, and I haven't had much time to look as I'm not to the point of doing style tokens yet. I'm certain this would be useful for annotating data and for all sorts of other applications. Sentiment analysis, etc.
I can make a blog post later, but at a high level:
A rust TTS server hosts two models: a mel inference model and a mel inversion model. The ones I'm using are glow-tts and melgan. They fit together back to back in a pipeline.
I chose these models not for their fidelity, but for their performance. They're 10x faster at inference than Tacotron 2. If you want something that sounds amazing, you're better off with a denser set of networks, like Tacotron 2 + WaveGlow. You should use these for achieving superior offline results for multimedia purposes.
Instead of using graphemes, I'm using ARPABET phonemes, and I get these from a lookup table called "CMUdict" from Carnegie Mellon. In the future I'll supplement this with a model that predicts phonemes for missing entries.
Each TTS server only hosts one or two voices due to memory constraints. These models are huge. This fleet is scaled horizontally. A proxy server sits in front and decodes the request and directs it to the appropriate backend based on a ConfigMap that associates a service with the underlying model. Kubernetes is used to wire all of this up.
This is incredibly cool. Do you mind sharing how big the models are, and what kind of instances you're deploying them on?
I ask because I help maintain an open source ML infra project ( https://github.com/cortexlabs/cortex ) and we've recently done a lot of work around autoscaling multi-model endpoints. Always curious to see how others are approaching this.
(All the voices use the same melgan, or derivations of it.)
I'll edit my post later with my deployment and cluster architecture. In short, it's sharded and proxied from a thin microservice at the top of the stack. I'll probably introduce a job queue soon.
I tried "Watch as the cat sniffs the flower, eats it, and then vomits. This is classic feline behavior" with Attenborough. He seems to slip into a bit of a German accent on the second sentence. What's the cause of that?
Thanks for sharing, though. Very interesting project!
I can come back and post a write up. Please refresh this post later today.
I scaled for today, but it's pretty cheap to run day to day.
I also have some architectural optimizations to make that will greatly reduce the costs. Right now, nodes are responsible for two speakers apiece. This is an under-utilization since most speakers don't get used.
Semi-serious follow-on question- would your model be able to produce voices like GladOS, which are highly processed, but in a consistent manner? Or are there too many assumptions baked in regarding normal human speech?
Can you comment a bit on the tech on this? I tried something similar with songs: I wanted artists X to sing a song from artist Y. I cleaned the voices, the audios, but the transfe rjust didnt work. I didnt do any annotations on the text (it shouldnt be that hard since all lyrics are available), but if you could recommend a path or maybe an open source project I be grateful. Thanks and great work by the way!
The best results I've seen are from researcher Ryuichi Yamamoto (r9y9 on Github). He continually publishes astonishing results and novel architectures:
"My name is Bill, the lord of computers. I love computers, and they love me too. I'll give you a computer, maybe one, maybe two. If you are lucky it might not even crash on you. Love your computer, like your daughter or your wife, treat it with kindness, and it will reward you for life! I am bill the god of computers. Bow to me now or I will be sod you."
I definitely need this. Looks like I have to wait until you are off the front page of HN though.
I am a writer and found that the best editing comes when I am reviewing audio files of my books from voice talent. Of course, then it is way to late to change anything. With a tool like this I can revise as much as I want!
This is great. I've been thinking about doing something similar with cartoon characters to build a Disney-style companion for my son as he gets older. I'm imagining something like an Alexa assistant but with Mickey Mouse's voice.
I know caselaw isn't settled at all on all this but I'd absolutely avoid posting anything on the web mentioning D' and the black and white mouse again unless you are interested in finding out firsthand how the law gets settled here ;o).
The hardest part of this is in dataset creation. It's hard to clean and annotate the data and can be quite manual. That's why companies with lots of data will win.
There are automated techniques to help with segmentation, bandpass filtering, transcriptions, etc., but they're far from perfect.
Can you give some more info on how you generated the models? I'm also interested in the tech stack you're using to implement this webapp... Would love some details!
> Can you give some more info on how you generated the models?
glow-tts and melgan, which are somewhat unpopular choices given the proliferation of Tacotron2/Waveglow. I chose these due to their sparsity and speed.
> I'm also interested in the tech stack you're using to implement this webapp... Would love some details!
It's a Rust microservice architecture. There's a proxy layer that decodes the request and sends it to the appropriate backend, and then there's the tts service that is horizontally scaled and is responsible for loading the model pipeline and turning requests into audio.
> ..What's next?
For me? Voice conversion in the near term. This takes microphone input and turns it into the target speaker's voice.
I'm also spending a lot of time on photogrammetry. I have a 3d volumetric webcam system right now that I have much bigger plans for.
Text To Video webapp that renders text to video + voice synchronised of famous people.
Who wouldn't like to laugh 5X more when social scrolling?
The first platform that enables creators with the ability to produce deep fakes of celebs from text that they can broadcast as HQ video content to their audience will kill both Youtube & Instagram.
Ranking based on likes so the best jokes of the day are trending on top of the feed.
Recommendation engine with a multibandid ML algo from the start so you can leverage all that incoming data.
I assume that if somebody had physically lost their ability to speak, it would now be possible to generate a pretty reasonable synthetic voice. Should we all be archiving a high-quality voice sample as insurance?
The implications for security are huge. If your friend calls you up for a very quick chat from an unknown number and asks you to remind them of your address, are you going to ask for authentication to prove it's really them and not a convincing synthetic voice?
This could make video games take up so much less space and have much more robust speech, especially from NPCs.
Subreddit simulator is pretty convincing conversations, putting that to high quality voices? mannnn, so many good applications.
Speaking of which, why don't people just talk about the good applications. You'll get ostracized for speculating more bad things about COVID, but talk about how doomed we potentially are with deep fakes? Give that blogger a pulitzer prize!
> This could make video games take up so much less space and have much more robust speech, especially from NPCs.
Maybe, maybe not. You'll see some of the model sizes I posted in comments above. These are quite large, and adding models for multiple speakers gets quite large. These have to live in memory and probably can't be paged in selectively.
Once we achieve high fidelity multi-speaker embedding models (where multiple speakers are encoded in a singular model), then we'll have something compelling. I imagine the models will become less dense over time as well.
Furthermore, if the models are deterministic, then the designers will know what each line will sound like exactly before it's produced.
I keep getting hit with the rate limiter so I wasn't able to try it :(
> There was an error and I still haven't implemented retry to make it invisible. You can absolutely submit your request again a few times; this is a self-healing Kubernetes cluster. Some models (voices) get more load than others and/or are scaled to fewer or more pods. There's also a rate limiter, but there aren't error messages yet.
I know there's legal (and perhaps ethical?) issues to work out, but I really wish tech like this, if fine-tuned, could be used to resurrect stuff like Jim Henson's original Kermit voice; the Muppets' new voices all sound horrible. I'd love for fictional character voices to become immortal.
This is good fun. I went with "Meanwhile, the young males form groups and compete in trials of strength and courage in an attempt to catch the attention of herding females" and surprisingly Richard Nixon sounds just as good saying this as David Attenborough
It's currently sourcing phonemes from a lookup table called CMUdict, which is constructed by Carnegie Mellon [1]. That database has 140,000 entries, but even so, you'd be surprised how many common words are omitted. And of course it is missing terms for things like "pokemon" and "fortnite", which I had to add myself.
I don't have generic grapheme -> phoneme/polyphone prediction, but that's something I look to add soon. In my literature review I didn't see anything in this space, so I was thinking I might have to come up with something novel.
Espeak-ng has pretty decent English word to phoneme translation. You run it in the mode where it just outputs the IPA. The vocabulary can be extended too (as the coverage is good but far from perfect)
If I supply you with isolated Robert Plant vocals and transcripts, would you consider training a model? That could be some 'interesting' output results with the dynamics and range of his singing.
I too expected more discussion of this. People play around with these things because they're interesting, then mostly hand wave away concerns about the implications with "well, people will just have to learn to be skeptical of recordings". But what we're really doing is muddying a previously reliable avenue of gaining quality evidence about the world. I expect this opinion is unpopular on HN but I think people shouldn't be developing these things, companies shouldn't be working on them, and they should be banned before they get to the point of causing real harm. I also believe that can be prevented by drying up funding and research, because bad actors have to rely on the body of existing work to make their bad actions practical.
As NN models get more advanced generating speech synthesis will get progressively more convincing and less expensive to implement, even if the models aren't built for speech synthesis specifically. The same can be said for image generation/transformation. If we are to continue develop AI then this is likely inevitable. There are benefits to these models for mute people, for example. Adversarial models can be built to detect fake audio samples. Regulation (ex: adding tells/signatures in commercial products) would also help. The government would have to ban most AI research or they would only be prolonging the inevitable.
I've tried Gilbert Gottfried and NDT. I do get a console error about CORS:
> Access to fetch at 'https://mumble.stream/speak_spectrogram' from origin 'https://vo.codes' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource. If an opaque response serves your needs, set the request's mode to 'no-cors' to fetch the resource with CORS disabled.
I'm also on Chrome (84) macOS, using the Craig Ferguson model.
I switched to Safari and Disabled CORS, but a 500 error is coming back now. So maybe the 500 response is the root cause, and the error handler is not returning CORS headers, masking the issue on Chrome.
Edit: by putting in a shorter input (sentence rather than paragraph) I was able to get a response.
I need better error messages, but I believe it should respond with something stating the length is too long.
What might've happened is that the instance your request was farmed out to might have been OOM killed. I've provided lots of memory, but these models are pretty massive and each inference run has to spin up a lot of matrices in memory.
This is all CPU inference, not GPU.
When the pods get OOM killed, they spin up again. The clusters for each speaker are about 5-10 pods apiece (with some double tenancy).
Cate Blanchett, Sarah Silverman, Katey Sagal, Jennifer Tilly, Laura Prepon, Viola Davis, Judi Dench, Whoopi Goldberg, Julie Andrews, Lake Bell, Jane Lynch, Joan Rivers, Martha Stewart, Katharine Hepburn, Sarah Vowell, Shoreh Aghdashloo...
I trained a base model on the Linda Johnson speech (LJS) data set for several days.
I then transfer learned for each of these speakers. Some speakers have as little as 40 minutes of data, others have up to five hours. The resulting quality isn't strictly a function of the amount of training data, though more typically helps. It's also important to have high fidelity text transcriptions free of errors.
The transfer learning runs vary between six hours and thirty six hours.
I'm using 8xV100 instances to train glow-tts and 2x1080Ti to train melgan. I'm continuously training melgan in the background and simply adding more training data. The same model works for all speakers.
Have you had any success with using speaker embeddings to generate voices with fewer samples of speech? I did some cursory experiments but I couldn't get too far beyond getting pitch similar to the target speaker.
My reasoning for this approach: IMO, if the model learns a "universal human voice", it shouldn't need too much additional information to get a target voice.
I did! I tried creating a multi-speaker embedding model for practical concerns: saving on memory costs. I'm going to have to add additional layers, because it didn't fit individual speakers very well. I wish I'd saved audio results to share. I might be able to publish my findings if I look around for the model files.
I think you're right in that if we can get such a model to work, training new embeddings won't require much data.
Hate to be that guy but I can't participate in this discussion due to javascript being required for the landing page.
As an outlier not running javascript, I'm reaping what I sow, but it would be nice to me and others in the same boat if projects make their landing page viewable without the need for javascript.
If you don't have javascript you see only "This page requires Javascript", when I would hope, even if the thing requires javascript to operate, I could at least find out if it is worth switching to another machine with X11 and firing up firefox.
https://vo.codes
It has celebrities like Sir David Attenborough and Arnold Schwarzenegger, a bunch of the presidents, and also some engineers: PG, Sam Altman, Peter Thiel, Mark Zuckerberg
I'm not far away from a working "real time" [1] voice conversion (VC) system. This turns a source voice into a target voice. The most difficult part is getting it to generalize to new, unheard speakers. I haven't recorded my progress recently, but here are some old rudimentary results that make my voice sound slightly like Trump [2]. If you know what my voice sounds like and you kind of squint at it a little, the results are pretty neat. I'll try to publish newer stuff soon, and that all sounds much better.
I was just about to submit all of this to HN (on "new").
Edit: well, my post [3] didn't make it (it fell to the second page of new). But I'll be happy to answer questions here.
[1] It has about ~1500ms of lag, but I think it can be improved.
[2] https://drive.google.com/file/d/1vgnq09YjX6pYwf4ubFYHukDafxP...
[3] I'm only linking this because it failed to reach popularity. https://news.ycombinator.com/item?id=23965787