
Show HN: Neural text to speech with dozens of celebrity voices - echelon
http://vocodes.com
======
echelon
I've built a lot of celebrity text to speech models and host them online:

[https://vo.codes](https://vo.codes)

It has celebrities like Sir David Attenborough and Arnold Schwarzenegger, a
bunch of the presidents, and also some engineers: PG, Sam Altman, Peter Thiel,
Mark Zuckerberg

I'm not far away from a working "real time" [1] voice conversion (VC) system.
This turns a source voice into a target voice. The most difficult part is
getting it to generalize to new, unheard speakers. I haven't recorded my
progress recently, but here are some old rudimentary results that make my
voice sound slightly like Trump [2]. If you know what my voice sounds like and
you kind of squint at it a little, the results are pretty neat. I'll try to
publish newer stuff soon, and that all sounds much better.

I was just about to submit all of this to HN (on "new").

Edit: well, my post [3] didn't make it (it fell to the second page of new).
But I'll be happy to answer questions here.

[1] It has about ~1500ms of lag, but I think it can be improved.

[2]
[https://drive.google.com/file/d/1vgnq09YjX6pYwf4ubFYHukDafxP...](https://drive.google.com/file/d/1vgnq09YjX6pYwf4ubFYHukDafxPNeEpH/view?usp=sharing)

[3] I'm only linking this because it failed to reach popularity.
[https://news.ycombinator.com/item?id=23965787](https://news.ycombinator.com/item?id=23965787)

~~~
bravoetch
What is the reason for them being almost all men?

~~~
sillysaurusx
It's extremely hard to get women's voices to sound right in TTS, at least back
when I was working on it. It was a striking difference, and I'm not sure why.

e.g. I tried to do an Alyx version of
[https://www.youtube.com/watch?v=koU3L7WBz_s](https://www.youtube.com/watch?v=koU3L7WBz_s)
but it came out sounding nothing like her.

------
boarnoah
On a more positive side to this technology.

I've been wondering about the possibility of using this sort of tech (or the
API offerings from Azure or GCP) to provide voice overs in video games.

By that I mean for smaller budget Indie development, it would be certainly
interesting to either be able to generate voice audio from transcripts in
order to add voices to background NPCs and so on (or even the possibility of
doing it at run time to produce much more dynamic worlds).

I guess the biggest blocker is the difficulty in conveying emotion with what
is currently available as well as the difficulty in getting pronunciation
correct (especially with nouns).

~~~
echelon
There are half a dozen startups in this space that provide the tech. They use
embedded style tokens or sliders to change the emotion, pitch, timbre, etc. I
don't have links off hand, but they're not too difficult to find.

These companies tend to focus on off-the-shelf turnkey solutions, so they'll
have a suite of a few voice actors to choose from for different character
archetypes.

~~~
ethbro
Out of curiosity, are there legal concerns?

E.g. training off Schwarzenegger and offering an Arnold transform

~~~
echelon
I'm not a lawyer, but I think we're entering into a legal gray area. There are
the existing frameworks of copyright, parody, free speech, slander, libel,
etc. that are all somewhat tangential to this.

I believe (I'm not certain) that celebrity voice _impersonation_ is legal as
long as it is not used to sell or endorse a product.

Most models are trained on the original speaker's voice, but maybe only a
little bit. Models might incorporate learning from many speakers. We might
even be able to boil down a speaker representation to a small vector encoding
in the future. It'll be interesting if we can capture the representation of a
person with just a few numbers.

I don't think the legislature should be overly protective against machine
learning. It seems obvious to me that neural networks will play a huge role in
creating entirely virtual musicians and influencers. We're already seeing this
start to happen. r9y9 on github has published some models that rival Vocaloid
in lyrical ability.

At the same time, we don't want these techniques used to commit fraud,
slander, or have them be used to falsely accuse someone of committing some
act. These are things we might need new legal protections for.

But I don't know what I'm talking about. I'm not a lawyer.

~~~
ethbro
I asked not because I expected an answer, but because I figured you'd have an
insightful opinion.

It's essentially the performance of a composition vs the composition question
again: at what point am I mimicking someone to the extent they have a valid
claim on a portion of my work?

I expect it'll enter the courts a few milliseconds after someone clones a dead
actor (without their estate's permission) for a new performance.

There's always been an inherent tension in the US distinction between a law of
nature and a creative work though. It seems a bit silly for me to claim patent
/ trademark on a vector that encodes my likeness.

~~~
mywittyname
I suspect there's some plausible deniability built-in that might allow for
such matters to be legal.

For example, lots of people sound like Arnold Schwarzenegger. So if you
trained a model with tall, deep-voiced Austrian man, you could probably get
something that people will immediately associate with Arnold without actually
being his voice, or someone emulating him. Because much of what Americans
associate with his voice is really a regional accent which is relatively
uncommon in the US.

There may be a little bit more difficulty getting away with with someone like
Gilbert Gottfried, whose voice is much more unique. But I do think you could
get away with creating a voice that people _think_ sounds just like him, but
doesn't hold up in a side-by-side comparison.

What I think will happen is celebrities like Morgan Freeman will use their
voice to train models like this, then gift these to their estates for use in
the future.

~~~
pbhjpbhj
> So if you trained a model with tall, deep-voiced Austrian man, you could
> probably get something that people will immediately associate with Arnold
> without actually being his voice, or someone emulating him. //

I think "passing off", an unregistered element of trademark laws, may be
pertinent here. If the public think that there's an association and you're
knowingly trading on that, even if the public are wrong, then you can be
'passing off' your output as someone else's goods/services/[vocal renditions].

It's likely you'd have to be very careful about use of copyright material for
training the voice (eg extracting metrics that describe the voice). Fair Use
might apply in USA though (even commercially).

 _IANAL, this is not legal advice._

------
rglover
Oh jeeze. I had to. Switch it to Bill Gates and pop this in:

> I'm going to steal your soul. One injection at a time. Slowly, over the
> course of the next decade, the entire essence of your being will be
> demolished until your body is nothing but a vessel for my command.

~~~
GordonS
It's pretty childish, but it was quite satisfying to hear George Bush confess
to being a war criminal too!

~~~
rglover
It'd be irresponsible not to be childish with this thing.

------
searchableguy
That's super cool.

I am worried about the potential abuse of this service, are there any existing
services that can help to identify audio deep fakes like this one is for
making them?

Found Resemblyzer: [https://github.com/resemble-
ai/Resemblyzer](https://github.com/resemble-ai/Resemblyzer)

~~~
GrantZvolsky
Such a system will always suffer from false positives and false negatives.

On a more positive note, when deepfakes become a problem, we will see the
emergence of a culture where unsigned authoritative content is not paid any
attention.

~~~
kibwen
_> we will see the emergence of a culture where unsigned authoritative content
is not paid any attention_

If current events are any indication, that culture will only emerge 30 years
after the tech becomes widely usable, and in the interim will lead to absolute
chaos in the form of weaponized disinformation.

~~~
totetsu
It could be built into Youtube or twitter to pop up a little "this could be a
fake voice" banner

------
echelon
This is my pandemic side project, and I'll be happy to answer any questions
about it.

~~~
zevv
After some experiments with playing text to people around me, we decided that
a huge factor in the perceptive quality of the voice comes from knowing who
you are listening to before you get to listen, with the best perceived quality
when the listener actually gets to see the picture of the voice's owner. Was
it a deliberate choice to add those photographs for this reason?

------
superasn
Guess the server is overloaded now. All I'm seeing are errors.

Tip: It's a cool idea to put some ready made samples under the photos. A lot
of people like myself only want to hear some demos and pre saved mp3 samples
are more than sufficient for that sort of thing. It will also help reduce your
server loads.

------
Dramatize
I think you're going to have legal risks with a project like this. At some
point a celebrity is going to sue and we'll find out what the law decides.

------
mattigames
I was wondering, wouldn't it be possible to classify the voice of every
celebrity based on moods so one could make the voices less monotonic? So one
could then add text metadata for the text-to-speech conversion, e.g. "[Angry]
I have a dream, [Calm] but it has a patent so you can't copy it! (laughter)
[Calm-fade-to-angry] In reality insomnia took it from me!"

~~~
echelon
Absolutely. These are called "style tokens" and they're an active area of TTS
research.

The problem is that currently your training data has to be annotated with
these tokens, and that adds a lot to the difficulty of creating data sets.

I imagine that over time this will get much easier to do.

~~~
mywittyname
Are there good emotion detectors for speech-to-text? Much like they have for
facial recognition?

~~~
echelon
I'm not aware of any, and I haven't had much time to look as I'm not to the
point of doing style tokens yet. I'm certain this would be useful for
annotating data and for all sorts of other applications. Sentiment analysis,
etc.

------
Firerouge
I skimmed your about, where you mention it as a hobby demo of your deep work.

Do you have a GitHub or technical documentation about how you build this sort
of thing to work at scale?

~~~
echelon
I can make a blog post later, but at a high level:

A rust TTS server hosts two models: a mel inference model and a mel inversion
model. The ones I'm using are glow-tts and melgan. They fit together back to
back in a pipeline.

I chose these models not for their fidelity, but for their performance.
They're 10x faster at inference than Tacotron 2. If you want something that
sounds amazing, you're better off with a denser set of networks, like Tacotron
2 + WaveGlow. You should use these for achieving superior offline results for
multimedia purposes.

Instead of using graphemes, I'm using ARPABET phonemes, and I get these from a
lookup table called "CMUdict" from Carnegie Mellon. In the future I'll
supplement this with a model that predicts phonemes for missing entries.

Each TTS server only hosts one or two voices due to memory constraints. These
models are huge. This fleet is scaled horizontally. A proxy server sits in
front and decodes the request and directs it to the appropriate backend based
on a ConfigMap that associates a service with the underlying model. Kubernetes
is used to wire all of this up.

~~~
calebkaiser
This is incredibly cool. Do you mind sharing how big the models are, and what
kind of instances you're deploying them on?

I ask because I help maintain an open source ML infra project (
[https://github.com/cortexlabs/cortex](https://github.com/cortexlabs/cortex) )
and we've recently done a lot of work around autoscaling multi-model
endpoints. Always curious to see how others are approaching this.

~~~
echelon
glow-tts:

    
    
        total 4.2G
        -rw-r--r-- 1 bt bt 110M glow-tts_alan-rickman_ljstx_2020.07.22_expr-1_chkpt-4765.torchjit
        -rw-r--r-- 1 bt bt 110M glow-tts_anderson_cooper_ljstx_2020.07.21_expr-1_chkpt-6622.torchjit
        -rw-r--r-- 1 bt bt 110M glow-tts_arnold_schwarzenegger_ljstx_2020.07.16_expr-2_chkpt-9045.torchjit
        -rw-r--r-- 1 bt bt 110M glow-tts_barack_obama_ljstx_2020.06.28_expr-1_chkpt-1729.torchjit
        -rw-r--r-- 1 bt bt 110M glow-tts_ben-stein_ljstx_2020.07.21_expr-1_chkpt-7516.torchjit
        -rw-r--r-- 1 bt bt 110M glow-tts_betty_white_ljstx_2020.06.28_expr-1_chkpt-1666.torchjit
        ...
    

melgan:

    
    
        -rw-r--r-- 1 bt bt 17M melgan_manyvoice5.0_2020-07-23_12d5838_10760.torchjit
     

(All the voices use the same melgan, or derivations of it.)

I'll edit my post later with my deployment and cluster architecture. In short,
it's sharded and proxied from a thin microservice at the top of the stack.
I'll probably introduce a job queue soon.

------
mmastrac
I'd love to have an option for Majel Barrett

~~~
dsteinman
I was going to mention the same. It would be a childhood dream come true to
talk to my computer and have it talk back to me in the TNG computer voice.

~~~
echelon
That's a fantastic suggestion! I'll get to it!

~~~
Baeocystin
Semi-serious follow-on question- would your model be able to produce voices
like GladOS, which are highly processed, but in a consistent manner? Or are
there too many assumptions baked in regarding normal human speech?

------
101008
Can you comment a bit on the tech on this? I tried something similar with
songs: I wanted artists X to sing a song from artist Y. I cleaned the voices,
the audios, but the transfe rjust didnt work. I didnt do any annotations on
the text (it shouldnt be that hard since all lyrics are available), but if you
could recommend a path or maybe an open source project I be grateful. Thanks
and great work by the way!

~~~
echelon
Thanks!

There are a lot of neat research threads ongoing in terms of generating
vocals.

Nvidia published Mellotron (code + paper + models), and the results are
promising:

[https://github.com/NVIDIA/mellotron](https://github.com/NVIDIA/mellotron)

[https://nv-adlr.github.io/Mellotron](https://nv-adlr.github.io/Mellotron)

The best results I've seen are from researcher Ryuichi Yamamoto (r9y9 on
Github). He continually publishes astonishing results and novel architectures:

[https://github.com/r9y9](https://github.com/r9y9)

[https://github.com/r9y9/nnsvs](https://github.com/r9y9/nnsvs)

[https://soundcloud.com/r9y9/sets/dnn-based-singing-
voice](https://soundcloud.com/r9y9/sets/dnn-based-singing-voice)

These results lead me to believe he's going to have a replacement for Vocaloid
soon.

There's lots more stuff out there, and I can come back and edit my post later.

Some folks are getting good results by simply combining Tacotron with
autotune:

\-
[https://www.youtube.com/watch?v=3qR8I5zlMHs](https://www.youtube.com/watch?v=3qR8I5zlMHs)
Mister Rogers sings Beautiful World (amazing, super charming, and shows the
promise of this tech)

\-
[https://www.youtube.com/watch?v=K1jrDgbRs9Q](https://www.youtube.com/watch?v=K1jrDgbRs9Q)
(Tupac, possibly NSFW lyrics)

\-
[https://www.youtube.com/watch?v=QW16_W0K3qU](https://www.youtube.com/watch?v=QW16_W0K3qU)
(Tupac with various results, possibly NSFW)

There's a lot that gets posted to /r/VocalSynthesis and occasionally
/r/MediaSynthesis

~~~
101008
Thank you very much, I will look at them!

------
mrthrowmantic
My bill gates input

"My name is Bill, the lord of computers. I love computers, and they love me
too. I'll give you a computer, maybe one, maybe two. If you are lucky it might
not even crash on you. Love your computer, like your daughter or your wife,
treat it with kindness, and it will reward you for life! I am bill the god of
computers. Bow to me now or I will be sod you."

------
netman21
I definitely need this. Looks like I have to wait until you are off the front
page of HN though.

I am a writer and found that the best editing comes when I am reviewing audio
files of my books from voice talent. Of course, then it is way to late to
change anything. With a tool like this I can revise as much as I want!

------
ReedJessen
You haven't done very many female voices. Is this a limitation of the modeling
process?

------
vedran
This is great. I've been thinking about doing something similar with cartoon
characters to build a Disney-style companion for my son as he gets older. I'm
imagining something like an Alexa assistant but with Mickey Mouse's voice.

~~~
pbhjpbhj
I know caselaw isn't settled at all on all this but I'd absolutely avoid
posting anything on the web mentioning D' and the black and white mouse again
unless you are interested in finding out firsthand how the law gets settled
here ;o).

Not legal advice, of course.

------
sunsetMurk
Very cool, and easy to use!

Can you give some more info on how you generated the models? I'm also
interested in the tech stack you're using to implement this webapp... Would
love some details!

..What's next?

~~~
lowdose
> What's next?

Text To Video webapp that renders text to video + voice synchronised of famous
people.

Who wouldn't like to laugh 5X more when social scrolling?

The first platform that enables creators with the ability to produce deep
fakes of celebs from text that they can broadcast as HQ video content to their
audience will kill both Youtube & Instagram.

Ranking based on likes so the best jokes of the day are trending on top of the
feed.

Recommendation engine with a multibandid ML algo from the start so you can
leverage all that incoming data.

~~~
sunsetMurk
Awesome. here come the 'deepMemes'!

------
iamthemonster
I assume that if somebody had physically lost their ability to speak, it would
now be possible to generate a pretty reasonable synthetic voice. Should we all
be archiving a high-quality voice sample as insurance?

The implications for security are huge. If your friend calls you up for a very
quick chat from an unknown number and asks you to remind them of your address,
are you going to ask for authentication to prove it's really them and not a
convincing synthetic voice?

------
vmception
This could make video games take up so much less space and have much more
robust speech, especially from NPCs.

Subreddit simulator is pretty convincing conversations, putting that to high
quality voices? mannnn, so many good applications.

Speaking of which, why don't people just talk about the good applications.
You'll get ostracized for speculating more bad things about COVID, but talk
about how doomed we potentially are with deep fakes? Give that blogger a
pulitzer prize!

~~~
echelon
> This could make video games take up so much less space and have much more
> robust speech, especially from NPCs.

Maybe, maybe not. You'll see some of the model sizes I posted in comments
above. These are quite large, and adding models for multiple speakers gets
quite large. These have to live in memory and probably can't be paged in
selectively.

Once we achieve high fidelity multi-speaker embedding models (where multiple
speakers are encoded in a singular model), then we'll have something
compelling. I imagine the models will become less dense over time as well.

Furthermore, if the models are deterministic, then the designers will know
what each line will sound like exactly before it's produced.

------
blisseyGo
I keep getting hit with the rate limiter so I wasn't able to try it :(

> There was an error and I still haven't implemented retry to make it
> invisible. You can absolutely submit your request again a few times; this is
> a self-healing Kubernetes cluster. Some models (voices) get more load than
> others and/or are scaled to fewer or more pods. There's also a rate limiter,
> but there aren't error messages yet.

------
shannifin
Nice work, I love this sort of stuff!

I know there's legal (and perhaps ethical?) issues to work out, but I really
wish tech like this, if fine-tuned, could be used to resurrect stuff like Jim
Henson's original Kermit voice; the Muppets' new voices all sound horrible.
I'd love for fictional character voices to become immortal.

------
nojvek
I think the site's being hugged to death by HN. I can't get any of the voices
to work, they all error out.

------
martinesko36
Having met some of the people on there, this is uncannily accurate, especially
in the High Quality voices. Well done!

------
billforsternz
This is good fun. I went with "Meanwhile, the young males form groups and
compete in trials of strength and courage in an attempt to catch the attention
of herding females" and surprisingly Richard Nixon sounds just as good saying
this as David Attenborough

------
gitgud
Gilbert Godfrey sounds like "Bonzi Buddy" from the old desktop widget days...
Those were some crazy times

[https://en.m.wikipedia.org/wiki/BonziBuddy](https://en.m.wikipedia.org/wiki/BonziBuddy)

------
maxerickson
It doesn't like spelling mistakes.

Ask it to say " The Aristoocrats!" with Gilbert Goddfried.

~~~
echelon
It's currently sourcing phonemes from a lookup table called CMUdict, which is
constructed by Carnegie Mellon [1]. That database has 140,000 entries, but
even so, you'd be surprised how many common words are omitted. And of course
it is missing terms for things like "pokemon" and "fortnite", which I had to
add myself.

I don't have generic grapheme -> phoneme/polyphone prediction, but that's
something I look to add soon. In my literature review I didn't see anything in
this space, so I was thinking I might have to come up with something novel.

[1] [http://www.speech.cs.cmu.edu/cgi-
bin/cmudict](http://www.speech.cs.cmu.edu/cgi-bin/cmudict)

~~~
totetsu
Espeak-ng also has extensive language rules to turn things like letter and
abbreviations into pronounceable text [https://github.com/espeak-ng/espeak-
ng/blob/master/dictsourc...](https://github.com/espeak-ng/espeak-
ng/blob/master/dictsource/en_list) you can get it's ipa output with "speak-ng
-q --ipa=3"

------
newtoday
If I supply you with isolated Robert Plant vocals and transcripts, would you
consider training a model? That could be some 'interesting' output results
with the dynamics and range of his singing.

------
j-james
For something with a similar concept, but entirely different execution, see
the classic website "Talk Obama To Me".

[http://talkobamato.me](http://talkobamato.me)

------
Quequau
I just want to hear Douglas Rain's voice coming out of my computer.

------
kumarm
Whats a good open source solution for Text to speech?

Current cloud based solutions from AWS/Google Cloud/Azure are pretty
expensive.

------
valaymerick
Is there a reason why no female voice is available in this great project?

Do we have more data on male speech than female speech?

~~~
kabacha
What iconic female voices would you add?

~~~
webmaven
> What iconic female voices would you add?

Cate Blanchett, Sarah Silverman, Katey Sagal, Jennifer Tilly, Laura Prepon,
Viola Davis, Judi Dench, Whoopi Goldberg, Julie Andrews, Lake Bell, Jane
Lynch, Joan Rivers, Martha Stewart, Katharine Hepburn, Sarah Vowell, Shoreh
Aghdashloo...

~~~
dencodev
Whoopi and Joan Rivers are the only two from that list that I actually know
what they sound like.

------
afarrell
Is this ethical?

~~~
tdeck
I too expected more discussion of this. People play around with these things
because they're interesting, then mostly hand wave away concerns about the
implications with "well, people will just have to learn to be skeptical of
recordings". But what we're really doing is muddying a previously reliable
avenue of gaining quality evidence about the world. I expect this opinion is
unpopular on HN but I think people shouldn't be developing these things,
companies shouldn't be working on them, and they should be banned before they
get to the point of causing real harm. I also believe that _can_ be prevented
by drying up funding and research, because bad actors have to rely on the body
of existing work to make their bad actions practical.

~~~
slickQ
As NN models get more advanced generating speech synthesis will get
progressively more convincing and less expensive to implement, even if the
models aren't built for speech synthesis specifically. The same can be said
for image generation/transformation. If we are to continue develop AI then
this is likely inevitable. There are benefits to these models for mute people,
for example. Adversarial models can be built to detect fake audio samples.
Regulation (ex: adding tells/signatures in commercial products) would also
help. The government would have to ban most AI research or they would only be
prolonging the inevitable.

------
programmarchy
Looks awesome, but haven't been able to get a result back yet. I think you may
be getting hugged to death :)

~~~
echelon
That's odd. I'm testing it right now and it's working. Which voices are you
trying, and which device and browser are you using?

~~~
imjared
I've tried Gilbert Gottfried and NDT. I do get a console error about CORS: >
Access to fetch at
'[https://mumble.stream/speak_spectrogram'](https://mumble.stream/speak_spectrogram')
from origin '[https://vo.codes'](https://vo.codes') has been blocked by CORS
policy: No 'Access-Control-Allow-Origin' header is present on the requested
resource. If an opaque response serves your needs, set the request's mode to
'no-cors' to fetch the resource with CORS disabled.

Using Chrome stable

~~~
echelon
Oh man, I thought I had this CORS stuff sorted.

Thanks for the help and info!

I'm using version 84.0.4147.89 (Official Build) (64-bit) and getting back
responses.

I got the following response headers:

    
    
      access-control-allow-origin: https://vo.codes
      content-length: 151689
      content-type: application/json
      date: Mon, 27 Jul 2020 15:55:37 GMT
      vary: Origin
      x-backend-hostname: tts-group-1-965d444f5-7kvkm
    

I'll try to dump the cache and reproduce.

edit: I must have an old browser. It works everywhere I'm testing it. CORS is
hard. :(

~~~
programmarchy
I'm also on Chrome (84) macOS, using the Craig Ferguson model.

I switched to Safari and Disabled CORS, but a 500 error is coming back now. So
maybe the 500 response is the root cause, and the error handler is not
returning CORS headers, masking the issue on Chrome.

Edit: by putting in a shorter input (sentence rather than paragraph) I was
able to get a response.

~~~
echelon
I need better error messages, but I believe it should respond with something
stating the length is too long.

What might've happened is that the instance your request was farmed out to
might have been OOM killed. I've provided lots of memory, but these models are
pretty massive and each inference run has to spin up a lot of matrices in
memory.

This is all CPU inference, not GPU.

When the pods get OOM killed, they spin up again. The clusters for each
speaker are about 5-10 pods apiece (with some double tenancy).

------
almstimplmntd
Played with the gilbert gottfried speaker option, gave me serious Twin Peaks
"Red Room" vibes.

------
jsilence
Wondering whether there is an androgynous TTS voice that sounds neither male
nor female.

------
baxtr
Please add Steve Jobs. I miss him

------
cptnapalm
Apparently alone in the world, I really want a Dalek voice.

------
galuggus
How long does it take to generate a good quality voice?

~~~
echelon
I trained a base model on the Linda Johnson speech (LJS) data set for several
days.

I then transfer learned for each of these speakers. Some speakers have as
little as 40 minutes of data, others have up to five hours. The resulting
quality isn't strictly a function of the amount of training data, though more
typically helps. It's also important to have high fidelity text transcriptions
free of errors.

The transfer learning runs vary between six hours and thirty six hours.

I'm using 8xV100 instances to train glow-tts and 2x1080Ti to train melgan. I'm
continuously training melgan in the background and simply adding more training
data. The same model works for all speakers.

~~~
blueblisters
Have you had any success with using speaker embeddings to generate voices with
fewer samples of speech? I did some cursory experiments but I couldn't get too
far beyond getting pitch similar to the target speaker.

My reasoning for this approach: IMO, if the model learns a "universal human
voice", it shouldn't need too much additional information to get a target
voice.

~~~
echelon
I did! I tried creating a multi-speaker embedding model for practical
concerns: saving on memory costs. I'm going to have to add additional layers,
because it didn't fit individual speakers very well. I wish I'd saved audio
results to share. I might be able to publish my findings if I look around for
the model files.

I think you're right in that if we can get such a model to work, training new
embeddings won't require much data.

~~~
webmaven
Hmm. Would a multi-speaker model be able to interpolate between voices (eg.
halfway between Morgan Freeman and James Earl Jones)?

------
ChadTheNomad
Well, there goes the neighborhood...

------
modzu
who owns the likeness of a voice? anyone? are these legally safe to use in a
product?

------
phantom_rehan
Did you use machine learning ?

~~~
echelon
Yeah, forks of two open source pytorch models.

[https://github.com/jaywalnut310/glow-
tts](https://github.com/jaywalnut310/glow-tts)

[https://github.com/seungwonpark/melgan](https://github.com/seungwonpark/melgan)

------
aww_dang
Needs more Christopher Walken

~~~
echelon
I both love your HN username (I hope you don't troll dang), and think that's
an awesome suggestion. I don't know why it didn't occur to me.

------
modzu
i can't be there only one wondering where is Morgan Freeman? :')

------
AnnoyingSwede
WOW!!! :)

------
holoduke
really impressed by this.

~~~
echelon
Thanks! I kind of want pg to see :)

If he thinks this is egregious, I'll take it down.

------
riazl
well done.

------
souravraj95
hiii

------
minerjoe
Hate to be that guy but I can't participate in this discussion due to
javascript being required for the landing page.

As an outlier not running javascript, I'm reaping what I sow, but it would be
nice to me and others in the same boat if projects make their landing page
viewable without the need for javascript.

~~~
thiagocsf
There’s a text area and a button to say what you typed.

Surely you can enable or use a browser with JavaScript when you choose to?

~~~
minerjoe
If you don't have javascript you see only "This page requires Javascript",
when I would hope, even if the thing requires javascript to operate, I could
at least find out if it is worth switching to another machine with X11 and
firing up firefox.

