
Voice Synthesis for in-the-Wild Speakers via a Phonological Loop - itamarb
https://ytaigman.github.io/loop/
======
bluetwo
I still think emphasis on a word or syllable is important here as there is far
more information than you realize being conveyed with inflection.

Consider:

 _I_ am going to eat the ham sandwich = Me, no one else

I _am_ going to eat the ham sandwich = Nothing can stop me

I am _going_ to eat the ham sandwich = On my way; got distracted

I am going _to_ eat the ham sandwich = In case you doubt my intent

I am going to _eat_ the ham sandwich = I will not be juggling it

I am going to eat _the_ ham sandwich = The ultimate ham sandwich will be mine

I am going to eat the _ham_ sandwich = Not turkey, not roast beef

I am going to eat the ham _sandwich_ = Between two slices of bread is what I
do

~~~
vosper
This made me chuckle, and it's a great illustration of how much meaning is
changed by different emphasis. However, I would read the _to_ example like
this:

I am going _to_ eat the ham sandwich = The sandwich is the reason I am going
(to the party, or wherever...)

------
olegkikin
Similar in quality to Lyrebird

[https://soundcloud.com/user-535691776/dialog](https://soundcloud.com/user-535691776/dialog)

Google WaveNet sounds almost perfect in comparison:

[https://deepmind.com/blog/wavenet-generative-model-raw-
audio...](https://deepmind.com/blog/wavenet-generative-model-raw-audio/)

~~~
ihsw2
Some of the generated speech clips are unsettlingly robotic while WaveNet
sounds passable, but it's the piano compositions that I found unnerving. I
can't explain how but randomly generated music sounds so hollow and cold.

~~~
tubian
Note that WaveNet was not trained "in the wild" (like, on celebs) but rather
on a speech style dataset

------
abhishek0318
Mix this with AI creating video from audio ([http://spectrum.ieee.org/tech-
talk/robotics/artificial-intel...](http://spectrum.ieee.org/tech-
talk/robotics/artificial-intelligence/ai-creates-fake-obama)) and you can make
anyone say anything.

------
Animats
Coming soon, audio ads with your friend's voices.

~~~
booleandilemma
Or your own :(

~~~
aknoob
Isn't this scary ?

~~~
euyyn
With how much people hate the sound of their own voices, I think that would
backfire on the advertiser.

~~~
pjc50
Subvocalisation. Your own voice speaking quietly in the background to
something else. Ideal for consumer indoctrination.

~~~
digi_owl
The voice we hear in our head and the one everyone else hear is starkly
different.

~~~
NTripleOne
The voice I 'hear' in my head and the one I actually hear when talking is
starkly different.

------
azinman2
To me this is very exciting. I'm already working on my own home digital
assistant modeled as NeNe Leaks from the Real Housewives to add personality to
otherwise boring conversations with a robot. I've been looking at various
style transfer techniques, and having something a bit more plug & play will
help me focus on the more unique parts. I predict that we'll see more
celebrity voices used as conversational interfaces become more common.

Part of the complexity is going from 'context-free phonemes' to actually
modeling personality. Having some way for the voice to know how to embed
emotion, and ideally contextually from the sentences themselves. NeNe is an
interesting example as she adds so many non-verbal sounds to her dialog
(bleeps and bloops and eye rolls that she translates into affected speech).
That's part of what makes her NeNe, and a big part of the entertaining value.
Pursuing that is what will bring style transfer to the next level... total
personality emulation. I fantasize about basic animatronics that can move her
head side to side, twirl, and literally give eye rolls.

If anyone wants to work on this with me, give me a ping @azinman on twitter.
I've currently been thinking about this as an open source project, but still
holding out options as I continue development. I've got a ton more ideas she's
integrating into with my bleeding edge smart home, far more than just
personality emulation (including what I believe to be a breakthrough in
passive context-sensing.. the real key to making the smart home actually
smart).

~~~
stephengillie
What language are you working in? I've been working in Powershell out of
convenience, but am looking to port my speech bot to Node.

~~~
azinman2
I'm working in Go right now, but probably will end up with a mix of various
things. What's nice about Go is that it's very portable, faster than Python,
nicer RAM/storage usage than Node (not needing a JIT and all), and I can cross
compile binaries and distribute them to Raspberry Pis or whatever.

------
johannkaupen
There are too many example to do fraud with this to list here.

One example: Not too long ago I still did the rather more important banking
stuff with a quick phone call (couldn't be done entirely online).

~~~
tyingq
Would certainly make the phishing scheme where _" fake CEO sends real CFO an
email request to send a bank wire"_ more successful. Send the email, follow up
with a voice mail.

------
digi_owl
For some reason this page gives Firefox a fit, and that is with
multiprocessing enabled...

------
placeybordeaux
Anyone else having trouble with the audio samples?

------
m00dy
I'm waiting for the code samples :)

Thanks

