
Adobe demos “photoshop for audio,” lets you edit speech as easily as text - nerdy
http://arstechnica.com/information-technology/2016/11/adobe-voco-photoshop-for-audio-speech-editing/
======
eduren
I've actually been tossing around the idea of creating a program like this,
although for a specific use case.

In Bethesda games (Oblivion, Skyrim, Fallout) there are large modding
communities adding new quests, areas and plot lines. But one technical and
financial challenge for them has always been voice acting. Not only do they
have to worry about voice acting new potential characters, but they have no
means of writing new dialogue for existing characters.

In Fallout 4, for example, the protagonist is fully voice acted. That means a
distinct change between the way the main game feels, and any modding efforts
made by the community (barring actually re-hiring the original voice actor for
new lines).

I'm envisioning having this tool train on the already provided voice lines in
the game(depending on the character in question, that's quite a bit). And then
letting mod authors input new dialogue lines to be spit out in somewhat the
actors voice.

Lots of problems with the approach of course, not to mention the fact that
these are actors and not just voices (there would probably be significant
amount of emotion lost). But it would give the modding community such a
powerful tool to add new plots for existing characters.

~~~
Frenchgeek
[https://www.youtube.com/watch?v=YPwFuCL33I8](https://www.youtube.com/watch?v=YPwFuCL33I8)
but automated, then.

~~~
southclaw
I used to love watching mans1ay3r's videos, amazing level of creativity and
hard work required to assemble all the voice clips! I can only imagine what
hilarity would come from a fully fledged voice-splicing application!

------
Jugurtha
I wonder how much better the rendering would be if the audio track were much
longer and the software would have more to learn. I don't mean more _words_
for a 1 to 1 match since it's clearly beyond that (pronouncing words that it
didn't see), but voice features that weren't in that short track.

Hypothetical question: say it had access to all the episodes from Key & Peele,
would the rendering be better to the point you could basically generate an
audio track from a script with intonation and all?

It would be interesting if they offered "voice packages" either online or
offline so you could just pass text through it and the output would be a
Morgan Freeman narration. You'd have a shop for "Cords" the same as iTunes for
songs & apps. Maybe game developers will find that interesting, too. Having
access to way more voices than they'd have in real life, and on a budget.

Someone could also save their voices for posterity. Many people listen to
recordings of loved ones who passed away to remember them. Saving the voice
for new content would be something to think about.

~~~
loudmax
> It would be interesting if they offered "voice packages" either online or
> offline so you could just pass text through it and the output would be a
> Morgan Freeman narration.

On that note, should Morgan Freeman be allowed to copyright the sound of his
own voice?

My immediate inclination is No, you shouldn't be allowed to copyright a voice
timbre because it opens up so many weird possibilities for making even more of
a mess of copyright law. For example, what happens if two people happen to
have the same voice, or at least, close enough so that Adobe's software can't
distinguish them.

But there is a reasonable case to be made that Morgan Freeman's voice belongs
to him. If some company profits from having a reproduction of Morgan Freeman's
voice in one of their products, maybe they should pay him.

~~~
thefalcon
Morgan Freeman's likeness - including his voice - is already protected
legally, no?

~~~
bnj
If you're using "audio with substantial similarity" to the voice of Morgan
Freeman, but using it to create entirely new works... doesn't that create new
and interesting collisions between rights of publicity, fair use,
transformative works...?

------
nerdy
I thought it sounded badly cut when he moved wife in the sentence but adding
new text was pretty amazing.

~~~
kyriakos
It did sound like they picked the right things to say. I won't be surprised if
the extra words appear in the 20 minutes of training speech. Either way even
if it's not as magical as they try to make it its still extremely useful for
the voiceover industry.

~~~
lozf
But not so much for the voice over talent.

~~~
kyriakos
Well another job that will become obsolete I guess.

------
pavel_lishin
How long before video and audio evidence will not longer be admissible in
court? 2030?

~~~
trendia
Photos are still admissible, despite Photoshop existing for several decades.

~~~
webwielder2
People were convincingly doctoring photos looong before Photoshop.

~~~
cptskippy
Communist countries are notorious for making people disappear.

[http://www.businessinsider.com/people-who-were-erased-
from-h...](http://www.businessinsider.com/people-who-were-erased-from-
history-2013-12)

------
noonespecial
1) Mentioned near the end of the video that it actually required around 20
minutes of audio to start synthesys. Not quite as magic as it first seemed.
Still cool.

2) The intonation always matched the initial sample. Give us some filters like
"vocal fry", "perplexed", "angry", "wonder" etc and then we'll really have
something here.

~~~
Florin_Andrei
> _Give us some filters like "vocal fry", "perplexed", "angry", "wonder" etc
> and then we'll really have something here_

Neural networks can already do style transfer for images. Seems like it should
be doable for voice, too.

~~~
augustt
I wonder if you could get interesting results by applying neural style
transfer to spectrograms.

------
erikschoster
Sounds a lot like this: [https://youtu.be/xzL-
pxcpo-E?t=933](https://youtu.be/xzL-pxcpo-E?t=933)

IRCAM has been doing some really cool stuff in this area for a long time.
Check out their pages on corpus-based synthesis for example:
[http://imtr.ircam.fr/imtr/Corpus_Based_Synthesis](http://imtr.ircam.fr/imtr/Corpus_Based_Synthesis)

------
echelon
This sounds waaay better than the Donald Trump text to speech system I've been
working on: [http://jungle.horse](http://jungle.horse)

I wish I could chat with their engineering team. I'd love to learn the
mathematics and tech. (A lot of it might be patented?)

Is there an equivalent of SIGGRAPH for audio?

~~~
pgbovine
UIST is one of the premier human-computer interaction tools conferences, so
it's the closest that accepts "SIGGRAPH-like" technique papers for audio.
Maneesh Agrawala from Stanford has several great papers in this space of audio
mixing/editing:
[http://graphics.stanford.edu/~maneesh/](http://graphics.stanford.edu/~maneesh/)

~~~
staticfloat
Top poster may also be interested in ICASSP, which bothers itself more with
the algorithms of speech/audio processing than the applications of said
processing.

------
jbverschoor
If photoshop would give these results, a lot of industries would go belly up.

Nice marketing line, but it's speech recognition which set the begin/end frame
in the sample.

I was expecting either "painting" away defects or actually reconstruction a
real TTS by using a small sample.

~~~
SamBam
The comparison to photoshop seems a silly stretch, but

> actually reconstruction a real TTS by using a small sample

it actually did this in the demo, if I understand what you're asking. It
created brand-new words, using the speakers own voice, from a tiny sample
size. (If we believe the demo.)

~~~
phreack
At the end of the demo the speaker clarifies that it requires about 20 minutes
of speech, and this was a controlled demo so it's quite possible that brand
new words were actually not created.

~~~
zodPod
Think about what one could do with this tool and a dataset of the speeches of
a president or something. I mean, with the amount of speeches our presidents
give from candidacy to finishing their term as president, we'd have probably
weeks worth of words to pick from. If it's an intelligent system you could use
this for so much even if it _is_ simple cutting and pasting!

------
schoen
Copied from my comment on an earlier submission on this:

I don't see how the watermarking they talk about is going to succeed in
preventing forgeries.

If they're planning to watermark unedited recordings, you have a huge false
positive problem because there are billions of hours of legitimate but
unwatermarked audio recordings, and will probably continue to be. You can also
get false negatives by tampering with a watermark-capable device to get it to
watermark something that wasn't recorded from analog. Or you can rerecord
edited audio from an analog source and simply claim that your "genuine"
recording is slightly noisy.

If they're planning to watermark edited recordings, someone else can implement
the same kind of technology but without the watermarking.

~~~
dictum
I doubt their watermarking can survive the analog hole and multiple reencodes.

~~~
rasz_pl
as AnimalMuppet said under me it will survive analog hole quite well, but
degrade quality. Look up Cinavia.

Its so bad you can actually hear distortions.

------
hammock
You know this is a good idea when half the commenters already have a half-
baked version of this created themselves!

~~~
godDLL
Aha, like it was with all the pre- and post-Slack chambered forums with cloud
history sync.

Did many of them become a success? Now Microsoft is in on it too, so you gotta
wrestle that gorilla.

------
Something1234
Just imagine what this will do for dubbing anime or any other tv show. It's
still scary how this can be abused.

~~~
Delameko
What about feeding in an audiobook narrated by a well-known personality (like
Stephen Fry), and then using the voice to narrate your own self-published
eBook?

~~~
Something1234
Maybe this technique will lead to people "copyrighting" their voice. That
would be weird word where intonation can be copyrighted.

~~~
schoen
Some people have tried to analyze this as an aspect of right of publicity
(e.g.
[http://ir.lawnet.fordham.edu/cgi/viewcontent.cgi?article=281...](http://ir.lawnet.fordham.edu/cgi/viewcontent.cgi?article=2813&context=flr)
but I think there are many other articles on this topic). The right of
publicity analysis could be different from the copyright analysis.

------
jwebb99
"Photoshop for audio," seems so obvious, I'm surprised we haven't seen this
before. (After all, the underlying technology has been around for a while
now.)

------
gallerdude
Joaquin Phoenix is now going to narrate all of my audio books.

~~~
vkou
I doubt this software can figure out which parts of your audiobook should be
whispered, which parts emphasized, which kind of accent each character should
have, when the passage should be read as a playful conversation, as opposed to
a clinical narrative...

A lot more goes into an audiobook then some guy reading 300 pages of text into
a microphone.

~~~
tdb7893
I wonder how long it will be before someone gets it doing intonation
correctly. I doubt it's an intractable problem and it probably wouldn't be too
hard to turn existing audiobooks into a pretty decent dataset.

~~~
vkou
If you want it to not sound like a comedy of errors, you'll have to wait until
we figure out Strong AI that can understand sarcasm in the works of
Shakespeare... And be able to carry on a coherent, deep philosophical debate.

The difficulty is not in modulating the voice to create intonation, but in
annotating prose with intonation.

------
glasz
if adobe has this working in a demo, rest assured "security service" developed
such thing 10 years ago. then you can go back and ask yourselves why osama has
been reported dead as early as like 2001, the cia released videos in which he
always looked different and why his body was quickly drowned at an unknown
location.

go back to sleep, now. everything's alright. great new tech. will help
catching terrorists from beneath your bed.

~~~
junkgvdghD
Wait. America made up Osama. Killed him in 2001. But quietly so no one would
know. Then digitally created new videos of him, all to kill him in 2011. Uh
huh

