
AI produces realistic sounds that fool humans - yunque
http://news.mit.edu/2016/artificial-intelligence-produces-realistic-sounds-0613
======
slr555
AI that produces sound through analysis of a source video is impressive.
Fooling humans is not. Since most of us have grown up on a steady diet of film
and television many of the sounds we have in our memories are the work of
foley artists that add sound effects to sequences in post. The sound of horse
hoofs on cobblestones is likely created from a percussive technique that has
no equine participation. The sounds of people being punched may be the sound
of a large piece of meat being struck with a club. Similarly crunching snow
likely is not the sound of a person walking through actual snow.

Our perceptions of sound within a video/film source is already deeply skewed
and therefore the notion that this AI is a Turing test of sorts is a weak
analogy.

~~~
anigbrowl
You're right, but as someone who's done a lot of sound editing/foley work I
can't help having mixed feelings at seeing yet another job skill automated
away. Good part - in a few years this will be good enough for commercial use
which will save sound editors all sorts of tedious dull work and free them up
to do more exciting creative stuff. Bad part: the tedious dull work was also
what paid the bills. The easier it is to do that stuff automatically, the less
people are willing to pay for good quality work.

Rather than now being able to make a living do the sort of fun really creative
stuff like inventing new sounds for teleportation devices or dramatic natural
phenomena, editors are more likely to be asked to work for free on the theory
that they'll gain great exposure for their creativity. That's generally a very
bad bargain. If past trends in the electronic dance music market are anything
to go by, increasing automation will not reward true creative talent but
rather just lead to an arms race to have the latest sound libraries,
synthesizers etc. and just be the first to market with big splashy new sounds
that offer superficial novelty.

Ability to provide high-value equipment below normal rental cost frequently
trumps considerations of talent in the film industry. Similarly there are
plenty of crappy directors of photography out there who get hired regularly
because they own a pile of nice lenses and related camera equipment, and
hiring them plus their camera package looks economically attractive on paper
because it's hard to quantify photographic talent.

~~~
slr555
I have so much respect for foley artists. Artist being the operative word.
People don't appreciate the hard work and creativity that goes into making the
perfect sound.

------
ThomPete
I think people here are underestimating how big a thing this actually is and I
think the headline is kind of to blame for that. This is much less about
fooling humans than about what this actually means.

Humans spend a huge part of our early live learning to listen and to connect
the dots between what we see and what we hear.

The fact that Deep Learning algos now can simulate audio based on what they
see thats the big thing here. Not the production of the sound that fools
humans. You can almost sense how imagination and inspiration is inside the
reach of machine learning (yes there are some way to go yet)

We are now not only seeing individual senses being simulated but also the
relationship between them. And as a bonus what one machine learns one place
can be instantly added to the knowledge of the other.

Thats IMO the big deal here.

~~~
YeGoblynQueenne
>> And as a bonus what one machine learns one place can be instantly added to
the knowledge of the other.

That's actually one big problem with machine learning algorithms: it's not at
all clear how to integrate their knowledge with that of other algorithms (and
that includes different instances of the same algorithm). Such algorithms
build a single model of one domain at a time, and we're talking about very
strict domains.

What we're seeing lately is many teams announcing that they trained an
algorithm to do this or that pretty damn amazing thing, but watch closely: how
many of those announcements describe a system that can integrate its learning
into a wider cognitive architecture? There's teams that trained models to
recognise images, to combine images, or to map images to strings, but all
these things are simple tasks, that are only useful in a very limited range of
circumstances. Machine learning algorithms unfortunately are one trick ponies.
They do one thing well- and that's it.

>> You can almost sense how imagination and inspiration is inside the reach of
machine learning (yes there are some way to go yet).

That's an understatement- the bit about having some way to go. We're not even
close, really. To train a machine learning algorithm the first thing you need
is a lot of examples of the thing you want it to learn. It's really hard to
see how one would compile a set of examples of imagination, not least because
it's inside peoples' heads. Not to mention we don't even know what human
imagination is in the first place.

~~~
visarga
> Machine learning algorithms unfortunately are one trick ponies.

Most humans only know a few of the skills that other people know. We are
specialized, too.

~~~
YeGoblynQueenne
You're talking about specialisation in a restricted field, like maths or a
scientific discipline. That's part of education.

I'm talking about how all (healthy) humans learn about the world they inhabit,
by building a broad context of the entities and concepts in it. We all learn
to speak a language for instance, in fact I believe most people actually learn
a couple. We learn to interpret facial expressions, who is our friend and who
is not, how to find sustenance and so on. We learn a whole bunch of things
outside of formal education and specialised technical knowledge.

We specialise even in the kind of knowledge I describe, sure, but we can also
change specialisation without too much hassle. I myself have been making my
living as a programmer for the past several years coming from a completely
non-technical background for most of my life. It was hard going to learn a new
thing from scratch, but I was perfectly able to do so. We lose this
flexibility as we grow older but for most of our lives we have nothing like
the limitations of machine learning.

~~~
visarga
> we have nothing like the limitations of machine learning

Well, that works both ways. Humans are quite limited, we haven't doubled our
IQ in the last 1000 years. But machines have even more potential to grow than
we do, in the next 50-100 years they will surely have matched our ability to
adapt.

Today, a company releases a translation software for a couple of languages,
next year they release translation between 100 languages. A human can't keep
up with that. Yes, they still need tuning and architecture design, but that
might change soon, maybe in a few years.

Also, the education of humans needs to be human supervised, but machine
learning can be unsupervised (like, AlphaGo playing millions of self games to
fine tune its value function), thus, cheaper. I am sure Lee Sedol needed much
more energy to train to get up to that level of play. He's 33 years old, and
in order to get to his level he required resources, teachers, food, etc.
AlphaGo played a few million self play games and only consumed a bunch of
cheap electricity, while doing it a hundred times faster and surpassing the
man.

~~~
YeGoblynQueenne
>> AlphaGo played a few million self play games and only consumed a bunch of
cheap electricity,

Well, if Google has access to cheap electricity then I understand why they're
so successful. Unfortunately, I think they find it as expensive as everyone
else, except they have a larger budget than most and they can afford to burn
it for as long as they like (well, ish).

>> I am sure Lee Sedol needed much more energy to train to get up to that
level of play. He's 33 years old, and in order to get to his level he required
resources, teachers, food, etc.

Sure, but in the same vein AlphaGo required the energy and combined effort of
probably a few hundred thousand humans to create the infrastructure on which
it runs, the factory that created its hardware, the people who invented its
programming language, its algorithms and so on. If you're going to think about
historical costs, then think about historical costs.

But, I'll refer you to my reply to TomPete (same level): no, runnign AlphaGo
is not "cheap" in any way. There are huge costs involved, as there are for
pretty much all state of the art machine learning algorithms, with deep
learning topping the curve. For instance, try training an instance of AlexNet
on the full ImageNet data on your hardware, with your home electricity budget.

>> next year they release translation between 100 languages.

That's not that hard to do. What's hard to do is to get good translation
between those 100 languages. In practice, for all companies who do machine
translation right now, translation works well between a few pairs (like three
or four pairs) and the rest is only useful as entertainmnt for native
speakers. I speak a few languages, French, English and Greek, and I can attest
to the fact that going from or to Greek from either French or English is just
hilarious, in any machine translation service I've tried, with Google
Translate first, of course.

I think you're just overestimating the quality of machine translation. I'm
afraid it's nowhere near as good as you think it to be.

>> Humans are quite limited, we haven't doubled our IQ in the last 1000 years.

That doesn't mean much. Even humans with a low IQ can learn to read and write,
and perform all sorts of reasoning tasks that are out of the reach of all AI
systems, even if those same systems can outperform every human in specific and
very restricted tasks.

Again- you're overestimating AI, I'm afraid.

------
macawfish
They trained the algorithm to watch the stick and play sounds from the
database where the stick moved similarly.

But the title makes it seem like the algorithm is synthesizing the sounds from
scratch!

~~~
542458
They can do pure parametric synthesis as well, but it's not nearly as
convincing so most of the video is devoted to the more convincing match
method. FWIW, constructing realistic sounds from first principles is much more
difficult than you'd think.

> where the stick moved similarly

where the stick moved similarly and was hitting similar things, which is a
non-trivial task.

~~~
tuewocnc
yes, it would have to learn to simulate the physics of the system to match the
video, which would be cool

------
andreyk
"The first step to training a sound-producing algorithm is to give it sounds
to study. Over several months, the researchers recorded roughly 1,000 videos
of an estimated 46,000 sounds that represent various objects being hit,
scraped, and prodded with a drumstick. (They used a drumstick because it
provided a consistent way to produce a sound.)"

Whilst the production of natural seeming sound is cool, that quote right there
perfectly shows just how limited AI/ML still is. Sure, Deep Learning systems
can be taught to do perception tasks (such as understanding or creating sounds
and images) very very well, but those perception tasks are incredibly specific
and narrow. Not only that, but they are trained large datasets hand-labels
through an intensely laborious process, and indeed this laborious process is
necessary because we are still using simplistic supervised learning. At this
point good recognition or generation with deep learning is entirely old news,
and I think zero shot or unsupervised/semi-supervised learning is where the
real challenges still are.

~~~
YeGoblynQueenne
>> They used a drumstick because it provided a consistent way to produce a
sound.

More to the point, that bit.

The article makes a big todo about how humans use sound to learn about their
environment and so on, but imagine if we needed to get a drumstick to make
sounds consistent enough to learn to recognise them.

Supervised and unsupervised learning is not the real challenge. The real
challenge is to get to the point where algorithms can build a model without
the benefit of an insanely expensive data pre-processing pipeline. Deep
learning's big promise is exactly that, but it's not always delivered (for
instance, there's a paper by Hinton and I forget who else, where they report
that training LSTM RNNs on raw characters does not give best performance) (so
we're stuck with tokenisation and the implicit assumptions they impose on your
data- my corollary).

Also, there are ways to avoid the expenses of hand-labelling, for instance co-
training: [https://en.wikipedia.org/wiki/Co-
training](https://en.wikipedia.org/wiki/Co-training)

~~~
andreyk
True, that bit is especially bad.

"The real challenge is to get to the point where algorithms can build a model
without the benefit of an insanely expensive data pre-processing pipeline."

Arguably, that is equivalent to saying the real problem is still
unsupervised/semi-supervised learning. IE, being able to just throw a bunch of
raw data and maybe just a bit of hand configuration at an algorithm and have
it do complicated things for you. The success of Deep learning is to scale to
tons of data and build really complicated models, but as it is used today that
data is still hand labeled for supervised learning in an insanely expensive
data pre-processing step. Good unsupervised or semi-supervised learning could
hopefully let us get out of this, but I don't think anyone really knows how to
get there yet. Co-training is an older example of semi-supervised learning,
and more recently there were Ladder Networks, but I don't think any algorithm
has been shown to work really well and become the norm in the way LSTM RNNs or
CNNs have.

~~~
YeGoblynQueenne
>> Arguably, that is equivalent to saying the real problem is still
unsupervised/semi-supervised learning.

I don't agree because data pre-processing and labeling are two distinct parts
of the pipeline and you can totally have one without the other.

And there's more to it than that. Currently we have to provide the context for
an algorithm to learn. We do this by selecting training examples. Whether
these examples are labelled or not, they are only a small part of the world we
wish the algorithm to learn about.

You don't even have to go as far as the wider physical world to see this in
action. In any training context, if your training set is missing a category of
entities, Y, then your algorithm will never model Y. It doesn't make any
difference if your model is trained in a supervised manner or not. What
matters is that there is a part of the world that it hasn't seen.

I guess you can say that humans don't have a way to learn this way either, but
human learning has a big advantage: we need very little data and very little
training to incorporate new knowledge and our context of a world is very broad
to begin with. It's at once broad, specialised, robust and flexible. We're a
bit scary if you think about it.

Which leads me to believe that the limitation of our machine learning
algorithms is not in the labeling, or even in the data pre-processing but in
some fundamental aspect of building a context from examples only. There's
something missing and it's not something we know about (hah!). The missing
part means that you can learn from examples until the heat death of the
universe and there will still be an infinity of things you don't know anything
about- and that are potentially part of your immediate environment.

Obviously, removing the need for pre-processing will make things much cheaper
and there will be progress, ditto removing the need for supervision. But it
won't get us anywhere nearer human learning, despite people's best wishes,
because we're missing a part of the puzzle that's a whole other ball game.

(and which I obviously don't claim to have any idea about)

------
mpitt
It fools humans _more often than a baseline algorithm._

What that means in numbers, they carefully avoid saying.

------
vernie
Sample size: 3

~~~
whatever_dude
Important little side note.

------
anotheryou
Fooling humans is much easier when about 50% of the sound is always accurate
(in any case it's a wooden drumstick hitting something). The human mind is
very forgiving, especially when vision accompanies sound (McGurk effect [1])

Furthermore this fixed variable makes the pool of samples to choose from much
much smaller.

Still impressive of course ;) just far from dubbing any video there is.

[1] [https://youtu.be/G-lN8vWm3m0?t=32](https://youtu.be/G-lN8vWm3m0?t=32)

------
jderick
Paper is here:

[http://arxiv.org/abs/1512.08512](http://arxiv.org/abs/1512.08512)

------
_mhr_
It would be cool to create the drums for a song by taking a video of the
performance and then using this software to create the recording to be put
into the song.

------
grondilu
Funny coincidence that I just stumbled upon a nice video about sound crafting
for cinema:

[http://sploid.gizmodo.com/wonderful-short-film-reveals-
the-p...](http://sploid.gizmodo.com/wonderful-short-film-reveals-the-
painstaking-magic-of-f-1782096219)

------
Animats
The demo seems to recognize three basic categories of things hit - shrubbery,
dirt, and other solid objects. It doesn't distinguish much between hitting
metal, wood, and asphalt.

It's another step toward common sense. Predicting what will happen if a robot
does something in the real world is essential to making robots less stupid.

------
EGreg
We've been fooling humans for years because the humans in question were
conditioned by tropes on TV:
[http://tvtropes.org/pmwiki/pmwiki.php/Main/TheCoconutEffect](http://tvtropes.org/pmwiki/pmwiki.php/Main/TheCoconutEffect)

------
logicallee
Good research, but some parts of the videos are like badly dubbed sound
effects - actually hilarious:
[https://www.youtube.com/watch?v=0FW99AQmMc8&t=1m1s](https://www.youtube.com/watch?v=0FW99AQmMc8&t=1m1s)
(the drumstick noise in the middle).

still impressive.

------
Lxr
Next we need to do it in reverse - given a sound, generate a video to match. I
wonder how well that would work?

------
dingle_thunk
Much of this is beyond me, but is it producing (synthesizing) sounds? Or just
sampling sounds?

~~~
systoll
Most of the video has them using an algorithm to predict the sound of the
hits, then finding the closest match in the database & using it as a sample.

At 1:50, they switch to using the prediction as a synthesised replacement. It
is not as good. They switch back to the first method for the tests

------
bekimdisha
Why is AI focused so much into fooling/imitating/trap-ing humans these days?

~~~
tmalsburg2
I don't think these systems' purpose is to fool humans. Tasks that test
whether a system can fool people are simply a good way to evaluate the
performance a system. If a speech synthesis fools people into thinking a real
person is speaking, that means the speech synthesis is really good. You might
say it's not important that a speech synthesis sounds perfectly human but our
speech perception evolved to be optimal for human speech, so it's likely that
any deviation from that makes the signal harder to process.

------
kelvin0
This has the potential to do the same for Videogames as did Mo-Cap.

------
ourcat
Percussionists beware. Your obsolescence clock is ticking.

------
danielmorozoff
This work looks very cool from the demo video. Have yet to read the paper, but
the parametric inversion for generating sounds from features seems very
intriguing.

------
tednoob
Since the discussion seem to have died down a bit I just have to say it,
sorry. Quit beating around the bush.

------
tmalsburg2
Just a couple of years ago no one would have called this AI. Interesting how
this old term has become so fashionable again. Perhaps it's also being
overused and AI winter is coming.

