Hacker News new | comments | show | ask | jobs | submit login
AI produces realistic sounds that fool humans (mit.edu)
296 points by yunque on June 13, 2016 | hide | past | web | favorite | 93 comments

AI that produces sound through analysis of a source video is impressive. Fooling humans is not. Since most of us have grown up on a steady diet of film and television many of the sounds we have in our memories are the work of foley artists that add sound effects to sequences in post. The sound of horse hoofs on cobblestones is likely created from a percussive technique that has no equine participation. The sounds of people being punched may be the sound of a large piece of meat being struck with a club. Similarly crunching snow likely is not the sound of a person walking through actual snow.

Our perceptions of sound within a video/film source is already deeply skewed and therefore the notion that this AI is a Turing test of sorts is a weak analogy.

You're right, but as someone who's done a lot of sound editing/foley work I can't help having mixed feelings at seeing yet another job skill automated away. Good part - in a few years this will be good enough for commercial use which will save sound editors all sorts of tedious dull work and free them up to do more exciting creative stuff. Bad part: the tedious dull work was also what paid the bills. The easier it is to do that stuff automatically, the less people are willing to pay for good quality work.

Rather than now being able to make a living do the sort of fun really creative stuff like inventing new sounds for teleportation devices or dramatic natural phenomena, editors are more likely to be asked to work for free on the theory that they'll gain great exposure for their creativity. That's generally a very bad bargain. If past trends in the electronic dance music market are anything to go by, increasing automation will not reward true creative talent but rather just lead to an arms race to have the latest sound libraries, synthesizers etc. and just be the first to market with big splashy new sounds that offer superficial novelty.

Ability to provide high-value equipment below normal rental cost frequently trumps considerations of talent in the film industry. Similarly there are plenty of crappy directors of photography out there who get hired regularly because they own a pile of nice lenses and related camera equipment, and hiring them plus their camera package looks economically attractive on paper because it's hard to quantify photographic talent.

I have so much respect for foley artists. Artist being the operative word. People don't appreciate the hard work and creativity that goes into making the perfect sound.

As someone who's worked in the film industry (visual fx & CG) I'm subject to the same problem, but all tedious jobs from the industrial revolution on have been automated away one by one. I can understand the lament for something you worked hard on, and this isn't to take that away from you, but most job skills do actually have less value in the market over time, right? The other way to look at it is that what counts as good quality work changes and improves continually over time, higher and higher quality becomes available for the same price. Jobs are continually being reinvented, and people always get to work on the interesting parts that can't be automated. Something that took many people to do one decade only takes one person the next decade. This has been true for hundreds of years from farmers to accountants to cooks to car makers ... This "problem" is here to stay, our economy hasn't crashed yet, and there are as many creative people as ever.

Great Line:

the tedious dull work was also what paid the bills. The easier it is to do that stuff automatically, the less people are willing to pay for good quality work.

Just an aside, but I really like the videos that Hiss and Roar makes to accompany their sound libraries:

For example, "Vegetable Violence": http://hissandaroar.com/sd001-vegetable-violence/

"Vegetable Violence is an organic sound effects library for creating your own orchestrated sonic mayhem. Vegetable rips, tears, squelches, hits, punches, stabs all recorded & mastered at 96kHz for stomach churning realism, this component library of gore sound effects is available for immediate download."

Aside to the aside -- if you like this and also like Italian horror films, you should check out the movie "Berberian Sound Studio".

What about the reverse? I saw BSS but feel like I don't really know the world it refers to.

Watch Dario Argento's films - Profundo Rosso, Suspiria, Inferno, Tenebrae. Those should all be quite easy to get hold of. Then start in on this list... https://en.wikipedia.org/wiki/Giallo

Suspiria in particular sticks in the mind. Great music, saturated colours, properly horrific horror. It's a bit more 'on-screen' than BSS, btw.

AI that produces sound through analysis of a source video is impressive. Fooling humans is not. Since most of us have grown up on a steady diet of film and television many of the sounds we have in our memories are the work of foley artists that add sound effects to sequences in post.

Right on! The fact that most audiences seem to expect a shotgun racking sound in a scene with a shotgun that doesn't even have that mechanism, or that drawing a katana is so often accompanied by a metallic "shing" and rattling sound -- these indicate the degree to which large swathes of people are drastically disconnected from an immediate and physically connected sense of how sound relates to the world around them.

I think this is also related to the degree to which I find many people are unaware of the kinesthetic feeling of how beats are emphasized, and how this changes the feel of music. The most vibrant intelligence involves a connection to the world in realtime. You can hear how machine parts interrelate just as much as you can see them. (You can even smell how they interrelate!) This disconnection even seems to be directly correlated with a loss of self awareness and flexibility in problem solving. It's like we're raising generations of brains over-trained on the simplistic and highly abstracted world of media tropes and vastly under-prepared for the messy complexity of the natural physical world.

>or that drawing a katana is so often accompanied by a metallic "shing"

Relevant demonstration and explanation of why the "shing" noise isn't a thing outside of movies. https://www.youtube.com/watch?v=yzbfuI0PMdA

Condensed for people who can't view the video: when you want to stab someone, it is better to not alert them beforehand with a characteristic sound.

I'd suspect that there is a wide range of what we're willing to accept in a given situation, reflecting our incomplete model of how sound works. However, this isn't the same as us accepting any substitute sound, the TV tropes continue because their absence feels awkward. An AI that correctly mimicks TV-acceptable sounds is just about as impressive (however, this isn't on our list of 'hardest problems', for sure).

I'd suspect that there is a wide range of what we're willing to accept in a given situation, reflecting our incomplete model of how sound works.

Given what I was talking about, it's largely a matter of people accepting symbols or tokens of things in lieu of perceiving the actual thing. It's a form of ignorance that masquerades as culture or "sophistication." (It is the former, but it's not the latter.)

The vocabulary of sound. There is also an equivalent vocabulary of vision.

That's one reason films from the 40s seem so different to today's. I suspect a cinema goer from then would have some trouble keeping up with the narrative of a 21 century movie.

If you see very high definition projections/videos of 40's movies, you'll find that there was sometimes an incredible humanity that came across from the actors then. The cinematography could be incredible at this. I bet a lot of modern audiences would see such a thing and be like, "Aw, man, where are the explosions!?"

I don't see what that matters. Just because the human baseline comes from what is essentially a virtual reality instead of actual reality it's still quite a challenge to generalize sounds from associated images. Just because the AI is an artificial foley artist rather than a model of the real world doesn't make it any less impressive to me.

More importantly, what are all of the foley artists of the world thinking right now? "Oh dear god no" would be my first guess. Suddenly the prospect of needing to beat a computer at your job is rearing its head.

You mean like how the "tweening" industry completely destroyed artists?


Nah, if this "AI" becomes useful, then sound artists will just be expected to use this new tool. AIs have taken over "tweening" (which is why 3D Animators exist. To specify model movements in such a way that the Computer can automatically create physics simulations or whatever to automatically fill in the annoying-to-do crap).

But the new tools have only made 3d animation more popular, leading to even more artists and more 3d animated content. And bigger productions (ie: Big Hero Six used a lot of AI for the city. "The Lion King" used flocking AI to animate all the Bison during the Stampede scene.)

AIs don't always destroy jobs, they sometimes create them. They replace jobs that no one wants to do (who wants to animate 500+ bison running down a cliff? Nah, lets have the AI do that). Letting the artists focus on more meaningful tasks, leading to a higher quality in production.

What will people do with this tool exactly, once it's a mature tool? It doesn't sound like it will require much in the way of human guidance at some point down the line.

Oh come on. Amplitude, envelopes, equalizing, balancing the audio.

If this were a professional production that needed to be matched up with two voice tracks and background music, the sound designer will use the AI to create the sound for the background events... but still needs to balance the various audio tracks so that the audience knows what to focus on.


The abstract subway sound in the background may be chosen by an AI rather than a human. But the human will still need to determine the amplitude of the various voice tracks, the background music. Its not like these films make themselves.

Even IF somehow an AI became good enough to make all those decisions (and most of those decisions are more "style" and "art" rather than hard-and-fast rules)... the video editor needs to choose the cuts, the order of the scenes and more.

No jobs will be at risk by this tool. If successful, it'd only become one more tool in the MASSIVE toolbox that video editors / sound designers are expected to master.


Anyway, "Tweening" AI completely eradicated one form of work for cartoonists. Humans aren't doing "tweening" work anymore. Big studios are making 3d productions where AIs can "tween" everything for you. Even 2d anime are using 3d animation techniques to cut down on the work and to leverage the AI.

It takes no work to command the AI to "tween" frames. But picking the right algorithm, deciding when to use "smear" animation style (stylized tweening) or changes to algorithms to switch things up for the audience?

Yeah, those things will always just be straight up work.

You are looking at this from such as simplistic perspective.

We are not at the limits of what machine learning can do, We are barely at the beginning.

Tweening or "make sound that can fool humans" is not what's important here it's the underlying "mechanics" that allow them to do these things which can be applied to so many other areas.

What make humans extraordinary is that we can combine our various mechanical and intellektuel abilities to adapt to our surroundings. What we are witnessing is another "species" who can do this and is only at the beginning.

I spent two years of my life writing an automated logic system for a professor. Trust me, I know what AI can and can't do.

In my experience, when AI reaches the critical mass of usefulness, it becomes a tool within the industry. Automatic solving of logical puzzles? Yeah, electrical engineers use PSpice to optimally lay out logic gates in CPU Chips.

Automated logic can be used to verify extremely complicated mathematical proofs, or even search for new mathematical truths! So what happens? Well... some company creates a product with the AI, then sells it as a tool.

We exist in an age where AIs are responsible for searching and coalescing information. (Erm, how often do you use Google's database?) It wasn't very long ago when search was considered an AI problem... but as soon as computers successfully do it better than humans, its a "tool" and "not AI" anymore.

The last 50 years of AI history has taught me one thing: when AIs are successful, humans change and stop thinking that the task was "intelligent". Chess as a measure of intelligence? No longer, once Chess AIs got good.

Database search? No longer, now that Google is faster than humans.

Automated driving? Was an AI task, now it isn't one. People are already discussing how its a tool for Truckers or Uber to use to make more money.


"Intelligent tasks" become "tasks for tools" because thats how stuff sells on the market. You wouldn't believe what people thought was "intelligence" in the 80s. Chess, Databases Search / Natural Language Processing, automated logic, symbolic mathematical solvers, chip layouts, compiler optimizations... and everything that we just take for granted today.

Similarly, the tasks we consider "intelligence" today will simply turn into tools for the next generation once the AIs are written that solve that problem.

You are IMO making the same mistake Searle made in his "Chinese Room" argument.

Your digestive system is a tool for you to get rid of garbage your system does not need, your neurons are tools for allowing you to ultimately think, your legs are tools for allowing you to move around.

It's the entire system thats relevant here. Not any of the individual subparts.

And we are not talking about what it can or can't do but it's potential.

You brush this off as a humans will always find a way... which is what I am objecting to.

Whether you spend 2 or 20 years writing automated logic systems for a professor is unimportant.

As a sound editor with 2 decades of first doing it for fun and then for a career I don't think that balancing the tracks is uniquely human and immune from automation.

This isn't going to lead to some new golden age of well-produced soundtracks, it's just going to make big bombastic soundtracks cheaper and more common. For a few years everything is going to sound like a disaster movie. Some would argue that we've already got that problem and I can't entirely disagree.

In short, quality won't go up, prices will go down and oversupply will result in excess.

You clearly have a significantly deeper knowledge of this topic than I do, and I'm inclined to accept that and thank you for the lesson.

Good! I can't abide foley. Nature documentaries are ruined by it. A tiny ant eating a leaf, accompanied by horrible sounds of plastic-wrap being twisted. Why not record the actual sounds of an ant eating ? It's supposed to be a documentary. And if the ant doesn't make any sound then just leave some silence.

This wouldn't remove foley, it would just replace the humans doing it with computers.

but what sound is a DNN going to associate with an insect eating ?

The answer is definitely: How did you train it? If you trained it by showing it tons of movies and television, then you'd get a modern foley artist in a box, right? If you want them to do something else, you'd need to get a bunch of stock footage with stock audio. Both are doable, but which was done here?

Allot of foley artists are already delegating most of their work to premade audio libraries rather than creating unique sounds on their own.

I didn't know that, but I suppose even then they can claim that there artistry exists in the choices they make. If a machine can make equally viable choices from the audiences' and critics' perspectives...

Everyone knows that if you fall while dying you make a Wilhelm scream. It's a clearly documented scientific phenomenon with earliest records dating back to long, long ago.

Yes, and often sounds from huge sound libraries are used. I allways smile when I hear a sound from Unreal Tournament 4.36 in a movie or on TV.

Or Doom.. I remember very clearly the intro to Modern Marvels on the History Channel uses the Doom door opening noise.

I guess they used the same SFX library or something?

The library used by Doom is REALLY common. The sound DSBOSPIT (sound of boss demon spitting a telecube in Doom 2) in particular is so overused it's not even funny. You hear it everywhere: in budget movies when a house or plane explodes, sometimes in other video games.

I hear the Protoss building creation sound used a lot as well!

The Turing test in this case might consist of feeding the algorithm with a mute video of a scene from Monthy Python and the Holy Grail, when coconut halves are used to simulate the sound of galloping horses.

All of your examples sound a lot like what these things sound in real life. Yes, there's hack foley work but the reality is that these aren't arbitrary sounds. You don't need an actual horse to get horse sounds.

I think you're shooting for some idealized authenticity argument here that just doesn't work. I work in a tourist area where horses walk on cobblestone. Yeah it sounds exactly like what the foley guys do with coconuts. Its uncanny. Also in the digital age, a lot of foley work are samples of real sounds. We don't have guys in sound booths making new sounds anymore with gadgets and old shoes and such, outside of edge cases.

I know everyone likes to feel clever when they identify a popular sound, but most of the times those are intentional homages and you have to consider that millions of sounds you don't recognize. Its not all from some static library of 1930s foley artists punching meat and knocking together coconuts anymore.

From what I've observed the sound of horse hooves is made by taking two halves of a coconut and banging them together.

I heard that was done in the Monty Python movie because of needing to save money by not getting a real horse. So that kinda makes it even funnier.

Fooling humans is fairly impressive - they didn't just ask "is this real sound", they played the real sound and the synthesised sound and ask "which one is the real sound".

Although it looks like they only used a sample of 3 people which is pretty small, I imagine the parametrically synthesised sound would have fooled no-one.

I would love to try this test. I guess I would be tricked too, but that's definitely something I want to try.

Somehow this factoid is one of the most American things I can imagine. It's almost allegorical.

Also in the clips in the video is fooled all _three_ of the participants tested. Couldn't find anything in the paper about sample sizes... but hopefully it was more than three...

The sound of the hamburger rain in "Cloudy with a Chance of Meatballs" was wet brown paper towels (you know, from school?) being flopped against a floor.

And not just our perceptions of sounds are skewed from a lifetime of manipulation, but so too our perceptions of the images themselves.

I think people here are underestimating how big a thing this actually is and I think the headline is kind of to blame for that. This is much less about fooling humans than about what this actually means.

Humans spend a huge part of our early live learning to listen and to connect the dots between what we see and what we hear.

The fact that Deep Learning algos now can simulate audio based on what they see thats the big thing here. Not the production of the sound that fools humans. You can almost sense how imagination and inspiration is inside the reach of machine learning (yes there are some way to go yet)

We are now not only seeing individual senses being simulated but also the relationship between them. And as a bonus what one machine learns one place can be instantly added to the knowledge of the other.

Thats IMO the big deal here.

>> And as a bonus what one machine learns one place can be instantly added to the knowledge of the other.

That's actually one big problem with machine learning algorithms: it's not at all clear how to integrate their knowledge with that of other algorithms (and that includes different instances of the same algorithm). Such algorithms build a single model of one domain at a time, and we're talking about very strict domains.

What we're seeing lately is many teams announcing that they trained an algorithm to do this or that pretty damn amazing thing, but watch closely: how many of those announcements describe a system that can integrate its learning into a wider cognitive architecture? There's teams that trained models to recognise images, to combine images, or to map images to strings, but all these things are simple tasks, that are only useful in a very limited range of circumstances. Machine learning algorithms unfortunately are one trick ponies. They do one thing well- and that's it.

>> You can almost sense how imagination and inspiration is inside the reach of machine learning (yes there are some way to go yet).

That's an understatement- the bit about having some way to go. We're not even close, really. To train a machine learning algorithm the first thing you need is a lot of examples of the thing you want it to learn. It's really hard to see how one would compile a set of examples of imagination, not least because it's inside peoples' heads. Not to mention we don't even know what human imagination is in the first place.

> Machine learning algorithms unfortunately are one trick ponies.

Most humans only know a few of the skills that other people know. We are specialized, too.

You're talking about specialisation in a restricted field, like maths or a scientific discipline. That's part of education.

I'm talking about how all (healthy) humans learn about the world they inhabit, by building a broad context of the entities and concepts in it. We all learn to speak a language for instance, in fact I believe most people actually learn a couple. We learn to interpret facial expressions, who is our friend and who is not, how to find sustenance and so on. We learn a whole bunch of things outside of formal education and specialised technical knowledge.

We specialise even in the kind of knowledge I describe, sure, but we can also change specialisation without too much hassle. I myself have been making my living as a programmer for the past several years coming from a completely non-technical background for most of my life. It was hard going to learn a new thing from scratch, but I was perfectly able to do so. We lose this flexibility as we grow older but for most of our lives we have nothing like the limitations of machine learning.

> we have nothing like the limitations of machine learning

Well, that works both ways. Humans are quite limited, we haven't doubled our IQ in the last 1000 years. But machines have even more potential to grow than we do, in the next 50-100 years they will surely have matched our ability to adapt.

Today, a company releases a translation software for a couple of languages, next year they release translation between 100 languages. A human can't keep up with that. Yes, they still need tuning and architecture design, but that might change soon, maybe in a few years.

Also, the education of humans needs to be human supervised, but machine learning can be unsupervised (like, AlphaGo playing millions of self games to fine tune its value function), thus, cheaper. I am sure Lee Sedol needed much more energy to train to get up to that level of play. He's 33 years old, and in order to get to his level he required resources, teachers, food, etc. AlphaGo played a few million self play games and only consumed a bunch of cheap electricity, while doing it a hundred times faster and surpassing the man.

>> AlphaGo played a few million self play games and only consumed a bunch of cheap electricity,

Well, if Google has access to cheap electricity then I understand why they're so successful. Unfortunately, I think they find it as expensive as everyone else, except they have a larger budget than most and they can afford to burn it for as long as they like (well, ish).

>> I am sure Lee Sedol needed much more energy to train to get up to that level of play. He's 33 years old, and in order to get to his level he required resources, teachers, food, etc.

Sure, but in the same vein AlphaGo required the energy and combined effort of probably a few hundred thousand humans to create the infrastructure on which it runs, the factory that created its hardware, the people who invented its programming language, its algorithms and so on. If you're going to think about historical costs, then think about historical costs.

But, I'll refer you to my reply to TomPete (same level): no, runnign AlphaGo is not "cheap" in any way. There are huge costs involved, as there are for pretty much all state of the art machine learning algorithms, with deep learning topping the curve. For instance, try training an instance of AlexNet on the full ImageNet data on your hardware, with your home electricity budget.

>> next year they release translation between 100 languages.

That's not that hard to do. What's hard to do is to get good translation between those 100 languages. In practice, for all companies who do machine translation right now, translation works well between a few pairs (like three or four pairs) and the rest is only useful as entertainmnt for native speakers. I speak a few languages, French, English and Greek, and I can attest to the fact that going from or to Greek from either French or English is just hilarious, in any machine translation service I've tried, with Google Translate first, of course.

I think you're just overestimating the quality of machine translation. I'm afraid it's nowhere near as good as you think it to be.

>> Humans are quite limited, we haven't doubled our IQ in the last 1000 years.

That doesn't mean much. Even humans with a low IQ can learn to read and write, and perform all sorts of reasoning tasks that are out of the reach of all AI systems, even if those same systems can outperform every human in specific and very restricted tasks.

Again- you're overestimating AI, I'm afraid.

Why shouldn't "Machines" be able to to that? What limit to technology exactly is it you believe there is that this can't be done by machines?

Humans are wrong about a million different things, they make mistakes all the time, we are limited to how much we can take in and so on.

Humans have all sorts of limitations yet somehow we manage to do ok.

>> What limit to technology exactly is it you believe there is that this can't be done by machines?

I used the word "limitation" not "limit" and I'm talking about machine learning algorithms, not machines in general and outside of the context of machine learning.

Machine learning does have limiations compared to humans. Specifically, machine learning algorithms need large, no, vast amounts of data and processing power. You can refer to Hinton, who is on record saying deep learning took off recently thanks to more data and more processing power, which wasn't available earlier.

Humans need nothing like that. We can even learn on no examples at all. If I tell a kid who has never seen a giraffe what one looks like they'll know one when they see it. I won't even have to show the kid a giraffe and say "that's a giraffe".

That's because we have a context of the world that we can incorporate new knowledge into in the blink of an eye. That is beyond the capabilities of our most advanced algorithms right now- their context is always limited to a very restricted domain, for instance- images of specific things or a sub-language etc.

Will machines one day be able to build a context as rich as that of a human? Who knows- maybe. But we're nowhere near achieving that in this generation, or the next, or the one after that.

I think its very counterfactual to look at this from a linear perspective. Make it exponential and everything changes around it.

They trained the algorithm to watch the stick and play sounds from the database where the stick moved similarly.

But the title makes it seem like the algorithm is synthesizing the sounds from scratch!

Exactly, most of the work here is video processing - analyzing a portion of one video, then finding another video clip that has similar characteristics. Then they copy the sound from the second clip into the first clip. But they could be copying any metadata, not just sound. This isn't really about sound _at all_.

> But the title makes it seem like the algorithm is synthesizing the sounds from scratch!

It does. If you read the paper, they say that first they went with matching sounds from a database, but later turned on to full synthesis.

Reference: look for "parametric synthesis" in the paper https://arxiv.org/pdf/1512.08512v2.pdf

They do both; there's a parametric synthesis module later on in the video. It doesn't work all that well for water.

They can do pure parametric synthesis as well, but it's not nearly as convincing so most of the video is devoted to the more convincing match method. FWIW, constructing realistic sounds from first principles is much more difficult than you'd think.

> where the stick moved similarly

where the stick moved similarly and was hitting similar things, which is a non-trivial task.

yes, it would have to learn to simulate the physics of the system to match the video, which would be cool

And for what it is, it's really unimpressive. It's a cool idea and all, just with disappointing results.

They also train for the material the stick is hitting. But yeah, it's sound transfer, not synthesis from scratch.

But that's also the approach of current speech synthesis algorithms and works better than trying to create the waveform from scratch.

> But that's also the approach of current speech synthesis algorithms and works better than trying to create the waveform from scratch.

I don't think it's that simple. Speech synthesis by concatenation does produce more natural-sounding results, at least until you notice its quirks, so casual users tend to prefer it. But I know some heavy speech synthesis users, specifically blind programmers and power-users, and they tend to prefer parametric synthesis, because it's more intelligible at high speeds.

Yes the title may give that impression, but the video explains clearly the source of the audio

"The first step to training a sound-producing algorithm is to give it sounds to study. Over several months, the researchers recorded roughly 1,000 videos of an estimated 46,000 sounds that represent various objects being hit, scraped, and prodded with a drumstick. (They used a drumstick because it provided a consistent way to produce a sound.)"

Whilst the production of natural seeming sound is cool, that quote right there perfectly shows just how limited AI/ML still is. Sure, Deep Learning systems can be taught to do perception tasks (such as understanding or creating sounds and images) very very well, but those perception tasks are incredibly specific and narrow. Not only that, but they are trained large datasets hand-labels through an intensely laborious process, and indeed this laborious process is necessary because we are still using simplistic supervised learning. At this point good recognition or generation with deep learning is entirely old news, and I think zero shot or unsupervised/semi-supervised learning is where the real challenges still are.

>> They used a drumstick because it provided a consistent way to produce a sound.

More to the point, that bit.

The article makes a big todo about how humans use sound to learn about their environment and so on, but imagine if we needed to get a drumstick to make sounds consistent enough to learn to recognise them.

Supervised and unsupervised learning is not the real challenge. The real challenge is to get to the point where algorithms can build a model without the benefit of an insanely expensive data pre-processing pipeline. Deep learning's big promise is exactly that, but it's not always delivered (for instance, there's a paper by Hinton and I forget who else, where they report that training LSTM RNNs on raw characters does not give best performance) (so we're stuck with tokenisation and the implicit assumptions they impose on your data- my corollary).

Also, there are ways to avoid the expenses of hand-labelling, for instance co-training: https://en.wikipedia.org/wiki/Co-training

True, that bit is especially bad.

"The real challenge is to get to the point where algorithms can build a model without the benefit of an insanely expensive data pre-processing pipeline."

Arguably, that is equivalent to saying the real problem is still unsupervised/semi-supervised learning. IE, being able to just throw a bunch of raw data and maybe just a bit of hand configuration at an algorithm and have it do complicated things for you. The success of Deep learning is to scale to tons of data and build really complicated models, but as it is used today that data is still hand labeled for supervised learning in an insanely expensive data pre-processing step. Good unsupervised or semi-supervised learning could hopefully let us get out of this, but I don't think anyone really knows how to get there yet. Co-training is an older example of semi-supervised learning, and more recently there were Ladder Networks, but I don't think any algorithm has been shown to work really well and become the norm in the way LSTM RNNs or CNNs have.

>> Arguably, that is equivalent to saying the real problem is still unsupervised/semi-supervised learning.

I don't agree because data pre-processing and labeling are two distinct parts of the pipeline and you can totally have one without the other.

And there's more to it than that. Currently we have to provide the context for an algorithm to learn. We do this by selecting training examples. Whether these examples are labelled or not, they are only a small part of the world we wish the algorithm to learn about.

You don't even have to go as far as the wider physical world to see this in action. In any training context, if your training set is missing a category of entities, Y, then your algorithm will never model Y. It doesn't make any difference if your model is trained in a supervised manner or not. What matters is that there is a part of the world that it hasn't seen.

I guess you can say that humans don't have a way to learn this way either, but human learning has a big advantage: we need very little data and very little training to incorporate new knowledge and our context of a world is very broad to begin with. It's at once broad, specialised, robust and flexible. We're a bit scary if you think about it.

Which leads me to believe that the limitation of our machine learning algorithms is not in the labeling, or even in the data pre-processing but in some fundamental aspect of building a context from examples only. There's something missing and it's not something we know about (hah!). The missing part means that you can learn from examples until the heat death of the universe and there will still be an infinity of things you don't know anything about- and that are potentially part of your immediate environment.

Obviously, removing the need for pre-processing will make things much cheaper and there will be progress, ditto removing the need for supervision. But it won't get us anywhere nearer human learning, despite people's best wishes, because we're missing a part of the puzzle that's a whole other ball game.

(and which I obviously don't claim to have any idea about)

We are making progress with embeddings. Today we can embed everything2vec and then apply RNN's on top to do reasoning. Another field of interest is building large ontologies which represent factual data and relationships between concepts at encyclopedic scale.

I'd say the hottest area now is not unsupervised, but reinforcement learning, which is in the middle ground between supervised and unsupervised.

It fools humans more often than a baseline algorithm.

What that means in numbers, they carefully avoid saying.

Sample size: 3

Important little side note.

Fooling humans is much easier when about 50% of the sound is always accurate (in any case it's a wooden drumstick hitting something). The human mind is very forgiving, especially when vision accompanies sound (McGurk effect [1])

Furthermore this fixed variable makes the pool of samples to choose from much much smaller.

Still impressive of course ;) just far from dubbing any video there is.

[1] https://youtu.be/G-lN8vWm3m0?t=32

It would be cool to create the drums for a song by taking a video of the performance and then using this software to create the recording to be put into the song.

Funny coincidence that I just stumbled upon a nice video about sound crafting for cinema:


The demo seems to recognize three basic categories of things hit - shrubbery, dirt, and other solid objects. It doesn't distinguish much between hitting metal, wood, and asphalt.

It's another step toward common sense. Predicting what will happen if a robot does something in the real world is essential to making robots less stupid.

We've been fooling humans for years because the humans in question were conditioned by tropes on TV: http://tvtropes.org/pmwiki/pmwiki.php/Main/TheCoconutEffect

Good research, but some parts of the videos are like badly dubbed sound effects - actually hilarious: https://www.youtube.com/watch?v=0FW99AQmMc8&t=1m1s (the drumstick noise in the middle).

still impressive.

Next we need to do it in reverse - given a sound, generate a video to match. I wonder how well that would work?

Much of this is beyond me, but is it producing (synthesizing) sounds? Or just sampling sounds?

Most of the video has them using an algorithm to predict the sound of the hits, then finding the closest match in the database & using it as a sample.

At 1:50, they switch to using the prediction as a synthesised replacement. It is not as good. They switch back to the first method for the tests

Why is AI focused so much into fooling/imitating/trap-ing humans these days?

I don't think these systems' purpose is to fool humans. Tasks that test whether a system can fool people are simply a good way to evaluate the performance a system. If a speech synthesis fools people into thinking a real person is speaking, that means the speech synthesis is really good. You might say it's not important that a speech synthesis sounds perfectly human but our speech perception evolved to be optimal for human speech, so it's likely that any deviation from that makes the signal harder to process.

Because with deep learning we quite recently got a new tool (someone figured out how to use GPUs for training) that lets of do a lot of those imitating things we couldn't do before.

A lot of fruit that suddenly became low hanging.

This has the potential to do the same for Videogames as did Mo-Cap.

Percussionists beware. Your obsolescence clock is ticking.

This work looks very cool from the demo video. Have yet to read the paper, but the parametric inversion for generating sounds from features seems very intriguing.

Since the discussion seem to have died down a bit I just have to say it, sorry. Quit beating around the bush.

Just a couple of years ago no one would have called this AI. Interesting how this old term has become so fashionable again. Perhaps it's also being overused and AI winter is coming.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact