Our perceptions of sound within a video/film source is already deeply skewed and therefore the notion that this AI is a Turing test of sorts is a weak analogy.
Rather than now being able to make a living do the sort of fun really creative stuff like inventing new sounds for teleportation devices or dramatic natural phenomena, editors are more likely to be asked to work for free on the theory that they'll gain great exposure for their creativity. That's generally a very bad bargain. If past trends in the electronic dance music market are anything to go by, increasing automation will not reward true creative talent but rather just lead to an arms race to have the latest sound libraries, synthesizers etc. and just be the first to market with big splashy new sounds that offer superficial novelty.
Ability to provide high-value equipment below normal rental cost frequently trumps considerations of talent in the film industry. Similarly there are plenty of crappy directors of photography out there who get hired regularly because they own a pile of nice lenses and related camera equipment, and hiring them plus their camera package looks economically attractive on paper because it's hard to quantify photographic talent.
the tedious dull work was also what paid the bills. The easier it is to do that stuff automatically, the less people are willing to pay for good quality work.
For example, "Vegetable Violence": http://hissandaroar.com/sd001-vegetable-violence/
"Vegetable Violence is an organic sound effects library for creating your own orchestrated sonic mayhem. Vegetable rips, tears, squelches, hits, punches, stabs all recorded & mastered at 96kHz for stomach churning realism, this component library of gore sound effects is available for immediate download."
Suspiria in particular sticks in the mind. Great music, saturated colours, properly horrific horror. It's a bit more 'on-screen' than BSS, btw.
Right on! The fact that most audiences seem to expect a shotgun racking sound in a scene with a shotgun that doesn't even have that mechanism, or that drawing a katana is so often accompanied by a metallic "shing" and rattling sound -- these indicate the degree to which large swathes of people are drastically disconnected from an immediate and physically connected sense of how sound relates to the world around them.
I think this is also related to the degree to which I find many people are unaware of the kinesthetic feeling of how beats are emphasized, and how this changes the feel of music. The most vibrant intelligence involves a connection to the world in realtime. You can hear how machine parts interrelate just as much as you can see them. (You can even smell how they interrelate!) This disconnection even seems to be directly correlated with a loss of self awareness and flexibility in problem solving. It's like we're raising generations of brains over-trained on the simplistic and highly abstracted world of media tropes and vastly under-prepared for the messy complexity of the natural physical world.
Relevant demonstration and explanation of why the "shing" noise isn't a thing outside of movies. https://www.youtube.com/watch?v=yzbfuI0PMdA
Given what I was talking about, it's largely a matter of people accepting symbols or tokens of things in lieu of perceiving the actual thing. It's a form of ignorance that masquerades as culture or "sophistication." (It is the former, but it's not the latter.)
That's one reason films from the 40s seem so different to today's. I suspect a cinema goer from then would have some trouble keeping up with the narrative of a 21 century movie.
Nah, if this "AI" becomes useful, then sound artists will just be expected to use this new tool. AIs have taken over "tweening" (which is why 3D Animators exist. To specify model movements in such a way that the Computer can automatically create physics simulations or whatever to automatically fill in the annoying-to-do crap).
But the new tools have only made 3d animation more popular, leading to even more artists and more 3d animated content. And bigger productions (ie: Big Hero Six used a lot of AI for the city. "The Lion King" used flocking AI to animate all the Bison during the Stampede scene.)
AIs don't always destroy jobs, they sometimes create them. They replace jobs that no one wants to do (who wants to animate 500+ bison running down a cliff? Nah, lets have the AI do that). Letting the artists focus on more meaningful tasks, leading to a higher quality in production.
If this were a professional production that needed to be matched up with two voice tracks and background music, the sound designer will use the AI to create the sound for the background events... but still needs to balance the various audio tracks so that the audience knows what to focus on.
The abstract subway sound in the background may be chosen by an AI rather than a human. But the human will still need to determine the amplitude of the various voice tracks, the background music. Its not like these films make themselves.
Even IF somehow an AI became good enough to make all those decisions (and most of those decisions are more "style" and "art" rather than hard-and-fast rules)... the video editor needs to choose the cuts, the order of the scenes and more.
No jobs will be at risk by this tool. If successful, it'd only become one more tool in the MASSIVE toolbox that video editors / sound designers are expected to master.
Anyway, "Tweening" AI completely eradicated one form of work for cartoonists. Humans aren't doing "tweening" work anymore. Big studios are making 3d productions where AIs can "tween" everything for you. Even 2d anime are using 3d animation techniques to cut down on the work and to leverage the AI.
It takes no work to command the AI to "tween" frames. But picking the right algorithm, deciding when to use "smear" animation style (stylized tweening) or changes to algorithms to switch things up for the audience?
Yeah, those things will always just be straight up work.
We are not at the limits of what machine learning can do, We are barely at the beginning.
Tweening or "make sound that can fool humans" is not what's important here it's the underlying "mechanics" that allow them to do these things which can be applied to so many other areas.
What make humans extraordinary is that we can combine our various mechanical and intellektuel abilities to adapt to our surroundings. What we are witnessing is another "species" who can do this and is only at the beginning.
In my experience, when AI reaches the critical mass of usefulness, it becomes a tool within the industry. Automatic solving of logical puzzles? Yeah, electrical engineers use PSpice to optimally lay out logic gates in CPU Chips.
Automated logic can be used to verify extremely complicated mathematical proofs, or even search for new mathematical truths! So what happens? Well... some company creates a product with the AI, then sells it as a tool.
We exist in an age where AIs are responsible for searching and coalescing information. (Erm, how often do you use Google's database?) It wasn't very long ago when search was considered an AI problem... but as soon as computers successfully do it better than humans, its a "tool" and "not AI" anymore.
The last 50 years of AI history has taught me one thing: when AIs are successful, humans change and stop thinking that the task was "intelligent". Chess as a measure of intelligence? No longer, once Chess AIs got good.
Database search? No longer, now that Google is faster than humans.
Automated driving? Was an AI task, now it isn't one. People are already discussing how its a tool for Truckers or Uber to use to make more money.
"Intelligent tasks" become "tasks for tools" because thats how stuff sells on the market. You wouldn't believe what people thought was "intelligence" in the 80s. Chess, Databases Search / Natural Language Processing, automated logic, symbolic mathematical solvers, chip layouts, compiler optimizations... and everything that we just take for granted today.
Similarly, the tasks we consider "intelligence" today will simply turn into tools for the next generation once the AIs are written that solve that problem.
Your digestive system is a tool for you to get rid of garbage your system does not need, your neurons are tools for allowing you to ultimately think, your legs are tools for allowing you to move around.
It's the entire system thats relevant here. Not any of the individual subparts.
And we are not talking about what it can or can't do but it's potential.
You brush this off as a humans will always find a way... which is what I am objecting to.
Whether you spend 2 or 20 years writing automated logic systems for a professor is unimportant.
This isn't going to lead to some new golden age of well-produced soundtracks, it's just going to make big bombastic soundtracks cheaper and more common. For a few years everything is going to sound like a disaster movie. Some would argue that we've already got that problem and I can't entirely disagree.
In short, quality won't go up, prices will go down and oversupply will result in excess.
I guess they used the same SFX library or something?
I think you're shooting for some idealized authenticity argument here that just doesn't work. I work in a tourist area where horses walk on cobblestone. Yeah it sounds exactly like what the foley guys do with coconuts. Its uncanny. Also in the digital age, a lot of foley work are samples of real sounds. We don't have guys in sound booths making new sounds anymore with gadgets and old shoes and such, outside of edge cases.
I know everyone likes to feel clever when they identify a popular sound, but most of the times those are intentional homages and you have to consider that millions of sounds you don't recognize. Its not all from some static library of 1930s foley artists punching meat and knocking together coconuts anymore.
Although it looks like they only used a sample of 3 people which is pretty small, I imagine the parametrically synthesised sound would have fooled no-one.
Humans spend a huge part of our early live learning to listen and to connect the dots between what we see and what we hear.
The fact that Deep Learning algos now can simulate audio based on what they see thats the big thing here. Not the production of the sound that fools humans. You can almost sense how imagination and inspiration is inside the reach of machine learning (yes there are some way to go yet)
We are now not only seeing individual senses being simulated but also the relationship between them. And as a bonus what one machine learns one place can be instantly added to the knowledge of the other.
Thats IMO the big deal here.
That's actually one big problem with machine learning algorithms: it's not at all clear how to integrate their knowledge with that of other algorithms (and that includes different instances of the same algorithm). Such algorithms build a single model of one domain at a time, and we're talking about very strict domains.
What we're seeing lately is many teams announcing that they trained an algorithm to do this or that pretty damn amazing thing, but watch closely: how many of those announcements describe a system that can integrate its learning into a wider cognitive architecture? There's teams that trained models to recognise images, to combine images, or to map images to strings, but all these things are simple tasks, that are only useful in a very limited range of circumstances. Machine learning algorithms unfortunately are one trick ponies. They do one thing well- and that's it.
>> You can almost sense how imagination and inspiration is inside the reach of machine learning (yes there are some way to go yet).
That's an understatement- the bit about having some way to go. We're not even close, really. To train a machine learning algorithm the first thing you need is a lot of examples of the thing you want it to learn. It's really hard to see how one would compile a set of examples of imagination, not least because it's inside peoples' heads. Not to mention we don't even know what human imagination is in the first place.
Most humans only know a few of the skills that other people know. We are specialized, too.
I'm talking about how all (healthy) humans learn about the world they inhabit, by building a broad context of the entities and concepts in it. We all learn to speak a language for instance, in fact I believe most people actually learn a couple. We learn to interpret facial expressions, who is our friend and who is not, how to find sustenance and so on. We learn a whole bunch of things outside of formal education and specialised technical knowledge.
We specialise even in the kind of knowledge I describe, sure, but we can also change specialisation without too much hassle. I myself have been making my living as a programmer for the past several years coming from a completely non-technical background for most of my life. It was hard going to learn a new thing from scratch, but I was perfectly able to do so. We lose this flexibility as we grow older but for most of our lives we have nothing like the limitations of machine learning.
Well, that works both ways. Humans are quite limited, we haven't doubled our IQ in the last 1000 years. But machines have even more potential to grow than we do, in the next 50-100 years they will surely have matched our ability to adapt.
Today, a company releases a translation software for a couple of languages, next year they release translation between 100 languages. A human can't keep up with that. Yes, they still need tuning and architecture design, but that might change soon, maybe in a few years.
Also, the education of humans needs to be human supervised, but machine learning can be unsupervised (like, AlphaGo playing millions of self games to fine tune its value function), thus, cheaper. I am sure Lee Sedol needed much more energy to train to get up to that level of play. He's 33 years old, and in order to get to his level he required resources, teachers, food, etc. AlphaGo played a few million self play games and only consumed a bunch of cheap electricity, while doing it a hundred times faster and surpassing the man.
Well, if Google has access to cheap electricity then I understand why they're so
successful. Unfortunately, I think they find it as expensive as everyone else,
except they have a larger budget than most and they can afford to burn it for as
long as they like (well, ish).
>> I am sure Lee Sedol needed much more energy to train to get up to that level of play. He's 33 years old, and in order to get to his level he required resources, teachers, food, etc.
Sure, but in the same vein AlphaGo required the energy and combined effort of probably a few hundred thousand humans to create the infrastructure on which it runs, the factory that created its hardware, the people who invented its programming language, its algorithms and so on. If you're going to think about historical costs, then think about historical costs.
But, I'll refer you to my reply to TomPete (same level): no, runnign AlphaGo is not "cheap" in
any way. There are huge costs involved, as there are for pretty much all state
of the art machine learning algorithms, with deep learning topping the curve.
For instance, try training an instance of AlexNet on the full ImageNet data on
your hardware, with your home electricity budget.
>> next year they release translation between 100 languages.
That's not that hard to do. What's hard to do is to get good translation between
those 100 languages. In practice, for all companies who do machine translation
right now, translation works well between a few pairs (like three or four pairs)
and the rest is only useful as entertainmnt for native speakers. I speak a few
languages, French, English and Greek, and I can attest to the fact that going
from or to Greek from either French or English is just hilarious, in any machine
translation service I've tried, with Google Translate first, of course.
I think you're just overestimating the quality of machine translation. I'm
afraid it's nowhere near as good as you think it to be.
>> Humans are quite limited, we haven't doubled our IQ in the last 1000 years.
That doesn't mean much. Even humans with a low IQ can learn to read and write,
and perform all sorts of reasoning tasks that are out of the reach of all AI
systems, even if those same systems can outperform every human in specific and
very restricted tasks.
Again- you're overestimating AI, I'm afraid.
Humans are wrong about a million different things, they make mistakes all the time, we are limited to how much we can take in and so on.
Humans have all sorts of limitations yet somehow we manage to do ok.
I used the word "limitation" not "limit" and I'm talking about machine learning
algorithms, not machines in general and outside of the context of machine
Machine learning does have limiations compared to humans. Specifically, machine
learning algorithms need large, no, vast amounts of data and processing power.
You can refer to Hinton, who is on record saying deep learning took off recently
thanks to more data and more processing power, which wasn't available earlier.
Humans need nothing like that. We can even learn on no examples at all. If I
tell a kid who has never seen a giraffe what one looks like they'll know one
when they see it. I won't even have to show the kid a giraffe and say "that's a
That's because we have a context of the world that we can incorporate new
knowledge into in the blink of an eye. That is beyond the capabilities of our
most advanced algorithms right now- their context is always limited to a very
restricted domain, for instance- images of specific things or a sub-language
Will machines one day be able to build a context as rich as that of a human?
Who knows- maybe. But we're nowhere near achieving that in this generation,
or the next, or the one after that.
But the title makes it seem like the algorithm is synthesizing the sounds from scratch!
It does. If you read the paper, they say that first they went with matching sounds from a database, but later turned on to full synthesis.
Reference: look for "parametric synthesis" in the paper https://arxiv.org/pdf/1512.08512v2.pdf
> where the stick moved similarly
where the stick moved similarly and was hitting similar things, which is a non-trivial task.
But that's also the approach of current speech synthesis algorithms and works better than trying to create the waveform from scratch.
I don't think it's that simple. Speech synthesis by concatenation does produce more natural-sounding results, at least until you notice its quirks, so casual users tend to prefer it. But I know some heavy speech synthesis users, specifically blind programmers and power-users, and they tend to prefer parametric synthesis, because it's more intelligible at high speeds.
Whilst the production of natural seeming sound is cool, that quote right there perfectly shows just how limited AI/ML still is. Sure, Deep Learning systems can be taught to do perception tasks (such as understanding or creating sounds and images) very very well, but those perception tasks are incredibly specific and narrow. Not only that, but they are trained large datasets hand-labels through an intensely laborious process, and indeed this laborious process is necessary because we are still using simplistic supervised learning. At this point good recognition or generation with deep learning is entirely old news, and I think zero shot or unsupervised/semi-supervised learning is where the real challenges still are.
More to the point, that bit.
The article makes a big todo about how humans use sound to learn about their environment and so on, but imagine if we needed to get a drumstick to make sounds consistent enough to learn to recognise them.
Supervised and unsupervised learning is not the real challenge. The real challenge is to get to the point where algorithms can build a model without the benefit of an insanely expensive data pre-processing pipeline. Deep learning's big promise is exactly that, but it's not always delivered (for instance, there's a paper by Hinton and I forget who else, where they report that training LSTM RNNs on raw characters does not give best performance) (so we're stuck with tokenisation and the implicit assumptions they impose on your data- my corollary).
Also, there are ways to avoid the expenses of hand-labelling, for instance co-training: https://en.wikipedia.org/wiki/Co-training
"The real challenge is to get to the point where algorithms can build a model without the benefit of an insanely expensive data pre-processing pipeline."
Arguably, that is equivalent to saying the real problem is still unsupervised/semi-supervised learning. IE, being able to just throw a bunch of raw data and maybe just a bit of hand configuration at an algorithm and have it do complicated things for you. The success of Deep learning is to scale to tons of data and build really complicated models, but as it is used today that data is still hand labeled for supervised learning in an insanely expensive data pre-processing step. Good unsupervised or semi-supervised learning could hopefully let us get out of this, but I don't think anyone really knows how to get there yet. Co-training is an older example of semi-supervised learning, and more recently there were Ladder Networks, but I don't think any algorithm has been shown to work really well and become the norm in the way LSTM RNNs or CNNs have.
I don't agree because data pre-processing and labeling are two distinct parts of the pipeline and you can totally have one without the other.
And there's more to it than that. Currently we have to provide the context for an algorithm to learn. We do this by selecting training examples. Whether these examples are labelled or not, they are only a small part of the world we wish the algorithm to learn about.
You don't even have to go as far as the wider physical world to see this in action. In any training context, if your training set is missing a category of entities, Y, then your algorithm will never model Y. It doesn't make any difference if your model is trained in a supervised manner or not. What matters is that there is a part of the world that it hasn't seen.
I guess you can say that humans don't have a way to learn this way either, but human learning has a big advantage: we need very little data and very little training to incorporate new knowledge and our context of a world is very broad to begin with. It's at once broad, specialised, robust and flexible. We're a bit scary if you think about it.
Which leads me to believe that the limitation of our machine learning algorithms is not in the labeling, or even in the data pre-processing but in some fundamental aspect of building a context from examples only. There's something missing and it's not something we know about (hah!). The missing part means that you can learn from examples until the heat death of the universe and there will still be an infinity of things you don't know anything about- and that are potentially part of your immediate environment.
Obviously, removing the need for pre-processing will make things much cheaper and there will be progress, ditto removing the need for supervision. But it won't get us anywhere nearer human learning, despite people's best wishes, because we're missing a part of the puzzle that's a whole other ball game.
(and which I obviously don't claim to have any idea about)
I'd say the hottest area now is not unsupervised, but reinforcement learning, which is in the middle ground between supervised and unsupervised.
What that means in numbers, they carefully avoid saying.
Furthermore this fixed variable makes the pool of samples to choose from much much smaller.
Still impressive of course ;) just far from dubbing any video there is.
It's another step toward common sense. Predicting what will happen if a robot does something in the real world is essential to making robots less stupid.
At 1:50, they switch to using the prediction as a synthesised replacement. It is not as good. They switch back to the first method for the tests
A lot of fruit that suddenly became low hanging.