A simplified analogy that I believe is applicable is the application of flocking behavior in birds. Current ML implementations would demand a large dataset of groups of birds in flight, both flocking and non-flocking, and curating the correct p-value of brute-forced models that allege to predict whether behavior is flocking or not (and in the case of an individual bird whether it is appropriate flocking behavior given the individual behaviors of birds in the dataset and their circumstances). But flocking behavior is easily modeled right now, and has been for decades, using simple rules for individuals and depending upon emergent phenomena within a group.
I'm concerned that most of the ML efforts now are merely attaining the low-hanging fruits of brute force but will run into a wall that halts progress at the level of relatively "easy" things solved by worms and other comparable biological solutions since so many domains have a level of complexity that would exceed any realistically imaginable level of simple mathematical computing power.
I've largely been of the camp that ML/DL will be the herald of the next AI winter, and while GPT-3 is impressive, it has no new constraints that we haven't already seen before, and doesn't break any that needed to be broken to change the field (NLP notwithstanding). Soon enough, the limitations of ML/DL will be apparent, and while there will be a breadth of practical applications to explore and analyze, we will be able to see clearly the boundary of how far ML/DL can take us, and will have to be content with what we have until the shift occurs.
"artificial intelligence is the second best way to solve any problem"
(KJ Astrom, UC Santa Barbara)
I don't think that's too different to how nature does it in worms, as you say. However, our brains also have a different method of operation. We can see something just once, and learn from it. The small child scalds herself on the room heater, and never goes near it again.
So the child has learnt from a single example. We have no idea how to do that. At some level it's probably a complex mechanism that involves memorising it, replaying the memory over and over again till, as you put it, the brain hits a satisfactory p-level. As far as I know no one has built such a mechanism yet. However, that's probably because even if you did build it, the fact remains the brain has learnt from a single example and we have no idea how to learn near perfectly from a single example. Let alone do as the child did and learn the lesson in the space of seconds or at most minutes.
I agree. However, my response if I intentionally don't over-think it, is "well she learns from a single example because that experience was painful and she wants to minimize pain."
Granted that's true as much as it is facile, but I do wonder if it hints at what the next frontier will consist of. A thermometer will happily keep reading out the temperature of a room heater until its physical form is damaged. A human child has a concept of pain to dissuade behaviors that lead to harm (in theory) (this is obviously simplified).
In other words, Deep Learning's growing need for computing power seems to have reached a point at which it is now motivating fundamental research to find greener, cheaper, more energy-efficient hardware.
The economic incentives are very powerful: Whichever companies (or organizations, or countries) find ways to harness the most computing power at the lowest marginal cost will win the race in this market.
PS. The same could be said for Bitcoin mining: it is also motivating fundamental research to develop greener, cheaper, more energy-efficient, more powerful hardware. Whoever finds ways to harness the most computing power at the lowest marginal cost will make the most money processing transactions on the network.
Alex Krizhevsky's revolution was to find a way to train large neural networks using existing hardware that was optimized for linear algebra. Linear algebra is literally the first applied problem that was studied in computer science, with papers that were written by Turing and von Neumann. It's the most mature field in computing and is long out of steam. Progress since AlexNet came from scaling $$$ not scaling tflops. There will not be exponential growth in compute performance.
We are not looking for just faster ML at any cost, we are looking for cost-efficient ML so we can do more of it.
2. Wafer scale integration and other advanced packaging approaches are qualitatively different types of computing advances that CMOS scaling, which provides better, faster, smaller, cheaper and at the same time, due to Dennard Scaling, reduces power.
The paper’s point is that eventually we will reach computing power limits and then we will have to improve the deep learning algorithm’s efficiency to continue to improve. From the abstract:
> Continued progress in these applications will require dramatically more computationally-efficient methods, which will either have to come from changes to deep learning or from moving to other machine learning methods.
I'm having trouble finding the papers I've skimmed on this (my original search phrase was something about succinct neural network encodings, entropy, yada yada yada), but that's already being done here and there, at least at a high level looking at graph symmetries (results were okayish -- theoretical space bounds aren't substantial improvements, and the compressed representations didn't perform super well). I haven't seen anything interesting yet explicitly dealing with the fact that neural networks represent the real world in some fashion and have a lot of biases imposed on the values the weights can take.
> improvements to it alternatives to backprop
I love seeing these come out. There's the no free lunch theorem and all that jazz, but in practice for real networks this can be a huge win.
> parallelism as well
Not really, at least if I'm understanding the chain of ancestor comments correctly. The arguments are more about the total cost of a given network than the total time to train it. Those are loosely entangled since with low levels of parallelism we're inclined to operate at higher clock speeds or take other energy-inefficient actions, but generally we would expect a parallel algorithm to be no more energy efficient (with respect to total conceptual work performed -- e.g., training a fixed neural network) than an equivalent serial implementation.
I think it depends on how you conceptualize the topology in both cases. A serial implementation requires threading all the data through a single point, whereas a parallel implementation can leave data where it is going to be used. Moving data around requires energy, so implementations that maximize locality of data should be more energy efficient. Such implementations would naturally synchronize as little as possible, so they would be highly parallel.
Basically, serial implementations of neural networks require a clock and a form of RAM, including the energy overhead of dispatch, synchronization and data transport, whereas parallel implementations don't: each neuron could just contain whatever little data it needs and nothing more.
Essentially inspired by spiking neurons, it can be implemented in neuromorphic hardware.
I don't doubt we will find more frugal hardware energy wise, but calling it greener is somewhat dishonest.
> The economic incentives are very powerful: […]
Precisely, a technology is not green in itself, it depends on its use within society, and as you pointed out, we are talking about a race here. So the solution is not merely technological, there is some policy and/or societal change to make this greener. Otherwise we are just going to red-queen ourselves into Jevons paradox.
Also bitcoin mining chips are actually a lot like deep learning chips in that it's a lot of simple operations scaled out. And indeed, Bitmain now produces deep learning chips too.
I would say there's a second piece here, too, which is that to the extent that it's driving any sort of research into more efficient practices, it's only in response to constraints of bitcoins own making, and there's every intention to immediately use up new capacity with additional bitcoin mining. What could have been an external net benefit to humanity will just be absorbed into the same things that absorbed prior capacity.
I don't want to run all the way with that thought, though, because it can be the case that there's a net benefit to humanity nevertheless. But it makes me think of how Keynes promised that future efficiencies would bring a utopia in the form of dramatically scaling back the need to toil to fulfill basic needs and comforts, how that wasn't necessarily wrong, and how we ended up not taking that path just the same.
Developing an ASIC design is very expensive. There's usually many thousands (sometimes hundreds of thousands) of dollars of NRE (Non-Recurring Engineering expenses). Whereas you could put your design into an FPGA for the cost of an FPGA board (typically a few hundred dollars to a few thousand - but you'll be able to use the same FPGA board for many different designs as the FPGA is reprogrammable) and the FPGA design software (which is often given away for free or deeply discounted by the FPGA vendors). So ASICs are for organizations with lots of money. FPGAs are accessible for hobbyists. FPGAs are also often used to test out a design that is eventually implemented in an ASIC.
The crypto folks have mostly moved to ASICs for the higher performance.
The tradeoff is that ASICs are (almost always) much faster at performing their specific task than an FPGA could be - I'm not sure of all the reasons, but you could imagine that there are many, many optimizations that you can do when designing pure hardware that don't transfer over to software-defined-hardware, because you have the FPGA intermediary there
From what I understand, FPGAs provide a software method of essentially reprogramming the actual behavior of the physical circuit to perform different tasks. This means FPGAs are (to some extent) general purpose, it's just they are programmed to provide only one purpose at a time (compared to traditional CPUs or GPUs for example).
ASICs on the other hand are circuits designed specifically for your application only, they cannot be reprogrammed and must be created specifically for your algorithm.
An FPGA is reconfigurable hardware, you tell it what to become instead of what to execute. An ASIC is able to be only the one thing that the mask used to create it dictates.
I think it is definitely true that historically the gamers drove the adoption of GPUs and almost accidentally enabled the deep learning revolution, but has the balance shifted already and is deep learning now the driving force? And are GPUs going to be completely superseded by custom deep learning hardware or rather not?
Instead, we need to start moving toward more efficient and specialized implementations of statistical models. SciML is a great example of how we can move past the age of amorphous black boxes and into more efficient representations of the same problems.
Maybe in a near future everyone would need xxx petaflops of compute just as you would any other utility. At least anyone who wants their personal AI secretaries to "learn" and "think" effectively.
Historically these utility/infrastructure projects needs to be built by the government since it's a chicken and egg problem. "Internet" companies can only exist after the internet is built. "AI" companies can only exist after gpt3 levels of compute is made available to the everyday person.
(There might be a physics argument to why the economy of scale of compute can't go down that much even with nation-level of investments. Usually it ends up with needing to build a dyson swarm. I really hope they are wrong because I don't think I will live to see it)
But I still think that system design is much more important than raw computing power, since the latter is still quite cheaply available on the market.
Deep learning models are effectively pattern matching machines that cannot separate causality from correlation. As we throw more compute and $$$ at deep learning models we will experience diminishing returns in performance because of this.
For us to achieve AGI, the holy grail of AI, we will need to develop algorithms that can recognize causality somehow. Judea Pearl’s “The Book of Why” does a great job articulating why this is important. Deep learning is a big leap forward and we’re only beginning to see its impacts, but it’s not sufficient to achieve AI’s most ambitious goals.
There's no reason to suppose that it's even possible to recognize causality and separate it from correlation purely from static observations - among others, Judea Pearl has written a bunch about it, this year's ACL best theme paper (Bender&Koller, Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data) was on a very similar topic, IIRC there's neuroscience research that suggests that mammals can't "learn to see" functionally if they only get visual input that's not caused by their own acts, etc. It's highly likely that no algorithm or mind can learn what you propose solely from the data (and the type of data) that was available to train GPT.
It seems that recognizing causality requires intervention, not just observation, it requires active agents, not just passive specators. It requires "play" and experimentation. Learning language meaning requires at least some grounding and alignment with reality. The problems of causality illustrate the limitations and necessity of integrating models with the environment, they do not indicate any fundamental limitations with model structure itself; perhaps deep learning isn't sufficient for learning these structures, and perhaps it is, as far as I know we don't have any good evidence to justify assuming one way or the other.
On the other hand, if/when some artificial agent has learned the core concepts of causality and language grounding in some small environment - using whatever algorithms are required for that - then it seems plausible that a GPT-like system could be sufficient to provide the mapping to extend these concepts to the whole range of things that we talk about; once the agent understands (according to whatever reasonable nontrivial definition of "understands") the concept of causality for some actions, it can learn the causality of all the other actions which are described if you read all the written text in the world.
Neither can most humans without rigorous experimentation. Deep learning today seems to emulate only system 1. It just turns out that much of what expert humans do can be attributed to finely tuned system-1 intuition. To achieve AGI we'd need a system-2. (nomenclature for system-1 and system-2 taken from Kahneman and Tversky's work on human cognition)
This video is from a talk at JuliaCon 2020 and is about the SciML ecosystem: https://www.youtube.com/watch?v=QwVO0Xh2Hbg
For more info, check out https://sciml.ai/
I am not affiliated with them, but really admire their work and think it will become an invaluable tool in the coming years as the limits of neural networks become apparent and we stop running out of little tricks to squeeze one more SOTA paper out of them.
The story narrated by Bixby, the cofounder of CPLEX and Gurobi the two best available commercial mathematical optimization solvers. The magic happened in CPLEX V6.5 back in 1999 with almost 10x y—o-y performance improvement.
I don't agree with the conclusion of the paper. The computing architectures have been improving dramatically over the last few years, and almost any task that was achievable 5 years ago with deep learning is orders of magnitudes cheaper to train.
The energy resources taken by deep learning is increasing because of the huge ROI for companies, but it will probably slow down as the compute cost gets close to the cost of software engineers (or the profit of a company), because at that point researching improvements to the models gets relatively cheaper again.
Perhaps it can. (Though the number of companies that can afford to train it is rather small.) But can $5 trillion be extracted from GPT-6? Even if it can, who can afford to train it?
If you want to build a supersonic fighter, it's a lot easier if you can look at how someone else has done it. This focuses your search to certain parts of the possibility space, saving tons of time. It gives you proof that certain techniques can in fact work. This is why the proof-of-concept, or technology demonstrater, is such an important milestone.
It's why countries do so much technology espionage.
Maybe it takes us a of computation to get the first NN that achieves a given performance level on a task we care about.
But then we have an artefact we can reverse engineer. At a minimum, we can compress it, sure. But ultimately we can do science on it to find out how it's doing what it's doing, to look for the learned architectures within.
It's going to be a lot easier to do this on a deep learning system, than a biological one.
Maybe we wouldn't do this while it's cheap to just scale compute, but eventually we will. And the new understanding we get will probably enable us to build new generations that are even more efficient.
This is really just a gut take, but it's an argument for optimistism here.