Hacker News new | past | comments | ask | show | jobs | submit login
The Computational Limits of Deep Learning (arxiv.org)
168 points by ozdave 42 days ago | hide | past | favorite | 73 comments

My interpretation of current ML solutions is that it's mostly brute force statistical models hoping to hit a satisfactory p-value (some implementations being more elegant in the brute force than others). The calculating power defines the limits of the finite domain you can apply the solution towards.

A simplified analogy that I believe is applicable is the application of flocking behavior in birds. Current ML implementations would demand a large dataset of groups of birds in flight, both flocking and non-flocking, and curating the correct p-value of brute-forced models that allege to predict whether behavior is flocking or not (and in the case of an individual bird whether it is appropriate flocking behavior given the individual behaviors of birds in the dataset and their circumstances). But flocking behavior is easily modeled right now, and has been for decades, using simple rules for individuals and depending upon emergent phenomena within a group.

I'm concerned that most of the ML efforts now are merely attaining the low-hanging fruits of brute force but will run into a wall that halts progress at the level of relatively "easy" things solved by worms and other comparable biological solutions since so many domains have a level of complexity that would exceed any realistically imaginable level of simple mathematical computing power.

Agreed. It's been stated time and time again that the next frontier of AI will come from a paradigm shift. Deep learning is great, but it is, at the end of the day statistical model fitting. Can it get us to the next frontier? Sure, as much as you can simulate a turing machine on a calculator and then simulate an R/B tree on that. Using ML to reach the next stage (of making algorithms capable of generic learning without being finely tuned) might be the hard way, and we may be approaching the mirror at mach speed. When, how, what, or why this paradigm shift will happen - no one knows, but we all know the alternative amounts to luck of the draw.

I've largely been of the camp that ML/DL will be the herald of the next AI winter, and while GPT-3 is impressive, it has no new constraints that we haven't already seen before, and doesn't break any that needed to be broken to change the field (NLP notwithstanding). Soon enough, the limitations of ML/DL will be apparent, and while there will be a breadth of practical applications to explore and analyze, we will be able to see clearly the boundary of how far ML/DL can take us, and will have to be content with what we have until the shift occurs.

This is one of my favorite quotes, which I think summerizes what you wrote in one sentence:

"artificial intelligence is the second best way to solve any problem"

(KJ Astrom, UC Santa Barbara)

eh, I get the point, but not really, at least for modern AI. Data-driven approaches just work better for large classes of hard problems. There's not going to be some simple algorithm to identify a kitten in a photo.

There appears to be some hope in the (currently largely Julia based) SciML - scientific machine learning - ecosystem. The goal there is to combine scientific models and learn corrections from data. Pretty impressive set of libraries in this space.

It seems to me we have only solved one part of the puzzle. Solved is possibly too strong a word, but the basic method seems to be to take a guess at what sort of network works best (flat / rrn / convolution / ltsm), throw a lot of data at it and with luck you get a good result. Sometimes as happened with AlphaZero a better result than the planet has seen to from man or machine to date. I suspect we have a fair bit to learn on the "best network design for a particular task" front, but it looks like that will happen over time.

I don't think that's too different to how nature does it in worms, as you say. However, our brains also have a different method of operation. We can see something just once, and learn from it. The small child scalds herself on the room heater, and never goes near it again.

So the child has learnt from a single example. We have no idea how to do that. At some level it's probably a complex mechanism that involves memorising it, replaying the memory over and over again till, as you put it, the brain hits a satisfactory p-level. As far as I know no one has built such a mechanism yet. However, that's probably because even if you did build it, the fact remains the brain has learnt from a single example and we have no idea how to learn near perfectly from a single example. Let alone do as the child did and learn the lesson in the space of seconds or at most minutes.

> We have no idea how to do that.

I agree. However, my response if I intentionally don't over-think it, is "well she learns from a single example because that experience was painful and she wants to minimize pain."

Granted that's true as much as it is facile, but I do wonder if it hints at what the next frontier will consist of. A thermometer will happily keep reading out the temperature of a room heater until its physical form is damaged. A human child has a concept of pain to dissuade behaviors that lead to harm (in theory) (this is obviously simplified).

What do you mean by p-hacking? In statistics that refers to running a large number of experiments and taking the one with the best results, but ML trains for hundreds of thousands of steps on a single model.

ML runs hundreds of thousands of experiments, each on a slightly different model, and takes the one with the best results.

That's not necessarily p-hacking. It becomes p-hacking when the model is hyperoptimized to the test set and thus fails when applied to new data from outside the test set.

We can look Deep Learning's growing demands for computation and despair, or view those growing demands as an economic incentive to develop more powerful hardware that uses energy more efficiently at a lower marginal cost and in a more sustainable manner.

In other words, Deep Learning's growing need for computing power seems to have reached a point at which it is now motivating fundamental research to find greener, cheaper, more energy-efficient hardware.

The economic incentives are very powerful: Whichever companies (or organizations, or countries) find ways to harness the most computing power at the lowest marginal cost will win the race in this market.


PS. The same could be said for Bitcoin mining: it is also motivating fundamental research to develop greener, cheaper, more energy-efficient, more powerful hardware. Whoever finds ways to harness the most computing power at the lowest marginal cost will make the most money processing transactions on the network.

Saying that increased demand for compute will spur exponential growth is like saying that increased cross-Atlantic travel will spur the development of supersonic passenger airplanes, which we know is the opposite of what happened.

Alex Krizhevsky's revolution was to find a way to train large neural networks using existing hardware that was optimized for linear algebra. Linear algebra is literally the first applied problem that was studied in computer science, with papers that were written by Turing and von Neumann. It's the most mature field in computing and is long out of steam. Progress since AlexNet came from scaling $$$ not scaling tflops. There will not be exponential growth in compute performance.

I really don't understand that comment. We had the Concorde, didn't we? We found out that the marginal costs are not worth the 2h saved, fair enough, but there was a race towards faster. And I believe there are companies developing supersonic planes today again, maybe at lower marginal costs. OP did not say the efforts will be successful, OP said there will be a race to see if it is possible to drive costs down spurred by increased demand.

Fast isn't the salient metric. Efficiency is.

The example is a fallacy, because the concorde was relatively inefficient. The passenger plane of today is highly optimized for actual market demands.

We are not looking for just faster ML at any cost, we are looking for cost-efficient ML so we can do more of it.

TPUs and waferscale hardware beg to differ.

1. TPUs haven't really beaten newer GPUs significantly in performance.

2. Wafer scale integration and other advanced packaging approaches are qualitatively different types of computing advances that CMOS scaling, which provides better, faster, smaller, cheaper and at the same time, due to Dennard Scaling, reduces power.

Wafer scale has existed since the 1980s. It's literally Concorde-era technology.

Aren’t companies like google[0] and nvidia[1] already doing this?

The paper’s point is that eventually we will reach computing power limits and then we will have to improve the deep learning algorithm’s efficiency to continue to improve. From the abstract:

> Continued progress in these applications will require dramatically more computationally-efficient methods, which will either have to come from changes to deep learning or from moving to other machine learning methods.

[0] https://cloud.google.com/tpu

[1] https://www.nvidia.com/en-us/data-center/v100/

Sure. I mean part of it will be finding and exploiting symmetries, improvements to it alternatives to backprop, parallelism as well.

> finding and exploiting symmetries

I'm having trouble finding the papers I've skimmed on this (my original search phrase was something about succinct neural network encodings, entropy, yada yada yada), but that's already being done here and there, at least at a high level looking at graph symmetries (results were okayish -- theoretical space bounds aren't substantial improvements, and the compressed representations didn't perform super well). I haven't seen anything interesting yet explicitly dealing with the fact that neural networks represent the real world in some fashion and have a lot of biases imposed on the values the weights can take.

> improvements to it alternatives to backprop

I love seeing these come out. There's the no free lunch theorem and all that jazz, but in practice for real networks this can be a huge win.

> parallelism as well

Not really, at least if I'm understanding the chain of ancestor comments correctly. The arguments are more about the total cost of a given network than the total time to train it. Those are loosely entangled since with low levels of parallelism we're inclined to operate at higher clock speeds or take other energy-inefficient actions, but generally we would expect a parallel algorithm to be no more energy efficient (with respect to total conceptual work performed -- e.g., training a fixed neural network) than an equivalent serial implementation.

> generally we would expect a parallel algorithm to be no more energy efficient (with respect to total conceptual work performed -- e.g., training a fixed neural network) than an equivalent serial implementation

I think it depends on how you conceptualize the topology in both cases. A serial implementation requires threading all the data through a single point, whereas a parallel implementation can leave data where it is going to be used. Moving data around requires energy, so implementations that maximize locality of data should be more energy efficient. Such implementations would naturally synchronize as little as possible, so they would be highly parallel.

Basically, serial implementations of neural networks require a clock and a form of RAM, including the energy overhead of dispatch, synchronization and data transport, whereas parallel implementations don't: each neuron could just contain whatever little data it needs and nothing more.

There are alternatives to backprop like direct feedback alignment (1). They specifically focus on making the computation more parallel and thus more scaleable for the reasons you mention.

(1) http://papers.nips.cc/paper/6441-direct-feedback-alignment-p...

An interesting backprop alternative here:


Essentially inspired by spiking neurons, it can be implemented in neuromorphic hardware.

> In other words, Deep Learning's growing need for computing power seems to have reached a point at which it is now motivating fundamental research to find greener, cheaper, more energy-efficient hardware

I don't doubt we will find more frugal hardware energy wise, but calling it greener is somewhat dishonest.

> The economic incentives are very powerful: […]

Precisely, a technology is not green in itself, it depends on its use within society, and as you pointed out, we are talking about a race here. So the solution is not merely technological, there is some policy and/or societal change to make this greener. Otherwise we are just going to red-queen ourselves into Jevons paradox.

I think the case here is a lot easier than the case for bitcoin mining. Bitcoin miners are so stupidly single purpose that development there doesn't help much. Maybe in general it helps create an industry for designing and manufacturing ASICs. I suppose that might go into making ASICs for deep learning at some point.

Bitmain is one of TSMC's largest customers, and that absolutely has been reinvested by TSMC in developing more advanced fabrication techniques.

Also bitcoin mining chips are actually a lot like deep learning chips in that it's a lot of simple operations scaled out. And indeed, Bitmain now produces deep learning chips too.

>Bitcoin miners are so stupidly single purpose that development there doesn't help much.

I would say there's a second piece here, too, which is that to the extent that it's driving any sort of research into more efficient practices, it's only in response to constraints of bitcoins own making, and there's every intention to immediately use up new capacity with additional bitcoin mining. What could have been an external net benefit to humanity will just be absorbed into the same things that absorbed prior capacity.

I don't want to run all the way with that thought, though, because it can be the case that there's a net benefit to humanity nevertheless. But it makes me think of how Keynes promised that future efficiencies would bring a utopia in the form of dramatically scaling back the need to toil to fulfill basic needs and comforts, how that wasn't necessarily wrong, and how we ended up not taking that path just the same.

Can bitcoin ASIC's (or ASIC's in general) be reprogrammed to do another task? I have always heard that they are only designed to do cryptographic hashing, but are they just fundamentally that way? Would programming them to do something else be impossible or very unfeasible?

ASICs are not reprogrammable. FPGAs are.

I'm not well versed in these things so excuse the stupid question, but is the 'code' that runs the chips somehow hard-wired into the chip essentially? I only know vaguely what FPGA's are from investing a little bit in Xilinx late last year, mainly because from what I read they seemed very important to AI implementation and development. What are the benefits of making ASIC's vs FPGA's? Is it a cost thing (i.e. the ASIC's are cheaper to make?)

The advantage of ASICs is that a design implemented in an ASIC is likely going to run faster than the same design implemented in an FPGA - usually significantly faster. FGPAs are flexible, but as such you trade performance for that flexibility.

Developing an ASIC design is very expensive. There's usually many thousands (sometimes hundreds of thousands) of dollars of NRE (Non-Recurring Engineering expenses). Whereas you could put your design into an FPGA for the cost of an FPGA board (typically a few hundred dollars to a few thousand - but you'll be able to use the same FPGA board for many different designs as the FPGA is reprogrammable) and the FPGA design software (which is often given away for free or deeply discounted by the FPGA vendors). So ASICs are for organizations with lots of money. FPGAs are accessible for hobbyists. FPGAs are also often used to test out a design that is eventually implemented in an ASIC.

The crypto folks have mostly moved to ASICs for the higher performance.

To answer your first question - yes, ASICs (application specific integrated circuits) are generally hardwired - once the hardware is burned into the silicon, it can't be changed. On the other hand, FPGAs (field programmable gate arrays) are, as the name suggests, reprogrammable. In order to allow this (in a useful way) FPGAs are built of a large variety of logic gates which can be connected together by the programmer to, essentially, simulate a pure hardware circuit.

The tradeoff is that ASICs are (almost always) much faster at performing their specific task than an FPGA could be - I'm not sure of all the reasons, but you could imagine that there are many, many optimizations that you can do when designing pure hardware that don't transfer over to software-defined-hardware, because you have the FPGA intermediary there

I also won't pretend I know anything here, so take what I say with a grain of salt.

From what I understand, FPGAs provide a software method of essentially reprogramming the actual behavior of the physical circuit to perform different tasks. This means FPGAs are (to some extent) general purpose, it's just they are programmed to provide only one purpose at a time (compared to traditional CPUs or GPUs for example).

ASICs on the other hand are circuits designed specifically for your application only, they cannot be reprogrammed and must be created specifically for your algorithm.

The chip is the code.

An FPGA is reconfigurable hardware, you tell it what to become instead of what to execute. An ASIC is able to be only the one thing that the mask used to create it dictates.

ASIC are more expensive to make, more performant per watt, and (a lot) less programmable.

There isn't really code in an ASIC (there can be, but the more there is, the less of an ASIC it is), it's a physical configuration of transistors to accomplish a specific task. Can you change the code in a drill to make it a hammer? Not really, you can maybe screw a metal bit to the back so you can use it to bash in some nails, but it's probably cheaper to buy a new tool/ASIC.

What is the relative proportion of the number of GPUs now sold for deep learning purposes vs gaming?

I think it is definitely true that historically the gamers drove the adoption of GPUs and almost accidentally enabled the deep learning revolution, but has the balance shifted already and is deep learning now the driving force? And are GPUs going to be completely superseded by custom deep learning hardware or rather not?

Nvidia revenue is still slightly more gaming than data center (which covers AI), but the are about to reach parity https://nvidianews.nvidia.com/news/nvidia-announces-financia...

Deep learning doesn't need to consume more energy. We are doing enough of that as it is.

Instead, we need to start moving toward more efficient and specialized implementations of statistical models. SciML is a great example of how we can move past the age of amorphous black boxes and into more efficient representations of the same problems.

If you think about it, it's insane how much infrastructure we take for granted (xxx kilowatt of electricity, xxx mbps of internet, all for dirt cheap).

Maybe in a near future everyone would need xxx petaflops of compute just as you would any other utility. At least anyone who wants their personal AI secretaries to "learn" and "think" effectively.

Historically these utility/infrastructure projects needs to be built by the government since it's a chicken and egg problem. "Internet" companies can only exist after the internet is built. "AI" companies can only exist after gpt3 levels of compute is made available to the everyday person.

(There might be a physics argument to why the economy of scale of compute can't go down that much even with nation-level of investments. Usually it ends up with needing to build a dyson swarm. I really hope they are wrong because I don't think I will live to see it)

Before that is was computer gaming to a large degree. Not yet sure if that isn't still the more reasonable and constructive use of computing power to be honest, although some results of machine learning begin to seem slightly impressive.

But I still think that system design is much more important than raw computing power, since the latter is still quite cheaply available on the market.

BC is a counter example & does not help here at all.

Reading this paper reminds me how important it is for AI to continue to evolve it’s core algorithms.

Deep learning models are effectively pattern matching machines that cannot separate causality from correlation. As we throw more compute and $$$ at deep learning models we will experience diminishing returns in performance because of this.

For us to achieve AGI[0], the holy grail of AI, we will need to develop algorithms that can recognize causality somehow. Judea Pearl’s “The Book of Why”[1] does a great job articulating why this is important. Deep learning is a big leap forward and we’re only beginning to see its impacts, but it’s not sufficient to achieve AI’s most ambitious goals.

[0] https://en.m.wikipedia.org/wiki/Artificial_general_intellige...

[1] https://www.amazon.com/Book-Why-Science-Cause-Effect/dp/0465...

The limitation here is not necessarily in the algorithms but in the data.

There's no reason to suppose that it's even possible to recognize causality and separate it from correlation purely from static observations - among others, Judea Pearl has written a bunch about it, this year's ACL best theme paper (Bender&Koller, Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data) was on a very similar topic, IIRC there's neuroscience research that suggests that mammals can't "learn to see" functionally if they only get visual input that's not caused by their own acts, etc. It's highly likely that no algorithm or mind can learn what you propose solely from the data (and the type of data) that was available to train GPT.

It seems that recognizing causality requires intervention, not just observation, it requires active agents, not just passive specators. It requires "play" and experimentation. Learning language meaning requires at least some grounding and alignment with reality. The problems of causality illustrate the limitations and necessity of integrating models with the environment, they do not indicate any fundamental limitations with model structure itself; perhaps deep learning isn't sufficient for learning these structures, and perhaps it is, as far as I know we don't have any good evidence to justify assuming one way or the other.

On the other hand, if/when some artificial agent has learned the core concepts of causality and language grounding in some small environment - using whatever algorithms are required for that - then it seems plausible that a GPT-like system could be sufficient to provide the mapping to extend these concepts to the whole range of things that we talk about; once the agent understands (according to whatever reasonable nontrivial definition of "understands") the concept of causality for some actions, it can learn the causality of all the other actions which are described if you read all the written text in the world.

So goes the old teachers’ proverb: “Tell me and I forget, show me and I may remember, involve me and I understand.”

This is a good argument for why AGI is more likely to come from AI gym environments where it learns to control an agent in an environment. It isn’t presented with a static dataset but interacts.

This is a great point!

> cannot separate causality from correlation

Neither can most humans without rigorous experimentation. Deep learning today seems to emulate only system 1. It just turns out that much of what expert humans do can be attributed to finely tuned system-1 intuition. To achieve AGI we'd need a system-2. (nomenclature for system-1 and system-2 taken from Kahneman and Tversky's work on human cognition)

The current state of the field is reminiscent of discrete mathematical optimization in the late nineties. The computational power was increasing rapidly back then but our our ability to solve problems with more than a hundred of binaries hadn’t improve almost at all. It was only after new theoretical results in integer programming found their way into the solution algorithms when we saw a stepwise increase of performance.

I think autograd and its suffusion into the modern tools for machine learning development is initiating a paradigm shift. If you haven't already, check out the SciML ecosystem in Julia. One can leverage it to build machine learning models that can slot right into any existing simulator, dynamical system, or any other method to represent systems to fit to data, and the gradients will be handled appropriately through the entire language natively.

This video is from a talk at JuliaCon 2020 and is about the SciML ecosystem: https://www.youtube.com/watch?v=QwVO0Xh2Hbg

For more info, check out https://sciml.ai/

I am not affiliated with them, but really admire their work and think it will become an invaluable tool in the coming years as the limits of neural networks become apparent and we stop running out of little tricks to squeeze one more SOTA paper out of them.

Interesting. Could you point me to some history article about that?


The story narrated by Bixby, the cofounder of CPLEX and Gurobi the two best available commercial mathematical optimization solvers. The magic happened in CPLEX V6.5 back in 1999 with almost 10x y—o-y performance improvement.

,,This article reports on the computational demands of Deep Learning applications in five prominent application areas and shows that progress in all five is strongly reliant on increases in computing power.''

I don't agree with the conclusion of the paper. The computing architectures have been improving dramatically over the last few years, and almost any task that was achievable 5 years ago with deep learning is orders of magnitudes cheaper to train.

The energy resources taken by deep learning is increasing because of the huge ROI for companies, but it will probably slow down as the compute cost gets close to the cost of software engineers (or the profit of a company), because at that point researching improvements to the models gets relatively cheaper again.

I wonder how long we can continue overfitting these benchmark datasets as a community of researchers? How much is ImageNet is labeled incorrectly/subotimally?

As long as there are PhD students in need of a dissertation.

What's disheartening, is that all the charts are in log scale for "Computation (Hardware Burden)". These days it feels like any non-trivial achievement in the field of ML requires resources of the state actors. Maybe that's the sign of ML becoming a mature field like Physics, where collective efforts like CERN and ITER are required, and all low hanging fruits have been picked. Bot it also goes in contrast with how Mathematics and Statistics research has been made before. In these fields a paper usually has only one to three authors, and one man can make a difference. In ML, it's probably not the case anymore.

Will it? Or, like capital-intensive industries of the past, will deep learning funnel its profits into bigger and bigger computers, as has been done in the past and will be done again?

These are order-of-magnitude increases. If it costs $5m to train GPT-3, which is 100x more compute than GPT-2, then it may cost $500m to train GPT-4, and $50b to train GPT-5. This is what is meant by economically (not to mention environmentally) unsustainable.

Environment aside (I don't even think the CURRENT rate of training is environmentally safe) -- I see no inherent problem with a 100x increase in cost per step holding anybody back. Once it's trained, you can run it much cheaper. Who's to say $50 billion of value can't be extracted from GPT-5?

Using electricity is not inherently damaging to the environment. Very low cost and high power zero-carbon generation sources exist (hydro and nuclear). For scale also keep in mind that all the datacenters in the world still use much less power than is used for smelting aluminium.

Also, processors turn electricity into heat - if you can find a useful way of harnessing that heat, the computation is practically free. Deep learning is fairly latency-insensitive and is almost never mission-critical, so it's not inconceivable to imagine that the boiler rooms of apartment buildings and office blocks could become mini-DCs.

Also, once there are billions of value to be extracted, specialized chips will appear just to train those models (e.g. wafer scale analog processors, could easily be 3 orders of magnitude more efficient than GPUs/TPUs).

oh wait, decentralized, publicly controlled infra. What was the cloud about exactly?

> Who's to say $50 billion of value can't be extracted from GPT-5?

Perhaps it can. (Though the number of companies that can afford to train it is rather small.) But can $5 trillion be extracted from GPT-6? Even if it can, who can afford to train it?

If GPT-5 is capable of creating a cost-efficient implementation of GPT-6, then maybe we don't need to worry about the extrapolated $5 trillion price tag.

More computation = more data. That's the real problem. Processors will get faster but for some applications we'll never have more data. So yes, new approaches are needed.

There's a big difficulty difference between Engineering, and Reverse Engineering.

If you want to build a supersonic fighter, it's a lot easier if you can look at how someone else has done it. This focuses your search to certain parts of the possibility space, saving tons of time. It gives you proof that certain techniques can in fact work. This is why the proof-of-concept, or technology demonstrater, is such an important milestone. It's why countries do so much technology espionage.

Maybe it takes us a of computation to get the first NN that achieves a given performance level on a task we care about.

But then we have an artefact we can reverse engineer. At a minimum, we can compress it, sure. But ultimately we can do science on it to find out how it's doing what it's doing, to look for the learned architectures within.

It's going to be a lot easier to do this on a deep learning system, than a biological one.

Maybe we wouldn't do this while it's cheap to just scale compute, but eventually we will. And the new understanding we get will probably enable us to build new generations that are even more efficient.

This is really just a gut take, but it's an argument for optimistism here.

a complementary reading piece should be sutton's "the bitter lesson", which more or less argues that ai methods/techniques that "leverage computation are ultimately the most effective, and by a large margin."

[0] http://www.incompleteideas.net/IncIdeas/BitterLesson.html

That also means only few monopolistic behemoths can do state of the art AI. Good luck for an independent researcher to pull GPT-3.

What does OpenAI have a monopoly in, exactly?

Currently, a monopoly on text generation. And they are not "open" anymore, being a commercial venture closely associated with Microsoft.

Time to ask Deep Learning to accelerate Moore’s Law

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact