
The Computational Limits of Deep Learning - ozdave
https://arxiv.org/abs/2007.05558
======
fallous
My interpretation of current ML solutions is that it's mostly brute force
statistical models hoping to hit a satisfactory p-value (some implementations
being more elegant in the brute force than others). The calculating power
defines the limits of the finite domain you can apply the solution towards.

A simplified analogy that I believe is applicable is the application of
flocking behavior in birds. Current ML implementations would demand a large
dataset of groups of birds in flight, both flocking and non-flocking, and
curating the correct p-value of brute-forced models that allege to predict
whether behavior is flocking or not (and in the case of an individual bird
whether it is appropriate flocking behavior given the individual behaviors of
birds in the dataset and their circumstances). But flocking behavior is easily
modeled right now, and has been for decades, using simple rules for
individuals and depending upon emergent phenomena within a group.

I'm concerned that most of the ML efforts now are merely attaining the low-
hanging fruits of brute force but will run into a wall that halts progress at
the level of relatively "easy" things solved by worms and other comparable
biological solutions since so many domains have a level of complexity that
would exceed any realistically imaginable level of simple mathematical
computing power.

~~~
panpanna
This is one of my favorite quotes, which I think summerizes what you wrote in
one sentence:

"artificial intelligence is the second best way to solve any problem"

(KJ Astrom, UC Santa Barbara)

~~~
Upvoter33
eh, I get the point, but not really, at least for modern AI. Data-driven
approaches just work better for large classes of hard problems. There's not
going to be some simple algorithm to identify a kitten in a photo.

------
cs702
We can look Deep Learning's growing demands for computation and despair, or
view those growing demands as an economic incentive to develop more powerful
hardware that uses energy more efficiently at a lower marginal cost and in a
more sustainable manner.

In other words, Deep Learning's growing need for computing power seems to have
reached a point at which it is now motivating fundamental research to find
greener, cheaper, more energy-efficient hardware.

The economic incentives are _very_ powerful: Whichever companies (or
organizations, or countries) find ways to harness the most computing power at
the lowest marginal cost will win the race in this market.

\--

PS. The same could be said for Bitcoin mining: it is also motivating
fundamental research to develop greener, cheaper, more energy-efficient, more
powerful hardware. Whoever finds ways to harness the most computing power at
the lowest marginal cost will make the most money processing transactions on
the network.

~~~
KKKKkkkk1
Saying that increased demand for compute will spur exponential growth is like
saying that increased cross-Atlantic travel will spur the development of
supersonic passenger airplanes, which we know is the opposite of what
happened.

Alex Krizhevsky's revolution was to find a way to train large neural networks
using existing hardware that was optimized for linear algebra. Linear algebra
is literally the first applied problem that was studied in computer science,
with papers that were written by Turing and von Neumann. It's the most mature
field in computing and is long out of steam. Progress since AlexNet came from
scaling $$$ not scaling tflops. There will not be exponential growth in
compute performance.

~~~
MacsHeadroom
TPUs and waferscale hardware beg to differ.

~~~
deepnotderp
1\. TPUs haven't really beaten newer GPUs significantly in performance.

2\. Wafer scale integration and other advanced packaging approaches are
qualitatively different types of computing advances that CMOS scaling, which
provides better, faster, smaller, cheaper and at the same time, due to Dennard
Scaling, reduces power.

------
peterthehacker
Reading this paper reminds me how important it is for AI to continue to evolve
it’s core algorithms.

Deep learning models are effectively pattern matching machines that cannot
separate causality from correlation. As we throw more compute and $$$ at deep
learning models we will experience diminishing returns in performance because
of this.

For us to achieve AGI[0], the holy grail of AI, we will need to develop
algorithms that can recognize causality somehow. Judea Pearl’s “The Book of
Why”[1] does a great job articulating why this is important. Deep learning is
a big leap forward and we’re only beginning to see its impacts, but it’s not
sufficient to achieve AI’s most ambitious goals.

[0]
[https://en.m.wikipedia.org/wiki/Artificial_general_intellige...](https://en.m.wikipedia.org/wiki/Artificial_general_intelligence)

[1] [https://www.amazon.com/Book-Why-Science-Cause-
Effect/dp/0465...](https://www.amazon.com/Book-Why-Science-Cause-
Effect/dp/046509760X)

~~~
PeterisP
The limitation here is not necessarily in the algorithms but in the data.

There's no reason to suppose that it's even possible to recognize causality
and separate it from correlation purely from static observations - among
others, Judea Pearl has written a bunch about it, this year's ACL best theme
paper (Bender&Koller, Climbing towards NLU: On Meaning, Form, and
Understanding in the Age of Data) was on a very similar topic, IIRC there's
neuroscience research that suggests that mammals can't "learn to see"
functionally if they only get visual input that's not caused by their own
acts, etc. It's highly likely that _no_ algorithm or mind can learn what you
propose solely from the data (and the type of data) that was available to
train GPT.

It seems that recognizing causality requires intervention, not just
observation, it requires active agents, not just passive specators. It
requires "play" and experimentation. Learning language meaning requires at
least some grounding and alignment with reality. The problems of causality
illustrate the limitations and necessity of integrating models with the
environment, they do not indicate any fundamental limitations with model
structure itself; perhaps deep learning isn't sufficient for learning these
structures, and perhaps it is, as far as I know we don't have any good
evidence to justify assuming one way or the other.

On the other hand, if/when some artificial agent has learned the core concepts
of causality and language grounding in some small environment - using whatever
algorithms are required for that - then it seems plausible that a GPT-like
system could be sufficient to provide the mapping to extend these concepts to
the whole range of things that we talk about; once the agent _understands_
(according to whatever reasonable nontrivial definition of "understands") the
concept of causality for _some_ actions, it can learn the causality of all the
other actions which are described if you read all the written text in the
world.

~~~
libraryofbabel
So goes the old teachers’ proverb: “Tell me and I forget, show me and I may
remember, involve me and I understand.”

------
whatever1
The current state of the field is reminiscent of discrete mathematical
optimization in the late nineties. The computational power was increasing
rapidly back then but our our ability to solve problems with more than a
hundred of binaries hadn’t improve almost at all. It was only after new
theoretical results in integer programming found their way into the solution
algorithms when we saw a stepwise increase of performance.

~~~
xenonite
Interesting. Could you point me to some history article about that?

~~~
whatever1
[https://www.math.uni-bielefeld.de/documenta/vol-
ismp/25_bixb...](https://www.math.uni-bielefeld.de/documenta/vol-
ismp/25_bixby-robert.pdf)

The story narrated by Bixby, the cofounder of CPLEX and Gurobi the two best
available commercial mathematical optimization solvers. The magic happened in
CPLEX V6.5 back in 1999 with almost 10x y—o-y performance improvement.

------
xiphias2
,,This article reports on the computational demands of Deep Learning
applications in five prominent application areas and shows that progress in
all five is strongly reliant on increases in computing power.''

I don't agree with the conclusion of the paper. The computing architectures
have been improving dramatically over the last few years, and almost any task
that was achievable 5 years ago with deep learning is orders of magnitudes
cheaper to train.

The energy resources taken by deep learning is increasing because of the huge
ROI for companies, but it will probably slow down as the compute cost gets
close to the cost of software engineers (or the profit of a company), because
at that point researching improvements to the models gets relatively cheaper
again.

------
256lie
I wonder how long we can continue overfitting these benchmark datasets as a
community of researchers? How much is ImageNet is labeled
incorrectly/subotimally?

~~~
oldgradstudent
As long as there are PhD students in need of a dissertation.

------
srg0
What's disheartening, is that all the charts are in log scale for "Computation
(Hardware Burden)". These days it feels like any non-trivial achievement in
the field of ML requires resources of the state actors. Maybe that's the sign
of ML becoming a mature field like Physics, where collective efforts like CERN
and ITER are required, and all low hanging fruits have been picked. Bot it
also goes in contrast with how Mathematics and Statistics research has been
made before. In these fields a paper usually has only one to three authors,
and one man can make a difference. In ML, it's probably not the case anymore.

------
freeone3000
Will it? Or, like capital-intensive industries of the past, will deep learning
funnel its profits into bigger and bigger computers, as has been done in the
past and will be done again?

~~~
rabidrat
These are order-of-magnitude increases. If it costs $5m to train GPT-3, which
is 100x more compute than GPT-2, then it may cost $500m to train GPT-4, and
$50b to train GPT-5. This is what is meant by economically (not to mention
environmentally) unsustainable.

~~~
freeone3000
Environment aside (I don't even think the CURRENT rate of training is
environmentally safe) -- I see no inherent problem with a 100x increase in
cost per step holding anybody back. Once it's trained, you can run it much
cheaper. Who's to say $50 billion of value can't be extracted from GPT-5?

~~~
saddlerustle
Using electricity is not inherently damaging to the environment. Very low cost
and high power zero-carbon generation sources exist (hydro and nuclear). For
scale also keep in mind that all the datacenters in the world still use much
less power than is used for smelting aluminium.

~~~
jdietrich
Also, processors turn electricity into heat - if you can find a useful way of
harnessing that heat, the computation is practically free. Deep learning is
fairly latency-insensitive and is almost never mission-critical, so it's not
inconceivable to imagine that the boiler rooms of apartment buildings and
office blocks could become mini-DCs.

~~~
p1esk
Also, once there are billions of value to be extracted, specialized chips will
appear just to train those models (e.g. wafer scale analog processors, could
easily be 3 orders of magnitude more efficient than GPUs/TPUs).

------
mathraki
More computation = more data. That's the real problem. Processors will get
faster but for some applications we'll never have more data. So yes, new
approaches are needed.

------
feral
There's a big difficulty difference between Engineering, and Reverse
Engineering.

If you want to build a supersonic fighter, it's a lot easier if you can look
at how someone else has done it. This focuses your search to certain parts of
the possibility space, saving tons of time. It gives you proof that certain
techniques can in fact work. This is why the proof-of-concept, or technology
demonstrater, is such an important milestone. It's why countries do so much
technology espionage.

Maybe it takes us a of computation to get the first NN that achieves a given
performance level on a task we care about.

But then we have an artefact we can reverse engineer. At a minimum, we can
compress it, sure. But ultimately we can do science on it to find out how it's
doing what it's doing, to look for the learned architectures within.

It's going to be a lot easier to do this on a deep learning system, than a
biological one.

Maybe we wouldn't do this while it's cheap to just scale compute, but
eventually we will. And the new understanding we get will probably enable us
to build new generations that are even more efficient.

This is really just a gut take, but it's an argument for optimistism here.

------
one_electron
a complementary reading piece should be sutton's "the bitter lesson", which
more or less argues that ai methods/techniques that "leverage computation are
ultimately the most effective, and by a large margin."

[0]
[http://www.incompleteideas.net/IncIdeas/BitterLesson.html](http://www.incompleteideas.net/IncIdeas/BitterLesson.html)

------
BMSmnqXAE4yfe1
That also means only few monopolistic behemoths can do state of the art AI.
Good luck for an independent researcher to pull GPT-3.

~~~
logicchains
What does OpenAI have a monopoly in, exactly?

~~~
BMSmnqXAE4yfe1
Currently, a monopoly on text generation. And they are not "open" anymore,
being a commercial venture closely associated with Microsoft.

------
dwighttk
Time to ask Deep Learning to accelerate Moore’s Law

