
The post-exponential era of AI and Moore’s Law - ChuckMcM
https://techcrunch.com/2019/11/10/the-post-exponential-era-of-ai-and-moores-law/
======
m3at
The "Bitter Lesson" post from Rich Sutton from earlier this year [1] seems a
very good complement to this article: he explain how all of the big
improvements in the field came from new methods that leveraged the much larger
compute available from Moore's law, instead of progressive buildup over
existing methods.

A great quote from McCarthy also regularly referenced by Sutton is
"Intelligence is the computational part of the ability to achieve goals",
which (IMO) help picture the tight link between compute growth and AI.

It's only a few minutes read, I highly recommend it:

[1]
[http://www.incompleteideas.net/IncIdeas/BitterLesson.html](http://www.incompleteideas.net/IncIdeas/BitterLesson.html)

~~~
_bxg1
It could be that the slowing of Moore's Law results in a more diverse array of
specialized computing hardware. After all, the resurgence of ML didn't happen
because x86 got fast enough, but because video games - of all things - funded
the maturation of a whole new category of massively-parallel chips.

~~~
codesushi42
_the resurgence of ML ... because video games - of all things - funded the
maturation of a whole new category of massively-parallel chips_

Fake news. It has far more to do with the rise of distributed computing than
the existence of GPUs.

~~~
jchw
I don’t really see this; All of the currently popular machine learning
frameworks either support GPU or are oriented around GPU based execution. For
most developers working on AI it is still the defacto standard.

~~~
sgt101
I think.. both. Distributed computing (Hadoop) was why a lot of data got
collected and made available(ish).

GPU's are the engines that made CNN's (in particular) tractable, and opened up
a bunch of applications for many companies, and opened up a reasonable route
to results for a generation of researchers.

~~~
unkulunkulu
Can anybody elaborate on why this is downvoted? This would be my guess as
well, simd parallelism of GPUs solves only part of the challenges, you still
need a general purpose data crunching machine to prepare and handle learning
data.

~~~
fspeech
For one GPU speedup over CPU isn’t that dramatic for small to medium sized
problems, e.g. MNIST or CIFAR that one would try algorithm ideas on. So I
think it’s a stretch to see GPU as essential to the new algorithms. On the
other hand for large problems like the original Alpha Go you need to figure
out the distributed computing to really scale.

This isn’t to say that GPUs aren’t nice. They do save time or for the same
amount of time let you produce more polished results, which means in a
competitive environment everyone would use them.

~~~
codesushi42
Exactly. GPUs are necessary _now_ , but did not originally herald the deep
learning revolution.

------
ChuckMcM
The article discusses one of the impacts of compute performance growth rates
slowing while compute demand (training AI models) is growing exponentially.

The 'end of Moore's law' (as a measure of performance, not density) is
probably the most significant "Tech" story of the next decade. Why? Because it
is going to demand engineers who can write fast code over engineers who write
code fast.

A lot of frameworks and abstractions will be ripped out and thrown away as
'taking too much time for not enough benefit.'

~~~
cft
Rust will fare well. Ruby won't.

~~~
fouc
Ruby isn't in the critical path, so that's a bad example.

In general, the real issue is the layers of abstractions between the hardware
and the end user. The over complicated architectures. The constant re-
invention of databases/operating systems/virtual machines at each layer.

~~~
clarry
> The constant re-invention of databases/operating systems/virtual machines

I wish that were the case but operating systems and systems software in
general seems to be the most stagnant field. Everyone just buys the same FLOSS
stack for $0 and compatibility is king, so there's very little research going
in this field and much less of it ends up in any product you're ever going to
see in use.

I would _love_ to see a vibrant scene of competing operating systems and
databases with fresh new ideas and research.

~~~
fouc
I meant that it is re-implemented at every layer.. Like a browser is
practically an OS on its own.

------
Symmetry
For a long time we had a situation where transistors got smaller and cheaper
and faster and more power efficient all at the same through the wonders of
Denard scaling and we gestured broadly at the whole thing and called it all
"Moore's Law" without needing to distinguish which exponential improvement the
term referred to. But in the mid 2000s Denard scaling broke down and now it
looks like transistors are still getting smaller and cheaper and more power
efficient but they aren't getting exponentially faster any more. So we've
mostly settled on the smaller bit as being the true "Moore's Law" and that's
more or less kept going but might be running out of steam. It's still very
nice from the perspective of very parallel tasks but as consumers we don't see
as much benefit from it in our computers except in graphics. But sooner or
later Moore's Law will run out too, transistors can only get so small as long
as they're made of atoms.

Luckily the laws of physics have precise limits for just how efficient
computation can be, Landauer's principle, that are just as binding as the laws
for how efficient a heat engine can be. Progress in engine efficiency did
indeed form a converging sigmoid on what's allowed by Carnot. But we're still
quite a distance away from Landauer's limit and if transistors can't get us
there we have every reason to believe that other computational substrates are
possible and can.

In the meantime we might be looking at an interregnum. But while AI is
gobbling up computational cycles because they're available there's no reason
to think that efficiency gains aren't possible - just look at the orders of
magnitude improvement in training resources from AlphaGo to AlphaZero. I don't
see that this should stop progress, though it might slow it.

[1][https://en.wikipedia.org/wiki/Landauer%27s_principle](https://en.wikipedia.org/wiki/Landauer%27s_principle)

~~~
rrss
Transistors haven't really been getting smaller for the last few nodes,
they've mostly just been getting denser.

AFAIK nobody currently knows how make finfets much smaller without making them
awful.

------
sprash
I predict there will be much more assembly programming required in the future
to squeeze out as much as performance as possible because the end of Moore's
Law is already very apparent for several applications that can not be easily
parallelized. It is a complete myth that you "can't beat the C-compiler" as it
is claimed so often. The compiler can't know many things you know about the
problem at hand. So far I have been able to beat gcc on a regular basis while
just relying on x86_64 (which means using no instructions/registers beyond
SSE2.0).

Right now I'm looking for a C/Shader-like language for x86_64/Linux which
allows you to be much closer to the metal without requiring you to go down to
the cumbersome level of ASM syntax and at the same time shedding libc and use
syscalls instead (E.g. if you have specific knowledge about the data
structures you want to map you can use the heap much more efficient than plain
old malloc). So far I found nothing.

~~~
Shorel
"Right now I'm looking for a C/Shader-like language for x86_64/Linux"

You seem like the ideal candidate for writing such a language. And the book
about it.

------
ilaksh
General purpose AI is not waiting for more compute power. Its waiting for
algorithms that can do it.

I'm not 100% sure deep learning can get all the way there. We might need a new
approach. But the deep learning leaders are taking it very seriously and
making some progress.

For example, Yann LeCun is talking about self-supervised, learning models of
the world. MILA (Bengio's group) is talking about "state representation
learning, or the ability to capture latent generative factors of an
environment". Hinton now has capsule networks. In my opinion these types of
approaches are very promising for general purpose AI.

~~~
jononor
Yeah, many ML applications are bottlenecked on availability of labels, not by
compute. Especially once outside very well-defined and established task. I
think self-supervised has large potential, at least in expanding ML
applications towards "more general".

------
skohan
I predict that the slowing of Moore's law is going to make HPC and
optimization in general a much more valuable skillset in the next few decades.
In the past 15 years or so, we've been more or less happy to treat CPU cycles
as a limitless resource, and as a result modern software stacks have a lot of
fat in them. At the end of the era of free speed increases, trimming the fat
is going to be a lot more important.

~~~
streetcat1
CPU cycles are limitless. Most of the time the CPU is waiting for the memory
system.

~~~
dragontamer
We're talking about 16-bit floats, or even INT8 or INT4, as the basis of
neural nets these days, pushing the problem back to CPU-cycles.

CNNs in particular recycle the same set of weights over-and-over again,
fitting inside of the tiny caches (or shared-memory in GPUs), allowing for the
compute-portion of the hardware to really work the data.

> CPU cycles are limitless. Most of the time the CPU is waiting for the memory
> system.

CPUs go out-of-order, deeply pipelines, and speculative so that they have work
to do even while waiting for the memory system.

The typical CPU has over 200+ instructions in flight in parallel these days.
(200+ sized reorder buffers and "shadow registers" to support this hidden
parallelism), and that's split between two threads for better efficiency
("Hyperthreading").

GPUs can have 8x warps / wavefronts per SM (NVidia) or CU (AMD) waiting for
memory. If one warp/wavefront (a group of 32 or 64 threads) is waiting for
memory, the GPU will switch to another "ready to run" warp/wavefront.

It takes some programmer effort to understand this process and write high-
performance code. But its doable with some practice.

~~~
streetcat1
So I assume that we are talking about Moores law for cpu/gpu and not for
memory.

The bottleneck for GPU today is the amount of memory on board and not compute.
Especially with large size models.

In addition, the problem with deep learning, in general, is the seq nature of
the alg. I.e it is parallel within the layer, but not between layers. And for
multi gpu setup, again it is the communication link between the GPUs.

So I think that the nature of the current state of the art optimization alg
are what matter.

~~~
dragontamer
> The bottleneck for GPU today is the amount of memory on board and not
> compute. Especially with large size models.

Well... everything "depends on the model". Some models will be GPU-compute
limited, others will be bandwidth-limited, and others will be capacity
limited.

> In addition, the problem with deep learning, in general, is the seq nature
> of the alg. I.e it is parallel within the layer, but not between layers. And
> for multi gpu setup, again it is the communication link between the GPUs.

You should increase the size of the model to increase parallelism. Any problem
has sequential-bits and parallel bits, so we can't stop the sequential nature
of problems.

But the idea is to think NOT in terms of Ahmdal's law, but instead in terms of
Gustafson's law. When you have a computer that's twice-as-parallel, you need
to double the work done.

You can't reasonably expect things to get faster-and-faster (Ahmdal's law).
Instead, you load up more-and-more work the wider-and-wider machines get.

In the case of neural nets, instead of doing 128x128x5 sized kernels, you
upgrade to 256x256x5 sized kernels as GPUs get thicker.

Moore's law was never about the "speed of computers", but instead about "the
number of transistors". As such, its Gustafson's law that best scales with
Moore's law. We've got another 5 to 10 years left in Moore's law by the way:
denser memory, denser GPU compute, 5nm-class chips (probably networked as a
bunch of chiplets).

We will have more CPU / GPU power, as well as more RAM density, in the future.
The question is how to organize our code so that we'll be ready 5 years from
now for the next-gen of hardware.

~~~
streetcat1
Thank you for your insights. Good info.

------
fyp
Since AI is heavily parallelizable, it only matters that cost(as in dollars)
will keep exponentially decreasing.

It doesn't matter if you can't double the transistor density of a single cpu
if you can just double the number of machines. At the end of the day you still
managed to double performance for the same price.

See
[https://en.wikipedia.org/wiki/FLOPS#Hardware_costs](https://en.wikipedia.org/wiki/FLOPS#Hardware_costs)
(note: in another thread someone noted that this wiki is outdated/inaccurate.
If anyone have the relevant expertise they should help edit it)

~~~
4NDR10D
Hardware Acceleration/Parallization is the next frontier. We've already seen
the benefit of some pretty simple ASICs (TPU was built to be simple) as well
as more general purpose accelerators. Hardware architects used to have a hard
time, because often the best option was to simply wait for CPUs to get faster.
Now that we've seen CPU power begin to stall it makes economic sense not only
to invest in more parallel software but more appliciation specific
accelerators.

CPUs/GPUs are beasts of hardware architecture, being complex mostly due to
their flexibility. We can achieve higher performance with dedicated hardware
(or FPGAs), and it looks like the economic reasons to do so are slowly
becoming more certain.

------
jacobcammack
I don’t agree fundamentally with this article. Correct details, but missing
the Forest for the trees. There are too many elements at work evolutionarily
speaking to ignore the emergence of God knows what. Plus our AI (not to
mention our most basic computing axioms) are absolutely juvenile as we are
brand new as a species to be developing anything at all that has to do with
computing. That doubly goes for “AI”... whatever that means.

------
lachlan-sneff
The topic is quite controversial, but there is a path forward. There are
several theoretical computing technologies that can get very close to the
theoretical maximum allowed by physics (as well as being revisible), but we
can't build them yet because nanofactories/molecular assemblers/whatever-you-
call-them don't exist yet.

------
narrator
It'd be really weird if we could keep expanding Moore's law and surpass the
brain on a performance per watt ratio.

~~~
hyko
Why?

------
shusson
> The takeaway is that, even if we assume great efficiency breakthroughs and
> performance improvements to reduce the rate of doubling, AI progress seems
> to be increasingly compute-limited at a time when our collective growth in
> computing power is beginning to falter

A similar conclusion can be made for genomics too.

------
e_carra
Modern hardware has lot to improve, memories and transmission lines have been
the main bottlenecks for more than a decade now, bigger caches helped but are
not enough. Solutions are being developed, like on-circuit optical fiber
transmission lines, but it takes time.

------
vagab0nd
> A couple of years ago I was talking to the CEO of an AI company who argued
> that AI progress was basically an S-curve, and we had already reached its
> top for sound processing, were nearing it for image and video, but were only
> halfway up the curve for text. No prize for guessing which one his company
> specialized in — but he seems to have been entirely correct.

This feels a little off. To me, it feels like image has made the most
progress, then text, then sound, then video.

Anyone knows which company they were referring to?

------
CRUDite
Moore's law may be dead, but it is still mind boggling to project if
forwards.. Seth lloyd in his paper [1] on the limits of computation, mentions
in 250 years computational density will equal that of a black hole (kilogram
sized). [1]
[https://cds.cern.ch/record/396654/files/9908043.pdf](https://cds.cern.ch/record/396654/files/9908043.pdf)

------
hyperpallium
I am optimistic that Moore's Law will eventually recover, with a new
technology, perhaps silicon-based, perhaps not. Information processing is not
intrinsically limited by silicon - for example, mammalian brains are more
powerful.

But that's lomg-term, big-picture. Technologies can remain stagnant longer
than you can remain alive.

~~~
giacaglia
It's funny that the article mentions that processors have not increase their
speed, and therefore Moore's law is dead. In fact, as many people know Intel
has been struggling with their 10nm chip, and now with a new CEO, this might
change. All other processor manufacturers are catching up and some are moving
ahead of Intel. That's the whole reason Apple is trying to move away from X86
with their MACs

------
_bxg1
Honestly, this is a bit of a relief. All of the AI nightmare scenarios (be
they the Terminator kind, or the more realistic hyper-empowerment-of-a-few-
elites kind) rely on that exponential growth continuing unimpeded. If there's
no exponential growth, there's no runaway AI that swiftly outpaces human
understanding.

~~~
xyproto
Yet AI beat humans at every strategic game out there, from Chess and Go to
StarCraft2. We don't have general AI, but in some fields, human understanding
have already been outpaced.

~~~
tsimionescu
Specialized fields like games aren't necessarily a problem - computers have
been beating us at arithmetic for decades, and it isn't exactly world
shattering.

Also, a minor nitpick: AI has not succeeded at beating the best players at
SC2, though it did beat the vast majority of the ladder.

~~~
ben_w
> Specialized fields like games aren't necessarily a problem - computers have
> been beating us at arithmetic for decades, and it isn't exactly world
> shattering.

I would argue that it is, because that’s the driver for industrial automation.

------
orasis
One factor: the computational cost of training any extremely expensive model
is amortized across all uses of that model.

------
sdoken123
Could it be that with Quantum Computing computational power will continue to
increase exponentially?

~~~
npo9
Is Quantum Computing computational power increasing exponentially now?

~~~
buzzkillington
Yes, we have gone from being able to factor the number 1 to being able to
factor the number 2 in the last 20 years.

------
hyperpallium
Why can't we just have bigger dies? Same density, more transistors. If yields
are too low, connect smaller dies somehow.

~~~
clarry
If you stop scaling node size down and start scaling die size up, the result
will be higher prices and power consumption. If you don't mind, go buy a
Threadripper or EPYC. Might consider getting multiple sockets while at it.
Caveat: won't fit in your pocket.

I'm sure there'll be additional interconnect issues to worry about as the
physical cluster of dies gets larger and larger.

------
boyadjian
Maybe it is a good thing, so the singularity will be avoided. There is so much
wrong usages of AI.

~~~
mikorym
Mathematician here, I don't think the singularity as popularised through
people like Ray Kurzweil would happen.

In particular, their metaphor of an "infinity" point seems to me like the
incorrect application of mathematical ideas to social contexts.

I think the question of artificial general intelligence is a different
question than a singularity, much like how the Chinese room addresses a
different question than the Turing test.

~~~
TomMarius
Why is it a different question? AFAIK the assumption is that if the computer
can improve itself (which it can if it has IQ 100), there is nothing holding
the singularity back.

~~~
TheOtherHobbes
Which is dead wrong, because not only is there no usable definition of
"improve itself", but there isn't even any understanding of the kinds of
skills required to create a usable definition.

It's the difference between a computer that is taught how to compose okay-ish
music, and a computer that learns spontaneously how to compose really really
great music _and_ do all of the social, cultural, and financial things
required to create a career for itself as a notable composer _and then_ does
something entirely new and surprising given that starting point.

They're completely different problem classes, operating on completely
different levels of sophistication and insight.

A lot of "real" AI problems are cultural, social, psychological, and semantic,
and are going to need entirely new forms of meta-computation.

You're not going to get there with any current form of ML, no matter how fast
it runs, because no current form of ML can represent the problems that need to
be solved to operate in those domains - never mind spontaneously generate
effective solutions for those problems.

~~~
TomMarius
> Which is dead wrong, because not only is there no usable definition of
> "improve itself", but there isn't even any understanding of the kinds of
> skills required to create a usable definition.

I disagree. A program improves itself when it reacts to a problem and
implements a solution. Obviously that is very general, but enough. A human of
IQ 100 certainly can develop software; a program of IQ 100 should be able to
do the same, and then you scale horizontally.

~~~
thfuran
Have you taken an IQ test? They only test a few classes of problem.
Performance on these _for a human_ is deemed a workable poxy for intelligence
but, for something that approaches the problems very differently, it may not
be at all indicative of general intelligence. I think we probably already have
reached or are near the point where we could train systems to achieve human-
level performance on each of those basic tasks. We are, however, not seemingly
near a humanlike AGI.

------
jeremydeanlakey
I wonder does the compute power per dollar have to level off so soon?

------
jobseeker990
We could make AI more efficient by switching to analog. Otherwise we're
encoding the real world into digital and then back into analog again.

