
The End of Moore’s Law and Faster General Purpose Computing, and a Road Forward [pdf] - banjo_milkman
https://p4.org/assets/P4WS_2019/Speaker_Slides/9_2.05pm_John_Hennessey.pdf
======
AtlasBarfed
We've built up layers and layers and layers of inefficiencies in the entire OS
and software stack since the gigahertz wars took us from 66Mhz to multiple GHz
in the 90s.

The software industry is awful at conserving code and approaches through the
every-five-years total redo of programming languages and frameworks. Or less
for Javascript.

That churn also means optimization from hardware --> program execution doesn't
happen. Instead we plow through layers upon layers of both conceptual
abstraction layers and actual software execution barriers.

Also, why the hell are standardized libraries more ... standardized? I get
lots of languages are different in mechanics and syntax... But a standardized
library set could be optimized behind the interface repetitively, be optimized
at the hw/software level, etc.

Why do ruby, python, javascript, c#, java, rust, C++, etc etc etc etc etc not
have evolved to an efficient underpinning and common design? Linux, windows,
android, and iOS need to converge on this too. It would be less wasted space
in memory, less wasted space in OS complexity, less wasted space in app
complexity and size. I guess ARM/Intel/AMD would also need to get in the game
to optimize down to chip level.

Maybe that's what he means with "DSLs", but to me "DSLs" are an order of
magnitude more complex in infrastructure and coordination if we are talking
about dedicated hardware for dedicated processing tasks while still having
general task ability. DSLs just seem to constrain too much freedom.

~~~
wayoutthere
Correct me if I'm wrong, but isn't this exactly the problem LLVM was designed
to tackle?

~~~
jcranmer
If you're targeting non-CPU designs--such as GPUs, FPGAs, systolic arrays,
TPUs, etc.--it is very much the case that you have to write your original
source code differently to be able to get good speedups on those accelerators.
It has long been known in the HPC community that "performance portability" is
an unattainable goal, not that it stops marketing departments from trying to
claim that they've achieved it.

LLVM/Clang makes it much easier to bootstrap support for a new architecture,
and to add the necessary architecture-specific intrinsics for your new
architecture, but it doesn't really make it possible to make architecture-
agnostic code work well on weirder architectures.

~~~
sifar
True. If you want performance - you have to re-write the code for the new
architecture otherwise it is pointless to develop the new core.

The problem with developing a good processor architecture is you have to
_always_ maintain legacy compatibility without sacrificing performance -
because you know software.

This adds layers of extra HW with each passing generation of the processor
lying around for some legacy code.

------
omarhaneef
For those who have not looked yet: John Hennessey presentation. Argues -- with
a lot of detail -- that Moore's law has closed out, that energy efficiency is
the next key metric, and that specialized hardware (like TPU) might be the
future.

When I buy a machine, I am now perfectly happy buying an old CPU, and I think
this shows you why. You can buy something from as far back as 2012, and you're
okay.

However, I do look for fast memory. SSDs at least, and I wish he had added a
slide about the drop in memory speed. Am I at inflection?

Perhaps the future is: you buy an old laptop with specs like today and then
you buy one additional piece of hardware (TPU, ASIC, Graphics for gaming etc).

~~~
gambler
_> energy efficiency is the next key metric, and that specialized hardware
(like TPU) might be the future._

This is nonsense pushed forward by large corporations who want to own all your
data and computational capacity.

~~~
johnfactorial
Admittedly knowing little about how multi-core CPUs work, I've always thought
the next breakthrough in CPU tech would be a hardware-based scheduler, a chip
that effectively load balances threads to ensure all cores are used equally
and simplifies the writing of software. The dev writes thread-safe code and
the hardware does the rest. I wonder how feasible that really is.

~~~
Mvhsz
That sounds perfectly reasonable to me, but John Hennessy literally wrote the
book on computer architecture. Towards the end of the deck he has a slide that
we shouldn't expect large gains from improved architecture (on general purpose
chips) in the future. I'm inclined to believe him, although I would be
interested in hearing a deeper proof/disproof of the architecture you
proposed.

------
banjo_milkman
This ties in nicely with chiplets: [https://semiengineering.com/the-chiplet-
race-begins/](https://semiengineering.com/the-chiplet-race-begins/) \- a way
to integrate dies in a package, where the dies can use specialized processes
for different functions - e.g. analog or digital or memory or accelerators or
CPUs or networking etc. This would make it easier to iterate
memory/CPU/GPU/FPGA/accelerator designs at different rates, and reduce
development costs (don't need to support/have IP for every function, just an
accelerated set of operations on an optimized process within each chiplet).
But it will need progress on inter-chiplet PHY/interface standardization.

------
deepnotderp
So yes, if you compare matrix multiply in Python vs SIMD instructions, you
will find a big improvement. Much harder to do that for more general purpose
workloads.

And it doesn't scale:
[https://spectrum.ieee.org/nanoclast/semiconductors/processor...](https://spectrum.ieee.org/nanoclast/semiconductors/processors/the-
accelerator-wall-a-new-problem-for-a-post-moores-law-world)

And in many cases, if you normalize all the metrics, e.g. precision, process
node, etc. You'll find that the advantage of ASICs is greatly exaggerated in
most cases and is often within ~2-4X of the more general purpose processor.
E.g. small GEMM cores in the Volta GPU actually beat the TPUv2 on a per chip
basis. Anton 2, normalized for process, is within 5x ish of manycore MIMD
processors in energy efficiency.

In other cases, e.g. the marquee example of bitcoin ASICs, that only works
because of extremely low memory and memory bandwidth requirements.

------
prvc
A possibly stupid question from a neophyte: what was the driving force behind
Moore's law when it was in operation? Did it become a self-fulfilling prophecy
by becoming a performance goal after becoming enshrined in folklore, or is
there an underlying physical reason?

~~~
aiCeivi9
[https://en.wikipedia.org/wiki/Transistor_count](https://en.wikipedia.org/wiki/Transistor_count)

The transistor can only get so small before it stops working. There are many
issues with required extreme ultraviolet light sources (lasers) and allowed
amount of impurities in silicon waffer. And R&D cost for each iteration of
lithography is getting higher while bringing less benefits.

~~~
prvc
Yes, the existence of an upper bound on transistor count follows easily from
the atomic nature of matter. The Wikipedia article on Moore's law lists
multiple disparate "enabling factors" which do not seem to have much to do
with one another. The conjunction of which comprise an explanation of sorts,
but I'm wondering whether there's a simple observation or fact that ties them
all together, apart from my Sociological theory.

------
sifar
Slide 36 compares the TPU with a CPU/GPU. This is apples to oranges
comparison. One uses an 8bit Integer multiply while the other uses a 32b
Floating Point multiply which inherently uses at least >4X more energy[1]. If
you scale the TPU by 4, it is not an order of magnitude better. The proper
comparison should be between the TPU and an equivalent DSP doing 8b
computations. That would show if eliminating the energy consumed due to the
Register File accesses is significant.I suspect most of the energy saving
comes from having a huge on chip memory.

[1] From slide 21

Function Energy in Pj

8-bit add 0.03

32-bit add 0.1

FP Multiply 16-bit 1.1

FP Multiply32-bit 3.7

Register file *6

L1 cache access 10

L2 cache access 20

L3 cache access 100

Off-chip DRAM access 1,300-2,600

------
SemiTom
Big chipmakers are turning to architectural improvements such as chiplets,
faster throughput both on-chip and off-chip, and concentrating more work per
operation or cycle, in order to ramp up processing speeds and efficiency
[https://semiengineering.com/chiplets-faster-interconnects-
an...](https://semiengineering.com/chiplets-faster-interconnects-and-more-
efficiency/)

Scaling certainly isn’t dead. There will still will be chips developed at 5nm
and 3nm, primarily because you need to put more and different types of
processors/accelerators and memories on a die. But this isn’t just about
scaling of logic and memory for power, performance and area reasons, as
defined by Moore’s Law. The big problem now is that some of the new AI/ML
chips are larger than reticle size, which means you have to stitch multiple
die together. Shrinking allows you to put all of this on a single die. These
are basically massively parallel architectures on a chip. Scaling provides the
means to make this happen, but by itself it is a small part of total the
power/performance improvement. At 3nm, you’d be lucky to get 20% P/P
improvements, and even that will require new materials like cobalt and a new
transistor structure like gate-all-around FETs. A lot of these new chips are
promising for orders of magnitude improvement—100 to 1,000X, and you can’t
achieve that with scaling alone. That requires other chips, like HBM memory,
with a high speed interconnect like an interposer or a bridge, as well as more
efficient/sparser algorithms. So scaling is still important, but not for the
same reasons it used to be.

------
DSingularity
It is not that I disagree with Hennessy, but I think it is premature to
conclude that general-purpose processors have reached the end of the road.
There is a healthy middle in between specialized and general-purpose design.
Exploiting that middle is what I think will deliver the next generation of
growth. That is exactly what naturally occurred with SoC and mobile design.

The raw computational capabilities of the TPU don't really prove anything. Of
course co-design wins. Whether it is vison or NLP -- NN training has dominant
characteristics. The arithmetic is known: GeMM. The control is known: SGD.
Tailoring control and memory-hierarchy to this is a no-brainer and of course
the economic incentives at Google push them in this direction and of course
the expertise available at Google powered this success. For other applications
it is not so clear.

Finding similar dominance in other applications is trickier. To accelerate an
application with a specialized architecture you need dominating
characteristics in the apps memory-access, computational, and control
profiles.

------
yogthos
It's odd that the presentation doesn't discuss alternatives to using silicon.
Ultimately, this is akin to saying that there are limits on how small a vacuum
tube we can make. We already know of a number of other potential computing
platforms such as graphene, photonics, memristors, and so on. These things
have already been discovered, and they have been shown to work in the lab.
It's really just a matter of putting the effort into producing these
technologies at scale.

Another interesting aspect of moving to a more efficient substrate would be
that power requirements for the devices will also lower as per Koomey's law
[https://en.wikipedia.org/wiki/Koomey%27s_law](https://en.wikipedia.org/wiki/Koomey%27s_law)

~~~
brennanpeterson
Well...no. what it says is there are limits on how small of wires we can make,
and how small of layers make a material functional (about 5).

Wires can't get smaller without compromising RC (and thus speed). Quite
horrifically: this is way more an issue than the transistor.

Graphene and photonics don't help this. At all. It isn't a matter of how small
a tube. You physically need 5nm to insulate, and 5nm for a functional
material. So a 5nm device with a 5nm spacer and a 5nm space to the next device
is about it. The smallest pitch of any physical device is 20nm. The critical
pitches in wafer are about 30nm and 40nm, so in an ideal world, we can reach
3x, ever. It doesn't matter which material you choose.

And yeah, you can stack up, but not in quite the way you dream, and thermal
and processing issues make this hard in most domains. When I build, I deposit
at temperatures, which affect underlying layers. So stacking doesn't quite
work as you might expect. Again, real materials Ina real flow are actually
different, and not in a trivial 'just make it work' reducible fashion.

Memristors may not really exist, and are useful in the context of high speed
memory. That has real physical challenges. And people.have spent billions for
decades on this problem.

Anyway, this is missing some background, but the presentation is great.

~~~
yogthos
We already know you can use individual atoms as transistors [1]. So, clearly
we can go a lot smaller than 5mm here. Obviously there are challenges in
scaling these new substrates up to create useful chips, and creating the
infrastructure to put them into mass production. My point is that we know this
is possible, and an inflection point will come where investing in these new
substrates starts being more lucrative than trying to squeeze more out of
silicon.

[1]
[https://www.sciencedaily.com/releases/2018/08/180816101939.h...](https://www.sciencedaily.com/releases/2018/08/180816101939.htm)

------
dragontamer
"WASTED WORK ON THE INTEL CORE I7", slide#12 (page 13 in pdf) is fascinating
to me. But I want to know how the data was collected, and what the % wasted
work actually means.

40% wasted work, does that mean that they checked the branch-predictor and
found that 40% of the time was spent on (wrongfully) speculated branches?

It also suggests that for all of the power-efficiency faults of branch
predictors (aka: running power-consuming computations when it was
"unnecessary"), the best you could do is maybe a 40% reduction in power
consumption (no task seems to be 40% inefficient).

~~~
vardump
> ... INTEL CORE I7

When someone says Intel i5 or i7, I immediately wonder if they're talking
about 2008 i7 or 2019 model.

Intel would be smart to retire whole i3/i5/i7/i9 branding. People seem to
think every i5 or i7 is the same.

~~~
dragontamer
> People seem to think every i5 or i7 is the same.

Unfortunately, this is a feature, not a bug. Intel wants their branding to
have this effect... the lay-person isn't supposed to understand Sandy Bridge
(i7-2700k) vs Skylake (i7-6700k)

~~~
vardump
So Intel wants laypersons not to realize there's something faster available
and to upgrade their x86 based systems?

Before that era people didn't know much about the details either, but they did
understand 800 MHz was faster than 533 MHz.

------
roenxi
Still be too early to call the end of the march of microprocessors though.

[https://www.scienceabc.com/humans/the-human-brain-vs-
superco...](https://www.scienceabc.com/humans/the-human-brain-vs-
supercomputers-which-one-wins.html)

The limits they are running up against are indeed crisises, but they're
probably going to be able to find that they can copy whatever it is that
biology is doing and squeeze out quite a bit more. The tradeoffs will get a
lot weirder though.

~~~
rrss
Humans are not good at general purpose computation. Your linked article states
the brain achieves 1 exaflops, and cites
[http://people.uwplatt.edu/~yangq/csse411/csse411-materials/s...](http://people.uwplatt.edu/~yangq/csse411/csse411-materials/s13/cs_1/brostm_AI-
in-commercet.doc) for this number. That document states the value with no
citation or rationale.

I can do far less than 0.0001 single precision floating point operations per
second, so whatever the context for "1 exaflops" is, it isn't general purpose
computation.

EDIT: this seems sort of like saying that throwing a brick through a window
achieves many exaflops because simulating the physics in real time would
require that performance. I'd like to read more about this value and how
someone came up with it, but googling just gives me that same scienceabc
article and stuff referencing it.

~~~
lacker
Nahh, you could do more than 0.0001 floating point operations per second. To
beat that you need to do a single floating point operation in two hours, which
is quite achievable with paper and pencil ;-)

0.01 floating point operations per second seems harder, but perhaps humanly
doable.

~~~
rrss
I'm easily distracted.

------
justicezyx
Amin's keynote is relevant here:
[https://onfconnect2019.sched.com/event/RzZl](https://onfconnect2019.sched.com/event/RzZl)

The basic form of computing is becoming distributed. More are coming.

------
mikewarot
I'm amazed that it's less than a picojoule to do an 8 bit add.

~~~
scottlocklin
The Landauer limit is about a billion times smaller than this, so there's room
for power savings before we hit any physical limits.

------
singularity2001
So what's the name of the metric flop/sec/USD because that keeps on growing
exponentially thanks to GPUs/TPUs, a paradigm shift predicted by Ray Kurzweil.

------
yalogin
Is there a video of this talk available somewhere?

Also can someone tell me what p4 is? Looks like almost every company and a
bunch of universities are "contributors" there.

~~~
musicale
P4 is a domain-specific language for specifying packet forwarding pipelines,
i.e. the hardware that takes packets in one port, decodes their headers (e.g.
destination MAC or IP address), munges them somehow (e.g. updating TTL,
destination MAC, and checksum), and sends them out another port. This enables
you to build all sorts of network devices from Ethernet switches to IP routers
to RDMA fabrics, etc.. You can compile P4 onto a CPU, a smart NIC, an NPU, a
programmable ASIC, an FPGA, etc.. It can also be used a bit like EBPF and
compiled into a pipeline in the Linux kernel.

Basically P4 allows you to (re)program your network data plane to do whatever
you want, and you can create new network protocols or change the way existing
ones work without having to change your hardware and without losing line rate
performance.

It's also somewhat like EBPF, but it compiles to hardware as well as software.

------
almost_usual
One of the more interesting things I’ve read on HN in awhile. Seems like this
will result in a large paradigm shift for the computing industry.

~~~
SkyPuncher
I think we've already seen the shift with cellphones.

I think consumer facing performance processors will fade.

Data centers will continue to push for more performance. It could mean less
rack space, less power consumption, and less to manage.

Cell phone/tablet focused processors will become powerful enough to handle the
majority of daily tasks while enjoying extended battery life.

------
Accujack
There's an internet meme about "Imminent death of Moore's law predicted".

All Moore's law talks about is the density of transistors on a chip, and it's
never been a linear progression of numbers. Recently I've seen news articles
about some research into 5nm processes and other methods for increasing
density of components on silicon, so it seems Moore's law (really Moore's rule
of thumb or Moore's casual observation) isn't done yet.

