
45 year CPU evolution – one law and two equations - godelmachine
https://arxiv.org/abs/1803.00254
======
synctext
Paper highlights:

\- 22mn or 7nm are idle marketing. "The nodes were first defined according to
the transistor channel length. The last nodes are more defined according to
marketing criteria."

\- the memory wall... "The huge difference between CPU and DRAM growth rates
led to the increased complexity of microprocessor memory hierarchies.
Different levels of caches are needed to balance the differences in the
bandwidth and latency needs of the CPU with those of the DRAM main memory."

\- Conclusion "When the end of the predicted end of the exponential evolution
will be real or when non-Von Neumann architectures will prove to be more
efficient for programmable applications, the situation will be totally
different. Until that point, the two equations that have been discussed in
this paper will be there to explain the evolution"

~~~
dragontamer
> \- the memory wall... "The huge difference between CPU and DRAM growth rates
> led to the increased complexity of microprocessor memory hierarchies.
> Different levels of caches are needed to balance the differences in the
> bandwidth and latency needs of the CPU with those of the DRAM main memory."

The way to "solve" this is well known. You build buffers between the CPU and
the Memory, and then execute as much as possible out-of-order. And that's the
purpose of L1, L2, and L3 caches: to provide this out-of-order code enough
data to work (while the CPU is waiting on slow, slow main-memory).

However, contemporary RAM is not designed to coordinate with the CPU very well
on this front, at least compared to designs like HMC (a type of "stacked RAM",
a competitor to a GPU's HBM).

HMC is interesting because its a packet-based system. You tell the RAM a
memory-address to access, but then the RAM may return that memory out-of-order
compared to other requests !!!

After all: RAM also has physical "latency" issues. When a bank is open, it
needs to close, incurring a tRAS, tCAS, and tRC delay. (A "hit" to the same
bank would incur only a tCAS delay). If RAM could execute "out of order", then
the memory-controller could allow more efficient orderings of memory.

Juggling which banks are open and closed has been traditionally a CPU's memory
controller job. But communicating this information over a many-cm long PCB
trace incurs latency. So it makes more sense to put this logic as close to the
RAM as possible (speed of light, capacitance, inductance, etc. etc. slow down
the signal).

Furthermore: since CPUs are incredible out-of-order machines already, it
wouldn't be a major hassle to make the memory also out-of-order. I mean, yeah,
its complicated, but its no different to what CPUs already do with L1, L2, and
L3 caches (and the programming constructs built on top: Atomic Accesses and
whatnot to control the effects of memory-reorderings).

I haven't heard of any major computer use HMC yet however. Only $10,000+ FPGAs
on occasion. Still, there exists RAM today which can execute requests out-of-
order. When this RAM becomes mainstream (if it ever becomes mainstream), I'd
expect systems to be able to get much faster.

~~~
slededit
FWIW DDR3 includes 8 banks each of which can execute independently of the
others. So there is still parallelism there. DDR4 doubles this to 16.

~~~
dragontamer
> FWIW DDR3 includes 8 banks each of which can execute independently of the
> others. So there is still parallelism there. DDR4 doubles this to 16.

Indeed. So there's some parallelism that can be captured, but they're managed
by the CPU instead of the Memory.

Because HMC abstracts away the protocol into a packet-based innately out-of-
order protocol, and because the "memory controller logic" has been moved to an
area physically as-close-as-possible to the RAM itself, it can achieve much
higher degrees of parallelism.

Case in point: a single stack of HMC has 128 banks of parallelism across
4-vaults (I would argue that a "vault" is roughly equivalent to a DDR4
channel): [https://www.micron.com/parts/hybrid-memory-cube/hmc-
sr/mt43a...](https://www.micron.com/parts/hybrid-memory-cube/hmc-
sr/mt43a4g40200nfa-s15?pc={8AD36F73-07F4-4ECD-A168-B5E899F1E650})

~~~
slededit
The bottleneck for DDR memory isn't really the memory controller but rather
the memory bus itself. Moving the controller won't change that.

Its very rare to get anywhere close to 100% bus utilization. HMC's real
advance is switching to independent serial links which we know how to scale
better than parallel busses. Its this improved bandwidth that allows you to
pack in more parallelism. We _could_ have more banks in traditional DDR memory
but there isn't much point because the data bus couldn't feed it.

