
Memory Is Holding Up the Moore's Law Progression of Processing Power (2014) - apsec112
http://motherboard.vice.com/read/memory-is-holding-up-the-moores-law-progression-of-processing-power
======
nkurz
I try to refrain from purely negative commentary on articles, but yuck! How
can one hope to say anything useful about "memory performance" without once
using the terms "latency" or "bandwidth"? Memory performance is getting higher
in the same way that processor performance is going up:

[http://www.corsair.com/en-
us/blog/2015/september/ddr3_vs_ddr...](http://www.corsair.com/en-
us/blog/2015/september/ddr3_vs_ddr4_generational)

Latency is to frequency as bandwidth is to parallelism. Single core CPU
frequencies are relatively stable, but parallelism offers the opportunity to
get more done each cycle. Latency for random access is holding quite steady,
but caches are getting bigger and faster, and bandwidth is going up.

The key is figuring out how to write software that takes advantage of spatial
locality and available bandwidth rather than getting choked by the latency.
This is hard in the same way that taking advantage of multiple cores is hard:
it requires a different approach, but is not a fundamental limitation.

Lots of memory access isn't the problem. Long latency isn't the problem. The
problem is designing your program so that it generates lots of long latency
memory accesses and then grinds to a halt in the presence of this latency. I
think the summary from this recent paper is spot on:

    
    
      Our conclusion is contrary to our expectations and to   
      previous findings and goes against conventional wisdom, 
      which states that accesses to RAM are slow, and should be 
      minimized. A more accurate statement, that accounts for our 
      findings, is that accesses to RAM have high latency and 
      this latency needs to be mitigated.
    

[http://arxiv.org/pdf/1509.05053v1.pdf](http://arxiv.org/pdf/1509.05053v1.pdf)

~~~
vardump
> I try to refrain from purely negative commentary on articles, but yuck! How
> can one hope to say anything useful about "memory performance" without once
> using the terms "latency" or "bandwidth"?

Yeah, it is pretty clear the author didn't really understand much of the
topic. Even if you have lightning fast memory, there's still the issue of
connecting it. Long traces and multiple levels of multiplexing are going to
add latency no matter how amazing memory tech is being employed. Before long,
to increase bandwidth, memory needs to be integrated in the same package as
the CPU if not on the same die. At least large eDRAM-style 'caches'. You can't
economically run 1024 traces just for memory on the mainboard PCB!

> The key is figuring out how to write software that takes advantage of
> spatial locality and available bandwidth rather than getting choked by the
> latency. This is hard in the same way that taking advantage of multiple
> cores is hard: it requires a different approach, but is not a fundamental
> limitation.

Indeed, of course this requires skilled labor for now, until compilers catch
up one day. And unfortunately code that truly requires random access is just
not going to perform well on modern hardware. Up until early nineties, memory
was faster than processors. Random access was just fine. Not so anymore.

It's also easy to get bandwidth limited with SSE and AVX. Although line fill
etc. buffers seem to often bottleneck first per CPU core. For scalar code,
being bandwidth limited is not going to happen.

The issue nowadays is that machines are so unique snowflakes performance and
configuration wise. It's not hard to max out a single configuration, but it is
hard to write something that performs decently across different system
configurations. There are like 30 instruction set extensions for x86,
variations in reorder buffer depth, variations in cache latency, size and
associativity. And of course memory bus configurations, number and
interleaving of memory channels, NUMA, DRAM memory page size (1, 2, 4 kB),
etc.

For example unlike 8-way associative L2 cache on Sandy Bridge, Haswell, etc.,
on Skylake it is now 4-way instead. Code that is optimised for 8-way L2 cache
might be pathologically invalidating L2 cache lines on Skylake.

Those aspects matter, because high performance is often a balancing act
between available features, bandwidth and CPU power.

~~~
Filligree
And while you're at it, it's hard to figure out what's going on in your
system. I've spent some time looking, and have not yet found anything
resembling a bandwidth monitor for main memory.

~~~
nkurz
It's possible, but depending on your OS and processor it might be quite
difficult. There are "Uncore" performance counters for this, but the means of
access has been changing from generation to generation and the software
sometimes lags behind.

For Linux and Haswell EP, I've had luck with likwid: [https://github.com/RRZE-
HPC/likwid/wiki/Haswell-EP#memory-co...](https://github.com/RRZE-
HPC/likwid/wiki/Haswell-EP#memory-controller-fixed-purpose-counters)

You also might have luck with Andi Kleen's pmu-tools:
[https://github.com/andikleen/pmu-
tools/blob/master/ucevent/R...](https://github.com/andikleen/pmu-
tools/blob/master/ucevent/README.md)

I don't know for sure if Intel's VTune supports these counters, but I'd
presume it would: [https://software.intel.com/en-us/intel-vtune-amplifier-
xe](https://software.intel.com/en-us/intel-vtune-amplifier-xe)

~~~
Filligree
Doesn't seem like any of them work with my E3 Xeon.

------
e5f34f89
A much bigger problem "holding up Moore's Law Progression" is the failure of
Dennard scaling and the fact that voltage scaling is hitting the threshold
voltage limit (where sub-threshold leakage current increases significantly) as
we move to smaller technology nodes. This means we can build bigger chips but
we don't necessarily have the power budget to power up all parts of it at the
same time (these could be cores, pipeline structures, etc). The architecture
community has written a lot on this "Dark Silicon" problem if anyone wants to
read further.

------
oldmanjay
I'm not certain who this article is targeting, but it reminds of an
acquaintance who would ask me detailed technical questions, misunderstand all
my answers, and then two days later misinform me of the things I told him as
if he were teaching me.

------
advancedprivacy
Moore's Law is about transistor count. Not about frequency, bandwidth or
"speed"... Is everybody ignoring this fact?

~~~
Symmetry
Moore's 1965 paper was about transistor count doubling every 12 months rather
than frequency but "Moore's Law" didn't come into use as a term until 1975, by
which time Moore was giving shrinking feature sizes equal billing[1]. And
Moore himself wrote a memo endorsing a more general use of the term "Moore's
Law" for any chip performance metric that doubles regularly. I can't find a
copy online but I have it in my old Computer Architecture lecture notes. And
performance and scaling were identical as long as Dennard scaling[2] lasted.

[1][http://www.eng.auburn.edu/~agrawvd/COURSE/E7770_Spr07/READ/G...](http://www.eng.auburn.edu/~agrawvd/COURSE/E7770_Spr07/READ/Gordon_Moore_1975_Speech.pdf)

[2][https://en.wikipedia.org/wiki/Dennard_scaling](https://en.wikipedia.org/wiki/Dennard_scaling)

------
Symmetry
When you're talking about memory performance you always have to include the
latency, the bandwidth, and the size of the memory pool. 64kb memory pools
have scaled in latency and bandwidth at the same rate as processing power -
now they're sitting deep in the heart of the chip as the L1 cache.

------
anonmeow
Server CPUs have >2x memory channels when compared to consumer CPUs. IBM Power
CPUs show that it's possible to get even more memory bandwidth than in
mainstream Xeons. Looks like low RAM bandwidth in consumer CPUs is a mostly
artificial differentiator to discourage use of these parts in servers.

On the other hand there are HMC and HBM technologies that offer order of
magnitude more bandwidth and several times less latency. They are already used
in AMD gpus as well as in prototypes of Nvidia's pascal gpu and Intel's
Knight's corner 60-core cpu [http://www.theplatform.net/2015/03/25/more-
knights-landing-x...](http://www.theplatform.net/2015/03/25/more-knights-
landing-xeon-phi-secrets-unveiled/)

I hope HBM comes to consumer CPUs too, but with current lack of competition in
the market it can take a long time.

------
ck2
No it's not.

Since skylake comes in both DDR4 and DDR3 motherboard versions, various
benchmarks have come out to test the difference for the state-of-the-art 14nm
cpu with the "improved" memory vs the "old" memory.

And the difference is often only 1-2%

Unless maybe the goal should be to put 32gb of memory right on the cpu die

~~~
vardump
Different software needs different resources.

Unsurprisingly, if the software in question is not bandwidth limited,
providing more bandwidth is not going to speed it up. Most software is like
that.

It also truly depends on how it was optimized. The software being tested was
likely optimized for previous gen configurations. It might very well favor a
bit more computation over higher memory bandwidth usage.

Wait until developers optimize against DDR4 Skylake systems, you might start
to see 5-10% difference at that point. Truly bandwidth limited code can run up
to about 40% faster on a DDR4 system, assuming typical 1600 MHz DDR3 and 2400
MHz DDR4.

------
graycat
Okay, do the usual: Add microcode to the processor cores to support more
capable instructions so that can trigger a _streams_ of all the data with
nearly no time for addressing.

E.g., implement heap sift, heap sort, heap priority queue, substring search,
of course, inner product accumulation, and whatever else looks promising,
e.g., standard multi-dimensional array addressing, chasing down chains of
pointers common in OO programming.

That is, have the machine instructions do larger chunks of work.

~~~
nkurz
I'm guessing this is downvoted because it's no longer a viable solution.
Microcode generates multiple µops with a single instruction, but the decoded
µop cache is large enough (and efficient enough) that the decoding is almost
never the bottleneck.

Worse, for anything in a loop it often actually slows things down by
preventing the usual caching mechanisms from working. The instructions/µops
are already where they need to be, but data dependencies prevent them from
being executed in a timely manner.

What's needed instead are changes to the algorithms that allow for more
instruction level parallelism. We need to overcome latency by creating
assembly lines within the core rather than having each core do piece work.

~~~
graycat
I thought that what I wrote was clear, simple, and drew heavily from some
quite solid, old ideas that, however, don't seem to be popular now but do
address the same old problem of main memory being too slow. I don't know where
I was unclear.

> I'm guessing this is downvoted because it's no longer a viable solution.
> Microcode generates multiple µops with a single instruction, but the decoded
> µop cache is large enough (and efficient enough) that the decoding is almost
> never the bottleneck.

Sure. But the OP was talking about memory speed, not internal processor speed,
from microcode or anything else.

By saying microcode, I was just trying to make the needed logic obviously
doable. Now transistors are so cheap could do it in hardware.

The main point I was trying to get at was just the one in the OP -- memory too
slow.

Well, memory can be darned fast, if talking just the memory. The way I see it,
it's not that the memory itself is or currently has to be too slow
electronically and, instead, it's that the darned addressing is too slow or
there's too much of it.

E.g., to access a Fortran array with three subscripts, have to do the darned
array calculation -- what is it, two multiplies and two adds starting with
five numbers -- for each element of the array. Can spend more time calculating
the address of the array component than spend on the data when get it. Yes, a
decent Fortran compiler will not do that arithmetic just from the beginning
for each component of the array, especially in a loop. Since C can't do such
arrays without the programmer writing a macro, I have to wonder if C compilers
are smart enough to save on the array addressing arithmetic like a Fortran
compiler does.

Still, commonly, spend more time calculating addresses than doing the work.
And to the processor core, the address calculations look just like just more
instructions that might be part of something really complicated instead of
something that has some simple patterns that can be exploited -- the hardware
designer would see the patterns and exploit them in the hardware. So, the poor
processor has to be absurdly _myopic_ and just do what the heck it is being
told to do.

Instead, for cases, say, heap sort, the fast Fourier transform, and more, just
have just one instruction for heap sort and, then, have all that addressing
logic in hardware and have it fast enough to keep memory fully busy. If
electronically memory is still too slow, then have interleaved memory -- since
the addressing is so simple and regular, the hardware implementation will know
how to look ahead, much as in speculative execution now except there will be
less or no speculating.

Some of this is now very old stuff and for just the reasons I suggested: So,
_super computing_ has long had an inner product instruction -- one instruction
and get the whole inner product calculation done. That is, an _inner product_
is the sum on i of x(i)y(i) and is just ubiquitous in scientific-engineering
computing.

That is, generally the idea is to move some relatively simple ordinary
instruction streams into hardware. Again, the idea is old, e.g., was used for
the instruction extensions for handling images -- one instruction and, slam,
bam, thank you ma'am, got some image processing code, that was maybe before
100 instructions, in a loop, done. So, get to save on fetching and decoding
all those instructions and much of the addressing arithmetic they would do,
and the addressing is so regular that the hardware gets to look ahead, e.g.,
which would exploit interleaved memory. And, in addition, might design the
sending of read commands to main memory not just one at a time but as a list,
boom, and with no more attention, waiting, synchronizing, hand shaking, the
memory delivers all the data at all the addresses in the list. E.g., something
like DMA for I/O. E.g., to find a sum of the numbers in an array, have a
single instruction and have memory just send the data ASAP, much like in DMA
for fast I/O -- on a machine with interleaved memory, say, 16 ways, that would
just fly and scream at the same time.

For instruction level parallelism, some old work showed that with 24 way very
long instruction word (VLIW) and just some compiler tweaks on ordinary code,
could get 9:1 speedup. IIRC, Itanium was supposed to be a VLIW machine.

~~~
nkurz

      > I don't know where I was unclear.
    

I don't think you were unclear. As respectively as possible (and I really do
like your perspective and many of your other posts) I think the problem is
that you were clear and wrong[1].

    
    
      > it's that the darned addressing is too slow or 
      > there's too much of it
    

Generally, no. Current processors have two dedicated address calculation ports
that each calculate (ptr + index*size + const) in the same cycle that the
request is issued. Separately, there are almost always unused arithmetic ports
such that one could easily double the amount of other arithmetic without
adding any additional latency. Address calculation is not a significant
performance factor.

    
    
      > So, get to save on fetching and decoding all those 
      > instructions and much of the addressing arithmetic 
      > they would do
    

My argument is that these are almost never a bottleneck, and that removing
them altogether will not produce a significant speed up.

    
    
      > If electronically memory is still too slow, then have 
      > interleaved memory -- since the addressing is so simple
      > and regular, the hardware implementation will know how
      > to look ahead, much as in speculative execution now 
      > except there will be less or no speculating.
    

Recent generations have been 2-, 3-, or 4-way interleaved. 6-way is promised
for a (I think) 2018. Hardware prefetchers are excellent at getting out ahead
of just about any regular pattern. The issue is that most software is designed
to require unpredictable access patterns, and thus is latency sensitive.

    
    
      > And, in addition, might design the sending of read
      > commands to main memory not just one at a time but
      > as a list, boom, and with no more attention, waiting, 
      > synchronizing, hand shaking, the memory delivers all
      > the data at all the addresses in the list.
    

This is essentially the 'gather' instruction that has been supported in the
last 3 generations of Intel processors. It's a single instruction that sends
out parallel requests for 8 addresses and returns 8 32-bit values with about
the same latency as a single request would take. To an first approximation,
it's never used[2] to advantage, because it's no faster than issuing 8
requests serially while being less flexible.

Mostly, I'm suggesting that hardware currently supports far more performance
than we are getting, because modern software is still designed as it it was
running on in-order processors with a flat memory hierarchy. I agree with you
that C (and other) compilers could do a better job, but mostly I think the
issue is with the mindset of the programmers who are making it.

[1] I find that clear and wrong however is almost always preferable to unclear
and wrong. And I reiterate: I don't mean this as an attack.

[2] I love it in concept, and hope to show that it actually can speed things
up on the latest Skylake generation, but until now it's been mostly a bust.

~~~
graycat
Clearly most of the processor design features I outlined and proposed were
speculative.

E.g., at IBM's Watson lab, in the room next to mine a guy was looking at
traces from deep in the processor hardware to estimate the speedups possible
from various proposed and speculative features. He did this work because it is
not easy to estimate, just from intuition, what features will give what
speedups.

The VLIW data I reported was from some careful work by a guy across the hall
from me.

Net, what I was proposing was not just wild guessing.

So, you are explaining that some of what I proposed has already been
implemented.

Okay.

