
Memory bandwidth - deafcalculus
https://fgiesen.wordpress.com/2017/04/11/memory-bandwidth/
======
nortiero
The article is very optimistic about memory availability per cycle, reality is
way worse.

As an example, on my Macbook Air 2011 with ~10 GB/s of maximum ram bandwidth,
random access to memory can take 100 time more than a sequential one.

This in C, with full optimizations and using a very low overhead read loop.

Using the same metrics of the author:

best case: ~ 3 bytes per cycle

(around 6 Gigabyte per second of available bandwidth)

worst case: ~ 0.024 bytes per cycle (every scheduler, prefetch, already open
column mostly defied)

Note that worst case uses 10 seconds (!) to read and sum in a random way all
the cells of an array of 100.000.000 of 4 byte integers, exactly once. Main
loop is light enough not to influence the test.

That's about 40 megabytes per second out of 6.000 available.

What can I say.. CPU designers are truly wizards!

~~~
logicallee
>worst case: ~ 0.024 bytes per cycle (every scheduler, prefetch, already open
column mostly defied)

>That's about 40 megabytes per second

Wow, that is an explosive conclusion. It's very hard for me to come to terms
with. 40 MB per second is the sustained read of a spinning platter hard drive

[http://hdd.userbenchmark.com/](http://hdd.userbenchmark.com/) (click any
line)†

(Did I say 40? I meant 160 MB/sec...) So more like 1/4 of the sequential read
speed off of a spinning platter of rust. Ten seconds is insane.

I don't care how many times you're bouncing back and forth and invalidating
caches and pipelines and prefetches and schedulers, you simply shouldn't be
able to ruin things that badly. It is off from what I would expect by (easily)
an order of magnitude.

I know you say that the main loop is very light - but aren't there other
aspects to your build system and operating system that might be affecting this
test? To say something very obvious, couldn't the Operating System scheduler
not be giving your process the appropriate number of cycles? There is a lot
more that I could say in this direction but let's just do something simpler:

-> Could you try your experiment without an operating system?

For example here are some people who booted a raspberry pi without an
operating system -

[https://www.google.com/search?q=chess+without+an+operating+s...](https://www.google.com/search?q=chess+without+an+operating+system+raspberry+pi)

Perhaps before going that far you could simply boot into a Linux image that
was simply not compiled with any hardware support to do anything. (After all
you really don't need to do anything except return to the shell.) Or simply
see what happens if you boot into Linux and try it.

If you get an instantly different result simply booting Linux on the same
hardware then you instantly have an explosive blog post: "summing 100k 4-byte
integers randomly takes 10 seconds on Mac OS X but only 1 second under Linux".

I realize there is a HUGE difference (HUGE) between sequential and random. But
I just wanted to get across how insanely slow 40 MB/second straight to RAM is.
That should not be possible, no matter how much you defy caches and scheduling
and so forth, unless you get the Mac to swap pages out of RAM onto an SSD or
something! So not using an operating system would really help here.

Could you try it? I'm not saying I don't believe you but - wow, that is
insane.

† I just noticed you wrote "Macbook air 2011". If you want to look at 2011
hard drive speeds, a quick glance still sees some quoting 140 MB/sec so it
still seems correct to me, but I just quoted 2017 figures.

~~~
qb45
> Wow, that is an explosive conclusion. It's very hard for me to come to terms
> with. 40 MB per second is the sustained read of a spinning platter hard
> drive

Parent is talking about random access. So compare with random access to
spinning rust :)

40MB/s for random RAM access is totally reasonable. Dynamic RAM (DRAM), the
kind of RAM used in computers nowadays, is organized and accessed in "rows" of
few kB. If you read random addresses, chances are good that almost every read
will miss all CPU caches and hit a DRAM row other than any currently opened
row (there is maybe a few dozen rows out of millions opened at any time,
depending on the number and internal organization of RAM modules). Opening and
closing a new row takes tRP+tRAS which is 13+35ns on some random DDR3 RAM I
have laying here. This is 20M individual accesses per second.

[https://en.wikipedia.org/wiki/Dynamic_RAM](https://en.wikipedia.org/wiki/Dynamic_RAM)

~~~
logicallee
What do you mean by "tRP + tRAS"?

I now understand how it's reasonable, as in, correct. But I don't understand
the fundamental reason for this. Okay, so every time a row is read, if it's
not in cache it'll get cached. But why does it have to be that way?

Couldn't there be a mode, "hey don't fully open these rows, I just one want
one random byte as fast as possible!"

I compared it with spinning disks just to show how unreasonable the total is.
I realize that the whole design isn't built around this idea of picking off a
byte at a time.

But don't you think there could be applications that have PRECISELY, exactly
this usage pattern?

For example, what percent of your neurons are firing at the moment? Very, very
low.

For some future applications, getitng a 10x speedup in random memory reads of
single bytes might totally increase that application by a lot. Even if
desktops aren't built this way today, I'm super-surprised that when the whole
system isn't doing anything else, there is no way to get that kind of raw
access without asking for whole rows at a time.

~~~
qb45
> Couldn't there be a mode, "hey don't fully open these rows, I just one want
> one random byte as fast as possible!"

As fast as possible is exactly tRP+tRAS. Since the whole row is read in
parallel to RAM's internal SRAM buffer, opening only part of it would make no
difference.

> What do you mean by "tRP + tRAS"?

Ever heard of RAM timings? I'm afraid at some point you will have to read how
DRAM works to understand more. There was a link in my last post.

------
adrianmonk
It's good to see some attention paid to the subject, but it's not exactly a
_new_ revelation.

Probably 15-20 years ago, my computer architecture professor commented that,
while CPUs of the past had relatively anemic number-crunching powers, all the
pipelining and high clock speeds and other advancements in more recent CPUs
had changed that, but corresponding advancements in memory bandwidth had not
been made.

Which in turn meant that the way you go about optimizing code needed to
change. In the past it had been mostly about finding ways to eliminate
instructions or simplify expressions, because the things that were holding you
back were the ALU and the ability to plow through instructions. With all those
things sped up massively but less improvement in RAM, it became very important
to start thinking about memory access patterns and caches.

We even had a homework assignment to optimize a matrix multiply, and the
lesson learned was that the dominating factor wasn't what the code looks like
the innermost loop, it was which direction you proceed through the matrices
(row by row vs. column by column) because that determines memory access
patterns.

------
luckydude
What a refreshing article. This guy gets it and put things into perspective in
a way that you (or at least I) don't see very often.

Worth a read if you are just scanning the comments.

------
socmag
Really great article and comments :)

Just as a point of reference, I currently see around 1.25 - 1.5 op/cycle on
carefully crafted highly parallel lock-free, stall-free code... say running on
8 threads. Code that has 0.01% branch misprediction.

Unfortunately in my case as others mention, access to the I/O ports and memory
latency, is a real limiting factor. The CPU is just... waiting

Getting to the Holy Grail that Ryg talks about of 3 instructions per cycle is
_really_ hard with non-vectorizable workloads - like screwing around with hash
tables that have no chance of fitting in L1/L3, and not being able to really
make much use of SIMD, even if you are paying attention to cache-lines.

Most apps barely scrape by at 0.5 instructions/cycle or worse and spend most
of the time bouncing on the kernel for stupid stuff. Not good.

Absolutely <3 performance freaks!

------
white-flame
A commonly recurring subject on the 6502 forums is what a "modern" 6502 would
look like. It always boils down to not really being able to replicate the
1-cycle access to any byte in RAM. Changes to the memory access model requires
changing everything to being unrecognizable as a 6502 derivative, in
programming style.

Of course, you could put 64KB of SRAM on the CPU die, but the size and power
of the RAM would dwarf the processor, and you'd get an old-school 6502,
arguably not a modern take on the concept. If you want more memory, you simply
can't replicate the 1980s access model at anything approaching today's speeds.

~~~
PaulHoule
I was wondering the other day how it would work if you made something like a
6502 out of indium phosphide with 64k of SRAM and clocked it at 50 GHz.

~~~
qb45
This would have to be a tiny core with tiny RAM indeed. At 50GHz your clock
signal is 180° out of phase after traveling just 2mm.

I wonder if there are any practical applications that could benefit from such
a thing.

~~~
white-flame
Fully asynchronous designs that send a local activity pulse along with their
data, perhaps.

------
spullara
I remember when DRAM "wait states" were 0, 1 or 2. They aren't advertised
anymore because they are 1-2 orders of magnitude worse than that now.

[https://en.wikipedia.org/wiki/Wait_state](https://en.wikipedia.org/wiki/Wait_state)

------
pslam
> Note that we’re 40 years of Moore’s law scaling later and the available
> memory bandwidth per instruction has gone down substantially.

This is unfair, and (this is me being unfair now) this article is missing the
woods for the trees.

There were engineering pressures which resulted in the current ratios. I think
it is fairer to say that the current situation — where there's about (hand-
waving) 1 byte per instruction of bandwidth per core — reflects the kinds of
tasks we expect our machines to be doing. It is very rare to find a task which
is memory speed bound. There's almost always substantial processing to be done
with data.

It's not even that hard to increase memory bandwidth. You "just" double up
memory channels. This is of course expensive, which in turn is a back-pressure
which results in architectures designed around the current sweet-spot.

I'm also puzzled the author thinks the situation is "worse". Pretty much every
desktop class machine I've used from about 1990-2005 was extremely starved of
memory bandwidth, and cores did a far worse job of hiding latency (OOO renames
etc). What we have today feels fairly comfortable, to me at least, with some
outlier tasks where you might want more (and then obtain specialist hardware).

This is a long-winded way of saying: the current core vs memory speed ratio is
a sweet-spot of cost vs efficiency, and works well given the tasks and
algorithms we execute on these machines. What we had in the olden days was
just a case of unoptimized architecture, which hadn't converged yet.

~~~
nialo
I don't think it's worse. The (possibly missing) context of this post is a
Compute Shader running on a GPU spending more time writing out the results of
a computation than actually doing the computation.

As I read this post, the moral of the story is just to try to write your code
in such a way as to always have that substantial work to do with the data.
Prefer one big pass over everything to several smaller passes, each of which
must write out results, especially on a GPU. Try to actually have 11
instructions per byte of memory access.

I don't think the intent is to make any argument in particular about the state
of CPU or GPU design or anything of the sort.

------
PaulHoule
Memory bandwidth holds back a number of "revolutionary" advances in computing,
both in the sense of FPGAs, ASICs, and new processor types, not to mention
Indium Phosphide parts that clock at 50 GHz and could go to 200 or more.

------
ant6n
"Code that runs OK-ish on that CPU averages around 1 instruction per cycle,
well-optimized code around 3 instructions per cycle."

Is unoptimized code really that bad?

I thought between modern compilers and supporting 4 op/cycle with out-of-order
execution, a new Kaby Lake cpu would get more than 1 cycle overall. Or are the
branch delays just killing the performance? (Code is small relative to data,
most of it should be in cache most of the time)

~~~
fulafel
Put another way, he's saying if the dataset doesn't fit in cache, it's
bottlenecked on memory rather than the CPU's computational resources. It might
still be good code.

Transforming an arbitrary computational task to a perfectly balanced "n
operations of computation per n bytes of data accessed" suited to hardware du
jour is a not a generally solved problem, as much as programmers like to talk
about memory/speed tradeoffs.

~~~
ant6n
I understand the overall point of the article is about memory blocking cpu;
but the author uses the idea of instruction through-put instruction-
memory/cycle a way to show that the throughput isn't enough just to execute
code. So in order to make that argument, you'd have to look at the instruction
through-put without the assumption of memory accesses slowing you down.

Otherwise the point comparing cpu speed vs memory speed is moot, if you start
with the assumption that cpu speed is bound by memory speed to begin with.

------
pqr
See also Roofline Model
[https://en.wikipedia.org/wiki/Roofline_model](https://en.wikipedia.org/wiki/Roofline_model)

and Memory Wall
[https://www.google.co.in/search?q=memory+wall+computer+archi...](https://www.google.co.in/search?q=memory+wall+computer+architecture)

~~~
socmag
Oh nice articles

------
KaiserPro
One Interesting quirk we found with Nvidia cards was the CPU-GPU transfer
speed was affected by how many physical CPUs you had.

I assumed it was because with newer intel CPUs the PCI bridge moved to the
die, away from the motherboard, so there was CPU affinity for the GPU.

The difference in bandwidth was around 15%

~~~
zbjornson
Which way was the association, more CPUs -> faster CPU-GPU transfer, or
slower? By "physical CPUs" I assume you mean cores in a single socket, but in
pre-Haswell multi-socket systems, the uncore clock speed was tied to the core
clock speed, so idle CPUs that enter a low power state can actually cause
horrible performance because cache coherence snoop signals have to wait for
the low-power/slow CPUs to respond. See [https://software.intel.com/en-
us/forums/intel-moderncode-for...](https://software.intel.com/en-
us/forums/intel-moderncode-for-parallel-architectures/topic/379378)

~~~
KaiserPro
sorry, yes, that was the most important part!

single cpu was faster than dual.

This was a xeon x56 __vs an e5-26 __

On the z620 the second proc was on a removable daughter board, so it was
simple test.

on the e5-26 __it was most pronounced, the x56 __was less obvious

------
socmag
One thing I'd love to see is to have enough L1 data cache per core for a
modest amount of stack space.

Not so critical on GPU's but would make a huge difference for CPU's and
languages that can take benefit.

Give me a meg or two to play with. Would make a huge difference for data heavy
workloads.

You could even go as far as having a separate cache just for stack.

I mean, by its very definition it is isolated. It's the "register file" of
CISC machines

~~~
nhaehnle
I don't think that idea makes sense. A lot of stack space is unused at any
particular time. There's not a one-size-fits-all partition that makes sense,
and it's not something that programs could easily tune either.

Some obvious stack-related ideas that _are_ worth exploring IMHO are (1) using
the stack pointer to guide the prefetcher, to run ahead of cache misses when
you return from subroutines and go up the stack; and (2) having an instruction
that helps avoid unneeded cache misses when going _down_ the stack; for
example, a dedicated instruction for decrementing the stack pointer which
zeros entire cache lines directly in L1 without loading them. Or perhaps, for
more generality, an instruction for marking any arbitrary address range in
this way.

Who knows, at least the first one might already be in use in some
microarchitectures!

~~~
socmag
Yes I agree that a lot of stack space goes unused right now, that is for sure,
but the nice thing about the stack it is definitely thread/core local, and if
you did have fast-stack capability we would definitely use it.

What I was thinking about was a separate bus to a per thread stack space with
no competitors. That seems interesting, and I don't think it would be that
hard to add.

I really like your ideas there, especially number 1, that's very smart.

Number 2 is clever as well.

> Who knows etc..

ORLY? Good to know.

------
xchaotic
Should we design the processing units differently then?

There's been talk about hundreds of cores, all with local memory, but this
will only speed up a certain subset of computing problems...

