
Memory and Native Code Performance - DanielRibeiro
http://www.infoq.com/news/2013/06/Native-Performance
======
nkurz
_Reading a single 32-integer from the L1 cache takes 4 cycles. The L2 cache
takes 12 cycles, the same amount of time it would take to calculate the square
root of a number. The L3 cache takes more than twice as long, 26 bytes.
Reading from memory takes far, far longer._

I guess this is true, but it's sort of misleading. Adding some details as I
understand them:

1) On Sandy Bridge, a read from cache takes the same amount of time whether
you are reading a single byte or a 32 byte AVX vector. That said, it is often
faster though to read integers individually than to read them as a vector and
split them.

2) While the L2 latency is comparable to taking a single square root, if using
an XMM register, you can also take 4 square roots in the same number of
cycles. Viewed this way, if you can use SIMD, you could consider each square
root to be comparable to an L1 hit.

3) For DDR3-1600 10-10-10 (current fast standard) a read from main memory will
take 12.5 ns to start returning data. At 4GHz, this is 50 CPU cycles.
DDR3-1333 10-10-10 is 15 ns --- about 60 cycles. This is about twice as long
as a read from L3: longer, but not "far, far longer".

[Note to author: s/26 bytes/26 cycles/]

------
long
I'm a decent coder but don't know about anything near the metal.

Is there a good introductory resource for learning about low-level details
like the ones discussed in the post?

~~~
Radim
Try Drepper's (of GNU library fame) _What every programmer should know about
memory_.

[http://www.akkadia.org/drepper/cpumemory.pdf](http://www.akkadia.org/drepper/cpumemory.pdf)

Slightly dated but still excellent introduction.

------
octo_t
And this why -Wpadded is always a useful clang/gcc flag to enable when doing
optimisations

------
carterschonwald
summary of this and every other article: make sure you're doing sequential
scans in your tight inner loop, and ideally also exploit memory alignment and
cache sizes while you're at it.

------
bcoates
I don't understand the "Multi-core effects" one. Surely the linear memory
scans in c[i] = a[i] + b[i] will result in cache prefetch and thus be limited
by memory bus bandwidth and not cache size?

------
betterunix
I suspect that a better matrix multiplication algorithm would have a much more
dramatic effect on performance than improving the memory access pattern:

[https://en.wikipedia.org/wiki/Strassen%27s_algorithm](https://en.wikipedia.org/wiki/Strassen%27s_algorithm)

Constant factor improvements are great, but asymptotic improvements are
better.

~~~
dottrap
You should benchmark, but you are probably wrong. The whole talk was about how
hardware (i.e. memory accesses) matters. Even the Wikipedia article you linked
to suggests this.

"Earlier authors had estimated that Strassen's algorithm is faster for
matrices with widths from 32 to 128 for optimized implementations. However, it
has been observed that this crossover point has been increasing in recent
years, and a 2010 study found that even a single step of Strassen's algorithm
is often not beneficial on current architectures, compared to a highly
optimized traditional multiplication, until matrix sizes exceed 1000 or more,
and even for matrix sizes of several thousand the benefit is typically
marginal at best (around 10% or less)."

~~~
betterunix
Sure, but that benefit will become increasingly large as the problem size is
increased. Benchmarking should be done, of course, but only as a way to
determine _where_ Strassen's algorithm begins to dominate. Note also that that
quote supports what I said, at least in the limit: Strassen's algorithm _does_
improve performance more dramatically for large problem sizes (though how much
more dramatically is certainly debatable).

I am not going to deny that hardware matters, but it is important to know
_where_ it matters. Hardware matters in terms of constant factors. For
sufficiently large problem sizes, hardware might give you a 10x improvement if
you use it correctly -- but that is 10x no matter how large the problem size
becomes. An algorithmic improvement will become increasingly beneficial as the
problem size increases; that is why we speak of things like "crossing points,"
where an asymptotically faster algorithm begins to dominate. Even very poor
use of hardware features (terrible locality of reference, etc.) will
eventually be irrelevant for an asymptotically better algorithm; a large
number of really good uses of the hardware is still slower than a small number
of really bad uses.

~~~
comefrom30
Sometimes the problem size doesn't increase. Sometimes you maybe just need to
do lots and lots and lots and lots of small matrix multiplications.

