
Effects of CPU Caches - chewxy
https://medium.com/@minimarcel/effect-of-cpu-caches-57db81490a7f
======
navjack27
There are a couple of issues here and there but I'm just glad another software
guy is looking into hardware and not just assuming the compiler can figure
everything out automatically.

~~~
Cthulhu_
I had a Go course (Ultimate Go from William Kennedy) and basically he says
he's not going to teach you the language, it's simple enough if you've got a
bit of programming experience. Instead, the first day was mostly about the
pretty much 1:1 relationship between code and hardware, using show instead of
tell, e.g. going through a 2d array row-based or column-based and showing the
difference (memory is row based), and also a lot about CPU caches - although
there he gratuitiously showed us a lot of footage from a great presentation
from an Intel guy.

~~~
Rafuino
Any chance the Intel guy presentation is public and you can share? I'd like to
take a look myself.

------
throwaway84742
Few people realize this, but completely random RAM access can have lower
throughput than e.g. large linear reads from SSD. That’s one of the reasons
why I don’t trust languages which give me little to no control over data
locality.

~~~
banachtarski
Pages from disk are resident in memory when retrieved, so linear reads from
ANY disk (not just an SSD), may be faster than random RAM access at some point
(the curves will intersect after you pay the initial seek cost). If you're
smart about queuing additional page fetches while traversing existing pages,
you can accelerate this even more.

~~~
throwaway84742
Not even when pages are resident in memory. I mean after it gets going,
ignoring the initial latency of access, which is indeed staggering by CPU
standards. To dramatically slow down memory access it’s enough to just miss
the cache every time. Here’s another thing most people don’t know: missing
cache results in at least 100-200 cycles of delay to fetch from main RAM.
People don’t quite get how long that is. A good demonstration is to count
aloud to 200. And then realize that if it was L1 hit, you’d be doing several
operations simultaneously per every single count. As CPU frequencies increase,
this number trends upward. All of a sudden your ~26GB/s of memory bandwidth
turns into a pumpkin.

------
m0skit0
Related: [https://www.youtube.com/watch?v=YQs6IC-
vgmo](https://www.youtube.com/watch?v=YQs6IC-vgmo)

~~~
tzahola
Nice. I was thinking about this lecture too.

------
sokoloff
> Because the point of this experimentation was to iterate over a list by
> accessing the memory and so to measure the time for each iteration, I had to
> be sure every executed operation at each iteration was the most stripped-
> down set of instructions. Hard to be shorter than one cpu instruction for
> the given loop:

I’ll wager that the author didn’t look at the generated assembly as if that
loop has only one CPU instruction, that instruction would be an increment,
making the analysis of cache effects moot.

~~~
goldenkey
In a handrolled assembly loop, the array pointer can be the iterator itself.
In fact, this would be the ideal to compare the linked list version to, as the
linked list would not have any kind of incremented iterator except for the
pointer itself..

~~~
mini_marcel
Yes you're right! Since I was using the benchmark tools of go test, I had to
use this increment. The main goal was to compare exactly same operations
between each benchmarks.

However, in the last test I ran, where I was walking indefinitely over the
linked list, I guess I was closer to the behavior you're pointing out.

------
puzzle
One of the effects first noticed in CS papers maybe 20 years ago is that
binary heaps should NOT start from the beginning of a cache line, which is
what most programmers would assume would be a wise thing to do. We're always
told to align to cache boundaries. If you don't move the root node to the end
of a cache line (or at least toward the middle, depending on line size and
node size), traversing the heap will touch more lines, since the sibling nodes
you need to compare will often/always be in separate cache lines.

~~~
mini_marcel
Hi! I'm sure I misunderstand your comment, but correct me if I'm wrong.

For instance, if each item has a size of 64B (for N=7), which is the size of a
cache line, then whatever the position of the first element into the first
cache line, the next item will necessary be located into the next cache line.
Did I miss something?

~~~
puzzle
In a binary heap, there's a single root, but after that, there are pairs of
nodes. In your case, the item has the same size of a cache line, so things are
not too bad. During traversal, you'll always read two whole lines for each
pair of nodes. You could still worry about the CPU cache's n-way
associativity, but either that requires a lot of control over memory
allocation layout or it will provide diminishing returns.

Anyway, the interesting case is when items are smaller than a line. Look at an
array-based implementation of a heap:

[https://en.m.wikipedia.org/wiki/Binary_heap#Heap_implementat...](https://en.m.wikipedia.org/wiki/Binary_heap#Heap_implementation)

Root is item 0. Assume it's aligned to a multiple of 64 bytes and you have two
32-byte items per line. When comparing array items 1&2, 3&4, 6&7, etc., notice
how the two items are always in different cache lines. Item 1 is in the same
line as item 0, not item 2. You're always crossing the line boundary. If the
root were at item 1 in the same array (or, equivalently, item 0 in case the
array started at a multiple of 64, plus an offset of 32), then each pair would
be always in the same cache line.

In case it wasn't clear, I was pointing out a similar phenomenon with other
data structures; it wasn't about linked lists.

~~~
mini_marcel
Ok yes!! This is perfectly clear now, thanks. I didn't know, very interesting
and a smart optimisation of cache lines use!

I've also read that the processors are able to ask words (of a cache line) in
different orders, and so ask for hot words first. It may help a little if this
kind of structure was not aligned in an optimized fashion. Doesn't work for HW
prefetch, though.

------
adrianN
This dissertation goes into great detail on this topic: [http://people.mpi-
inf.mpg.de/~tojot/papers/TowardBetterCompu...](http://people.mpi-
inf.mpg.de/~tojot/papers/TowardBetterComputationModelsForModernMachines.pdf)

------
tralarpa
> The hardware prefetch is not really efficient when N>7, because each list
> element is spread over several cache lines, plus the prefetching is not able
> to cross page boundaries.

Could you explain the first part of the sentence (about the list elements
spreading over several cache lines)? It's true that the elements do not fit
into a cache line for N>7\. But is that important for the prefetcher?
Shouldn't the regular access pattern to the "en.n" be detected by the streamer
prefetcher (without shuffling, of course)?

~~~
mini_marcel
Hi, thanks for your question.

You're right, this is exactly why a few paragraphs later I doubt about that
statement, saying that the prefetcher is certainly not reading the entire
array and may act exactly like N=7. I think that this is more about the fact
that the prefetcher can't cross page boundaries.

That's still right for N<7, though. I still have to dig on this part :)

Thanks again!

~~~
tralarpa
This page [http://hexus.net/tech/reviews/cpu/37989-intel-
core-i7-3770k-...](http://hexus.net/tech/reviews/cpu/37989-intel-
core-i7-3770k-22nm-ivy-bridge/?page=2) claims that i7 CPUs have a "next-page
prefetcher". I don't know how different i5 and i7 are. Do you think your i5
CPU doesn't have it?

~~~
mini_marcel
Sorry for the delay, I took some time to perform additional researches.

What I found is that my i5 has the code name "Products formerly Haswell". I
don't exactly know if it has the "next-page prefetcher", but I would say yes.

Looking at the document that a reader on Medium linked to me
[https://software.intel.com/sites/default/files/managed/9e/bc...](https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-
optimization-manual.pdf)

it looks like the NPP has been introduced into the IVY Bridge family
(paragraph 2.4.7), and the paragraph 2.3 says: "The Haswell microarchitecture
builds on the successes of the Sandy Bridge and Ivy Bridge
microarchitectures.".

That said, how the NPP is working is not really clear. I'd say it's maybe more
about prefetching the next page to avoid a TLB misses. In this document I
didn't found any statement about crossing a page boundary during data
prefetches.

