
Myths Programmers Believe about CPU Caches (2018) - noego
https://software.rajivprab.com/2018/04/29/myths-programmers-believe-about-cpu-caches/
======
strstr
If cache coherence is relevant to you, I strongly recommend the book “A Primer
on Memory Consistency and Cache Coherence”. It’s much easier to understand the
details of coherency from a broader perspective, than an incremental read-a-
bunch-of-blogs perspective.

I found that book very readable, and it cleared up most misconceptions I had.
It also teaches a universal vocabulary for discussing coherency/consistency,
which is useful for conveying the nuances of the topic.

Cache coherence is not super relevant to most programmers though. Every
language provides an abstraction on top of caches, and nearly every language
uses the “data race free -> sequentially consistent”. Having an understanding
of data races and sequential consistency is much more important than
understanding caching: the compiler/runtime has more freedom to mess with your
code than your CPU (unless you are on something like the DEC Alpha, which you
probably aren't).

If you are writing an OS/Hypervisor/Compiler (or any other situation where you
touch asm), cache coherence is a subject you probably need a solid grasp of.

~~~
jblow
Disagree on that last part. If more programmers understood cache coherency,
maybe their programs would not run like a giant turd.

~~~
strstr
Most engineers don't write code with hard performance constraints. Game devs
probably need to be fighting to get every frame.

For the bulk of the eng I work with the concept of StoreLoad reordering on x86
would be an academic distraction.

~~~
AnIdiotOnTheNet
> Most engineers don't write code with hard performance constraints.

Only because most software "engineers" don't give two shits about the actual
user experience of their glacially slow over-engineered garbage.

~~~
hvidgaard
99.999% of performance issues in development are related to the abstract
model, and not the underlying implementation details. Things such as searching
in a big unordered list, repeating the same work, stupid SQL queries ect.

~~~
elweston2
The way I usually follow it is 1) Is this OK? For example It is only called
once in the code in an error path? 2) Did I do something silly? For example, I
left in some extra debug code. 3) is the code doing something silly? For
example extra work, dragging in extra data that is not needed? 4) is the code
written in a way that causes extra work, re-init on inner loop, loop hoisting,
etc 5) Is there a better algorithm for this? Binary search vs linear? 6) is
the code doing too many things at once. De-serialize it re-serialize it 7) Is
there a better abstraction for this? Monolith code vs microservice? 8) Is
there any compiler switches that are 'easy' wins? Packing, -O2, etc? Usually
not. 9) What sort of memory arch does this machine have? It varies between
machines even for the x86 world. For example if I rearrange my memory
structures do I use less cache and emit less code. The cache line on this box
is x bytes and on that box is y bytes. Some languages do not lend themselves
well to this one due to the way their VM/ISA is written. 10) is there some asm
that would help? usually not.

Usually 1-7 are all you need. If you get all the way to 10 you are in the deep
end for most things.

Big O is good for many things. But in reality big O is O(N)+C where the C can
get you. That is where the later steps help. But usually you can totally
ignore it. Most of the big wins I get are just from flipping out a bad search
for a O(log(n)) search, or removing 'extra' code.

~~~
hvidgaard
On the BigO notation, you have to remember is that it's an upper bound on the
runtime. There is no guarantee other than the growth of the function.

~~~
elweston2
Oh I agree. It is just that +C bit. What happens when your O(nLog(n))
basically just trashes the cache because of your data structures? Yet maybe
something 'worse' would run better because of the way cache works? That +c bit
can sometimes turn into its own big O. It is a nasty little gotcha when it
does happen.

It is a good framework to get you in the ballpark of the correct thing. Even
usually 99% of the time it is right. But sometimes the arch bites back due to
your data.

------
dragontamer
A good, introductory, high-level overview of what is going on with cache
coherence... albeit specific to x86.

ARM systems are more relaxed, and therefore need more barriers than on x86.
Memory barriers (which also function as "compiler barriers" for the memory /
register thing discussed in the article) are handled as long as you properly
use locks (or other synchronization primitives like semaphores or mutexes).

Its good to know how things work "under the covers" for performance reasons at
least. Especially if you ever write a lock-free data-structure (not allowed to
use... well... mutexes or locks), so you need to place the barriers in the
appropriate spot.

\------

I think the acquire/release model of consistency will become more important in
the coming years. PCIe 4.0 is showing signs of supporting acquire/release...
ARM and POWER have added acquire/release model instructions, and even CUDA has
acquire/release semantics being built.

As high-performance code demands faster-and-faster systems, the more-and-more
relaxed our systems will become. Acquire/release is quickly becoming the
standard model.

~~~
jabl
> ARM and POWER have added acquire/release model instructions

They have implemented the acquire-release consistency model since day one (or,
the day they started supporting multi-processors). Yes, there are some
subtleties there that have in some cases been tightened later on, e.g. multi-
copy atomicity.

~~~
dragontamer
IIRC, there was a big stink because ARM and POWER historically implemented
consume/release semantics, which is very slightly more relaxed than the now
de-facto standard acquire/release semantics.

ARM and POWERPC CPU devs worked very hard to get consume/release into C++11,
but no compiler writer actually implemented that part of the standard. As
such, consume/release can be safely forgotten into the annals of computer
history (much like DEC Alpha's fully relaxed semantics)

Then in ARM8, ARM simply added LDAR (Load-acquire) and STLR (Store-release)
instructions.
[https://developer.arm.com/docs/100941/0100/barriers](https://developer.arm.com/docs/100941/0100/barriers)
. So the ARM CPU how fully supports the acquire/release model. Apparently
IBM's POWER instruction set was similarly strengthened to acquire/release
(either POWER8 or POWER9).

ARM / POWER "normal" loads and stores are still consume/release semantics. But
compilers can simply emit LDAR (load-acquire) for the stronger guarantee.

\----------

I remember at least one talk that showed that consume/release is ideal for
things like Linux's RCU or something like that (that acquire/release is
actually "too strong" for RCU, and therefore suboptimal). But because
compiler-writers found consume/release too hard to reason about in practice,
we're stuck with acquire/release.

It seems like the C++ standard continues to evolve to push for
memory_order_consume ([http://www.open-
std.org/jtc1/sc22/wg21/docs/papers/2018/p075...](http://www.open-
std.org/jtc1/sc22/wg21/docs/papers/2018/p0750r1.html)), but all the details
are still up for discussion.

~~~
jabl
> ARM and POWERPC CPU devs worked very hard to get consume/release into C++11

AFAIR Paul McKenney was the primus motor, and the motivation was largely RCU.
Then again, McKenney also worked for IBM at the time and certainly had an
interest in pushing a model that mapped well to POWER.

But it turned out to be both somewhat mis-specified and hard to implement
cleanly, so most compilers just implemented it as an acquire.

As you mention, there is ongoing work to fix it.

As for ARM, it seems the big thing they've done since the initial release of
ARMv8 is to banish non-multicopy atomicity. See
[https://www.cl.cam.ac.uk/~pes20/armv8-mca/armv8-mca-
draft.pd...](https://www.cl.cam.ac.uk/~pes20/armv8-mca/armv8-mca-draft.pdf)

------
sherincall
One thing not mentioned here (nor in previous discussions of the article, it
seems) is that DMA is typically not coherent with the CPU caches. This is
kinda visible from the little diagram at the top, with the disk sitting on the
other side of the memory, but it should be explicitly spelled out. If you're
using a DMA device (memory<->device or memory<->memory copies), you might end
up in a state where the DMA and the CPU see different values. This usually
means data transfer to/from a Disk or GPU, though other peripherals might use
it too.

Your options here are either to manually invalidate your caches and
synchronize with the DMA (e.g. via interrupts), or to request from the OS that
the given memory section be entirely uncached; or in some cases, you can get
away with a write-through cache policy, if the DMA is only ever reading the
memory.

~~~
AllanHoustonSt
I think DPDK does some user-level trickery to achieve per-core caching through
DMA, do you happen to know how they go about it?

------
lettergram
For those interested (in 2014!) I did a rather simple analysis of CPU caches
and for loops to point out some pitfalls:

[https://austingwalters.com/the-cache-and-
multithreading/](https://austingwalters.com/the-cache-and-multithreading/)

Hope it helps someone, I tend to link it to my co-workers when they ask me why
I PR'd re-ordering of loops & functions OR when they ask how I get speedups
without changing functionality.

------
nemothekid
I've never heard of the MESI protocol before so that was really interesting to
read, and I liked the comparison to distributed systems.

I'm wondering if the MESI protocol could be used in a networked database
manner? I feel like you need master node though to coordinate everything
though (like the L2 does in the example).

~~~
cperciva
The essential point of MESI is that it _doesn 't_ need a single master. In a
sense, anyone who has exclusive ownership of a cache line is the "master" for
that one cache line.

The downsides of MESI are that (a) it requires broadcasts, which don't scale
very well; and (b) it doesn't tolerate partitions -- which also imposes an
effective scaling limit, since large systems are always partitioned (usually
with a partition of N-k and k partitions of 1, due to k nodes having failed).

~~~
jabl
> The downsides of MESI are that (a) it requires broadcasts

No, it can be implemented with directory instead, e.g.

[https://en.wikipedia.org/wiki/Directory-
based_cache_coherenc...](https://en.wikipedia.org/wiki/Directory-
based_cache_coherence)

Or various combinations of snooping and directories ("snoop filters", or
directories that act as "bridges" between broadcast domains, etc.).

In current Xeon processors (and presumably AMD EPYC as well, thought I don't
yet have first-hand experience with those), you have a couple of directories
per CPU with snoop filtering, as with tens of cores broadcasting becomes a
scalability bottleneck. In the BIOS you can change the mode how it operates,
with slightly different names and semantics depending on the CPU generation.

------
dakom
Something I don't understand is how to deal with cache coherency when you need
the same data in a bunch of different configurations.

Take a typical game loop and assume we have a list of Transforms (e.g. world
matrix, translation/rotation/scale, whatever - each Transform is a collection
of floats in contiguous memory)

Different systems that run in that loop need those transforms in different
orders. Rendering may want to organize it by material (to avoid shader
switching), AI may want to organize it by type of state machine, Collision by
geometric proximity, Culling for physics and lighting might be different, and
the list goes on.

Naive answer is "just duplicate the transforms when they are updated" but that
introduces its own complexity and comes at its own cost of fetching data from
the cache.

I guess what I'm getting at is:

1) I would love to learn more about how this problem is tackled by robust game
engines (I guess games too - but games have more specific knowledge than
engines and can have unique code to handle it)

2) Does it all just come out in the wash at the end of the day? Not saying
just throw it all on the heap and don't care... but, maybe say optimizing for
one path, like "updating world transforms for rendering", is worth it and then
whatever the cost is to jump around elsewhere doesn't really matter?

Sorry if my question is a bit vague... any insight is appreciated

~~~
yvdriess
Assuming that once determined at the start of the frame (e.g. camera position
changes after user input handling), the transform matrices are not written to.
They can then be freely shared across multiple cores without causing problems
with coherency. The cache lines associated with the transform will be set to
'Shared' across all cores. Cache coherency will start to bite you in the ass
in this situation if you start mutating the transforms while other threads are
reading it, causing cache invalidations and pipeline flushes across all caches
owning those lines.

In short, write a transform once and treat it as immutable. Do not reuse the
Transform allocation for a good while for subsequent frames to ensure that its
cache lines are no longer in cache. If you do need to reuse right away, you
can force invalidate cache lines by addresses, so that the single-writer in
the next step is the single (O)wner and no other caches need to invalidate
anything.

~~~
dakom
Thanks - I'll have to do a bit more learning to really understand this, e.g.
how mutability relates to cache lines and what "Shared" means in that context,
but this gives me some good practical direction and insight to take it further
:)

------
dang
Discussed last year:
[https://news.ycombinator.com/item?id=17670095](https://news.ycombinator.com/item?id=17670095)

------
vagab0nd
This might be a naive question: how did we decide as an industry that cache
should be controlled by the hardware, but registers and main memory by the
compiler?

~~~
yvdriess
What do you mean by 'controlled by hardware'? The registers the compiler
chooses for you are an abstraction themselves, one of the first things a CPU
does is register renaming. Same for main memory, you are presented mostly an
abstraction of main memory, with a ton of layers in-between (load/store
buffers, write-combine buffers, coherent caches, etc)

Turning the argument the other way, you as a programmer can control a lot
about caches: you can prefetch cache lines, invalidate them and use streaming
instructions to bypass them.

~~~
vagab0nd
What you said makes sense. The question is more from the perspective of the
compiler/assembler. At these layers you explicitly say, move this number to
this register, or read a number from this address. But you very rarely are
able to say, copy this chunk of data into this cache line. Sure there are
exceptions (like the GPU), and there are cases where you can hack it to do
what you want. But in general you don't get to control the cache in a very
specific way.

------
nostrademons
Anyone else start thinking of Rust's mutable/immutable borrow system when
reading the MESI algorithm? It's not quite the same - with Rust, mutable
borrows are never shared, and you can never mutate a shared read-only borrow -
but the principle seems like a simplification of the full MESI protocol.

It seems like this would be generally applicable for a wide variety of
distributed & concurrent applications.

~~~
xakahnx
Keeping the directory coherent is the difficult part when translating
directory-based cache coherence protocols to other distributed systems
problems. The directory is like an oracle that sees every transaction in
order. This is hard in most network distributed systems problems where you
have to worry about availability, network partitioning, or durability of this
node.

~~~
yvdriess
Indeed, the evolution will probably in the other direction, with the on chip
network adopting algorithms from wider networks to deal with scaling problems.
DRAM interfaces use to be pretty simple, now they are being trained almost
like a DSL line.

------
tyingq
AMD's Rome processors are an interesting case, with 8MB of L3 cache per core.
So the 64 core processor has 512MB of L3 cache. It wasn't that long ago that
512MB was a respectable amount of DRAM in a big server. An early 90's fridge
sized Sun 690MP maxed out at 1GB of DRAM and had 1MB of L2 cache, no L3.

~~~
zamadatix
Half that - 4 MB per core so the 64 core CPU has 256 MB (dual socket is where
the 512 number comes from but that's 128 cores and NUMA).

It's also not fully accessible, each core can only directly access the 16 MB
in its group of 4. Everything else is the same as a cross cache read.

~~~
tyingq
Ah, yeah. Mixed up their CCD and CCX terms. The 690MP was dual socket though,
so still a somewhat valid comparison.

------
musicale
> “different cores can have different/stale values in their individual
> caches”.

Different processes can certainly have different versions of the same state,
different values for the same variable, and different values at the same
virtual address.

And what about virtual caches? Non-coherent cache lines?

Moreover, even in the face of cache coherency you can still have race
conditions.

~~~
gpderetta
> Different processes can certainly have different versions of the same state,
> different values for the same variable, and different values at the same
> virtual address.

what do you mean? Either two caches agree on the content of a cacheline or one
of the cacheline is marked invalid (and the stale content is irrelevant).
There are components of a core that might not respect coherency, like load and
store buffers and arguably registers, but not caches (on cache-coherent
systems of course).

Virtually addressed caches are an issue and that's why they have fallen out of
favor.

------
praptak
The one-line summary seems to be that one should never worry about caches
themselves introducing concurrency bugs.

I mean after we account for memory operations reordering on each core, the
memory address storing a single value that is visible to all cores is a
correct model from the concurrency-correctness point of view, right?

~~~
Tuna-Fish
On x86, the correct model is that on any core, all reads are in order, all
writes are in order, and no write will ever be moved earlier than a read on
that core.

Or in other words, the only kind of visible reordering that is allowed to
occur is that writes can be delayed past reads.

An example of a situation where this is significant:

    
    
        thread 1       thread 2
        mov [X], 0     mov [Y], 0
        mov [X], 1     mov [Y], 1
        mov r1, [Y]    mov r2, [X]
        

After this sequence of code, r1 == r2 == 0 is legal. (As is any other
combination of 1 and 0.)

(edit:) And just to add, all this reordering is of course impossible to detect
on just one core, as when a read request hits a recent write on the same core,
it reads it out of the store queue. This can sometimes be really bad for
performance, though, as if you read a value that is partially in the store
queue (such as, write 16-bit value to x, the immediately read 32 bits from x),
some cpus will stall that read, and all that follow it, until the entire store
queue is flushed. Since the store queue can easily take tens if not hundreds
of cycles to clear, this can be very expensive.

------
blattimwind
It's worth pointing out that the L1 cache and its associated logic is the
only* way a core talks to the outside world, including all I/O ever. With that
in mind it is easy to understand why it is so crucial to performance.

* there might be some minor exceptions

------
wildmanx
The biggest myth is that any of this matters to anybody but a tiny fraction of
niche programmers.

The reason some "app" is slow is not because of cache coherence traffic. It's
because somebody chose the wrong data structure, created some stupid
architecture, wrote incomprehensible code that the next guy extended in a
horribly inefficient way cause they didn't understand the original. My web
browsing experience is slow because people include heaps of bloated JS crap
and trackers and ad networks so I have to load 15 megabytes of nonsense and
wait for JS to render stuff I don't want to see. None of this was any better
if anybody involved understood CPU caches better.

Even in the kernel or HPC applications, most code is not in the hot path.
Programmers should rather focus on clean architectures and understandable
code. How does it help if you hand-optimize some stuff that nobody
understands, nobody can fix a bug, nobody can extend it. That's horrible code,
even if it's 5% faster than what somebody else wrote.

TL;DR: This is interesting, but likely totally irrelevant to your day job. In
the list of things to do to improve your code, it comes so far down that
you'll never get there.

------
gchokov
Half a decade - woow! Sounds like the author has spent half a century there..

------
johnthescott
the entire design of unix is realized in a moment when a motherboard is seen
as a network.

------
1e1f
Should include tl;dr your concurrency fears are real, but for registers and
not caches.

~~~
simpsond
If you have two threads reading then writing values in memory, you still need
synchronization/atomic changes at the software level.

------
mrich
Note that this is quite specific to x86, on other architectures like Power
there are much weaker guarantees that will lead to problems when assuming the
same model.

~~~
gpderetta
which part is x86 specific?

~~~
mrich
To quote from [https://fgiesen.wordpress.com/2014/07/07/cache-
coherency/](https://fgiesen.wordpress.com/2014/07/07/cache-coherency/)

"Memory models

Different architectures provide different memory models. As of this writing,
ARM and POWER architecture machines have comparatively “weak” memory models:
the CPU core has considerable leeway in reordering load and store operations
in ways that might change the semantics of programs in a multi-core context,
along with “memory barrier” instructions that can be used by the program to
specify constraints: “do not reorder memory operations across this line”. By
contrast, x86 comes with a quite strong memory model."

~~~
gpderetta
Yes, I know the difference between the memory model of x86 and, say, ARM. I'm
asking what's x86 specific on this article about cache coherency.

~~~
mrich
The article explicitly mentions two times things that are only true for x86
(grep for it). In addition, the statement at the end is definitely not true
for POWER: "As soon as the data is read/written to the L1 cache, the hardware-
coherency protocol takes over and provides guaranteed coherency across all
global threads. Thus ensuring that if multiple threads are reading/writing to
the same variable, they are all kept in sync with one another."

~~~
gpderetta
Those two things are not x86 specific (the author only gives x86 an example).
And the statement you quote is certainly true for POWER or any other cache
coherent architecture.

~~~
mrich
Sorry you are right, I confused this with reordering.

