
CPU Cache Flushing Fallacy - dekayed
http://mechanical-sympathy.blogspot.com/2013/02/cpu-cache-flushing-fallacy.html
======
rayiner
You learn something new every day: "The second major issue with ordering we
have to be aware of is a thread could write a variable and then, if it reads
it shortly after, could see the value in its store buffer which may be older
than the latest value in the cache sub-system."

I was skeptical hippo about this because x86 has such a strongly ordered
memory model, but lo and behold: "HOWEVER. If you do the sub-word write using
a regular store, you are now invoking the _one_ non-coherent part of the x86
memory pipeline: the store buffer. Normal stores can (and will) be forwarded
to subsequent loads from the store buffer, and they are not strongly ordered
wrt cache coherency while they are buffered." (Linus,
<http://yarchive.net/comp/linux/store_buffer.html>).

~~~
haberman
This is blowing my mind. I had no idea this was possible on x86.

~~~
bnegreve
Note that locking primitives include memory fences, that's why you are not
confronted with re-ordering problems unless you do lockfree multicore
programming.

From my experience, the fence is much costlier than the actual locking (at
least on Linux, thanks to Futexes [1]). So unless you're going to put your
hands in the dirt and remove the fence, it is useless and error prone to
remove the locking, just because you believe it is not necessary.

[1] <http://en.wikipedia.org/wiki/Futex>

------
gcp
Seems like a nice article, though I have 2 nits:

1) I'd have liked if you would have dived into those cases where you do
actually flush the CPU cache. I've run into this maybe once or twice in my
entire career, and this was while doing MIPS kernel drivers. I'm guessing it
would be cool for the audience to understand what shenanigans are needed to
actually require it, particularly as more people will be transitioning from
x86 to ARM.

2) You are ascribing meaning to volatile which it absolutely does not have (in
C/C++). You really should be going deeper into read/store memory barriers.
Using volatile in the hope that this allows you to do some kind of
synchronization is misguided.

~~~
pm215
The classic case where you need to do a cache flush is when you're a JIT
writing out native code. Once you've written the instructions into memory you
need to clean the data cache [ie ensure that your changes are made visible to
anything 'below' you in the memory hierarchy] and invalidate the icache [ie
throw away any info you have], so that when the CPU starts to execute
instructions from the memory you've just written it doesn't get the stale old
versions that might otherwise be in the icache. In fact you only need to clean
data out to the point in the memory hierarchy where the icache and the dcache
for all your cores come together, which is probably not the same as "write it
all really back to system RAM" but is basically indistinguishable from such by
the programmer.

NB that x86 maintains coherency between icache and dcache (unlike ARM, say),
so you don't need to do this on that CPU architecture.

------
haberman
"Even from highly experienced technologists I often hear talk about how
certain operations cause a CPU cache to "flush"."

Ok.

"This style of memory management is known as write-back whereby data in the
cache is only written back to main-memory when the cache-line is evicted
because a new line is taking its place."

That sounds like a flush to me. Modified data is written back (or flushed) to
main memory.

"I think we can safely say that we never "flush" the CPU cache within our
programs."

Maybe not explicitly, but a write-back is triggered by "certain operations,"
(see the first quotation above).

So it sounds like the real "fallacy" the article is discussing is the idea
that a cache flush is something that a program explicitly does. That would
indeed be a fallacy, but I have never heard anyone claim this.

On the upside, the article does give a lot of really nice details about the
memory hierarchy of modern architectures (this stuff falls out of date
quickly). I had no idea the Memory Order Buffers could hold 64 concurrent
loads and 36 concurrent stores.

~~~
rayiner
"Cache flush" usually refers to the idea that all of the lines in the cache
are written back to memory or otherwise invalidated. Writing back a single
line is usually called an eviction or writeback.

------
MichaelGG
"When hyperthreading is enabled these registers are shared between the co-
located hyperthreads."

How does that work? I thought each HT had its own registers - otherwise,
wouldn't that add a lot of complication and overhead? And does that mean if I
disable HT, a program can double the available registers? Wouldn't that need
different machine code?

~~~
brigade
Renamed registers for OoOE, not architectural registers. These number over 100
(200-300 counting FP/SSE) for modern x86 CPUs.

Hyperthreading splits resources (including registers, branch prediction, TLB,
etc.) between the threads. This split is static in Sandy Bridge and older and
dynamic in Ivy Bridge and newer.

~~~
tbrownaw
_Within each core are separate register files containing 160 entries for
integers and 144 floating point numbers. These registers are accessible within
a single cycle and constitute the fastest memory available to our execution
cores. Compilers will allocate our local variables and function arguments to
these registers. When hyperthreading is enabled these registers are shared
between the co-located hyperthreads._

The article is oversimplified here to the point of being incorrect. The
registers the compiler sees don't change (and there aren't 160 of them, I
think there's a dozen or less), only the behind-the-scenes ones as the parent
comment says.

~~~
jules
Actually while the compiler will not explicitly output instructions to access
these 160 registers, compilers can and do take advantage of the fact that
under the hood there are 160 registers instead of a dozen or so. They do this
with "instruction scheduling" which carefully orders the instructions in such
a way that at the end when the hardware is done with it they can be executed
in parallel and thus need more than the dozen or so architectural registers.

So in a way _"Compilers will allocate our local variables and function
arguments to these registers"_ is correct.

~~~
brigade
Opposite actually - instruction scheduling is significantly more useful for
in-order processors without renaming. Compilers take advantage of OoOE and
register renaming by not caring too much about careful scheduling in favor of
reducing register pressure.

For example, the compiler could order a sequence of load -> store copies as 4x
load -> 4x store, or 4x (load -> store). They're equally fast on an OoOE CPU,
so compilers might choose the latter so as to spill fewer registers elsewhere
in the function.

~~~
jules
It's true that the need for instruction scheduling is reduced in an OoO
processor, but now you're comparing it to a processor that does not do OoO and
hence needs to expose its entire register set as architectural registers. A
compiler for an OoO processor still needs to be aware that under the hood
there are 160 registers that can be used instead of 14 (at least if it wants
to generate the most efficient code).

------
nkurz
_I think we can safely say that we never "flush" the CPU cache within our
programs._

Perhaps true, but not for lack of trying!

For benchmarking compression algorithms from a cold cache, I've been trying to
intentionally flush the CPU caches using WBINVD (Write Back and Invalidate
Cache) and CLFLUSH (Cache Line Flush). I'm finding this difficult to do, at
least under Linux for Intel Core i7.

1) WBINVD needs to be called from Ring 0, which is the kernel. The only way
I've found call this instruction from user space is with a custom kernel
module and an ioctl(). This works, but feels overly complicated. Is there some
built in way to do this?

2) CLFLUSH is straightforward to call, but I'm not sure it's working for me. I
stride through the area I want uncached at 64 byte intervals calling
__mm_clflush(), but I'm not getting consistent results. Is there more that I
need to do? Do I need MFENCE both before and after, or in the loop?

~~~
trotsky
You might be interested in system management mode (SMM) - SMM often gets
entered by various ACPI and other platform functions, some of which can be
triggered pretty easily from user space. To protect SMM the caches get trashed
on each transition.

------
jamieb
<http://linux.die.net/man/2/cacheflush>

There are other CPU architectures than the Nehalem.

The Nehalem's IO hub connects to the caches, not to memory, via the QPI. I/O
sees the data as the caches see it. Most architectures, historically, have the
IO systems talking to main memory or a memory controller independent of the
CPU. Not waiting for memory to be consistent before firing off a DMA was a
great way to get "interesting" visual effects.

We called that process "flushing the cache".

------
vemv
"If our caches are always coherent then why do we worry about visibility when
writing concurrent programs? This is because within our cores, in their quest
for ever greater performance, data modifications can appear out-of-order to
other threads"

I've read the opposite from various sources such as JCIP - a single
unsynchronized-nonvolatile write could never be noticed by other threads (i.e.
processors). I don't think that case falls into the "instruction reordering"
category, does it?

~~~
mjpt777
x86 is a total store order memory model so any ASM MOV instruction that writes
to a memory address will eventually be seen when the store buffer drains. In
code we must ensure the write is not register allocated. This is achieved by
the use of lazySet() as described in the article.

------
drudru11
I think people do detect cache issues. Considering the fact that an access to
DRAM is like ~150 or more cycles slower... your programs if written even
normally may starve for RAM.

Also, people sometimes people confuse TLB flushing with overall cache flushing
as mentioned at the end of your article. Also, tagged TLB flushing is still
not commonly used (to my knowledge)

The reality is that there is a huge performance hit whenever these systems
need to be used by a new process or context. Maybe people are equating this
experience to 'flushing'.

------
jessaustin
As if the Gbps vs GBps vs GiB/s obfuscation weren't obnoxious enough, now I
learn about GT/s.

