You learn something new every day: "The second major issue with ordering we have to be aware of is a thread could write a variable and then, if it reads it shortly after, could see the value in its store buffer which may be older than the latest value in the cache sub-system."
I was skeptical hippo about this because x86 has such a strongly ordered memory model, but lo and behold: "HOWEVER. If you do the sub-word write using a regular store, you are now
invoking the _one_ non-coherent part of the x86 memory pipeline: the store buffer. Normal stores can (and will) be forwarded to subsequent loads from the store buffer, and they are not strongly ordered wrt cache coherency while they are buffered." (Linus, http://yarchive.net/comp/linux/store_buffer.html).
Note that locking primitives include memory fences, that's why you are not confronted with re-ordering problems unless you do lockfree multicore programming.
From my experience, the fence is much costlier than the actual locking (at least on Linux, thanks to Futexes [1]). So unless you're going to put your hands in the dirt and remove the fence, it is useless and error prone to remove the locking, just because you believe it is not necessary.
1) I'd have liked if you would have dived into those cases where you do actually flush the CPU cache. I've run into this maybe once or twice in my entire career, and this was while doing MIPS kernel drivers. I'm guessing it would be cool for the audience to understand what shenanigans are needed to actually require it, particularly as more people will be transitioning from x86 to ARM.
2) You are ascribing meaning to volatile which it absolutely does not have (in C/C++). You really should be going deeper into read/store memory barriers. Using volatile in the hope that this allows you to do some kind of synchronization is misguided.
The classic case where you need to do a cache flush is when you're a JIT writing out native code. Once you've written the instructions into memory you need to clean the data cache [ie ensure that your changes are made visible to anything 'below' you in the memory hierarchy] and invalidate the icache [ie throw away any info you have], so that when the CPU starts to execute instructions from the memory you've just written it doesn't get the stale old versions that might otherwise be in the icache. In fact you only need to clean data out to the point in the memory hierarchy where the icache and the dcache for all your cores come together, which is probably not the same as "write it all really back to system RAM" but is basically indistinguishable from such by the programmer.
NB that x86 maintains coherency between icache and dcache (unlike ARM, say), so you don't need to do this on that CPU architecture.
You are absolutely right that volatile is inadequate for ordering C/C++ concurrent algorithms, and memory barriers/fences are additionally required. I tried to focus on the hardware in this article.
"Even from highly experienced technologists I often hear talk about how certain operations cause a CPU cache to "flush"."
Ok.
"This style of memory management is known as write-back whereby data in the cache is only written back to main-memory when the cache-line is evicted because a new line is taking its place."
That sounds like a flush to me. Modified data is written back (or flushed) to main memory.
"I think we can safely say that we never "flush" the CPU cache within our programs."
Maybe not explicitly, but a write-back is triggered by "certain operations," (see the first quotation above).
So it sounds like the real "fallacy" the article is discussing is the idea that a cache flush is something that a program explicitly does. That would indeed be a fallacy, but I have never heard anyone claim this.
On the upside, the article does give a lot of really nice details about the memory hierarchy of modern architectures (this stuff falls out of date quickly). I had no idea the Memory Order Buffers could hold 64 concurrent loads and 36 concurrent stores.
"Cache flush" usually refers to the idea that all of the lines in the cache are written back to memory or otherwise invalidated. Writing back a single line is usually called an eviction or writeback.
"That sounds like a flush to me. Modified data is written back (or flushed) to main memory."
That's one possible meaning of "flush"; however it's common to use it to mean not just "push dirty stuff out" but also "all your data is stale so drop it on the floor and refetch from the level below if anybody asks for it later". On x86 the "CLFLUSH" instruction does both "write modified data out to the level below" and "forget about the possibly stale cached copy you have". ARM generally avoids the term "flush" altogether and talks about "Invalidate" and "Clean" operations (where "Clean and Invalidate" does roughly the same as the x86 "CLFLUSH").
I guess this sounds a bit like nitpicking, except that in my experience if you're down at the level where you need to really care about what the cache is doing you probably also need to be precise in discussion of exactly what (possibly architecture specific) operations you're doing to your cache.
"When hyperthreading is enabled these registers are shared between the co-located hyperthreads."
How does that work? I thought each HT had its own registers - otherwise, wouldn't that add a lot of complication and overhead? And does that mean if I disable HT, a program can double the available registers? Wouldn't that need different machine code?
Renamed registers for OoOE, not architectural registers. These number over 100 (200-300 counting FP/SSE) for modern x86 CPUs.
Hyperthreading splits resources (including registers, branch prediction, TLB, etc.) between the threads. This split is static in Sandy Bridge and older and dynamic in Ivy Bridge and newer.
Within each core are separate register files containing 160 entries for integers and 144 floating point numbers. These registers are accessible within a single cycle and constitute the fastest memory available to our execution cores. Compilers will allocate our local variables and function arguments to these registers. When hyperthreading is enabled these registers are shared between the co-located hyperthreads.
The article is oversimplified here to the point of being incorrect. The registers the compiler sees don't change (and there aren't 160 of them, I think there's a dozen or less), only the behind-the-scenes ones as the parent comment says.
Actually while the compiler will not explicitly output instructions to access these 160 registers, compilers can and do take advantage of the fact that under the hood there are 160 registers instead of a dozen or so. They do this with "instruction scheduling" which carefully orders the instructions in such a way that at the end when the hardware is done with it they can be executed in parallel and thus need more than the dozen or so architectural registers.
So in a way "Compilers will allocate our local variables and function arguments to these registers" is correct.
Opposite actually - instruction scheduling is significantly more useful for in-order processors without renaming. Compilers take advantage of OoOE and register renaming by not caring too much about careful scheduling in favor of reducing register pressure.
For example, the compiler could order a sequence of load -> store copies as 4x load -> 4x store, or 4x (load -> store). They're equally fast on an OoOE CPU, so compilers might choose the latter so as to spill fewer registers elsewhere in the function.
It's true that the need for instruction scheduling is reduced in an OoO processor, but now you're comparing it to a processor that does not do OoO and hence needs to expose its entire register set as architectural registers. A compiler for an OoO processor still needs to be aware that under the hood there are 160 registers that can be used instead of 14 (at least if it wants to generate the most efficient code).
OK that makes sense. So this would add to the scenario where you have a compute-intensive task, to disable HT? HT's more suited as another level of latency hiding, while one thread's stalled the other can work?
It's pretty rare for a thread to completely stall in modern CPUs as you might imagine an in-order CPU doing, but it is common to achieve less than 1 instruction/cycle throughput out of a theoretical ~3-4 instructions/cycle in the frontend and 6 µop/cycle in the backend (Sandy Bridge). So another thread helps feed the execution units, even if the first thread has no classical stalls.
I think we can safely say that we never "flush" the CPU cache within our programs.
Perhaps true, but not for lack of trying!
For benchmarking compression algorithms from a cold cache, I've been trying to intentionally flush the CPU caches using WBINVD (Write Back and Invalidate Cache) and CLFLUSH (Cache Line Flush). I'm finding this difficult to do, at least under Linux for Intel Core i7.
1) WBINVD needs to be called from Ring 0, which is the kernel. The only way I've found call this instruction from user space is with a custom kernel module and an ioctl(). This works, but feels overly complicated. Is there some built in way to do this?
2) CLFLUSH is straightforward to call, but I'm not sure it's working for me. I stride through the area I want uncached at 64 byte intervals calling __mm_clflush(), but I'm not getting consistent results. Is there more that I need to do? Do I need MFENCE both before and after, or in the loop?
You might be interested in system management mode (SMM) - SMM often gets entered by various ACPI and other platform functions, some of which can be triggered pretty easily from user space. To protect SMM the caches get trashed on each transition.
There are other CPU architectures than the Nehalem.
The Nehalem's IO hub connects to the caches, not to memory, via the QPI. I/O sees the data as the caches see it. Most architectures, historically, have the IO systems talking to main memory or a memory controller independent of the CPU. Not waiting for memory to be consistent before firing off a DMA was a great way to get "interesting" visual effects.
"If our caches are always coherent then why do we worry about visibility when writing concurrent programs? This is because within our cores, in their quest for ever greater performance, data modifications can appear out-of-order to other threads"
I've read the opposite from various sources such as JCIP - a single unsynchronized-nonvolatile write could never be noticed by other threads (i.e. processors). I don't think that case falls into the "instruction reordering" category, does it?
x86 is a total store order memory model so any ASM MOV instruction that writes to a memory address will eventually be seen when the store buffer drains. In code we must ensure the write is not register allocated. This is achieved by the use of lazySet() as described in the article.
I think people do detect cache issues. Considering the fact that an access to DRAM is like ~150 or more cycles slower... your programs if written even normally may starve for RAM.
Also, people sometimes people confuse TLB flushing with overall cache flushing as mentioned at the end of your article. Also, tagged TLB flushing is still not commonly used (to my knowledge)
The reality is that there is a huge performance hit whenever these systems need to be used by a new process or context. Maybe people are equating this experience to 'flushing'.
I was skeptical hippo about this because x86 has such a strongly ordered memory model, but lo and behold: "HOWEVER. If you do the sub-word write using a regular store, you are now invoking the _one_ non-coherent part of the x86 memory pipeline: the store buffer. Normal stores can (and will) be forwarded to subsequent loads from the store buffer, and they are not strongly ordered wrt cache coherency while they are buffered." (Linus, http://yarchive.net/comp/linux/store_buffer.html).