
Memory Disambiguation on Skylake - nkurz
https://github.com/travisdowns/uarch-bench/wiki/Memory-Disambiguation-on-Skylake
======
nkurz
This is a high-effort deep-dive into the details of how modern Intel systems
deal with "memory disambiguation". The exploration here goes well beyond
what's in Intel's detailed optimization guides
([https://software.intel.com/en-us/articles/intel-
sdm](https://software.intel.com/en-us/articles/intel-sdm)), and deeper than
what's in Agner Fog's excellent manuals
([http://www.agner.org/optimize/](http://www.agner.org/optimize/)). The
information here is new, and likely doesn't exist anywhere outside of an Intel
NDA.

While most programmers can ignore this level of operation, research like this
is the only thing that can predict otherwise inexplicable observed behavior
like this
[https://news.ycombinator.com/item?id=15935283](https://news.ycombinator.com/item?id=15935283),
and hopefully can allow future low-level optimizations that would not
otherwise be possible. If you are writing an optimizing compiler that targets
these processors, or optimizing an inner loop where every cycle counts, this
is gold.

Since it's a pretty dense piece, here's a higher level intro what he's talking
about. Modern desktop and server processors execute instructions "out-of-
order". Future instructions along the "speculative" execution path are thrown
into a reorder buffer capable of holding a couple hundred instructions, and
then executed as soon as possible, often several instructions per cycle (ie,
they are "superscalar"). Since hundreds of instructions can be executed in the
time it takes to access main memory, one of the main opportunities to make
things faster is when loads from memory can be executed sooner.

But in the presence of both loads and stores, it can be difficult to determine
whether it's safe to "hoist" a load above a store. Sometimes this is obviously
unsafe, as when reading and storing from the same unchanged register, but what
if two different registers happen to hold overlapping addresses? Then it can
be hard to tell if a store happened to write to the same memory that a hoisted
load was reading from, causing the load to retrieve the wrong data. This
article is about how Skylake (a particular recent generation of Intel
processors) actually handles these cases.

