
Discerning the Out-Of-Order Advantage: Is It Speculation or Dynamism? (2013) [pdf] - matt_d
http://zilles.cs.illinois.edu/papers/mcfarlin_asplos_2013.pdf
======
crawshaw
It would be wonderful to simplify CPUs and ship this work off to the compiler.
The two difficulties I see are:

Firstly, every microarchitecture will have a different optimal schedule. This
means a significant shift in how binaries are deployed. Either the compiler is
built into a distribution system aware of microarchitectures (for example,
bitcode and the Apple App Store), or these tasks are brought online with a JIT
compiler. For the sake of software freedom I hope for the latter.

Secondly, this paper shows 88% of the speedup is static. That means there will
always be a little extra speed available to a chip maker that ships an OOO
processor. For the same dollars, will you take the CPU that is 10% slower.
Today in the middle of Intel's security disaster, sure. But ten years from now
when CPU security is stable again, that free 10% performance will be
compelling.

~~~
obl
It's not free though. If you can actually remove most of the power/area hungry
ooo scheduling logic for only 10% less performance on real world workload, you
can pack a lot more cores or a bigger GPU on the same chip without blowing up
TDP or die size.

Sure, some use case will always require the highest single thread perf at all
cost but that's less and less true these days.

~~~
thesz
The speculation must be done with content-addressable memory, because you must
commit or rollback register mapping after speculative execution have succeed
or failed.

The dynamism, on the other hand, is a way to issue (possibly) out of order
commands when they are ready to be executed (their inputs are ready and no
write conflicts detected). For this you need a scoreboard (which is also
needed for speculative execution). Scoreboards are for registers (which one is
ready, etc) and for commands (which one is eligible for execution). As you can
easily assume, these scoreboards are bit vectors, not some interesting thing
that maps 80 internal registers to 16 externally accessible register indices.

The scoreboarding in CPUs have long history. They were used in Burroughs' CPUs
long ago. They were used in early Alpha AXP designs (first one or two, I
guess). They are very easy to implement - yours truly is guilty in this,
having done that in month and he is not even a hardware designer.

------
petermcneeley
In-order Machines still tend to speculate branches which leads to Spectre
attack. Using a static compiler to hoist loads will also lead to Spectre style
attacks.

------
yazr
What is the typical window size for an Intel OOO decoder ?

(measured in the number of instructions or code bytes in the memory?!)

~~~
BeeOnRope
There are (at least) two types of windows - the ROB (reorder buffer) which
usually usually on the order of 200 instructions on modern chips, and the
scheduler, on the order of 60-100 instructions. My impression is that usually
people refer to the ROB size as the "out of order window", but maybe just
because it's bigger and sounds better.

The former limits the distance between the oldest unfinished instruction and
the newest instruction that can be executed.

The latter (scheduler limit) puts a smaller limit on the possible lookahead in
the case of dependent instructions. It is the number of not-executed
instructions that can be buffered while looking for instructions to execute.

For example, if you have a scheduler size of 60 and take a cache miss (this
will become the oldest incomplete instruction), and the following instructions
mostly aren't data-dependent on the miss, you could execute up to ROB-size
(~200) additional instructions while waiting for the cache miss.

If the next 60 instructions are data-dependent on the cache miss, however, the
scheduler will fill up and then you don't be able to execute anything, even if
there exist instructions beyond the 60 that don't depend on on the cache miss.

Obviously in real code it won't usually be so "binary" \- you'll often have an
intermediate case where both limits matter in a soft way.

There are also other types of buffers such as load/store buffers, branch order
buffers, and other things which might fill up before the two I mentioned and
limit OoO execution. There might be various other limits to out-of-order
execution as well: a scheduler may not be "unified" so not all scheduler slots
may be available to all types of instructions, making the apparent scheduler
size smaller, etc.

