Hacker News new | past | comments | ask | show | jobs | submit login
Discerning the Out-Of-Order Advantage: Is It Speculation or Dynamism? (2013) [pdf] (illinois.edu)
32 points by matt_d 6 months ago | hide | past | web | favorite | 8 comments

It would be wonderful to simplify CPUs and ship this work off to the compiler. The two difficulties I see are:

Firstly, every microarchitecture will have a different optimal schedule. This means a significant shift in how binaries are deployed. Either the compiler is built into a distribution system aware of microarchitectures (for example, bitcode and the Apple App Store), or these tasks are brought online with a JIT compiler. For the sake of software freedom I hope for the latter.

Secondly, this paper shows 88% of the speedup is static. That means there will always be a little extra speed available to a chip maker that ships an OOO processor. For the same dollars, will you take the CPU that is 10% slower. Today in the middle of Intel's security disaster, sure. But ten years from now when CPU security is stable again, that free 10% performance will be compelling.

It's not free though. If you can actually remove most of the power/area hungry ooo scheduling logic for only 10% less performance on real world workload, you can pack a lot more cores or a bigger GPU on the same chip without blowing up TDP or die size.

Sure, some use case will always require the highest single thread perf at all cost but that's less and less true these days.

The speculation must be done with content-addressable memory, because you must commit or rollback register mapping after speculative execution have succeed or failed.

The dynamism, on the other hand, is a way to issue (possibly) out of order commands when they are ready to be executed (their inputs are ready and no write conflicts detected). For this you need a scoreboard (which is also needed for speculative execution). Scoreboards are for registers (which one is ready, etc) and for commands (which one is eligible for execution). As you can easily assume, these scoreboards are bit vectors, not some interesting thing that maps 80 internal registers to 16 externally accessible register indices.

The scoreboarding in CPUs have long history. They were used in Burroughs' CPUs long ago. They were used in early Alpha AXP designs (first one or two, I guess). They are very easy to implement - yours truly is guilty in this, having done that in month and he is not even a hardware designer.

From the paper: "we make no attempt to describe the mechanisms that would be necessary to produce these schedules nor the architectural fea- tures that would be necessary to encode these schedules. We do, however, think that a useful mental model for the system under study is a machine like the Transmeta Crusoe [9], which consisted of a VLIW processor coupled with the CMS dynamic binary trans- lator. Our study can be viewed as investigating the potential for a futuristic dynamic optimizer provided it was not constrained in the static schedules it could generate and it could effectively profile the execution so as to generate relevant schedules."

"The Itanium approach ... was supposed to be so terrific—until it turned out that the wished-for compilers were basically impossible to write." - Donald Knuth

I should have qualified my post with the assumption that such a compiler could be written. I don't know if it could. I suspect, if a compiler was focused on one specific micro-architecture rather than a general ISA, that it could.

In-order Machines still tend to speculate branches which leads to Spectre attack. Using a static compiler to hoist loads will also lead to Spectre style attacks.

What is the typical window size for an Intel OOO decoder ?

(measured in the number of instructions or code bytes in the memory?!)

There are (at least) two types of windows - the ROB (reorder buffer) which usually usually on the order of 200 instructions on modern chips, and the scheduler, on the order of 60-100 instructions. My impression is that usually people refer to the ROB size as the "out of order window", but maybe just because it's bigger and sounds better.

The former limits the distance between the oldest unfinished instruction and the newest instruction that can be executed.

The latter (scheduler limit) puts a smaller limit on the possible lookahead in the case of dependent instructions. It is the number of not-executed instructions that can be buffered while looking for instructions to execute.

For example, if you have a scheduler size of 60 and take a cache miss (this will become the oldest incomplete instruction), and the following instructions mostly aren't data-dependent on the miss, you could execute up to ROB-size (~200) additional instructions while waiting for the cache miss.

If the next 60 instructions are data-dependent on the cache miss, however, the scheduler will fill up and then you don't be able to execute anything, even if there exist instructions beyond the 60 that don't depend on on the cache miss.

Obviously in real code it won't usually be so "binary" - you'll often have an intermediate case where both limits matter in a soft way.

There are also other types of buffers such as load/store buffers, branch order buffers, and other things which might fill up before the two I mentioned and limit OoO execution. There might be various other limits to out-of-order execution as well: a scheduler may not be "unified" so not all scheduler slots may be available to all types of instructions, making the apparent scheduler size smaller, etc.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact