Firstly, every microarchitecture will have a different optimal schedule. This means a significant shift in how binaries are deployed. Either the compiler is built into a distribution system aware of microarchitectures (for example, bitcode and the Apple App Store), or these tasks are brought online with a JIT compiler. For the sake of software freedom I hope for the latter.
Secondly, this paper shows 88% of the speedup is static. That means there will always be a little extra speed available to a chip maker that ships an OOO processor. For the same dollars, will you take the CPU that is 10% slower. Today in the middle of Intel's security disaster, sure. But ten years from now when CPU security is stable again, that free 10% performance will be compelling.
Sure, some use case will always require the highest single thread perf at all cost but that's less and less true these days.
The dynamism, on the other hand, is a way to issue (possibly) out of order commands when they are ready to be executed (their inputs are ready and no write conflicts detected). For this you need a scoreboard (which is also needed for speculative execution). Scoreboards are for registers (which one is ready, etc) and for commands (which one is eligible for execution). As you can easily assume, these scoreboards are bit vectors, not some interesting thing that maps 80 internal registers to 16 externally accessible register indices.
The scoreboarding in CPUs have long history. They were used in Burroughs' CPUs long ago. They were used in early Alpha AXP designs (first one or two, I guess). They are very easy to implement - yours truly is guilty in this, having done that in month and he is not even a hardware designer.
"The Itanium approach ... was supposed to be so terrific—until it turned out that the wished-for compilers were basically impossible to write." - Donald Knuth
(measured in the number of instructions or code bytes in the memory?!)
The former limits the distance between the oldest unfinished instruction and the newest instruction that can be executed.
The latter (scheduler limit) puts a smaller limit on the possible lookahead in the case of dependent instructions. It is the number of not-executed instructions that can be buffered while looking for instructions to execute.
For example, if you have a scheduler size of 60 and take a cache miss (this will become the oldest incomplete instruction), and the following instructions mostly aren't data-dependent on the miss, you could execute up to ROB-size (~200) additional instructions while waiting for the cache miss.
If the next 60 instructions are data-dependent on the cache miss, however, the scheduler will fill up and then you don't be able to execute anything, even if there exist instructions beyond the 60 that don't depend on on the cache miss.
Obviously in real code it won't usually be so "binary" - you'll often have an intermediate case where both limits matter in a soft way.
There are also other types of buffers such as load/store buffers, branch order buffers, and other things which might fill up before the two I mentioned and limit OoO execution. There might be various other limits to out-of-order execution as well: a scheduler may not be "unified" so not all scheduler slots may be available to all types of instructions, making the apparent scheduler size smaller, etc.