What you do is extract up to N instructions from a block of memory (in this case, 16 bytes). If there are fewer than N instructions, well, you can only issue as many as you have; if there are more than N instructions, well, you can only issue N instructions that cycle. Note that most instructions are only a few bytes length; I don't have hard data on the average instruction length of an x86-64 instruction in practice, but my guess is it's around 3-4 bytes.
The hard part about a variable-length ISA, as I understand it (hardware is not my expertise), is that you have to build essentially a mux that can steer from 16 source places to 5 outputs.
> The hard part about a variable-length ISA, as I understand it (hardware is not my expertise), is that you have to build essentially a mux that can steer from 16 source places to 5 outputs.
It's not just the mux, it's the control signals for the mux. If your first instruction in your 16-byte window is three bytes long, you can't realistically mux bytes 3-15 to the second decoder input, because you don't know that it needs to start decoding at byte 3 until /after/ the first instruction has decoded; so your gate delays would be the sum of the five decoders, plus muxing etc, and we're trying to do this at a single cycle.
The alternative is to have 16 decoders, each decoding what an instruction /would/ be if it started at a given byte. These all complete in parallel, but many of them might produce garbage. So we need to find the first five of these that have non-garbage outputs, and /then/ mux those sixteen possible outputs down to five channels. This is doable (and done), but it's a lot more hardware in both decoders and muxing than you might think at first.
Basically: To decode five instructions per cycle from a fixed-instruction-width stream takes five decoders. To decode up to five instructions per cycle from a variable-instruction-width stream takes number-of-possible-starts decoders and a mess of muxing. There really is a cost here.
The Thumb ISA is left as an exercise for the reader.
Very nice explanation! It gets very complex very quickly... And then there's of course also the fact you never know if your last bytes in the Ifetch you get from the cache is a full instruction or not. You'll need to keep that around until the next line arrives which will have the rest of that instruction.
And then add to that the complexity of speculative and out of order execution: you will want to very early decide, i.e. (pre)decode, where branch instructions are, in order to make sure your branch prediction unit has the right inputs in time, to decide where to fetch the next instructions from.
And after all that, on a branch mispredict earlier on (or pipeline flush for another reason), you might need to throw almost all of that expensively decoded instructions away and start fetching from elsewhere...
It all depends how much you want to get that last bit of performance and what you are willing to pay for it in terms of power, die area, all that.
And of course every part of it has to be balanced. It makes no sense to decode instructions faster than the system downstream can handle it, particularly if you are wasting resources to do so.
The reason why Intel has so many different implementations of x86 is to meet the needs of different market segments.
Yep. And it's worth saying that, contrary to my previous over-simplification, the trade-offs change once you give up on single-cycle instruction decoding. With a longer pipeline, you can do a split into 16 simple, small pre-decoders, that basically just extract the length of the instruction, /then/ mux on the inputs to just five full decoders. It adds a pipeline stage, but it can be lower area for the same sustained performance.
Trade-offs are everywhere. But certainly fixed width instructions, or variable-width-but-trivial-to-determine-length (Thumb before Thumb2, RISC-V compressed) have real advantages.
> The alternative is to have 16 decoders, each decoding what an instruction /would/ be if it started at a given byte. These all complete in parallel, but many of them might produce garbage. So we need to find the first five of these that have non-garbage outputs, and /then/ mux those sixteen possible outputs down to five channels. This is doable (and done), but it's a lot more hardware in both decoders and muxing than you might think at first.
Length decoding is a finite state machine, and the transition function of finite state machines is associative, which means you can actually compile them into a reduction tree to do it in parallel.
I don't think the x86 decoder is actually a particularly complex state machine (I haven't attempted to actually build it), so it's probably overall far less complex than most people assume it is.
> The alternative is to have 16 decoders, each decoding what an instruction /would/ be if it started at a given byte. These all complete in parallel, but many of them might produce garbage. So we need to find the first five of these that have non-garbage outputs, and /then/ mux those sixteen possible outputs down to five channels.
I believe reading somewhere that that's pretty much how more or less recent Intel processors do that.
No you don't. Look at a block diagram of (say) Skylake: https://en.wikichip.org/wiki/intel/microarchitectures/skylak...
What you do is extract up to N instructions from a block of memory (in this case, 16 bytes). If there are fewer than N instructions, well, you can only issue as many as you have; if there are more than N instructions, well, you can only issue N instructions that cycle. Note that most instructions are only a few bytes length; I don't have hard data on the average instruction length of an x86-64 instruction in practice, but my guess is it's around 3-4 bytes.
The hard part about a variable-length ISA, as I understand it (hardware is not my expertise), is that you have to build essentially a mux that can steer from 16 source places to 5 outputs.