Definitely, it's as important as ever.
If you have matched call/ret pairs but cause an address mismatch by say modifying the return value on the stack, the subsequent `ret` will mispredict.
OTOH, if you cause a mismatch by having unbalanced call ret pairs, e.g., a `call` to a location that does a `jmp` back instead of a ret, the situation is much worse: the ret prediction stack is mismatched so you'll can mispredict on the next N return calls where N is 16 or 32 or whatever the size of the stack is (it doesn't mean that will happen: if the following code does many more calls before the correspond rets the bad predictions may fall of the stack and never be used).
The good news is that this one is really easy to avoid. Why would you end up with mismatched call/ret in the first place? It doesn't happen in normal code and even in hand-rolled assembly I can't see many reasons you'd end up like that.
There is one pattern that is used sometimes with mismatched calls: `call; pop eax` to get the current IP in rax (for example) - useful in 32-bit position independent code (in 64-bit code you can just use rip-relative addressing). This one is special cased though so it does not cause a misprediction.
Way more details avaialble here (this page is also linked in my post):
> - mixing int/float vector instructions with same register
These are the so-called domain bypass penalties. They are much reduced and I haven't seen them in practice on modern chips, but lets check Agner...
Yeah, it's best just to read Agner. See 10.9 Data Bypass Delays, for example. Here's an excerpt:
> However, there are fewer such delays on Haswell and Broadwell than on previous
processors. I found no such delays in the following cases:
> - when a floating point Boolean instruction, such as ORPS is used with integer data
> - when a wrong type of move instruction is used, e.g. MOVPS or MOVDQA
> - when a wrong type of shuffle instruction is used, e.g. SHUFPS or PFHUFD
The situation is similar for Skylake. Recommend checking out microarchitecture.pdf for the full scoop.
- false dependencies between partial registers (ie. AH and AL)
This one has the best answer of all, very complete:
I refer to it whenever I forget the details. The part I always remember: anything involving al always extracts/merges from/into the existing physical register: there are no uop or latency penalties, but there is always a dependency on the full register (rax in this case). So a write to al can't proceed until the rest of rax is available, and same for a read.
ah is very different: it renames to separate register, so it acts decoupled from rax. The penalties are reversed from above: there is no false dependency, but there may be merging uops and other costs.
The details on old chips are different.
IMO this one has little impact, but there are some cases where it matters, like really trying to optimize scalar byte extraction.
It looks like the main remaining delay cases are integer mul instructions which get a bypass delay no matter what their source, and a few cases like FMAs fed by non-shuffle integer ops (not that common), or non-shuffle integer ops fed by FMA or integer mul (also not that common).
The key part is that shuffles have zero delays in any configuration, as producer or consumer, except when a shuffle feeds an integer mul. That's good because shuffles are very common as inputs to both integer and FP ops.
I tried a simple JiT once that did code threading by jmp/ret - seems like that would not be any better today :D
tldr; prefer trampolines over weird stack tricks.
call/ret has to particular advantage over e.g. storing the return address in a register (say 15) and then `jmp [r15]` - other than the prediction, so if you can't use the prediction, jmp/jmp indirect should work fine.