The 3 cycles latency casts massive suspicion on the bypass network. But I don't see how the bypass network could be bugged without causing the incorrect result. So the scheduler doesn't know how to bypass this "shift with small immediate" micro op.
Or maybe the bypass network is bugged, and what we are seeing is a chicken bit set by the microcode that disables the bypass network for this one micro op that does shift.
> But I don't see how the bypass network could be bugged without causing the incorrect result.
Maybe if they really rely on this kind of forwarding in many cases, it's not unreasonable to expect that latency can be generated by having to recover from "incorrect PRF read" (like I imagine there's also a case for recovery from "incorrect forwarding")
Yeah, "incorrect PRF read" is something that might exist.
I know modern CPUs will sometimes schedule uops that consume the result of load instruction, with the assumption the load will hit L1 cache. If the load actually missed L1, it's not going to find out until that uop tries to read the value coming in from L1 over the bypass network. So that uop needs to be aborted and rescheduled later. And I assume this is generic enough to catch any "incorrect forwarding", because there are other variable length instructions (like division) that would benefit from this optimistic scheduling.
But my gut is to only have these checks on the bypass network, and only ever schedule PRF reads after you know the correct value has been stored.
maybe, the bypass network doesn't include these "constant registers"? a bit like zen5 where some 1-cycle SIMD ops are executed in 2 cycles, probably for shortcomings of the same network
Or maybe the bypass network is bugged, and what we are seeing is a chicken bit set by the microcode that disables the bypass network for this one micro op that does shift.