Also, most modern processors will easily forward the store to the subsequent rea...

kevingadd · 2024-04-19T05:15:41

Forwarding isn't unlimited, though, as I understand it. The CPU has limited-size queues and buffers through which reordering, forwarding, etc. can happen. So I wouldn't be surprised if using registers well takes pressure off of that machinery and ensures that it works as you expect for the data that isn't in registers.

(Looked around randomly to find example data for this) https://chipsandcheese.com/2022/11/08/amds-zen-4-part-2-memo... claims that Zen 4's store queue only holds 64 entries, for example, and a 512-bit register store eats up two. I can imagine how an algorithm could fill that queue up by juggling enough data.

rayiner · 2024-04-19T12:37:57

It’s limited, but in the argument passing context you’re storing to a location that’s almost certainly in L1, and then probably loading it immediately within the called function. So the store will likely take up a store queue slot for just a few cycles before the store retires.

FullyFunctional · 2024-04-19T16:29:37

Due to speculative out-of-order execution, it's not just "a few cycles". The LSU has a hard, small, limit on the number of outstanding loads and stores (usually separate limits, on the order of 8-32) and once you fill that, you have to stop issuing until commit has drained them.

This discussion is yet another instance of the fallacy of "Intel has optimized for the current code so let's not improve it!". Other examples include branch prediction (correctly predicted branch as a small but not zero cost) and indirect jump prediction. And this doesn't even begin to address implementations that might be less aggressive about making up for bad code (like most RISCs and RISC-likes).

dwattttt · 2024-04-19T02:13:01

More broadly: processor design has been optimised around C style antics for a long time, trying to optimise the code produced away from that could well inhibit processor tricks in such a way that the result is _slower_ than if you stuck with the "looks terrible but is expected & optimised" status quo

eru · 2024-04-19T04:03:20

Reminds me of Fortran compilers recognising the naive three-nested-loops matrix multiplication and optimising it to something sensible.

pcwalton · 2024-04-19T16:26:08

Register allocation decisions routinely result in multi-percent performance changes, so yes, it does.

Also, registers help the MachineInstr-level optimization passes in LLVM, of which there are quite a few.