The criticisms there are at the same time 1) true and 2) irrelevant. Just to tak...

clamchowder · 2025-02-04T19:31:48 1738697508

"don't run any faster than a sequence of simpler instructions"

This is false. You can find examples of both x86-64 and aarch64 CPUs that handle indexed addressing with no extra latency penalty. For example AMD's Athlon to 10H family has 3 cycle load-to-use latency even with indexed addressing. I can't remember off the top of my head which aarch64 cores do it, but I've definitely come across some.

For the x86-64/aarch64 cores that do take additional latency, it's often just one cycle for indexed loads. To do indexed addressing with "simple" instructions, you'd need at a shift and dependent add. That's two extra cycles of latency.

dzaima · 2025-02-04T23:28:41 1738711721

Note that Zba's sh1add/sh2add/sh3add take care of the problem of separate shift+add.

But yeah, modern x86-64 doesn't have any difference between indexed and plain loads[0], nor Apple M1[1] (nor even cortex-a53, via some local running of dougallj's tests; though there's an extra cycle of latency if the scale doesn't match the load width, but that doesn't apply to typical usage).

Of course one has to wonder whether that's ended up costing something to the plain loads; it kinda saddens me seeing unrolled loops on x86 resulting in a spam of [r1+r2*8+const] addresses and the CPU having to evaluate that arithmetic for each, when typically the index could be moved out of the loop (though at the cost of needing to pointer-bump multiple pointers if there are multiple), but x86 does handle it so I suppose there's not much downside. Of course, not applicable to loads outside of tight loops.

I'd imagine at some point (if not already past 8-wide) the idea of "just go wider and spam instruction fusion patterns" will have to yield to adding more complex instructions to keep silicon costs sane.

[0]: https://uops.info/table.html?search=%22mov%20(r64%2C%20m64)%...

[1]: https://dougallj.github.io/applecpu/measurements/firestorm/L... vs https://dougallj.github.io/applecpu/measurements/firestorm/L...