Hacker News new | past | comments | ask | show | jobs | submit login

The criticisms there are at the same time 1) true and 2) irrelevant.

Just to take one example. Yes, on ARM and x86 you can often do array indexing in one instruction. And then it is broken down into several µops that don't run any faster than a sequence of simpler instructions (or if it's not broken down then it's the critical path and forces a lower clock speed just as, for example, the single-cycle multiply on Cortex-M0 does).

Plus, an isolated indexing into an array is rare and never speed critical. The important ones are in loops where the compiler uses "strength reduction" and "code motion out of loops" so that you're not doing "base + array_offset + indexelt_size" every time, but just "p++". And if the loop is important and tight then it is unrolled, so you get ".. = p[0]; .. = p[1]; .. = p[2]; .. = p[3]; p += 4" which RISC-V handles perfectly well.

"But code size!" you say. That one is extremely easily answered, and not with opinion and hand-waving. Download amd64, arm64, and riscv64 versions of your favourite Linux distro .. Ubuntu 24.04, say, but it doesn't matter which one. Run "size" on your choice of programs. The RISC-V will always be significantly smaller than the other two -- despite supposedly being missing important instructions.

A lot of the criticisms were of a reflexive "bigger is better" nature, but without any examination of HOW MUCH better, or the cost in something else you can't do instead because of that. For example both conditional branch range and JAL/JALR range are criticised as being limited by including one or more 5 bit register specifiers in the instruction through having "compare and branch" in a single instruction (instead of condition codes) and JAL/JALR explicitly specifying where to store the return address instead of having it always be the same register.

RISC-V conditional branches have a range of ±4 KB while arm64 conditional branches have a range of ±1 MB. Is it better to have 1 MB? In the abstract, sure. But how often do you actually use it? 4 KB is already a very large function -- let alone loop -- in modern code. If you really need it then you can always do the opposite condition branch over an unconditional ±1 MB jump. If your loop is so very large then the overhead of one more instruction is going to be far down in the noise .. 0.1% maybe. I look at a LOT of compiled code and I can't recall the last time I saw such a thing in practice.

What you DO see a lot of is very tight loops, where on a low end processor doing compare-and-branch in a single instruction makes the loop 10% or 20% faster.






"don't run any faster than a sequence of simpler instructions"

This is false. You can find examples of both x86-64 and aarch64 CPUs that handle indexed addressing with no extra latency penalty. For example AMD's Athlon to 10H family has 3 cycle load-to-use latency even with indexed addressing. I can't remember off the top of my head which aarch64 cores do it, but I've definitely come across some.

For the x86-64/aarch64 cores that do take additional latency, it's often just one cycle for indexed loads. To do indexed addressing with "simple" instructions, you'd need at a shift and dependent add. That's two extra cycles of latency.


Note that Zba's sh1add/sh2add/sh3add take care of the problem of separate shift+add.

But yeah, modern x86-64 doesn't have any difference between indexed and plain loads[0], nor Apple M1[1] (nor even cortex-a53, via some local running of dougallj's tests; though there's an extra cycle of latency if the scale doesn't match the load width, but that doesn't apply to typical usage).

Of course one has to wonder whether that's ended up costing something to the plain loads; it kinda saddens me seeing unrolled loops on x86 resulting in a spam of [r1+r2*8+const] addresses and the CPU having to evaluate that arithmetic for each, when typically the index could be moved out of the loop (though at the cost of needing to pointer-bump multiple pointers if there are multiple), but x86 does handle it so I suppose there's not much downside. Of course, not applicable to loads outside of tight loops.

I'd imagine at some point (if not already past 8-wide) the idea of "just go wider and spam instruction fusion patterns" will have to yield to adding more complex instructions to keep silicon costs sane.

[0]: https://uops.info/table.html?search=%22mov%20(r64%2C%20m64)%...

[1]: https://dougallj.github.io/applecpu/measurements/firestorm/L... vs https://dougallj.github.io/applecpu/measurements/firestorm/L...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: