I’m not a compiler expert, an assembly expert or an ARM expert, so this may be w...

adwn · 2025-12-03T15:57:53 1764777473

> this looks optimized to me.

It's not. Why would lsl+csel or add+csel or cmp+csel ever be faster than a simple add? Or have higher throughput? Or require less energy? An integer addition is just about the lowest-latency operation you can do on mainstream CPUs, apart from register-renaming operations that never leave the front-end.

DannyBee · 2025-12-03T23:14:32 1764803672

In the end, the simple answer is that scalar code is just not worth optimizing harder these days. It's rarer and rarer for compilers to be compiling code where spending more time optimizing purely scalar arithmetic/etc is worth the payback.

This is even true for mid to high end embedded.

DullPointer · 2025-12-03T16:09:06 1764778146

ARM is a big target, there could be cpus where lsl is 1 cycle and add is 2+.

Without knowing about specific compiler targets/settings this looks reasonable.

Dumb in the majority case? Absolutely, but smart on the lowest common denominator.

adwn · 2025-12-03T18:12:02 1764785522

> Without knowing about specific compiler targets/settings this looks reasonable.

But we do, armv8-a clang 21.1.0 with O3, and it doesn't.

> […] but smart on the lowest common denominator.

No, that would be the single add instruction.