Some quick searching gives that FSRM is used for at least 128 bytes or so (ERMS ...

the8472 · on Nov 30, 2023

https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/...

Suggests that it should be usable for even shorter copies. And that's really my point. We should have One True memcpy instruction sequence that we use everywhere and stop worrying. And yet...

dzaima · on Nov 30, 2023

Unfortunately, we can't change the past, and seemingly in the past it wasn't worth it to have a fast One True memcpy (and perhaps to a decent extent still isn't). I'm still typing this on a Haswell CPU, which don't have FSRM (rep movsb of 16 bytes in a loop takes ~10ns=36 cycles per iteration avg).

But, yeah it does seem that my 128 bytes of a quick search was wrong. (though, gcc & clang for '-march=alderlake' both never generate 'rep movsb' on '-O3'; on `-Os` gcc starts giving a rep movsb for ≥65B, clang still never does)