Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Some quick searching gives that FSRM is used for at least 128 bytes or so (ERMS for ≥~2048 bytes for reference); in base x86 (i.e. SSE2) that's 8 loads & 8 stores, ~62 bytes of code. At that point calling into a library function isn't too unreasonable (at the very least it could utilize AVX and cut that in half to 4 loads+4 stores, though at the cost of function call overhead & some (likely correctly-predicted) branches).


https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/...

Suggests that it should be usable for even shorter copies. And that's really my point. We should have One True memcpy instruction sequence that we use everywhere and stop worrying. And yet...


Unfortunately, we can't change the past, and seemingly in the past it wasn't worth it to have a fast One True memcpy (and perhaps to a decent extent still isn't). I'm still typing this on a Haswell CPU, which don't have FSRM (rep movsb of 16 bytes in a loop takes ~10ns=36 cycles per iteration avg).

But, yeah it does seem that my 128 bytes of a quick search was wrong. (though, gcc & clang for '-march=alderlake' both never generate 'rep movsb' on '-O3'; on `-Os` gcc starts giving a rep movsb for ≥65B, clang still never does)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: