Nowadays glibc has modern SSE code and the kernel uses "rep movsb". The kernel can store and restore FPU state if the copy is long and doing SSE/AVX is worth it. Someone on the Linux kernel mailing list measured that performance depends on src and dest being 64-byte aligned compared to each other: if they are aligned, "rep movsb" is faster than SSE.
The list is publicly archived, but glibc's maintainer (Ulrich Drepper) actively discourages public interaction for the project. The project's policy is that bug reports should almost always go through a Linux distribution, and to say it nicely, Drepper can be difficult to persuade.
Debian was in the process of switching to eglibc in order to avoid glibc (and Drepper), and fix issues they saw with the library.
A couple of years ago, before SSE existed, I wrote a highly optimized memory copy routine. It was more than just using movntq (non temporal is important to avoid cache pollution) and the like, for large data I copied the chunks in a local buffer less than one page size and copied it to the destination. Sounds crazy? It actually was much faster because of page locality.
For small chunks however, nothing was faster than rep movsb which moves one byte at the time.
Someone tell me if I am mistaken - but it looks like the main difference between GCC's and Intel's memcpy() boils down to gcc using `rep movsl` and icc using `movdqa`, the latter having a shorter decode time and possibly shorter execution time?
No, the problem is with x86-64, which apparently doesn't use `rep movsl`; as far as I can tell, GCC's x86-64 backend assumes that SSE will be available, and so only has a SSE inline memcpy. However, in the kernel SSE is not available (as SSE registers aren't saved normally, to save time), so this is disabled. With no non-SSE fallback (such as `rep movsl` on x86), gcc falls back to a function call, with the performance impact this implies.
rep movsl moves data 32 bits at a time, while movdqu/movdqa moves data 128 bits at a time. The advantage is not only in decoding -- the data paths in modern Intel processors are really 128bit, so movdqu/movdqa gets 4 times the throughput out of the system. (Until you run out of L1 cache, after which you really slow down.)
I'm sad that computers in this modern age still require me to be in their business. Doesn't it seem like the cpu's own business to move bytes efficiently? Why is the compiler, much less the programmer, involved? The tests being made in the compiler/lib are of factors better-known at runtime (overlap, size, alignment) and better handled by microcode.
A really robust memmove library routine should handle about eleven different factors, one of which is alignment. I don't know of ANY library that handled that right, probably because its so hard. E.g. unaligned source, unaligned dest with Different alignment is very hard. Usually they settle on aligning the destination (unaligned cache writes are more expensive). The true solution is to load the partial source, then loop loading whole aligned source words, shifting values in multiple registers to create aligned destination words to store.
That all requires about 16 different unrolled code loops to cover all the cases. Nobody bothers. So nobody every got the best performance in a general memmove anywhere. Sigh.