Nowadays glibc has modern SSE code and the kernel uses "rep movsb". The kernel can store and restore FPU state if the copy is long and doing SSE/AVX is worth it. Someone on the Linux kernel mailing list measured that performance depends on src and dest being 64-byte aligned compared to each other: if they are aligned, "rep movsb" is faster than SSE.
The thread: https://lkml.org/lkml/2011/9/1/229
Seems public to me.
Debian was in the process of switching to eglibc in order to avoid glibc (and Drepper), and fix issues they saw with the library.
For small chunks however, nothing was faster than rep movsb which moves one byte at the time.
To be fair, things are improving. eg. The latest Intel CPUs no longer need aligned memory to avoid slowing down.
A really robust memmove library routine should handle about eleven different factors, one of which is alignment. I don't know of ANY library that handled that right, probably because its so hard. E.g. unaligned source, unaligned dest with Different alignment is very hard. Usually they settle on aligning the destination (unaligned cache writes are more expensive). The true solution is to load the partial source, then loop loading whole aligned source words, shifting values in multiple registers to create aligned destination words to store.
That all requires about 16 different unrolled code loops to cover all the cases. Nobody bothers. So nobody every got the best performance in a general memmove anywhere. Sigh.