

Intel's take on GCC's memcpy implementation - mtdev
http://software.intel.com/en-us/articles/memcpy-performance/

======
wolf550e
This article is old: March 9, 2009 1:00 AM PDT

Nowadays glibc has modern SSE code and the kernel uses "rep movsb". The kernel
can store and restore FPU state if the copy is long and doing SSE/AVX is worth
it. Someone on the Linux kernel mailing list measured that performance depends
on src and dest being 64-byte aligned compared to each other: if they are
aligned, "rep movsb" is faster than SSE.

The thread: <https://lkml.org/lkml/2011/9/1/229>

[http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git...](http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob;f=arch/x86/lib/memcpy_64.S;hb=HEAD)

[http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_...](http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/multiarch/memcpy-
ssse3.S;hb=HEAD)

------
abrahamsen
> the developer communications don't appear on a public list. There is no
> visible public help forum or mail list

<http://dir.gmane.org/index.php?prefix=gmane.comp.lib.glibc>

Seems public to me.

~~~
ominous_prime
The list is publicly archived, but glibc's maintainer (Ulrich Drepper)
actively discourages public interaction for the project. The project's policy
is that bug reports should almost always go through a Linux distribution, and
to say it nicely, Drepper can be difficult to persuade.

Debian was in the process of switching to eglibc in order to avoid glibc (and
Drepper), and fix issues they saw with the library.

------
shin_lao
A couple of years ago, before SSE existed, I wrote a highly optimized memory
copy routine. It was more than just using movntq (non temporal is important to
avoid cache pollution) and the like, for large data I copied the chunks in a
local buffer less than one page size and copied it to the destination. Sounds
crazy? It actually was much faster because of page locality.

For small chunks however, nothing was faster than rep movsb which moves one
byte at the time.

------
memset
Someone tell me if I am mistaken - but it looks like the main difference
between GCC's and Intel's memcpy() boils down to gcc using `rep movsl` and icc
using `movdqa`, the latter having a shorter decode time and possibly shorter
execution time?

~~~
bdonlan
No, the problem is with x86-64, which apparently doesn't use `rep movsl`; as
far as I can tell, GCC's x86-64 backend assumes that SSE will be available,
and so only has a SSE inline memcpy. However, in the kernel SSE is not
available (as SSE registers aren't saved normally, to save time), so this is
disabled. With no non-SSE fallback (such as `rep movsl` on x86), gcc falls
back to a function call, with the performance impact this implies.

~~~
sliverstorm
From the sound of it, the function call was not the issue, so much as the
function that gets called is old and non-optimal with modern tools.

------
JoeAltmaier
I'm sad that computers in this modern age still require me to be in their
business. Doesn't it seem like the cpu's own business to move bytes
efficiently? Why is the compiler, much less the programmer, involved? The
tests being made in the compiler/lib are of factors better-known at runtime
(overlap, size, alignment) and better handled by microcode.

~~~
Andys
Hardware improvements necessarily move slower than software, especially when
carrying the complex historical baggage of out-of-order execution of x86.

To be fair, things are improving. eg. The latest Intel CPUs no longer need
aligned memory to avoid slowing down.

~~~
JoeAltmaier
Really? That's huge!

A really robust memmove library routine should handle about eleven different
factors, one of which is alignment. I don't know of ANY library that handled
that right, probably because its so hard. E.g. unaligned source, unaligned
dest with Different alignment is very hard. Usually they settle on aligning
the destination (unaligned cache writes are more expensive). The true solution
is to load the partial source, then loop loading whole aligned source words,
shifting values in multiple registers to create aligned destination words to
store.

That all requires about 16 different unrolled code loops to cover all the
cases. Nobody bothers. So nobody every got the best performance in a general
memmove anywhere. Sigh.

~~~
Andys
PCs will never be perfect. Huge compromises have had to be made all over the
hardware and software, to give us the cheap, ubiquitous computing power which
drives the Internet.

------
vz0
Anger Fog found this issue one year earlier, 2008:

<http://www.cygwin.com/ml/libc-help/2008-08/msg00007.html>

