
Memcpy vs. memmove - luu
http://www.tedunangst.com/flak/post/memcpy-vs-memmove
======
twoodfin
Linus had some strong opinions on the distinction between memcpy and memmove
when it became a significant userspace issue on Linux a few years ago:

[https://bugzilla.redhat.com/show_bug.cgi?id=638477#c132](https://bugzilla.redhat.com/show_bug.cgi?id=638477#c132)

I think he's probably right that there's no reason not to alias memcpy to
memmove and make the latter smart enough to do the fastest possible thing
given its input. Every implementation of memcpy and memmove that I've seen has
been sufficiently complicated in order to optimize for big copies that an
added comparison or two for bounds checking would seem a drop in the bucket.

But I'm open to practical examples where the performance difference would
become significant.

~~~
masklinn
One of the commits asserts memcpy triggers extra optimisations by hinting the
areas are non-overlapping: [http://marc.info/?l=openbsd-
cvs&m=137026274514948&w=2](http://marc.info/?l=openbsd-
cvs&m=137026274514948&w=2)

> Replace "hot" bcopy() calls in ether_output() with memcpy(). This tells the
> compiler that source and destination are not overlapping, allowing for more
> aggressive optimization, leading to a significant performance improvement on
> busy firewalls.

~~~
makomk
The issue that caused Linus to argue memcpy should just act like memmove
involved old binaries breaking with new kernels, so the compiler's behaviour
wasn't an issue there.

~~~
cbsmith
It was a bit more than that. He implied that the performance differences were
trivial, so it wasn't worth the trouble.

I think, unfortunately, Linus' argument amounts to demanding that API's
conform to how they are used rather than how they are specified, which leads
to O_PONIES type problems.

------
robinhouston
I was pretty surprised to learn that “the x86 architectures have a direction
flag that can be set to cause the processor to run backwards”.

It turns out not to be quite as dramatic as that description makes it sound:
[https://en.wikipedia.org/wiki/Direction_flag](https://en.wikipedia.org/wiki/Direction_flag)

------
tgb
I looked up nop sleds but don't see why they're used here. Explanation?

~~~
danielweber
My best guess is to properly align instructions along a certain boundary, but
I'm pulling that out of my hat.

You can put bcopy as a function call right before memmove, and then you don't
need one function to call the other, which would cause a stack push. Maybe
instructions need to be at addresses = 0 mod 16, so that's the "closest" you
can get it. And spinning over ~12 NOPs might be faster than incrementing the
PC by ~9.

~~~
cjubb39
Yeah, looks like it's for function alignment padding. It's a pretty common
thing at the end of functions to have the next function start on a specific
boundary. (even if the first function doesn't fall into the other)

I haven't tested, but I'd bet good money that 12 NOPs would be faster than a
jmp.

~~~
pbsd
You can do an unconditional jump every 1 or 2 cycles, depending on the chip,
whereas no chip I know of can execute more than 4 nops per cycle. Therefore I
would say the jump is probably marginally faster than 12 nops.

Smart toolchains will turn those 12 bytes into 2 multi-byte nops, e.g., a
9-byte one and a 3-byte one.

~~~
0x0
What does a 9byte NOP look like?

~~~
pkhuong
[https://github.com/sbcl/sbcl/blob/master/src/compiler/x86-64...](https://github.com/sbcl/sbcl/blob/master/src/compiler/x86-64/insts.lisp#L2978)
has

0x66 0x0f 0x1f 0x84 0x00 0x00 0x00 0x00 0x00

That's a size override prefix, followed by the dedicated NOP instruction (0x0f
0x1f), and finally 6 bytes to encode an effective address with offset.

~~~
makomk
Multi-byte nops have compatibility issues on some of the more obscure 32-bit
x86 CPUs, unfortunately:
[https://sourceware.org/bugzilla/show_bug.cgi?id=13675](https://sourceware.org/bugzilla/show_bug.cgi?id=13675)

~~~
pkhuong
Right… you have to check cpuid for the long nop feature. I believe 0x66 0x90
is compatible (but slow, I would expect) with older CPUs.

------
corysama
I believe that on several modern compilers, memcpy and memset are practically
-if not literally- treated as intrinsics. As in, the compiler has been granted
semantic understanding of those specific functions and can generate very well
optimized assembly based on that knowledge. Haven't heard about the same being
done for memmove.

~~~
lgeek
What compilers are you thinking about? I've never seen GCC inlining and
specializing a call to memcpy, it just calls the generic (but optimized)
implementation in the standard library.

~~~
nkurz
I take it that you are a long-time C programmer who hasn't realized that the
language is no longer the one you grew up loving. That casting a float to an
int to do some bit twiddling is now undefined behavior of the sort that
technically entitles the compiler have it's way with your root filesystem.
Although usually it will just let you off with a warning and a gaping hole
where your null checking code used to be.

Believe it or not, memcpy() is now the only portable and properly defined way
to cast one type to another in C. You are expected to use it with the explicit
intent that no bytes actually be copied, simply as a (completely voluntary)
offering to the type gods. If they accept your offering, the call will just
disappear from your code, and never be made. If you fail to make this
offering, they feel entitled to excise an equal quantity of other code to
punish you.

Sure, you the programmer happen to know that bits are bits, and that on your
system they are already in the register you want to use, but the compiler
stopped playing that game years ago. If you wanted to work close to the metal,
you should have chosen a language more appropriate for the task. There's a
great discussion of the issues in this epic comp.arch thread:
[http://compgroups.net/comp.arch/if-it-were-
easy/2993157](http://compgroups.net/comp.arch/if-it-were-easy/2993157)

Search for the first occurence of 'memcpy', where you'll find a polite but
beleaguered Terje Mathisen asking for the best way to portably cast a float to
an integer in C. Then keep searching forward for further occurrences of memcpy
as the situation becomes surreal, with a GCC maintainer Mike Stump eventually
clearing things up:

    
    
      >>> So what is the blessed method?
      >>
      >> Just memcpy it, simple sweet, fast, standard.
      >
      >So what you are saying is that memcpy() isn't just magic but high magic:
    
      No.  What I think I'm saying is that it is standard.  See the quoted
      text above.  The word magic [ checking ] doesn't not occur in c99-tc3.
      What is defined in that standard is memcpy, it is as standard as if,
      which is also defined.
    
      >It looks like a function call but can be whatever the compiler wants it 
      >to be, as long as the results behave as if the data was actually copied.
      >:-)
    
      No, you fail to grasp the totality of the standard.   The implementation 
      is free to do _anything_ it wants.  The only constraint is that the user 
      can't figure that it deviated from the required semantics by using a standards
      defined mechanism for figuring it out.  We call this the as-if rule, and its
      power is awesome; we could destroy the universe, and repiece it back together
      one subatomic particle at a time over a billion years, and still be compliant,
      if we wanted.
    

So there you have it: if you want to reuse the exact contents of a register as
a different type in a way that you know will work, the correct approach is to
use memcpy() to copy the data from one variable to another, and then hope
without confirmation that GCC has optimized out the call to make it equivalent
to a simple cast.

If you want a better understanding of the GCC mindest, the rest of Mike's
posts are excellent: cutting, extremely techically accurate, and (in my
opinion) completely missing the point of why some people are unhappy with the
direction that C is evolving.

~~~
stephencanon
C11 explicitly blessed unions for type-punning as well as memcpy (actually,
C99 TC2 did). That said, people should just use memcpy.

~~~
nkurz
Yes, it's now in the standard (someone in the thread I linked pointed to TC3
draft, §6.5.2.3, footnote 82) but apparently there are issues with the way GCC
implements in its default dialect that make it unsafe. More Mike Stump:

    
    
      >>> So what about using a union?
      >>
      >> Most people screw it up, the rules for it working are slightly odd.
      >> Those rules are not standard[1], rather they are gcc specific, so in very
      >
      >Ouch, so what you are saying is that this is one of those (according to 
      >Nick M) barely defined areas of the language?
    
      What I am saying is that it is defined to not work in the language
      standard gcc implements by default.  There is no barely, there is no
      defined.  As an extension to the language standard, gcc implements
      (defines) a few things so that users can make some non-standard things
      always work.  I say slightly odd, as the rules are just a tad harder
      than trivial.
    

I prefer the union approach to memcpy() because I can more easily reason about
its behaviour, but veiled warnings from GCC maintainers scare me away from it.
But perhaps I misunderstand the warning.

For that matter, I prefer simple casts to unions, and while I agree the spec
makes it undefined, I don't yet see why all common implementations couldn't
simply make it work in all reasonable cases. I'm currently trying to get used
to using memcpy() for type annotation, but it still feels unnatural to write a
function I don't want executed (yes, I have trouble with setters/getters too).

What I'd like is to have a way to specify to the compiler that I want it to
compile the code I give it as written as best as it can, rather than
optimizing it out as undefined. As it is, I often resort to inline assembly if
I actually want an operation to happen. It seems like there should be an
intermediate approach, a hypothetical '#pragma "dwim"' that could avoid this.

------
pmalynin
repne movsb

~~~
nhaehnle
I guess somebody downvoted you for the shortness of your comment. You probably
know this, but repne movsb is actually a fairly slow approach to copying data
compared to the various SSE assembly implementations that libc and friends
have. The only thing repne movsb has going for it is how short its encoding
is.

It _is_ a bit bizarre that modern CPUs don't have an instruction for something
as basic as "copy N bytes as fast as you can", and instead we're in a
situation where library and compiler writers have to tune their assembly for
different micro-architectures. (I'm not saying that it's a wrong decision,
there are probably good reasons for doing things this way, but it's not
something you'd expect.)

~~~
stephencanon
"rep movsb" (note: not "repne") actually is one of the fastest ways to copy
memory on Ivybridge and Haswell (about as fast as vectorized copy
implementations when the data is resident in L1/L2, and significantly faster
than vector copy loops when the buffers are too large to be cache resident).
On micro-architectures preceding Ivybridge, rep movsb is indeed slow.

It's still not quite "as fast as you can"; it's possible to beat rep movsb
when buffers are small due to edging effects, but it's about as close as it's
possible to come to that today.

~~~
userbinator
_On micro-architectures preceding Ivybridge, rep movsb is indeed slow._

Actually, rep movsb/movsw/movsd has been at the top at least since Nehalem
(confirmed with benchmarks), and "fast string mode" which does cacheline-sized
copies has been around since the P6; you may be able to squeeze out a few %
more with SSE (or MMX), but the much larger code of the SSE-based copy
functions (especially for alignment) is often not a win overall. Intel only
really started advertising that they made it _even faster_ with Ivybridge.

It was the fastest way to do memory copies on the 8086/8088, and might've been
the fastest until around the time of the 486 and Pentium when it could be
beaten by other techniques; but now it seems that it's coming back in favour.

There's an interesting discussion about this instruction on the Intel forums
here: [https://software.intel.com/en-
us/forums/topic/275765](https://software.intel.com/en-us/forums/topic/275765)

~~~
stephencanon
It was still nowhere near as fast as a good software sequence on Nehalem or
Sandybridge. The startup cost to the microcode engine was simply too high, and
its handling of misalignment was very inefficient. For "small" (< 1K)
unaligned buffers (which are actually a large portion of memcpy usage on a
live system), good software sequences were 3-4x faster than REP MOVS.

It _did_ perform competitively for large all-aligned copies, which are what
most people tend to look at when they post "memcpy benchmarks", but those turn
out to be a relatively small portion of actual usage in most workloads.

------
anon4
I keep dreaming of a world where memcpy is an instruction to the memory
controller _only_ and doesn't hold up the CPU at all. It feels kind of silly
to have this complicated piece of silicone with all its amazing processing
capabilities just sit there reading RAM into registers and then writing them
back doing precisely nothing.

~~~
nsrango
This is done but mostly for peripheral IO (like DACs/ADCs, disk, video or
audio codecs and even between non-shared memory for processors) but can (and
has) been used in the main memory hierarchy .
[http://en.wikipedia.org/wiki/Direct_memory_access](http://en.wikipedia.org/wiki/Direct_memory_access)

~~~
Torgo
Some game systems did this too. For instance the Gameboy Advance had DMA, and
it was used all the time because if you were eating up your cycles on copying
memory, you would run out of time to process your game logic before your
vertical blank ended and the system started drawing onscreen.

I also saw it used to save memory. Onscreen sprites had to be placed into a
small memory-mapped address range, so instead of putting every character
animation frame into this space and using it all up, I saw just one block
designated for the character, and it was animated by DMAing the frames
sequentially into that space.

------
angersock
Note that, if memory serves correctly, many vendors just alias memcpy to
memmove anyways, because the performance is pretty similar and the bugs
prevented are super annoying.

~~~
jandrese
It's fun to benchmark memmove and memcpy on a box to see if memcpy has more
optimizations or not. On Linux x86_64 gcc memcpy is usually twice as fast when
you're not bound by cache misses, while both are roughly the same on FreeBSD
x86_64 gcc.

Linux (2.4Ghz Xeon X3430):

./memtest 10000 1000000

memcpy took 0.575571 seconds

memmove took 1.082038 seconds

FreeBSD (2.0Ghz AMD Athlon64 3000):

./memtest 10000 1000000

memcpy took 1.487334 seconds

memmove took 1.442741 seconds

~~~
lbenes
What memtest are you using? This one:

[https://github.com/UK-
MAC/WMTools/blob/master/tests/memtest....](https://github.com/UK-
MAC/WMTools/blob/master/tests/memtest.c)

~~~
jandrese
It's one I wrote myself to test this very thing after reading the memmove
manpage. I was wondering why you would ever use memcpy unless it had a big
performance advantage, so I wrote a micro-benchmark to see. If you are really
interested, I can put my benchmark program up somewhere.

Additional findings: Unsurprisingly using clang instead of gcc does not affect
performance on FreeBSD or Linux.

~~~
lbenes
Yes please. If you want share it, I'm curious how different compilers and
architectures handle memmove vs memcpy.

