
Apex memmove – fast memcpy/memmove on x86/x64 - Tatyanazaxarova
http://www.codeproject.com/Articles/1110153/Apex-memmove-the-fastest-memcpy-memmove-on-x-x-EVE
======
userbinator
I am unconvinced. REP MOVS is tiny (it is literally 2 bytes, a single
instruction), _extremely_ fast on modern x86, and in my experience what little
gain (if any) you get from the giant blob of code will be offset by all the
instruction cache bloat it causes. If you search for "enhanced REP MOVSB" you
will find some further reading.

~~~
stephencanon
REP MOVS is the fastest option for huge copies on Ivybridge and later by a
large margin, as it uses special cache semantics that aren't available via any
other instructions.

For small copies with buffers that aren't aligned or cacheline-multiple length
and are resident in L1 cache, it's possible to be significantly faster than
REP MOVS using a software sequence of AVX instructions. This is because the
branches in these sequences are usually perfectly predictable in the real
world, but the "microcode sequence"[1] used by REP MOVS does not benefit from
the branch predictor. This imposes a static startup cost (a few tens of
cycles) that exceeds function-call overhead, which keeps software
implementations of memcpy in business.

[1] Not exactly microcode as that term is classically understood, but there
isn't really a better term for it that's widely used.

~~~
nkurz
_[1] Not exactly microcode as that term is classically understood, but there
isn 't really a better term for it that's widely used._

What are technically correct but less widely used terms? And how does the
current Intel approach differ from classical microcode? Most of my knowledge
is from the Intel manuals, so I don't have a context for how it is different
than other approaches.

~~~
stephencanon
I don't think there is one, really. "Fast microcode" or something, maybe intel
has an internal name.

------
bboreham
I couldn't believe it was possible to write a new article about faster copies.
Reminds me of reading "Inner Loops" by Rick Booth [0] in 1996, which first
opened my mind to the darker corners of x86 optimizing.

BTW, the article is quite strangely structured. The initial piece sounds like
he has been up for four days straight and he just wants to get his thoughts
down. But he forgot a few points, so they are added in five updates. Arguably
that's the best place to start reading - around the middle of the page.

Then, he goes back in time two years and adds an earlier draft of the same
piece, noting "don't even look at the original article! It's very confusing!".

[0]
[https://books.google.co.uk/books/about/Inner_Loops.html?id=v...](https://books.google.co.uk/books/about/Inner_Loops.html?id=v2pnQgAACAAJ&redir_esc=y&hl=en)

------
wscott
I don't see how you can write an article on x86 memcpy without having
benchmarks for 'rep movs' as a baseline. That is the old x86 assembly
instruction that does the entire copy in one instruction. With modern
processors, that instruction drops into a microcode loop to move the data
using the internal instructions.

Microcode has a bit of a delay getting in there and so 'rep movs' is not so
great for short moves, but normally does pretty well for large transfers.

That is the big problem with memcpy, the optimal code depends on how it is
used. Block copies of large aligned structures wants different code than
moving short unaligned strings. Ideally, the compiler has more context and can
select the different versions for different call sites.

~~~
tedunangst
Among other problems with benchmarking SSE versions, your operating system may
implement lazy FPU save/restore. SSE memcpy suddenly turns every process into
an FPU using process, increasing context switch overhead. This is something
you won't see in single process microbenchmarks.

~~~
pcwalton
Are there many processes out there that _don 't_ use SSE these days? LLVM, for
example, will aggressively use SSE for memcpy/memmove of medium sized values
(e.g. copies of 128 byte long structs), which almost every nontrivial app will
end up using at some point. Likewise, if the autovectorizer succeeds even
once, then you'll also end up using SSE.

~~~
tedunangst
True. I'm still living in the Bronze Age, but we may already be past the
tipping point. I wonder if any consideration was given. Personally I don't
know what the cost is, but I know I've never seen anybody try to measure it.

------
gens
I don't want to register to download the code, so take my comment as such.

I wrote an (actually) fast memcpy() in assembly. There are more problems then
just different size-s. Hardest problem is actually alignment, particularly 1
byte unaligned buffers.

A couple things (if hn doesn't autoformat): 1\. SSE2 non-temporal MOV (one
that doesn't go through cache) is basically required when going over
~cache_size. 2\. One byte unaligned (or any odd number) to cpu alignment
buffers need to have the first n bytes copied first to get to that alignment
(16byte for SSE; also couple bytes at the end). 3\. Buffers not aligned to
each other are the biggest mess. SSSE3 actually helps with that as it can
shuffle bytes around, but i found it easier (and probably faster) to shift-or
the things.

There's probably more that i forgot. I lost the original code but i have a
half-arsed version somewhere around here if someone wants to see. It is only
for 8 or 16 bytes aligned buffers (to each other) but the grand structure is
there.

PS To those saying "REP MOV* is faster": In theory, yes, but no, not really.

~~~
JoeAltmaier
Such a fascinating problem! For efficiency you want large aligned loads and
stores. The source and destination alignment are independent issues; so the
CPU uses registers for a sort of shift-and-realign engine.

And you want to unroll each case, because speed is the goal.

The longer the alignment size (4-byte? 8-byte? 16-byte?) the more permutations
of start and end conditions there are. E.g. moving from a 15-byte aligned
source to a 3-byte aligned destination. Maybe 256 combinations? Only 1 of
which is zero-aligned source and destination (full-register move) which is all
most folks think of writing to begin with.

Add in cache-line size considerations, and overlap possibilities (are we
moving a string in-place left or right by 7 characters?) Can the bus do
unaligned loads? Stores? How do they compare in performance to multiple
single-byte transfers? So many things can affect performance, each case
becomes a heuristic. After all the theorizing, testing is the only real
metric.

I've dived down this rabbit hole before. Wrote a test app to move 0-128 bytes
from source aligned 0-128 to destination aligned 0-128. The memcpy I was
testing on a RISC architecture hit 11 bugs (processor faults, incorrect
result) before I was through.

Its refreshing to find somebody else with an appreciation for the depth of
this problem. Thanks for posting, gens!

------
grenoire
His writing is difficult to read, and makes him sound like a snake oil
salesman. Good work nevertheless.

~~~
imglorp
Important safety tip! Exclaiming half your sentences make you sound
breathless!

------
Kristine1975
_> Although these are C/C++ functions, they were `designed by disassembly` ...
meaning I was paying close attention to the compiler output. This is NOT
normally advisable (we don't want the compiler to dictate how we design our
functions), however, when you are designing things for maximum performance,
it's important to pay attention to how the code will compile! The strange
layout of the functions is a testament to my close observation of the compiler
output._

Why not write in assembly then? And this sounds like a recipe for slower code
when using a different compiler or a different version of the same compiler:

 _> The way instructions are ordered also prevents the compiler from making
some `assumptions`!_

------
graycat
Gee, for a fast memory move of blocks of data that contained some whole
virtual memory pages the fast way was just to have an API to the OS that
tweaked the virtual memory page-segment tables! I'm not sure that simple idea
has ever been used -- maybe there are some problems with it.

~~~
mikeash
I recall that macOS's memcpy does this for some cases. There's overhead for
the syscalls and TLB invalidation, so it's not efficient for smaller copies,
but it's a win for bigger ones. I think it considers the breakeven point to be
40kB. And of course the source and destination both must be page-aligned.

------
ndesaulniers
Does anyone have any reading material that goes into the issues with unaligned
load/store? I'm trying to do some research for a blog post.

~~~
nkurz
You may have already found it, but I posted some experimental results here
last year, and others contributed additional useful additional information:
[http://www.agner.org/optimize/blog/read.php?i=415#423](http://www.agner.org/optimize/blog/read.php?i=415#423)

------
_nalply
As for size = -size I think he was inverting the bits then adding 1 in one
step (or something similar). Just a guess.

~~~
nkurz
I didn't look at the code, but I often use something similar in assembly when
hyper-optimizing loops. The fastest memcpy-like loop structure I've found on
x64 involves using a negative offset from the end of the region, which is then
incremented until it hits zero.

The advantage of this construct is that you can use the flags set by the
increment to test for loop termination, rather than needing an additional
comparison. The problem in C is that the compiler tends to "unoptimize" it for
you after seemingly innocuous changes.

Anyway, the loop ends up looking something like this:

    
    
      char *end = start + size;
      size_t neg = -size;
      do { 
        register = *(end + neg);
        process(register);
        neg += advance;
      } while (neg);
        

For certain loops this construct shaves off 1 cycle per iteration, at the cost
of ~1 cycle of initial setup cost. I'm wagering that he was doing something
similar, but just kept the name "size" after the negation, rather than
switching to "neg" as I did here.

~~~
DblPlusUngood
> The advantage of this construct is that you can use the flags set by the
> increment to test for loop termination, rather than needing an additional
> comparison.

I don't understand your optimization. ZF is also set for _dec_ , so why not
just use _dec_ and not negate the size?

~~~
nkurz
Sticking with a positive value and using _dec_ or _sub_ works if you just have
a loop counter or if it's acceptable to process the data in your buffer
backwards, but switching to negative lets you go through the buffer in
standard ascending order. Sometimes data dependencies require this, and
sometimes ascending works better for prefetching.

------
PaulHoule
Reminds me of replacing the memcpy function from turbo Pascal with run that
ran 2x as fast on a 286 processor.

