Hacker News new | past | comments | ask | show | jobs | submit login

It's amazing to me that every CPU doesn't have specialized memory copying instructions given how common an operation this is and how it can be a bottleneck. We have hundreds of vector instructions to optimize all kinds of special cases but little for this most general thing?



Intel has `rep movs` which is recommended for general use, you can beat it for sizes < 128B using a specialized loop. A very good memcpy for general purposes would just branch over the size to either a specialized small copy or `rep movs` for larger sizes.

The GCC version is just bananas. https://sourceware.org/git/?p=glibc.git;a=blob_plain;f=sysde...

Compare the newer ERMS implementation: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86...


What's weird is that if you git blame a bit, a lot of the elaborate stuff was contributed by Intel.


Intel is at least as guilty of chasing misleading benchmarks as anyone else in the history of our field.


No AVX2-optimized version? I'm disappointed…

Edit: it's rolled into the erms version.


x86 does, it's called REP MOVS, and it copies entire cachelines at a time (if the size of the copy lets it.) AFAIK it has done this since the P6 family (Pentium II), and has only gotten better over time. As a bonus, it is only 2 bytes long. I believe the Linux kernel uses it as its default memcpy()/memmove().

Here's some very interesting discussion about it: https://stackoverflow.com/questions/43343231/enhanced-rep-mo...


It's funny enough that the REP family of instructions has existed at least since the original 8086/8088. It's very much a relic of the CISC heritage of the x86 ISA, and it was probably meant to make writing assembly code easier and to save space - rather to improve performance.

I remember it was actually quite common in early PC programs, but I don't remember well enough whether it was also generated by compilers, or just existed in hand-optimized assembly (which was of course extremely common back then).


Why isn't it the default everywhere?


As I tried to indicate, the microbenchmark-guided unrolled loops look better than the simple `rep` instruction because the benchmarks aren't realistic. They don't suffer from mispredicted branches, they don't experience icache pressure. Basically the microarchitectural reality is absent in a small benchmark.

There's also the fact that `rep` startup cost was higher in the past than it is now. I think it started to get really fast around Ivy Bridge.

This is all discussed in Intel's optimization manual, by the way.


Neat. I was unaware of this manual. For the lazy but curious:

https://software.intel.com/content/www/us/en/develop/downloa...

And if you want ALL THE THINGS:

https://software.intel.com/content/www/us/en/develop/article...


Yes, some compilers emit AVX move instructions on memcpy or compiler-generated moves (e.g. pass by value struct). (In addition to the rep move mentioned in the sibling posts)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: