Hacker News new | past | comments | ask | show | jobs | submit login

I am unconvinced. REP MOVS is tiny (it is literally 2 bytes, a single instruction), extremely fast on modern x86, and in my experience what little gain (if any) you get from the giant blob of code will be offset by all the instruction cache bloat it causes. If you search for "enhanced REP MOVSB" you will find some further reading.

According to Intel themselves[1], REP MOVS is optimal only for large copies, above about 2kb. Below that the large startup costs of the microcode sequence can be beaten by hand-written code for the target architecture. Which explains why glibc takes that approach. The linux kernel on the other hand, just goes with:

  shrq $3, %rcx
  andl $7, %edx
  rep movsq
  movl %edx, %ecx
  rep movsb
Torvalds justified it as a way to force Intel and AMD to do the right thing and optimize the rep movs* instructions. He was annoyed at the code complexity of "more optimal" apporaches, and like you, concerned about icache effects.

A hand-written memmove is faster in microbenchmarks, but the icache effects may make the overall performance difference smaller (or even negative.) That's harder to measure.

AFAIK glibc does get better results than the kernel approach, but they've also introduced bugs[3] that way and the code is very complex by comparison. Also worth noting is that the kernel usually runs with a cold(er) icache, which is why they compile for size rather than speed (last I checked anyway, which was long ago) and why their memmove may make more sense in the context of the kernel. Also I suspect copies in the kernel are more likely to be larger (e.g. packets, pages, blocks) as opposed to small objects in userspace.

[1] http://www.intel.com/content/www/us/en/architecture-and-tech...

[2] https://github.com/torvalds/linux/blob/master/arch/x86/lib/m...

[3] https://bugzilla.redhat.com/show_bug.cgi?id=638477#c99

I believe this [1] is what you are referencing. Additional reading for the truly knowledge craving [2] (section 3.7.6).

For the lazy:

On aligned data REP MOVS will automatically select the largest available register/load/store instruction to use. So invoking this instruction will also determine SSE2 vs AVX vs AVX512 (or more over its burned into the silicon). Furthermore if your allocation is large enough it will by-pass caching mechanisms for an additional speed up.

For the VERY lazy:

REP MOVS will attempt to saturate DRAM bandwidth. The best hand coded ASM or C can hope to do is be as fast as it.

[1] https://stackoverflow.com/questions/26246040/whats-missing-s...

[2] https://www-ssl.intel.com/content/www/us/en/architecture-and...

REP MOVS is the fastest option for huge copies on Ivybridge and later by a large margin, as it uses special cache semantics that aren't available via any other instructions.

For small copies with buffers that aren't aligned or cacheline-multiple length and are resident in L1 cache, it's possible to be significantly faster than REP MOVS using a software sequence of AVX instructions. This is because the branches in these sequences are usually perfectly predictable in the real world, but the "microcode sequence"[1] used by REP MOVS does not benefit from the branch predictor. This imposes a static startup cost (a few tens of cycles) that exceeds function-call overhead, which keeps software implementations of memcpy in business.

[1] Not exactly microcode as that term is classically understood, but there isn't really a better term for it that's widely used.

[1] Not exactly microcode as that term is classically understood, but there isn't really a better term for it that's widely used.

What are technically correct but less widely used terms? And how does the current Intel approach differ from classical microcode? Most of my knowledge is from the Intel manuals, so I don't have a context for how it is different than other approaches.

I don't think there is one, really. "Fast microcode" or something, maybe intel has an internal name.

REP MOVS is the fastest for large moves, but it seems to have a higher fixed overhead. Intel's optimization manual goes into depth with comparisons between the various methods of blitting.

I am skeptical as well, and I agree with you in principle. However I'd be interested if a 3rd party managed to reproduce his benchmarks. He credits using various prefetch hints for the performance gains, which sounds somewhat plausible.

The code may bloat, but only because of the permutations. Each call, you execute only the small case that applies. Maybe not that bad; probably it all fits in the nearest instruction cache?

Latency for short copies is very important. I think "rep movs" is always going to win in this case.

Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact