shrq $3, %rcx
andl $7, %edx
movl %edx, %ecx
A hand-written memmove is faster in microbenchmarks, but the icache effects may make the overall performance difference smaller (or even negative.) That's harder to measure.
AFAIK glibc does get better results than the kernel approach, but they've also introduced bugs that way and the code is very complex by comparison. Also worth noting is that the kernel usually runs with a cold(er) icache, which is why they compile for size rather than speed (last I checked anyway, which was long ago) and why their memmove may make more sense in the context of the kernel. Also I suspect copies in the kernel are more likely to be larger (e.g. packets, pages, blocks) as opposed to small objects in userspace.
For the lazy:
On aligned data REP MOVS will automatically select the largest available register/load/store instruction to use. So invoking this instruction will also determine SSE2 vs AVX vs AVX512 (or more over its burned into the silicon). Furthermore if your allocation is large enough it will by-pass caching mechanisms for an additional speed up.
For the VERY lazy:
REP MOVS will attempt to saturate DRAM bandwidth. The best hand coded ASM or C can hope to do is be as fast as it.
For small copies with buffers that aren't aligned or cacheline-multiple length and are resident in L1 cache, it's possible to be significantly faster than REP MOVS using a software sequence of AVX instructions. This is because the branches in these sequences are usually perfectly predictable in the real world, but the "microcode sequence" used by REP MOVS does not benefit from the branch predictor. This imposes a static startup cost (a few tens of cycles) that exceeds function-call overhead, which keeps software implementations of memcpy in business.
 Not exactly microcode as that term is classically understood, but there isn't really a better term for it that's widely used.
What are technically correct but less widely used terms? And how does the current Intel approach differ from classical microcode? Most of my knowledge is from the Intel manuals, so I don't have a context for how it is different than other approaches.