As I tried to indicate, the microbenchmark-guided unrolled loops look better than the simple `rep` instruction because the benchmarks aren't realistic. They don't suffer from mispredicted branches, they don't experience icache pressure. Basically the microarchitectural reality is absent in a small benchmark.
There's also the fact that `rep` startup cost was higher in the past than it is now. I think it started to get really fast around Ivy Bridge.
This is all discussed in Intel's optimization manual, by the way.