I believe so. A memcpy needs to touch large blocks of memory, far more than would fit in a cache, so most memcpy implementation use non-temporal stores to indicate that the memory is not going to be used again. They write directly to main memory, avoiding using up valuable cache space with a cache line that is written to once and never used again. However, this program can operate entirely within L2, so it can go faster.