for (int i = 0; i < n; i += 8*16) {
__builtin_prefetch(&a[i + 512]);
for (int j = 0; j < 8; j++) {
__m128i v = _mm_load_si128((__m128i *)&a[i + j*16]);
__m128i vl = _mm_unpacklo_epi8(v, vk0);
__m128i vh = _mm_unpackhi_epi8(v, vk0);
vsum = _mm_add_epi32(vsum, _mm_madd_epi16(vl, vk1));
vsum = _mm_add_epi32(vsum, _mm_madd_epi16(vh, vk1));
}
The goal is to issue one prefetch for each 128B block of data you read. There are probably better ways to do this than what I did. I'm hoping the compiler did something reasonable, and haven't really looked at the generated assembly.
Also, if it indeed is that case that TLB misses are the major factor (and I think it is), I don't think you will have much success with by adding prefetch alone. Trying right now, I get a slight slowdown with just the prefetch. It may only in combination with hugepages that you get a positive effect.
I think 'perf' is probably lying to you. Although maybe it's not a lie, as your Gist does contain that line '<not supported> dTLB-misses'. Perf tries very hard to be non-CPU specific, and thus doesn't do a great job of handling the CPU specific stuff.
What processor are you running this on? If Intel, you might have luck with some of the more Intel specific wrappers here: https://github.com/andikleen/pmu-tools
The other main advantage of 'likwid' is that it allows you to profile just a section of the code, rather than the program as a whole. For odd political reasons, 'perf' doesn't make this possible.
ps. I think your 'argc' check is off by one. Since the name of the program is in argv[0], and argc is the length of argv, you want to check 'argc != 2' to confirm that a filename has been given.