Can you share the prefetching code you wrote? I tried to use __builtin_prefetch,...

nkurz · on May 13, 2014

My approach was somewhat ugly. I went this this:

  for (int i = 0; i < n; i += 8*16) {
      __builtin_prefetch(&a[i + 512]);
      for (int j = 0; j < 8; j++) {
        __m128i v = _mm_load_si128((__m128i *)&a[i + j*16]);
        __m128i vl = _mm_unpacklo_epi8(v, vk0); 
        __m128i vh = _mm_unpackhi_epi8(v, vk0);
        vsum = _mm_add_epi32(vsum, _mm_madd_epi16(vl, vk1));
        vsum = _mm_add_epi32(vsum, _mm_madd_epi16(vh, vk1));
    }

The goal is to issue one prefetch for each 128B block of data you read. There are probably better ways to do this than what I did. I'm hoping the compiler did something reasonable, and haven't really looked at the generated assembly.

Also, if it indeed is that case that TLB misses are the major factor (and I think it is), I don't think you will have much success with by adding prefetch alone. Trying right now, I get a slight slowdown with just the prefetch. It may only in combination with hugepages that you get a positive effect.

jvns · on May 13, 2014

I profiled my program looking for TLB misses, and got 0 dTLB misses (https://gist.github.com/jvns/a42ff6f48c659cfc4600).

nkurz · on May 13, 2014

I think 'perf' is probably lying to you. Although maybe it's not a lie, as your Gist does contain that line '<not supported> dTLB-misses'. Perf tries very hard to be non-CPU specific, and thus doesn't do a great job of handling the CPU specific stuff.

What processor are you running this on? If Intel, you might have luck with some of the more Intel specific wrappers here: https://github.com/andikleen/pmu-tools

You also might have better luck with 'likwid': https://code.google.com/p/likwid/

Here's the arguments I was giving it to check:

  sudo likwid -C 1 -g              \
       INSTR_RETIRED_ANY:FIXC0,    \
       CPU_CLK_UNHALTED_CORE:FIXC1,\      
       CPU_CLK_UNHALTED_REF:FIXC2, \     
       DTLB_LOAD_MISSES_WALK_COMPLETED:PMC0 \
       ./bytesum 1gb_file

  |        INSTR_RETIRED_ANY        | 7.38826e+08 |
  |      CPU_CLK_UNHALTED_CORE      | 5.42765e+08 |
  |      CPU_CLK_UNHALTED_REF       | 5.42753e+08 |
  | DTLB_LOAD_MISSES_WALK_COMPLETED | 1.04509e+06 |

  sudo likwid -C 1 -g              \
       INSTR_RETIRED_ANY:FIXC0,    \
       CPU_CLK_UNHALTED_CORE:FIXC1,\      
       CPU_CLK_UNHALTED_REF:FIXC2, \     
       DTLB_LOAD_MISSES_WALK_COMPLETED:PMC0 \
       ./hugepage_prefetch 1gb_file

  |        INSTR_RETIRED_ANY        | 5.79098e+08 |
  |      CPU_CLK_UNHALTED_CORE      | 2.63809e+08 |
  |      CPU_CLK_UNHALTED_REF       | 2.63809e+08 |
  | DTLB_LOAD_MISSES_WALK_COMPLETED |    11970    |

The other main advantage of 'likwid' is that it allows you to profile just a section of the code, rather than the program as a whole. For odd political reasons, 'perf' doesn't make this possible.

ps. I think your 'argc' check is off by one. Since the name of the program is in argv[0], and argc is the length of argv, you want to check 'argc != 2' to confirm that a filename has been given.

jvns · on May 13, 2014

yeah, I tried prefetching like this: https://github.com/jvns/howcomputer/blob/master/bytesum_pref... (using MAP_POPULATE instead of hugepages) and got a slight slowdown