Does anybody have an example of good performance of unaligned memory access on m...

nkurz · on Oct 15, 2016

In other words, how about that the people who claim that operations that do extra work are as fast as the ones that don't prove it? Instead of the burden of proof falling on people that don't have such an opinion/experience ? Then i will bow my head and say "You are right. Thank you for pointing that out". But alas google-ing for 10min and have found no such benchmark anywhere. And writing such a test isn't hard, not in the slightest.

I tend to the opposite view: those saying "do not do X" are in fact obligated to explain why X should be avoided. But perhaps this is just a difference in worldview.

One good reference for the "alignment doesn't matter" view is an earlier post on the same blog that is being discussed here: http://lemire.me/blog/2012/05/31/data-alignment-for-speed-my...

I linked elsewhere in the thread to my more detailed experiments regarding unaligned vector access on Haswell and Skylake: http://www.agner.org/optimize/blog/read.php?i=415#423. This is the source of my conclusion that alignment is not a significant factor when reading from L3 or memory, but does matter when attempting multiple reads per cycle from L1.

Both of these link to code that can be run for further tests. If you find an example of an unaligned access that is significantly slower than an aligned on a recent processor (and they certainly may exist) I'll nudge Daniel into writing an update to his blog post.

gens · on Oct 16, 2016

>I tend to the opposite view: those saying "do not do X" are in fact obligated to explain why X should be avoided. But perhaps this is just a difference in worldview.

For me it depends on the context. Here aligned access makes more sense so unaligned should be defended.

I hacked together a test, feel free to point out mistakes.

c part: http://pastebin.com/zMha8Fre

asm part(SSE and SSE2 for nt): http://pastebin.com/mxEFC8Cw

results:

aligned: 0 sec, 69049070 nsec

unaligned on aligned data: 0 sec, 69210069 nsec

unaligned on one byte unaligned data: 0 sec, 70278354 nsec

unaligned on three bytes unaligned data: 0 sec, 70315162 nsec

aligned nontemporal: 0 sec, 42549571 nsec

naive: 0 sec, 67741031 nsec

Repeating the test only shows non-temporal to be of benefit. The difference of, on average, 1-2% is not much, that i yield. But it is measurable.

But that is not all! Changing the copy size to something that fits in the cache (1MB) showed completely different results.

aligned: 0 sec, 160536 nsec

unaligned on aligned data: 0 sec, 179999 nsec

unaligned on one byte unaligned data: 0 sec, 375108 nsec

aligned nontemporal: 0 sec, 374811 nsec // usually a bit slower then one byte unaligned

And, out of interest, i made all the copy-s skip every second 16 bytes, (relative) results are the same as the original test except non-temporal being over 3x slower then anything else.

And this is on a amd fx8320 that has the misalignsse flag. On my former cpu (can't remember if it was the celeron or the amd 3800+) the results were very much in favor of aligned access.

So yea, align things. It's not hard to just add " __attribute__ ((aligned (16))) " (for gcc, idk anything else).

PS It may seem like the naive way is good, but memcpy is a bit more complicated then that.

qb45 · on Oct 17, 2016

See what happens when you change HALF_OF_BUFFER_SIZE from 1M to 1M+64. Or 128 or 1024. I think what you observed is the result of loads and stores hitting the same cache set at the same time, all while misalignment additionally increases the number of cache banks involved in any given operation. But that's just hand-waving, I don't know the internals enough to say with confidence what's going on exactly.

BTW, changing misalignment from 1 to 8 reduces this effect by half on my Thuban. Which is important, because nobody sane would misalign an array of doubles by 1 byte, while processing part of an array starting somewhere in the middle is a real thing.

Also, your assembly isn't really that great. In particular, LOOP is microcoded and sucks on AMD. I got better results with this:

  typedef float sse_a __attribute__ ((vector_size(16), aligned(16)));
  typedef float sse_u __attribute__ ((vector_size(16), aligned(1)));
  
  void c_is_faster_than_asm_a(sse_a *dst, sse_a *src, int count) {
          for (int i = 0; i < count/sizeof(sse_a); i += 8) {
                  dst[i] = src[i+0];
                  dst[i] = src[i+1];
                  dst[i] = src[i+2];
                  dst[i] = src[i+3];
                  dst[i] = src[i+4];
                  dst[i] = src[i+5];
                  dst[i] = src[i+6];
                  dst[i] = src[i+7];
          }
  }
  void c_is_faster_than_asm_u(sse_u *dst, sse_u *src, int count) {
          // ditto

gens · on Oct 18, 2016

>See what happens when you change HALF_OF_BUFFER_SIZE from 1M to 1M+64. Or 128 or 1024.

Tested. There's a greater difference between aligned and aligned_unaligned. But that made the test go over my cache size (2MB per core), so i tested with 512kB with and without your +128. Results were (relatively) similar to the original 1MB test.

>Which is important, because nobody sane would misalign an array of doubles by 1 byte [...]

Adobe flash would, for starters (idk if doubles but it calls unaligned memcpy all the time). The code from the person above also does because compilers sometimes do (aligned mov sometimes segfaults if you don't tell the compiler to aligned an array, especially if it's in a struct).

>Also, your assembly isn't really that great. In particular, LOOP is microcoded and sucks on AMD. I got better results with this:

Of course you did, you unrolled the loop. The whole point was to test memory access, not to write a fast copy function.

>c_is_faster_than_asm_a()

First of all, that is not in the C specification. It is a gcc/clang/idk_if_others extension to C. It compiles to similar what I would write if i had unrolled the loop. Actually worse, here's what it compiled to http://pastebin.com/yL31spR2 . Note that this is still a lot slower then movnpts when going over cache size.

edit: I didn't notice at first. Your code copies 8 16byte... chunks to the first. You forgot to add +n to dst.

qb45 · on Oct 18, 2016

Crap, that was bad. Fixed. And removed the insane unrolling, now 2x is sufficient.

You are right, 128 is not enough on Piledriver. Still,

  ./test $(( 512*1024+1024*0 ))
  aligned: 0 sec, 134539 nsec
  unaligned on aligned data: 0 sec, 101471 nsec
  unaligned on one byte unaligned data: 0 sec, 190368 nsec
  unaligned on three bytes unaligned data: 0 sec, 181823 nsec
  aligned nontemporal: 0 sec, 359920 nsec
  naive: 0 sec, 214007 nsec
  c_is_faster_than_asm_a:   0 sec, 92437 nsec
  c_is_faster_than_asm_u:   0 sec, 92643 nsec
  c_is_faster_than_asm_u+1: 0 sec, 156574 nsec
  c_is_faster_than_asm_u+3: 0 sec, 156359 nsec
  c_is_faster_than_asm_u+4: 0 sec, 154932 nsec
  c_is_faster_than_asm_u+8: 0 sec, 155784 nsec

  ./test $(( 512*1024+1024*1 ))
  aligned: 0 sec, 107036 nsec
  unaligned on aligned data: 0 sec, 94861 nsec
  unaligned on one byte unaligned data: 0 sec, 114444 nsec
  unaligned on three bytes unaligned data: 0 sec, 115915 nsec
  aligned nontemporal: 0 sec, 407951 nsec
  naive: 0 sec, 219215 nsec
  c_is_faster_than_asm_a:   0 sec, 82474 nsec
  c_is_faster_than_asm_u:   0 sec, 82554 nsec
  c_is_faster_than_asm_u+1: 0 sec, 112544 nsec
  c_is_faster_than_asm_u+3: 0 sec, 115159 nsec
  c_is_faster_than_asm_u+4: 0 sec, 198434 nsec
  c_is_faster_than_asm_u+8: 0 sec, 118952 nsec

4k is the stride of L1, your code slows down 1.5x:

  ./test $(( 512*1024+1024*4 ))
  aligned: 0 sec, 107576 nsec
  unaligned on aligned data: 0 sec, 94010 nsec
  unaligned on one byte unaligned data: 0 sec, 140534 nsec
  unaligned on three bytes unaligned data: 0 sec, 140517 nsec
  aligned nontemporal: 0 sec, 467981 nsec
  naive: 0 sec, 206891 nsec
  c_is_faster_than_asm_a:   0 sec, 85294 nsec
  c_is_faster_than_asm_u:   0 sec, 85174 nsec
  c_is_faster_than_asm_u+1: 0 sec, 118674 nsec
  c_is_faster_than_asm_u+3: 0 sec, 118902 nsec
  c_is_faster_than_asm_u+4: 0 sec, 118370 nsec
  c_is_faster_than_asm_u+8: 0 sec, 118638 nsec

128k is the stride of L2, both codes slow down further:

  ./test $(( 512*1024+1024*128 ))
  aligned: 0 sec, 167906 nsec
  unaligned on aligned data: 0 sec, 140650 nsec
  unaligned on one byte unaligned data: 0 sec, 239271 nsec
  unaligned on three bytes unaligned data: 0 sec, 251342 nsec
  aligned nontemporal: 0 sec, 458850 nsec
  naive: 0 sec, 364731 nsec
  c_is_faster_than_asm_a:   0 sec, 125240 nsec
  c_is_faster_than_asm_u:   0 sec, 118917 nsec
  c_is_faster_than_asm_u+1: 0 sec, 197348 nsec
  c_is_faster_than_asm_u+3: 0 sec, 196755 nsec
  c_is_faster_than_asm_u+4: 0 sec, 199757 nsec
  c_is_faster_than_asm_u+8: 0 sec, 197842 nsec