Does anybody have an example of good performance of unaligned memory access on modern cpus ? And note that it doesn't matter if the cpu supports AVX, but if it has a flag that says it can do fast unaligned memory access (i don't remember, is it misalignsse ?).
Common sense says that unaligned access can't be faster then aligned. And if you have data that fits into a ymm register, then you might as well use aligned access (a neural network is usually an example of such).
I did test it a while ago. Problem is that i don't remember if it was on this, modern, cpu or the older one. I could test if i cared enough for other peoples opinion, but alas i don't (only usage of unaligned AVX access i found to be from newbies to SIMD). An example, that you request, would be to look at glibc memcpy, that uses ssse3 [0] so that it could always get aligned access (ssse3 has per-byte operations).
In other words, how about that the people who claim that operations that do extra work are as fast as the ones that don't prove it ? Instead of the burden of proof falling on people that don't have such an opinion/experience ? Then i will bow my head and say "You are right. Thank you for pointing that out". But alas google-ing for 10min and have found no such benchmark anywhere. And writing such a test isn't hard, not in the slightest.
In other words, how about that the people who claim that operations that do extra work are as fast as the ones that don't prove it? Instead of the burden of proof falling on people that don't have such an opinion/experience ? Then i will bow my head and say "You are right. Thank you for pointing that out". But alas google-ing for 10min and have found no such benchmark anywhere. And writing such a test isn't hard, not in the slightest.
I tend to the opposite view: those saying "do not do X" are in fact obligated to explain why X should be avoided. But perhaps this is just a difference in worldview.
I linked elsewhere in the thread to my more detailed experiments regarding unaligned vector access on Haswell and Skylake: http://www.agner.org/optimize/blog/read.php?i=415#423. This is the source of my conclusion that alignment is not a significant factor when reading from L3 or memory, but does matter when attempting multiple reads per cycle from L1.
Both of these link to code that can be run for further tests. If you find an example of an unaligned access that is significantly slower than an aligned on a recent processor (and they certainly may exist) I'll nudge Daniel into writing an update to his blog post.
>I tend to the opposite view: those saying "do not do X" are in fact obligated to explain why X should be avoided. But perhaps this is just a difference in worldview.
For me it depends on the context. Here aligned access makes more sense so unaligned should be defended.
I hacked together a test, feel free to point out mistakes.
unaligned on one byte unaligned data: 0 sec, 70278354 nsec
unaligned on three bytes unaligned data: 0 sec, 70315162 nsec
aligned nontemporal: 0 sec, 42549571 nsec
naive: 0 sec, 67741031 nsec
Repeating the test only shows non-temporal to be of benefit. The difference of, on average, 1-2% is not much, that i yield. But it is measurable.
But that is not all! Changing the copy size to something that fits in the cache (1MB) showed completely different results.
aligned: 0 sec, 160536 nsec
unaligned on aligned data: 0 sec, 179999 nsec
unaligned on one byte unaligned data: 0 sec, 375108 nsec
aligned nontemporal: 0 sec, 374811 nsec // usually a bit slower then one byte unaligned
And, out of interest, i made all the copy-s skip every second 16 bytes, (relative) results are the same as the original test except non-temporal being over 3x slower then anything else.
And this is on a amd fx8320 that has the misalignsse flag. On my former cpu (can't remember if it was the celeron or the amd 3800+) the results were very much in favor of aligned access.
So yea, align things. It's not hard to just add " __attribute__ ((aligned (16))) " (for gcc, idk anything else).
PS It may seem like the naive way is good, but memcpy is a bit more complicated then that.
See what happens when you change HALF_OF_BUFFER_SIZE from 1M to 1M+64. Or 128 or 1024. I think what you observed is the result of loads and stores hitting the same cache set at the same time, all while misalignment additionally increases the number of cache banks involved in any given operation. But that's just hand-waving, I don't know the internals enough to say with confidence what's going on exactly.
BTW, changing misalignment from 1 to 8 reduces this effect by half on my Thuban. Which is important, because nobody sane would misalign an array of doubles by 1 byte, while processing part of an array starting somewhere in the middle is a real thing.
Also, your assembly isn't really that great. In particular, LOOP is microcoded and sucks on AMD. I got better results with this:
>See what happens when you change HALF_OF_BUFFER_SIZE from 1M to 1M+64. Or 128 or 1024.
Tested. There's a greater difference between aligned and aligned_unaligned. But that made the test go over my cache size (2MB per core), so i tested with 512kB with and without your +128. Results were (relatively) similar to the original 1MB test.
>Which is important, because nobody sane would misalign an array of doubles by 1 byte [...]
Adobe flash would, for starters (idk if doubles but it calls unaligned memcpy all the time). The code from the person above also does because compilers sometimes do (aligned mov sometimes segfaults if you don't tell the compiler to aligned an array, especially if it's in a struct).
>Also, your assembly isn't really that great. In particular, LOOP is microcoded and sucks on AMD. I got better results with this:
Of course you did, you unrolled the loop. The whole point was to test memory access, not to write a fast copy function.
>c_is_faster_than_asm_a()
First of all, that is not in the C specification. It is a gcc/clang/idk_if_others extension to C. It compiles to similar what I would write if i had unrolled the loop. Actually worse, here's what it compiled to http://pastebin.com/yL31spR2 . Note that this is still a lot slower then movnpts when going over cache size.
edit: I didn't notice at first. Your code copies 8 16byte... chunks to the first. You forgot to add +n to dst.
Common sense says that unaligned access can't be faster then aligned. And if you have data that fits into a ymm register, then you might as well use aligned access (a neural network is usually an example of such).
I did test it a while ago. Problem is that i don't remember if it was on this, modern, cpu or the older one. I could test if i cared enough for other peoples opinion, but alas i don't (only usage of unaligned AVX access i found to be from newbies to SIMD). An example, that you request, would be to look at glibc memcpy, that uses ssse3 [0] so that it could always get aligned access (ssse3 has per-byte operations).
In other words, how about that the people who claim that operations that do extra work are as fast as the ones that don't prove it ? Instead of the burden of proof falling on people that don't have such an opinion/experience ? Then i will bow my head and say "You are right. Thank you for pointing that out". But alas google-ing for 10min and have found no such benchmark anywhere. And writing such a test isn't hard, not in the slightest.
[0]https://github.com/lattera/glibc/blob/master/sysdeps/x86_64/...