
Data alignment for speed: myth or reality? (2012) - bluetomcat
http://lemire.me/blog/archives/2012/05/31/data-alignment-for-speed-myth-or-reality/
======
nkurz
It's worth noting that Daniel was testing scalar loads and stores up to
64-bits (8B). Unaligned loads for 16B and 32B vectors can have slightly more
of a slowdown when they cross 64B cachelines. Unaligned stores across
cachelines are somewhat worse, but usually not enough to really hurt you.

But _note bene_ , unaligned vector stores that cross 4K pageline boundaries
are so slow that they can destroy performance despite their rarity. Here's
what the Intel Architecture Optimization Guide has to say:

    
    
      11.6.3 Prefer Aligned Stores Over Aligned Loads
      There are cases where it is possible to align only a subset 
      of the processed data buffers. In these cases, 
      aligning data buffers used for store operations usually 
      yields better performance than aligning data 
      buffers used for load operations.
    
      Unaligned stores are likely to cause greater performance 
      degradation than unaligned loads, since there 
      is a very high penalty on stores to a split cache-line that 
      crosses pages. This penalty is estimated at 150 
      cycles. Loads that cross a page boundary are executed at 
      retirement. In Example 11-12, unaligned store 
      address can affect SAXPY performance for 3 unaligned 
      addresses to about one quarter of the aligned 
      case.
    

[http://www.intel.com/content/dam/www/public/us/en/documents/...](http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-
optimization-manual.pdf)

------
mjevans
At a glance this appears to be running linear reads of memory with different
alignments and performing the same work. Thus the only difference between the
two for any algorithm which fits within (even L1 cache in this case) would be
a single additional memory read.

~~~
rasz_pl
exactly, his "unaligned" test does exactly ONE unaligned read per whole array
.....

------
kbart
It's not mentioned how that test program was compiled nor assembler code is
given. Might be some kind of an optimization made by the compiler. At least on
ARM, unaligned access really does affect performance negatively, even the
company itself[1] states that: _" [Unaligned] Accesses typically take a
greater number of cycles to complete compared to a naturally aligned
transfer."_

1\.
[http://infocenter.arm.com/help/topic/com.arm.doc.ddi0333h/Cd...](http://infocenter.arm.com/help/topic/com.arm.doc.ddi0333h/Cdffhdje.html)

~~~
dalke
The first line of the code shows how it was compiled.

The end of the essay says "Recent ARM processors do support unaligned memory
accesses though it is unclear what the performance penalty is." and points to
basically the same documentation you did.

~~~
nkurz
_The first line of the code shows how it was compiled._

'kbart' does have a point: it's always risky talk about hardware behaviour as
if C++ code actually specifies what processor is doing. And while the first
line shows the flags, one could argue that without the version number for g++
you still don't know what the compiler has done. But in this case, it does
seem that gcc-4.7 and gcc-4.8 both produce more less the code that you'd
expect when compiling with -O2.

But to show that compiler behaviour is not just an idle worry, icpc 14.0.3 -O3
produces much faster times by realizing it can combine the loops and
completely omit the reads. And g++ -O3? Oh, it segfaults, because it tries to
optimize to a 128-bit 'movdqa' (Move Double-Quadword Aligned) that requires
16B alignment.

While unhelpful, I think it's technically legal for it to do this, as
(incredibly?) reading an unaligned int directly from memory is actually
undefined behaviour. But I guess I'd prefer if it just used the equally
performant 'movdqu' (Move Double-Quadword Unaligned) instead, so it worked
rather than crashing.

Separately, here's a more recent post coming to the same general conclusion as
Daniel: [http://cdglove98.blogspot.com/2014/05/the-performance-of-
una...](http://cdglove98.blogspot.com/2014/05/the-performance-of-unaligned-
simd-loads.html)

------
eximius
An offset counted in single-digit bytes doesn't seem to be particularly
'unaligned' to me.

~~~
ygra
It's the offset from a memory location at a multiple of 4/8 bytes. So you only
get 3/7 different states of unalignedness. Unaligned memory accesses below the
byte granularity are quite rare, I guess.

