
Comparing Arrays of Structures and Structures of Arrays (2013) [pdf] - Tomte
https://software.intel.com/sites/default/files/article/392271/aos-to-soa-optimizations-using-iterative-closest-point-mini-app.pdf
======
zeotroph
Apart from Jai [1], are there any languages which support transparent
conversion from one format for the programmer (who will prefer AoS) to the
other for the compiler/cpu/cache?

This feature - plus the inversion of the `inline` keyword (specify where the
function is used, not where it is defined) - was what I remember most from the
writeup.

1:
[https://sites.google.com/site/jailanguageprimer/](https://sites.google.com/site/jailanguageprimer/)

~~~
DannyBee
GCC, ICC, etc, will actually do it with profile guided feedback in whole
program optimiation mode, when they determine it is valuable.

(Actually, GCC removed it a few years back, whoops)

~~~
zurn
Last I heard it was a fragile benchmark stunt, and compilers would enable the
optimization in only very limited circumstances. Has there been progress
since?

~~~
DannyBee
Yes, in the sense that they can do a lot more. It was implemented initially as
a fragile stunt.

The real problem with it in C based languages is that it requires the real,
true, whole program, because it is impossible to not break the ABI of the
structure :)

In general, memory locality/etc optimizations are more focused on reorganizing
loops to walk the data optimally, instead of reorganizing data so normal loops
walk them optimally.

(See, e.g., polyhedral loop optimizations, which XLC and others ship with. A
good paper to read is PLUTO and PLUTO+)

------
mrich
The conclusions are not much of a surprise here, but nice that someone took
the time to verify it.

This is comparable to the difference between row stores and column stores in
the database world. When only looking at one column a column store is faster
due to the memory locality, but when fetching a whole row the row store will
beat it for the same reason.

~~~
btrask
Well said. This is the classic latency versus throughput tradeoff.

Game developers are right to prefer SoA. Web developers are right to prefer
AoS.

------
lsiebert
This is cool, but trying to reason about compilers may not an individual
developer as far as simply profiling changes to the sections of code the
program will spend the most time in.

Don't guess, profile. Don't talk about something being faster until you load
it with production (or production like) data and run it a few times.

------
francoisdevlin
tl;dr?

~~~
benjaminjackman
Conclusions:

What can we conclude from all this? The software world is complex enough that
anything you can say will almost undoubtedly have counter-examples. While we
haven’t proved it here, one of your take-aways should be the following:

It is almost always better to vectorize than not to vectorize on Intel SIMD
capable hardware.

If you are vectorizing (“LOOP WAS VECTORIZED”), then it is almost always
better (faster) to vectorize with SoA “stride-1” data layout than the AoS data
layout simply because your chances of the compiler doing something wonderful
for you are much better. This is because there are fewer instructions needed
(less work) in the SoA case. It is possible that SoA may not be dramatically
faster than AoS for a given code depending on how a program exercises the
memory subsystem, but it is unlikely that you will see many examples where AoS
code would execute significantly faster than SoA code. So when in doubt or
unless you can prove otherwise, prefer SoA to AoS, specially on Intel Xeon
Phi. This tends to go against how many of us were trained to arrange data
structures, but for quadruple-speed single precision execution and double-
speed double precision execution over the alternative, you will want to
consider it.

tl;dr:

So when in doubt or unless you can prove otherwise, prefer SoA to AoS

~~~
redcalx
> quadruple-speed single precision execution and double-speed double precision

FYI Xeon Phi supports AVX512, i.e. 512 bit SIMD registers (8 doubles or 16
floats). In the consumer CPU world there is AVX2 which has 256 bit registers
(4 doubles or 8 floats). AVX also has a 'fused multiply with add'(FMA)
instruction which is useful for matrix multiplication amongst other things.

