Hacker News new | past | comments | ask | show | jobs | submit login
Comparing Arrays of Structures and Structures of Arrays (2013) [pdf] (intel.com)
91 points by Tomte on May 24, 2016 | hide | past | favorite | 21 comments



Apart from Jai [1], are there any languages which support transparent conversion from one format for the programmer (who will prefer AoS) to the other for the compiler/cpu/cache?

This feature - plus the inversion of the `inline` keyword (specify where the function is used, not where it is defined) - was what I remember most from the writeup.

1: https://sites.google.com/site/jailanguageprimer/


Julia's StructsOfArrays.jl package[1] does this quite simply and easily without needing to modify any downstream code.

1. https://github.com/simonster/StructsOfArrays.jl



GCC, ICC, etc, will actually do it with profile guided feedback in whole program optimiation mode, when they determine it is valuable.

(Actually, GCC removed it a few years back, whoops)


Last I heard it was a fragile benchmark stunt, and compilers would enable the optimization in only very limited circumstances. Has there been progress since?


Yes, in the sense that they can do a lot more. It was implemented initially as a fragile stunt.

The real problem with it in C based languages is that it requires the real, true, whole program, because it is impossible to not break the ABI of the structure :)

In general, memory locality/etc optimizations are more focused on reorganizing loops to walk the data optimally, instead of reorganizing data so normal loops walk them optimally.

(See, e.g., polyhedral loop optimizations, which XLC and others ship with. A good paper to read is PLUTO and PLUTO+)


I'm guessing the GCC feature and removal is this here?

https://gcc.gnu.org/gcc-4.8/changes.html "The struct reorg and matrix reorg optimizations (command-line options -fipa-struct-reorg and -fipa-matrix-reorg) have been removed. They did not always work correctly, nor did they work with link-time optimization (LTO), hence were only applicable to programs consisting of a single translation unit. "

Though my web searches don't return hits about SoA in connection to GCC reorg optimizations, and a paper[1] about the struct reorg optimization doesn't mention it either...

[1] https://www.research.ibm.com/haifa/dept/svt/papers/golovanev...


Depending on your language, you may not need explicit language support. For example these guys have a runtime codegen library that they use (on Scala): https://ppl.stanford.edu/papers/popl13_rompf.pdf

There's also a tradition of older metaprogramming approaches in data layout optimizations, eg. ATLAS BLAS.


I remember seeing a language that compiled to 6502 do this. I can't remember the name of it, but it was fairly new. On 6502 the advantage is clear, since there is no cache, since index registers are only 8 bits and since they can only be modified by increment/decrement/load/tranfer operations.


Haskell's `vector` library does this for vectors of tuples.


kdb will do this https://kx.com/software.php . A list of dictionary is converted to a table which is the same as a dictionary to lists.


Haskell's standard vector package supports this: https://hackage.haskell.org/package/vector-0.11.0.0/docs/Dat...


The conclusions are not much of a surprise here, but nice that someone took the time to verify it.

This is comparable to the difference between row stores and column stores in the database world. When only looking at one column a column store is faster due to the memory locality, but when fetching a whole row the row store will beat it for the same reason.


Exactly, and very good point about the duality with row/column stores!

Sometimes people (usually coming from general-purpose software engineering background) have a preconception that SoA data structures are somehow cleaner and more object-oriented -- it's nice to have an analogy they could relate to more easily.


Well said. This is the classic latency versus throughput tradeoff.

Game developers are right to prefer SoA. Web developers are right to prefer AoS.


This is cool, but trying to reason about compilers may not an individual developer as far as simply profiling changes to the sections of code the program will spend the most time in.

Don't guess, profile. Don't talk about something being faster until you load it with production (or production like) data and run it a few times.


tl;dr?


Conclusions:

What can we conclude from all this? The software world is complex enough that anything you can say will almost undoubtedly have counter-examples. While we haven’t proved it here, one of your take-aways should be the following:

It is almost always better to vectorize than not to vectorize on Intel SIMD capable hardware.

If you are vectorizing (“LOOP WAS VECTORIZED”), then it is almost always better (faster) to vectorize with SoA “stride-1” data layout than the AoS data layout simply because your chances of the compiler doing something wonderful for you are much better. This is because there are fewer instructions needed (less work) in the SoA case. It is possible that SoA may not be dramatically faster than AoS for a given code depending on how a program exercises the memory subsystem, but it is unlikely that you will see many examples where AoS code would execute significantly faster than SoA code. So when in doubt or unless you can prove otherwise, prefer SoA to AoS, specially on Intel Xeon Phi. This tends to go against how many of us were trained to arrange data structures, but for quadruple-speed single precision execution and double-speed double precision execution over the alternative, you will want to consider it.

tl;dr:

So when in doubt or unless you can prove otherwise, prefer SoA to AoS


> quadruple-speed single precision execution and double-speed double precision

FYI Xeon Phi supports AVX512, i.e. 512 bit SIMD registers (8 doubles or 16 floats). In the consumer CPU world there is AVX2 which has 256 bit registers (4 doubles or 8 floats). AVX also has a 'fused multiply with add'(FMA) instruction which is useful for matrix multiplication amongst other things.


"So when in doubt or unless you can prove otherwise, prefer SoA to AoS"


Which is the crux of the situation -- it really depends on the problem domain, and SoA is only the right answer in some contexts. I naively tried to switch some code to SoA and found that I was missing on a lot more of my reads.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: