I dont understand the example given for _mm256_add_pd.
In the domino example, it shows some parts where 1+2 gives 4 ? What dont I understand ?
Edit: ok I understand now. Not deleting the comment just in case someone is also wondering : there are 4 values with the total points in the domino being considered. So you must co side the total number of dots in each of the 4 dominos. The choice here is a bit unfortunate, it would have been clearer to say ymm0(2,3,2,5) + ymm1(3,2,5,2) -> ymm2 (5,5,8,8) in a table.
> The lesson here just reinforces the adage about letting the compiler find and transform your source code into vectorized code with the programmer giving hints or small rewrites as needed.
This has been my M.O. for a number of years, but sadly on modern platforms there seems to be little logic to which code is fast and which isn't. Even if the compiler emits vector instructions, that is no guarantee that the code will be particularly fast IME. Things like cache misses play a role obviously, but sometimes it's just a plain mystery why fast-looking is a bit slow. For that purpose I keep a benchmark of various ways of doing common array operations that I can run on each platform, it's quite handy.
Note that simple reductions, like level 1 BLAS, are memory-bound, and optimizing is generally only worthwhile for short arrays. (Contrast level 3, i.e. matrix-matrix operations.) Also it's generally arithmetically incorrect to vectorize them. That said, the BLAS test suites I've tried do pass when vectorized with gcc -funsafe-math-optimizations or the incorrect-by-default Intel compiler.
The general technique of restructuring arrays of data structures is a standard optimization. However, if you just write the computation described straight down in Fortran, (at least recent) gfortran -Ofast unrolls and vectorizes it. Hooray for Fortran types.
This article seems to be written very unclearly, but by "reduction", it seems be means a vectorized function that takes a double[4] and sums the 4 values, and fills another double[4] with 4 copies of the result.
I don't really see any explanation of why you'd want to do this? Not only is there a instruction for doing exactly this in most vector instruction sets, but you typically don't need to do it inside your hot loop anyway.
> Not only is there a instruction for doing exactly this in most vector instruction sets
Is there a single one in AVX for adding all the components of a vector register? Is it fast? How does it associate the computation? If there is such an instruction, GCC and Clang seem reluctant to emit it: https://gcc.godbolt.org/z/vkaeYi though I might be missing some magic flags.
Yes this is a horizontal add, but HADDPD only does a very specific part of a full horizontal add. You can use it to compute x[0] + x[1] in part of a register and x[2] + x[3] in another part, but afterwards you would still need to shuffle and add.
Do you have an input program and an exact command line to experiment with? Using my function from above with -Ofast -ffast-math -march=haswell does not produce this on the GCC 8.1 on Compiler Explorer.
Thanks. There probably wasn't a regression, it's just that this is not the code I was asking about. Your code works for the special case of horizontally adding two elements, but not for the more general case (4 elements) that this whole thread was about.
Edit: ok I understand now. Not deleting the comment just in case someone is also wondering : there are 4 values with the total points in the domino being considered. So you must co side the total number of dots in each of the 4 dominos. The choice here is a bit unfortunate, it would have been clearer to say ymm0(2,3,2,5) + ymm1(3,2,5,2) -> ymm2 (5,5,8,8) in a table.