Hacker News new | past | comments | ask | show | jobs | submit login
Can You Write a Vectorized Reduction Operation? (2016) (intel.com)
50 points by flipchart on May 9, 2020 | hide | past | favorite | 19 comments



I dont understand the example given for _mm256_add_pd. In the domino example, it shows some parts where 1+2 gives 4 ? What dont I understand ?

Edit: ok I understand now. Not deleting the comment just in case someone is also wondering : there are 4 values with the total points in the domino being considered. So you must co side the total number of dots in each of the 4 dominos. The choice here is a bit unfortunate, it would have been clearer to say ymm0(2,3,2,5) + ymm1(3,2,5,2) -> ymm2 (5,5,8,8) in a table.


5,5,7,7 you mean


Damned, you're right. I m definitely not thinking right today


> The lesson here just reinforces the adage about letting the compiler find and transform your source code into vectorized code with the programmer giving hints or small rewrites as needed.

This has been my M.O. for a number of years, but sadly on modern platforms there seems to be little logic to which code is fast and which isn't. Even if the compiler emits vector instructions, that is no guarantee that the code will be particularly fast IME. Things like cache misses play a role obviously, but sometimes it's just a plain mystery why fast-looking is a bit slow. For that purpose I keep a benchmark of various ways of doing common array operations that I can run on each platform, it's quite handy.


Hinting alignment is also fairly difficult, and most compilers fall down the moment you put any conditions in your loop.


On x86 you can use something like MAQAO or one of the top-down analysis tools for guidance.


Here's a good presentation from Nvidia on the topic (Cuda specific): https://developer.download.nvidia.com/assets/cuda/files/redu...


That presentation is pretty old - must be over a decade since the example uses G80. Nvidia has shipped 5 or 6 new architectures since then.

Kepler introduced a 'shuffle' intrinsic to accelerate the kind of 'horizontal' reductions discussed in the article, see http://on-demand.gputechconf.com/gtc/2013/presentations/S317....


Note that simple reductions, like level 1 BLAS, are memory-bound, and optimizing is generally only worthwhile for short arrays. (Contrast level 3, i.e. matrix-matrix operations.) Also it's generally arithmetically incorrect to vectorize them. That said, the BLAS test suites I've tried do pass when vectorized with gcc -funsafe-math-optimizations or the incorrect-by-default Intel compiler.

The general technique of restructuring arrays of data structures is a standard optimization. However, if you just write the computation described straight down in Fortran, (at least recent) gfortran -Ofast unrolls and vectorizes it. Hooray for Fortran types.


This article seems to be written very unclearly, but by "reduction", it seems be means a vectorized function that takes a double[4] and sums the 4 values, and fills another double[4] with 4 copies of the result.

I don't really see any explanation of why you'd want to do this? Not only is there a instruction for doing exactly this in most vector instruction sets, but you typically don't need to do it inside your hot loop anyway.


> by "reduction", it seems be means a vectorized function that takes a double[4] and sums the 4 values

"Reduce" or "fold" is a standard term for applying a binary operation (like +) to all elements of a vector/list/more general data structure: https://en.wikipedia.org/wiki/Fold_(higher-order_function)

> Not only is there a instruction for doing exactly this in most vector instruction sets

Is there a single one in AVX for adding all the components of a vector register? Is it fast? How does it associate the computation? If there is such an instruction, GCC and Clang seem reluctant to emit it: https://gcc.godbolt.org/z/vkaeYi though I might be missing some magic flags.


Unless I completely misunderstood, isn't that an horizontal add, ie. HADDPD[1]?

edit:

apparently GCC doesn't specifically optimize reductions [2] and does a generic vectorization instead.

[1] https://www.felixcloutier.com/x86/haddpd

[2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54400


Yes this is a horizontal add, but HADDPD only does a very specific part of a full horizontal add. You can use it to compute x[0] + x[1] in part of a register and x[2] + x[3] in another part, but afterwards you would still need to shuffle and add.


It's also slow (5-6 cycle latency and 2 cycle throughput on Intel processors, compared to 4/0.5 for multiplication).


GCC 8 -Ofast, targeting Haswell, produces

vhaddpd %xmm0, %xmm0, %xmm0


Do you have an input program and an exact command line to experiment with? Using my function from above with -Ofast -ffast-math -march=haswell does not produce this on the GCC 8.1 on Compiler Explorer.


Just per #54400 above.

    $ gcc --version | head -1
    gcc (Debian 8.3.0-6) 8.3.0
    $ cat x.c
    #include <x86intrin.h>
    double f(__m128d v){return v[1]+v[0];}
    $ gcc  -Ofast -march=haswell -S x.c
    $ grep -C1 hadd x.s
            .cfi_startproc
            vhaddpd %xmm0, %xmm0, %xmm0
            ret
GCC 4.8.5 with adjusted options does the same, and 7.5, so perhaps there was some regression in early 8.

(-ffast-math is redundant with -Ofast.)


Thanks. There probably wasn't a regression, it's just that this is not the code I was asking about. Your code works for the special case of horizontally adding two elements, but not for the more general case (4 elements) that this whole thread was about.


oh, interesting, I explicitly tried -ffast-math and a few optimization targets, with not haswell explicitly.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: