
Can You Write a Vectorized Reduction Operation? (2016) - flipchart
https://software.intel.com/content/www/us/en/develop/blogs/can-you-write-a-vectorized-reduction-operation.html
======
makapuf
I dont understand the example given for _mm256_add_pd. In the domino example,
it shows some parts where 1+2 gives 4 ? What dont I understand ?

Edit: ok I understand now. Not deleting the comment just in case someone is
also wondering : there are 4 values with the total points in the domino being
considered. So you must co side the total number of dots in each of the 4
dominos. The choice here is a bit unfortunate, it would have been clearer to
say ymm0(2,3,2,5) + ymm1(3,2,5,2) -> ymm2 (5,5,8,8) in a table.

~~~
rurban
5,5,7,7 you mean

~~~
makapuf
Damned, you're right. I m definitely not thinking right today

------
svantana
> The lesson here just reinforces the adage about letting the compiler find
> and transform your source code into vectorized code with the programmer
> giving hints or small rewrites as needed.

This has been my M.O. for a number of years, but sadly on modern platforms
there seems to be little logic to which code is fast and which isn't. Even if
the compiler emits vector instructions, that is no guarantee that the code
will be particularly fast IME. Things like cache misses play a role obviously,
but sometimes it's just a plain mystery why fast-looking is a bit slow. For
that purpose I keep a benchmark of various ways of doing common array
operations that I can run on each platform, it's quite handy.

~~~
saagarjha
Hinting alignment is also fairly difficult, and most compilers fall down the
moment you put any conditions in your loop.

------
andoma
Here's a good presentation from Nvidia on the topic (Cuda specific):
[https://developer.download.nvidia.com/assets/cuda/files/redu...](https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf)

~~~
rrss
That presentation is pretty old - must be over a decade since the example uses
G80. Nvidia has shipped 5 or 6 new architectures since then.

Kepler introduced a 'shuffle' intrinsic to accelerate the kind of 'horizontal'
reductions discussed in the article, see [http://on-
demand.gputechconf.com/gtc/2013/presentations/S317...](http://on-
demand.gputechconf.com/gtc/2013/presentations/S3174-Kepler-Shuffle-Tips-
Tricks.pdf).

------
gnufx
Note that simple reductions, like level 1 BLAS, are memory-bound, and
optimizing is generally only worthwhile for short arrays. (Contrast level 3,
i.e. matrix-matrix operations.) Also it's generally arithmetically incorrect
to vectorize them. That said, the BLAS test suites I've tried do pass when
vectorized with gcc -funsafe-math-optimizations or the incorrect-by-default
Intel compiler.

The general technique of restructuring arrays of data structures is a standard
optimization. However, if you just write the computation described straight
down in Fortran, (at least recent) gfortran -Ofast unrolls and vectorizes it.
Hooray for Fortran types.

------
BorisTheBrave
This article seems to be written very unclearly, but by "reduction", it seems
be means a vectorized function that takes a double[4] and sums the 4 values,
and fills another double[4] with 4 copies of the result.

I don't really see any explanation of why you'd want to do this? Not only is
there a instruction for doing exactly this in most vector instruction sets,
but you typically don't need to do it inside your hot loop anyway.

~~~
tom_mellior
> by "reduction", it seems be means a vectorized function that takes a
> double[4] and sums the 4 values

"Reduce" or "fold" is a standard term for applying a binary operation (like +)
to all elements of a vector/list/more general data structure:
[https://en.wikipedia.org/wiki/Fold_(higher-
order_function)](https://en.wikipedia.org/wiki/Fold_\(higher-order_function\))

> Not only is there a instruction for doing exactly this in most vector
> instruction sets

Is there a single one in AVX for adding all the components of a vector
register? Is it fast? How does it associate the computation? If there is such
an instruction, GCC and Clang seem reluctant to emit it:
[https://gcc.godbolt.org/z/vkaeYi](https://gcc.godbolt.org/z/vkaeYi) though I
might be missing some magic flags.

~~~
gpderetta
Unless I completely misunderstood, isn't that an horizontal add, ie.
HADDPD[1]?

edit:

apparently GCC doesn't specifically optimize reductions [2] and does a generic
vectorization instead.

[1]
[https://www.felixcloutier.com/x86/haddpd](https://www.felixcloutier.com/x86/haddpd)

[2]
[https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54400](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54400)

~~~
gnufx
GCC 8 -Ofast, targeting Haswell, produces

vhaddpd %xmm0, %xmm0, %xmm0

~~~
tom_mellior
Do you have an input program and an exact command line to experiment with?
Using my function from above with -Ofast -ffast-math -march=haswell does not
produce this on the GCC 8.1 on Compiler Explorer.

~~~
gnufx
Just per #54400 above.

    
    
        $ gcc --version | head -1
        gcc (Debian 8.3.0-6) 8.3.0
        $ cat x.c
        #include <x86intrin.h>
        double f(__m128d v){return v[1]+v[0];}
        $ gcc  -Ofast -march=haswell -S x.c
        $ grep -C1 hadd x.s
                .cfi_startproc
                vhaddpd %xmm0, %xmm0, %xmm0
                ret
    

GCC 4.8.5 with adjusted options does the same, and 7.5, so perhaps there was
some regression in early 8.

(-ffast-math is redundant with -Ofast.)

~~~
tom_mellior
Thanks. There probably wasn't a regression, it's just that this is not the
code I was asking about. Your code works for the special case of horizontally
adding _two_ elements, but not for the more general case (4 elements) that
this whole thread was about.

