
Improving performance with SIMD intrinsics in three use cases - jbredeche
https://stackoverflow.blog/2020/07/08/improving-performance-with-simd-intrinsics-in-three-use-cases/
======
qppo
It would be useful to add a recursively defined algorithm like trapezoidal
integration, which is something that basically every autovectorization tool
will fall over

    
    
        y[n] = (x[n] + x[n-1]) / 2 + y[n-1] 
    

Which will quickly show you that you can't just replace the numeric operations
with their intrinsic counterparts and get speed up (and if you did, you may
wind up with incorrect code!).

It also shows you that sometimes more is less. To vectorize that code you wind
up doing _more_ multiplication than you would normally require for the scalar,
and depending on the conditions where you execute it, it might not be able to
improve over the scalar implementation.

There are more formal methods for converting these kinds of operations
(linear, time-invariant recursive systems/filters) to vector equivalents using
state space matrix forms, but then you start having to benchmark whether or
not you get any benefits.

~~~
dragontamer
I got nerd-sniped. Lets see...

    
    
        y[n] = (x[n] + x[n-1]) / 2
        y = prefixSum(y);
    

I'm pretty sure prefixsum over Y would solve the problem (see:
[https://en.wikipedia.org/wiki/Prefix_sum#Algorithm_1:_Shorte...](https://en.wikipedia.org/wiki/Prefix_sum#Algorithm_1:_Shorter_span,_more_parallel)
for the Vectorized solution to prefix sum).

Without the addition of y[n-1], the first line is now trivially parallelizable
by most auto-vectorizers. The prefix-sum probably needs to be written with
intrinsics by hand, but once written it generalizes to many operations.
(Prefix-max, prefix-min, prefix-XOR).

"Prefix-sum" is sometimes called "scan" in the literature
([https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-
co...](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-
computing/chapter-39-parallel-prefix-sum-scan-cuda))

\-----------

EDIT: A surprising number of "recursive" definitions can become a prefix-
operation (or scan). Associativity is the main requirement. (Note that
floating-points are not associative: error builds up non-associatively). So
you'd need to redo your error analysis if these were floats.

~~~
Const-me
> the first line is now trivially parallelizable by most auto-vectorizers

It’s not that trivial to vectorize, would be something like this for SSE:

    
    
        const __m128 x = _mm_loadu_ps( px );
        // Rotate the lanes
        __m128 xPrev = _mm_shuffle_ps( x, x, _MM_SHUFFLE( 2, 1, 0, 3 ) );
        // Insert the highest lane of previousStepX into lowest lane of xPrev
        // The IMM number means extract lane 3, insert into lane 0, don't zero out any lanes
        xPrev = _mm_insert_ps( xPrev, previousStepX, 0b11000000 );
        const __m128 halfSum = _mm_mul_ps( _mm_add_ps( x, xPrev ), _mm_set1_ps( 0.5f ) );
        previousStepX = x;
    

Which compiler do you think would do that automatically?

~~~
dragontamer
Why would you need to rotate the lanes?

    
    
        const __m128 x = _mm_loadu_ps( px );
        const __m128 xPrev  = _mm_loadu_ps( px + 4 );
    

_mm_loadu_ps: load-unaligned. There's no requirement for addresses to
load/unload. There are probably penalties for loading across a cache line, but
its perfectly valid to do this.

\-----------

> Which compiler do you think would do that automatically?

GCC simply does the unaligned load:
[https://godbolt.org/z/2HwYMi](https://godbolt.org/z/2HwYMi)

Clang does the shuffle successfully:
[https://godbolt.org/z/Y3hbQn](https://godbolt.org/z/Y3hbQn)

The shuffle is probably optimal... I'm surprised CLang is good enough to
recognize the opportunity. But simply doing 2x loads from memory is probably
fine, especially because the 2nd load is guaranteed to be from L1 cache.
Besides, assuming 128-bit vectors, 2x 128-bit vectorized loads every
4-elements (averaging 1/2 a load per element) is still superior to scalar code
(1-load per element).

Honestly, Clang surprised me by generating the superior shuffle code. I was
expecting both GCC and Clang to generate the suboptimal (but simpler)
unaligned loads.

~~~
Const-me
> Why would you need to rotate the lanes?

To half the count of RAM access instructions. Even L1D cache hit is still
slower than these two instructions for shuffle and insert, both are quite
fast, they deliver the result on the next cycle.

> GCC simply does the unaligned load

Right, that’s twice as many loads as necessary.

> Clang does the shuffle successfully

I wouldn’t call that code “successfully”. Clang uses shufps instructions to
shuffle integer vectors. On many CPUs, especially older than a couple of
years, there’s a non-trivial latency, a few cycles, to pass data between
integer and floating-point execution units. The approach is correct with these
shuffles, but not the implementation. For integers it should have used at
least _mm_shuffle_epi32 = pshufd or _mm_slli_si128 = pslldq, but ideally
_mm_alignr_epi8 = palignr from SSSE3.

~~~
dragontamer
I guess we have different measures of "success". I don't expect any compiler
to beat me in intrinsics or hand-crafted assembly. After all, I can use the
compiler. But the compiler can't use my brain. So in all circumstances, I will
always at least tie the compiler.

Its an "auto-vectorization success", because the pseudo-code I've laid out
auto-vectorizes to SSE 128-bit, AVX 256-bit, and AVX512 512-bit code.
Furthermore, CLang supports many architectures, and probably can convert my
simple for-loop into SIMD on other architectures without any intervention (ie:
ARM Neon)

The code doesn't need to be optimal, it just needs to be faster than the
scalar form and portable.

~~~
Const-me
Yes and no.

“yes” because what clang did is a very good job, pretty sure it substantially
faster than scalar code. Especially on newish CPUs like Zen2.

“no“ because from what they have currently for that particular test case, they
only need a few improvements to make their code as fast as a manually
vectorized version. Apparently, the compiler is already aware of the cost of
RAM loads. They just need to account for these bypass delays between integer
and FP execution units in their code generator.

~~~
nkurz
Frequently, when one is using SIMD one is more concerned about throughput (how
many operations you can do per second) and less about latency (how many
nanoseconds until you get there first result back). I'm not sure if you are
equating the two in your comments, or have an actual preference for lower
latency, but realize that one can often gain in throughput by giving up some
latency.

I mention this because I don't share your certainty that avoiding duplicate
loads is necessarily "faster". I've occasionally been able beat compilers at
getting higher throughput by making the duplicate load, despite the fact the
latency increases. This is because vector shuffles (and I think aligns and
inserts) are limited to a single execution port (at least on recent Intel
processors), while loads can be executed twice per cycle. Depending on the mix
of the rest of instructions, this means that sometimes the redundant load ends
up being the better choice.

Which is to say, rather than being sure, measure!

~~~
Const-me
I agree in general, and yes, I always use a profiler before writing these
intrinsics.

But still, that particular clang output is suboptimal. They have already
optimized loads with these shuffles. I think they did 80% of the work already.
They are just using the wrong shuffles.

------
mayoff
If you have a big need for SIMD on the CPU, you might want to look into ispc,
the Intel Implicit SPMD Program Compiler. (It supports ARM NEON too.)

[https://ispc.github.io/](https://ispc.github.io/)

------
SloopJon
It's great to see tables comparing the different approaches, but it's also a
bit discouraging to think that you might have to solve the problem a dozen
different ways. Fortunately, it does seem that naive vectorization, for lack
of a better term, is pretty effective. For grayscale, "vector floats SSE2" and
"vector floats AVX2" aren't far behind the winner. Similarly, for dot product,
I might call it a day after AvxDpPs dropped my time from 193 to 34.

------
gnufx
Note that the code GCC generates for reductions (dot product, anyhow) will not
be scalar with an option like -Ofast. That enables loop vectorization but also
use of non-standard-conforming SIMD for the non-associative operations. It
likely does as well as simple use of intrinsics. Level 1 operations are only
worth worrying about much with small arrays, as they're ultimately memory-
bound.

~~~
jeffbee
It's not even scalar at -O1.

[https://godbolt.org/z/TRS4W6](https://godbolt.org/z/TRS4W6)

It generates vector instructions for all k8-like architectures, with wider
instructions starting from -march=sandybridge.

~~~
Const-me
> It generates vector instructions

vmovss, vmulss and vaddss instructions are not vector. They only load/store,
multiply or add the lowest lane of their operands. The equivalent vector
instructions are vmovaps, vmulps and vaddps respectively, these would load,
multiply or add complete vectors, not just the lowest lanes of them.

~~~
jeffbee
Right you are. I totally stopped reading at `vxorps` and didn't scan all the
other `_ss` properly.

~~~
gnufx
There would be something amiss if it was vectorizing at O1, and if it was
disregarding arithmetic rules (as the Intel compiler would by default).

------
dialecticDolt
Does anyone know any more sources that would be useful to learn more about
SIMD intrinsics and how to use them? I've tried a few times (with moderate
success) to parse through source-code using them but I've never found a good
introduction.

~~~
dragontamer
Honestly, most SIMD intrinsics are pretty simple and straightforward to use.
You can probably get 90% of the way there by simply browsing Intel's intrinsic
reference manual.

[https://software.intel.com/sites/landingpage/IntrinsicsGuide...](https://software.intel.com/sites/landingpage/IntrinsicsGuide/)

Intel's intrinsics guide has an approximate clock-count on all instructions.
There are a few obscure details to memorize, but if you already understand
super-scalar, pipelined, and OoO operations of modern CPUs, its all straight-
forward.

\---------

The last 10% of using SIMD successfully is obscure and eclectic. Its not
"difficult", its just a whole bunch of tricks that you'll probably never
figure out on your own. I'm talking about pshufb magic, prefix sum / scan
patterns, gather/scatter patterns, and the like.

As complicated as pshufb is, the Intel intrinsic documentation does a great
job precisely describing the operation:
[https://software.intel.com/sites/landingpage/IntrinsicsGuide...](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=pshufb&expand=5194,5153,5153)

_mm_shuffle_epi8 gives you a register-to-register (no RAM touched, faster than
even L1 cache) 16-byte lookup table. An alternative viewpoint: it gives you a
register-to-register arbitrary gather command. I've seen so many tricks from
this one instruction alone.

\---------

Another straightforward guide is Intel's optimization manual. Its
straightforward, but a lot of material to read:
[https://www.intel.com/content/dam/www/public/us/en/documents...](https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-
optimization-manual.pdf)

Chapters 4, 5, and 6 deal with SIMD coding.

------
alleycat5000
Easy SIMD in the context of image processing is one of the cool things about
Halide:

[https://halide-lang.org](https://halide-lang.org)

It was personally eye opening to see how vectorization can be the gift they
keeps on giving when you can sprinkle it throughout a computational pipeline
vs spreading work across cores.

------
grecy
I always find code samples of SIMD instructions to be so esoteric it makes it
harder to follow that necessary.

It's 2020, who uses variable names like 'acc', 'a' and 'b'?

There's also no real explanation of what 'a' and 'b' (or 'p1' and 'p2')
actually are. It's just magic.

I get that it's just an example, but the example feels so useless it doesn't
demonstrate why these SIMD calls are helpful at all.

~~~
chrisseaton
> It's 2020, who uses variable names like 'acc', 'a' and 'b'?

Mathematicians, scientists, engineers, artists. Have you read any numerical
work? They often use one-letter variables, or otherwise very abbreviated
variables. Sometimes they differentiate variables with the same letter using
by using different _fonts_. Seems strange to us, but clearly it works for
them!

And then they keep doing the same thing when they translate to code. When I
ask them they say they think our long variable names are unreadable. Maybe
they're right and we're wrong? Who knows.

They generally don't use tests, linting, CI, anything like that either. But
again, clearly it works for them!

~~~
superjan
If you write code that does basic arithmatic, a and b will do just fine in my
experience. If I’m implementing an existing algorithm, I like to stick as
close to the algoritm’s symbol use as close as possible. I miss that I can’t
use greek symbols like ϕ or σ as variables, but that’s just me probably.

~~~
minerjoe
Depending on your programming language, you can definitely use those greek
symbols as variables.

~~~
sgillen
Yup was just about to say this. That being said I usually avoid doing this
even though I do a lot of numerical work. They are usually more of a pain to
type out, and I don’t trust that they will show up properly on every text
editor that it might wind up in.

