
Flavors of SIMD - ingve
https://zeuxcg.org/2019/02/17/flavors-of-simd/
======
svantana
I find that modern c++ compilers can do a pretty good job of auto-
vectorization, given that the code is "vector-friendly" to begin with. But
writing code in that style is a bit of black magic with lots of trial and
error. I wish compiler vendors would publish "style guides" for this purpose,
or at least a collection of best practices.

~~~
abainbridge
Hmmm, I feel like I've been waiting for a compiler to get clever enough to
auto-vectorize my code since MMX was new. It never seems to happen. I guess I
need that style guide.

For example, here's a hello-world type SIMD problem: auto-vectorize a simple
alpha blending pixel painting routine.

    
    
      struct Pixel { unsigned char r, g, b, a; };
    
      // Draw a horizontal row of alpha blended pixels into a scan line.
      // Assume clipping is already done.
      void hline(Pixel *scanline, int startIdx, int len, Pixel c) {
        len &= 0xfff0; // Let the compiler assume that the length is always a multiple of 4.
        int invAlpha = 255 - c.a;
        for (int i = 0; i < len; i+=4) {
          Pixel &dest = scanline[startIdx + i];
          dest.r = (dest.r * invAlpha + c.r * c.a) >> 8;
          dest.g = (dest.g * invAlpha + c.g * c.a) >> 8;
          dest.b = (dest.b * invAlpha + c.b * c.a) >> 8;
        }
      }
    

And here is GCC 8 _not_ auto-vectorizing it:
[https://godbolt.org/z/zPZL-J](https://godbolt.org/z/zPZL-J)

Is this the kind of thing you think should be auto-vectorizable by the
compiler? If so, any idea how to persuade the compiler to do it?

~~~
jcranmer
There are a few issues impeding vectorization that I see:

1\. The memory loads are not stride 1. They're stride 12, effectively.

2\. You have i8 -> i32 -> i8 operations going on. Mixed size conversion is
generally problematic for vectorization.

3\. You're storing 3 bytes of every 12 bytes. The compiler cannot generate
extra stores, so it has to introduce masked stores. That's expensive.

The optimal vectorized code for this sort of algorithm is probably an unroll-
and-jam: load 4 pixels, shuffle the vectors into r/g/b/a vectors for
computation, do the computation, and then shuffle them back into r/g/b/a
vectors for storage. This requires some code rewriting--inserting a dummy
dest.a = dest.a * 1.0; might convince the compiler to store, although it could
well eliminate the store before vectorization.

This sort of thing is why you end up needing a more specialized language for
vector codes, such as ISPC.

~~~
abainbridge
> The memory loads are not stride 1. They're stride 12, effectively.

Stride 16, no? Anyhow, that's a bug. I shouldn't have put the +=4 in the loop
iteration increment. Once I fixed that and added a dummy set of dest.a, then I
get auto-vectorization. Woo-hoo. See the link I posted in my reply to
tomsmeding.

------
Const-me
> It’s not immediately clear how to make this use SIMD… unless you use
> gathers.

Gathers are slow.

With SSE2, you only need 2 instructions to check for duplicate indices. First
_mm_shuffle_epi32 to “rotate” the triangle [a,b,c] into e.g. [b,c,a], second
_mm_cmpeq_epi32 to do per-lane comparison between the original and rotated.

Similarly, with AVX2, same approach can be used to check 2 triangles at once.

For best result, unroll the loop into batches of 4 triangles (8 for AVX), and
use 128/256 bit loads.

------
BooneJS
> unfortunately, the use of SIMD can make the code less portable and less
> maintainable

Unfortunately general purpose processors aren’t getting any faster, but they
are getting more specialized. At some point in the future package managers for
OSS May have to do LLVM compilation on the machine just to specifically target
the bits and baubles of the specific architecture it’s running on.

~~~
dragontamer
Good GPU (SIMD) code is written in a very, very different manner than good CPU
code.

Case in point: NVidia has 32 x 32-bit shaders per L1 cache / Shared Memory (a
Symmetric Multiprocessor, or SM). Each SM can run 32-warps at a time, for an
effective 1024 "shader threads" per L1 cache.

In contrast: CPUs have 1-core, with 2-way SMT. That's 2-threads per L1 cache.
IBM has a Power9 CPU with 8-way SMT for 8-threads on one core, but that's as
far as you get with traditional CPUs.

Effectively: GPUs are memory-constrained. Not memory-bandwidth constrained btw
(GPUs have a TON of bandwidth), but literally memory constrained. Each shader
only has ~500 bytes, probably less, that it can access efficiently (more if
you have "uniform" data that can be shared between shaders). And maybe 2MB
total per shader before you run out of RAM entirely.

\---------

GPU code is innately parallel, so any memory allocation you do is multiplied
by a thousand fold, or more. Vega64 needs at least 16384 work-items before it
has occupancy 1 per vALU (64 compute units, 4 work queues per compute unit, 64
vALUs per queue)... and needs x10 work-items for max occupancy (a total of
163840 work-items in flight).

If you have max occupancy 10 and allocate 16-bytes per work-item, and you just
allocated 2.6MB of data on the GPU. I'm not kidding. You run out of space very
quickly on GPUs.

While GPUs struggle with this multiplicative problem per work-item, CPUs have
a ton of memory. That's 32kB of L1 cache typically available per thread (64kB
per core, but typically you have two threads per core thanks to
Hyperthreading).

That's the core of GPU vs CPU programming IMO. Dealing with the grossly
constrained memory conditions of the GPU environment.

~~~
jcranmer
> In contrast: CPUs have 1-core, with 2-way SMT. That's 2-threads per L1
> cache.

Comparing GPUs and CPUs are tricky. What you want to compare is not the number
of logical threads that can be independently scheduled [1], but the number of
simultaneous FMA units you can access. If you have an AVX2 processor, that's 2
FMA units each processing 8 lanes of 32-bit values, or a total of 16 lanes per
L1 cache. AVX-512 has 32 lanes due to the wider vectors.

[1] The independent scheduling of a GPU thread is also a bit of a lie. GPU
threads are closer to maskable lanes of a wide vector, rather like the mask
vectors of AVX-512.

~~~
dragontamer
> Comparing GPUs and CPUs are tricky

I agree it is tricky, but my point in the above post is not to compare
processing girth, but instead to compare memory-allocation. If you are aiming
for occupancy 10 on Vega64, every 32-bit DWORD you allocate will allocate a
total of 655,360 bytes.

Yes, this code, a simple 4-byte allocation in OpenCL:

    
    
        __private uint32_t foobar;
    

This code just allocated 655,360 bytes on Vega64 (assuming worst-case
Occupancy 10 per queue). While this following OpenCL code:

    
    
        __private uint64_t wow;
    

Just allocated 1.25 MB. By itself. The single line of code. The total amount
of memory used adds up surprisingly quickly on GPUs. This is IMO the biggest
issue with GPU programming, the absurdly small amount of memory you're
expected to work with per shader.

------
leeter
A few thoughts:

* This might be a good candidate for a compute shader, the swizzle abilities of shaders make quite a few of the things they are doing much easier as does the faster memory connections GPUs (usually) have.

* The committee is actually looking at potential ways to handle this [http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2018/p110...](http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2018/p1101r0.html) has been proposed. There is also a lot of discussion about support for GPUs in a standard way, but I haven't seen any proposals myself.

~~~
gameswithgo
iirc the domain this guy is working in, is one where the GPU is going to be
busy with rendering, so he may prefer to keep this on the cpu. But you are
right that any time you turn to SIMD, you might be able to turn to the GPU and
get a much biggger win.

~~~
leeter
So what I'm somewhat curious about is if this isn't an argument for
heterogeneous GPU usage. Schedule this on an iGPU while the main GPU is busy.
The main downside, as I alluded to above, would be memory bandwidth unless
there is a good on die cache for the iGPU or it has dedicated memory. That
said with a multi-gpu setup you might also have another card you could use.

The other thought I have is that it's very difficult to keep the GPU 100%
saturated in my experience. If the vertex count is high enough to make the
PCI-e transition worth it (and it would seem to be), then you can potentially
cycle steal a bit on the main GPU. This will potentially slow down the main
render but speed up the overall. This would be particularly true if you can
rebind the resource back for use in the main render pipe.

~~~
gameswithgo
All ideas that can work in certain situtions. As GPUs get more like CPUs and
CPUs get more like GPUs things will just get fuzzier I imagine.

------
microtherion
One option which I have not seen mentioned here yet is to use a platform
independent API with a platform optimized implementation. i.e. if your
operations can be expressed in terms of BLAS primitives, then ATLAS can give
you a good part of the benefit of hand optimized code, without much of the
hassle: [http://math-atlas.sourceforge.net](http://math-atlas.sourceforge.net)

~~~
gnufx
Yes, that's important for portable performance, but not with ATLAS. Unless
it's improved since I last tried, it's generally slower than OpenBLAS (or
BLIS, by extension) and doesn't do dynamic dispatch on the micro-architecture
like OpenBLAS and BLIS -- currently on x86_64 and also aarch64 with OpenBLAS.
(To deal with small matrices (on x86_64 only) you want libxsmm backed by
OpenBLAS or BLIS.) From previous experience I'd guess BLIS' new generic C
micro-kernel for GEMM will beat ATLAS with recent GCC vectorizing it. Note in
the case of GEMM, the most important part of BLAS, vectorizing is only really
relevant after you've restructured it.

------
dragontamer
Some thoughts on my experiments with SIMD Programming:

1\. AVX2 is good, but tedious to use manually. The difficulty problem with
AVX2 is that it is SIMD of SIMD: Its 2-way SIMD of 128-bit. Going "across" the
lanes of the 128-bit bunches can only be done with rare instructions... or
through L1 cache (write to memory, then read back in another register).

2\. "#pragma OpenMP SIMD" seems to be the most portable way to attempt to
"force" autovectorization. Its compatible across GCC, CLang, and ICC, and
other compilers. Visual Studio unfortunately does NOT support this feature,
but VSC++ has well documented auto-vectorization features.

3\. If you are sticking to Visual C++, its auto-vectorization capabilities are
pretty good. Enable compiler warnings so that you know which loops fail to
auto-vectorize. Be sure to read those warnings carefully.
[https://docs.microsoft.com/en-us/cpp/parallel/auto-
paralleli...](https://docs.microsoft.com/en-us/cpp/parallel/auto-
parallelization-and-auto-vectorization?view=vs-2017)

4\. If you keep reaching for the SIMD button, the GPU Programming model seems
superior. If you must use a CPU, try the ISPC: Intel's SPMD Program Compiler
([https://ispc.github.io/](https://ispc.github.io/)) so that your "programming
model" is at least correct.

5\. If a huge portion of your code is SIMD-style, a dedicated GPU is better.
GPUs have more flops and memory bandwidth. GPUs have faster local memory (aka
"shared memory" on NVidia, or "LDS" on AMD) and faster thread-group
communications than a CPU. Know how amazing vpshufb is? Well, GPUs will knock
your socks off with ballot(), CUDA __shfl(), AMD Cross-lane operations and
more ([https://gpuopen.com/amd-gcn-assembly-cross-lane-
operations/](https://gpuopen.com/amd-gcn-assembly-cross-lane-operations/)).

6\. To elaborate on point #5: GPUs simply have better gather/scatter
capabilities. The GPU's "shared" or "LDS" memory (just 64kB or so) is very
small, but provides arbitrary gather/scatter capabilities across "lanes" of
GPU SIMD units. They even support relatively efficient atomic operations. Yes,
even vpshufb seems relatively limited compared to what is available on GPUs.

7\. Raw AVX2 assembly seems "easy" if what you need is a 128-bit or 256-bit
register. For example, if you are writing Complex-Doubles (real+imaginary
number), then it is very straightforward to write 128-bit SIMD code to handle
your arithmetic. But if you are writing "true SIMD" code, such as the style in
the 1986 Seminal Paper "Data Parallel Algorithms"
([https://dl.acm.org/citation.cfm?id=7903](https://dl.acm.org/citation.cfm?id=7903)),
then stick with ISPC or GPU-style coding instead.

8\. Be sure to read that paper: "Data Parallel Algorithms" to get insight into
true SIMD Programming. GPU programmers already know what is in there, but its
still cool to read one of the first papers on the subject (in 1986
nonetheless!)

~~~
gnufx
I can't comment on the GPU comments, but you may be better off leaving the
vectorization to gcc than using the simd pragma. On haswell, it uses avx2, but
not fma, so you loose a factor of two on GEMM, for instance. The GCC manual
also gives an example for the ivdep pragma.

~~~
pjmlp
Would not that be just a question of missing improvements?

If I recall correctly, OpenJDK can use FMA thanks to Intel contributions.

~~~
gnufx
I don't see how openjdk is related to the openmp pragma. GCC has no problem
using FMA if you just let it, avoiding the pragma which simply says "simd".

~~~
pjmlp
I understood that GCC auto-vectorization wouldn't do it currently, and hence
gave an example where auto-vectorization does make use of it, assuming I
remember Intel's session at CodeONE correctly.

