Hacker News new | past | comments | ask | show | jobs | submit login
Flavors of SIMD (zeuxcg.org)
104 points by ingve 64 days ago | hide | past | web | favorite | 58 comments

I find that modern c++ compilers can do a pretty good job of auto-vectorization, given that the code is "vector-friendly" to begin with. But writing code in that style is a bit of black magic with lots of trial and error. I wish compiler vendors would publish "style guides" for this purpose, or at least a collection of best practices.

Leveraging auto vectorization can be nice, as you can compile for different targets with one set of code, and not have to write out intrinsics for SSE2, AVX2, AVX512 etc.

But yeah it is tricky, you might get it all sorted and a compiler upgrade ruins it, or it won't work on another compiler, or someone updates your function later not realizing you carefully tuned it to get vectorized and it ruins it.

Also there are some clever hacks you can do with intrinsics that compilers won't figure out sometimes.

I made a library in Rust that will let you write a single function using abstracted intrinsics, that can produce SSE2/SSE41/AVX2 versions at compile time:


Seems to me like it would be a good idea to have a tool (superoptimiser?) that can take your simple code and rewrite it with intrinsics, which you then check into source control. So you have a doFooGeneric and have the tool generate doFooSSE, doFooSSE4, doFooAVX2, doFooNEON and select one at runtime (like Intel's compiler does) or at compile time. Since you'll only run the tool infrequently, it can spend a lot of time trying to find the best vectorisation.

another Rust library "Faster" attempts to do that. Its cool, but the downside is simple code can't always translate optimally to intrinsics. Like if you ask for a "floor" operation for instance. With SSE2 you have no floor operation. So you either can do a floor one at a time, or use some clever combinations of other SSE2 intrinsics to do a floor, or, if you can accept certain constraints, you can use a faster option using less intrinsics. So your simple code needs to specify which you want, which ends up being kind of like intrinsics.

Also your simple code has to line everything up in arrays anyway so intrinsics don't add that much complexity. I do overload operators in SIMDeez so some of the time your code can look pretty simple.

Side note: GCC supports feature based dispatch too[0]

[0]: https://lwn.net/Articles/691932/

Yes, but target_clones, at least, isn't a panacea. If you know how to make it work in a shared library, I'm rather interested.

I have a similar project in C++ supporting SSE2,AVX2, and AVX512 here: https://github.com/dnbaker/vec.

I've used it for numerical code and sketch data structures, for example. (See https://github.com/dnbaker/frp or https://github.com/dnbaker/sketch, respectively).

Most of the work I do, the compiler just isn't able to vectorize, so I just write intrinsics. I similarly wrapped FFTW and the Sleef trigonometric function library so that I can generate optimal code for a variety of types and architectures.

FWIW clang has a feature called opt-viewer [1] [2] that isn't quite a style guide but instead gives explicit indication when and why it couldn't optimize particular code.

[1] https://llvm.org/devmtg/2016-11/Slides/Nemet-Compiler-assist...

[2] https://github.com/androm3da/optviewer-demo

Useful, thanks! God knows why they chose YAML though...

Hmmm, I feel like I've been waiting for a compiler to get clever enough to auto-vectorize my code since MMX was new. It never seems to happen. I guess I need that style guide.

For example, here's a hello-world type SIMD problem: auto-vectorize a simple alpha blending pixel painting routine.

  struct Pixel { unsigned char r, g, b, a; };

  // Draw a horizontal row of alpha blended pixels into a scan line.
  // Assume clipping is already done.
  void hline(Pixel *scanline, int startIdx, int len, Pixel c) {
    len &= 0xfff0; // Let the compiler assume that the length is always a multiple of 4.
    int invAlpha = 255 - c.a;
    for (int i = 0; i < len; i+=4) {
      Pixel &dest = scanline[startIdx + i];
      dest.r = (dest.r * invAlpha + c.r * c.a) >> 8;
      dest.g = (dest.g * invAlpha + c.g * c.a) >> 8;
      dest.b = (dest.b * invAlpha + c.b * c.a) >> 8;
And here is GCC 8 _not_ auto-vectorizing it: https://godbolt.org/z/zPZL-J

Is this the kind of thing you think should be auto-vectorizable by the compiler? If so, any idea how to persuade the compiler to do it?

There are a few issues impeding vectorization that I see:

1. The memory loads are not stride 1. They're stride 12, effectively.

2. You have i8 -> i32 -> i8 operations going on. Mixed size conversion is generally problematic for vectorization.

3. You're storing 3 bytes of every 12 bytes. The compiler cannot generate extra stores, so it has to introduce masked stores. That's expensive.

The optimal vectorized code for this sort of algorithm is probably an unroll-and-jam: load 4 pixels, shuffle the vectors into r/g/b/a vectors for computation, do the computation, and then shuffle them back into r/g/b/a vectors for storage. This requires some code rewriting--inserting a dummy dest.a = dest.a * 1.0; might convince the compiler to store, although it could well eliminate the store before vectorization.

This sort of thing is why you end up needing a more specialized language for vector codes, such as ISPC.

> The memory loads are not stride 1. They're stride 12, effectively.

Stride 16, no? Anyhow, that's a bug. I shouldn't have put the +=4 in the loop iteration increment. Once I fixed that and added a dummy set of dest.a, then I get auto-vectorization. Woo-hoo. See the link I posted in my reply to tomsmeding.

Here's GCC 8 providing diagnostics (see also -vec-all)

    $ g++ -march=haswell -Ofast -fopt-info-vec-missed x.C -c
    x.C:8:23: note: interleaved store with gaps
    x.C:8:23: note: not vectorized: complicated access pattern.
    x.C:8:23: note: bad data access.
    x.C:8:23: note: not vectorized: not enough data-refs in basic block.
    x.C:12:41: note: not vectorized: no grouped stores in basic block.
    x.C:8:23: note: Build SLP failed: unrolling required in basic block SLP
    x.C:8:23: note: Build SLP failed: unrolling required in basic block SLP
    x.C:14:3: note: not vectorized: not enough data-refs in basic block.
and here's GCC nearly-9:

    $ g++ -march=haswell -fopt-info-vec-all -Ofast x.C -c
    x.C:8:23: optimized: loop vectorized using 32 byte vectors
    x.C:5:8: note: vectorized 1 loops in function.
    x.C:12:14: note: ***** Re-trying analysis with vector size 16
    x.C:8:23: note: ***** Re-trying analysis with vector size 16
In addition to what others have said, you typically need "restrict" on array args, and -ffast-math, or at least -funsafe-math-optimizations (in -Ofast), for things like reductions. (You need -funsafe-math-optimizations for NEON generally, per GCC docs.) A BLAS-like library still passes its tests with -ffast-math.

I've recently had GCC 8 get about 2/3 the performance on haswell for a generic C optimized DGEMM micro-kernel as the hand-optimized one of similar structure which uses intrinsics/assembler. Without checking, the expert expects the performance difference to be due to prefetch and unrolling, which the generic kernel doesn't leave to the compiler.

Wow, magic flags I've never seen before - look useful.

As I understand it, -ffast-math is NOT useful here. That only effects floats and doubles.

Yes, that was meant to be a general remar. This probably most often comes up for floating point, at least in my circles, and the similar option to fast-math that the Intel compiler defaults to contributes somewhat to the myth of its greatly superior code. Likewise restrict isn't relevant here -- you'd see notes about aliasing from opt-info otherwise.

To properly auto-vectorise this, you need to transform your ArrayOfStructures (AoS) representation to a StructureOfArrays (SoA) form, i.e. put all the r's next to each other, all the g's, etc. Then your operations are nicely in groups of four adjacent elements and the compiler will almost surely vectorise it.

I'm no black magician, but I think the fact that the operations you want to do in parallel (the r's of four adjacent pixels) do not access adjacent memory cells, requiring a gather instruction or somesuch, which I hear is either slow or unsupported. Alternatively, it could vectorise the pixels themselves, but that's three elements, not four. Does it auto-vectorise it if you also do the operation on the a's?

> Does it auto-vectorise it if you also do the operation on the a's?

Good point. Combined with fixing my loop increment (that jcranmer made me realize was wrong), I now get some vectory goodness: https://godbolt.org/z/yTJzKH

No idea what it all means. It'll be interesting to benchmark.

The code is not optimal, because there are implicit conversions to int (32-bit values) which are quite expansive. There should be explicit casts to uint16_t, like this: dest.r = uint16_t(uint16_t(dest.r * invAlpha) + uint16_t(c.r * c.a)) >> 8;

Adding a single uint16_t cast like this:

  dest.r = uint16_t(dest.r * invAlpha + c.r * c.a) >> 8;
Seems to get all of the benefit. The result is 34 instructions in the loop. (Was 38).

Amazingly, switching from gcc 8.2 to gcc trunk, reduces the instructions in the loop to 18. See https://godbolt.org/z/sEvz2w

there is no floating point here so you don't need ffast-math, the problem might be that you need to have r/g/b each in their own array, rather than an array of r.g.b structs

> But writing code in that style is a bit of black magic with lots of trial and error. I wish compiler vendors would publish "style guides" for this purpose, or at least a collection of best practices.

I think that would be more difficult than teaching developers to look at the generated code sometimes. At least some JavaScript developers do this, so I don't see why C++ developers can't.

I think C++ developers do this a lot. I often see links to godbolt.org on Stack Overflow and Bjarne himself on YouTube videos.

The tricky part is the are a number of different valid march flags for x86 and ARM that support SIMD.

> The tricky part is the are a number of different valid march flags for x86 and ARM that support SIMD.

Sure, but if you're writing software which works on AMD64, you can see prettymuch every autovectorization opportunity that exists on other platforms. Most of the autovectorizer functionality in gcc and clang is platform-agnostic, AFAIK.

Yes you can see the broad opportunity but various things can easily break vectorization between extensions of x86 and ARM. The first that comes to mind is use of doubles instead of floats. Another is too wide of an inner loop, especially on algorithms that must be tuned for cache line sizes. You can also run into accuracy & stability issues between different x86 instruction extensions - it's really a nightmare to debug those.

Isn't the best flag `-march=native`? It gives you all the vectorization that your processor supports, or at least I think that's the case? At least if you don't have to provide builds for other people's machines.

It may be black magic now, but people are working on making the black magic something lesser mortals don't have to worry about (eg: https://theory.stanford.edu/~aiken/publications/papers/asplo...)

Unfortunately, no one is actually paying attention to all this research being done in ivory tower academics. Sad!

Superoptimization is slow, and it's generally considered feasible only for shortish bits of code. A block of about 30 instructions takes around 10 minutes to superoptimize even on state-of-the-art stochastic superoptimizers.

You also need correct semantics for instructions. Go read the PCMPISTRI instruction and try to write out the SMT formulas for it so you can prove equivalence. Floating point is in an even worse state, since most interesting floating point transformations are only legal if you're willing to slacken requirements for bit-equivalence.

Superoptimization is not mature enough to become part of the regular compiler flow. What is happening is you're seeing people couple superoptimizers to find missing optimizations in the compiler--John Regehr is basically doing this for LLVM's InstCombine pass with Souper.

Lol, my comment was restricted to SIMD, which definitely does not take 10 minutes.

Secondly, no mainstream compiler actually compiles code to the PCMPISTRI instruction. Presumably it was meant to be used directly as assembly. I'm not sure why you are bringing in this obscure instruction into the discussion of superoptimizers.

I personally have a paranoid fantasy where NSA/GCHQ introduced this instruction to speed up password cracking. :D

You're citing the classic superoptimization paper, although it's actually quite old at this point. The goal of superoptimization has always been to try to compile to the no-real-C-equivalent instructions in an ISA, which includes the weird vector instructions such as PCMPISTRI. (I believe Strata attempted to match some of these instructions, and I think they managed to get formulas for about half of the weird immediate forms--these are the instructions that caused them to match "half" an instruction).

In any practical vectorized tight inner-loop, the block you're trying to optimize is inherently going to be large. Superoptimization is exponential in the size of the block being optimized, which limits its utility. That was my entire point: it becomes unacceptably expensive way too quickly to get used in compilers. (Some of the code I'm looking at right now has 100s of instructions in a single basic block, definitely not atypical for a compiler).

Don't be ridiculous, dude. No compiler is actually compiling to PCMPISTRI. It's meant to be used directly as assembly.

Of course I could be wrong. If you do know of any open-source compiler which compiles code to PCMPISTRI, let me know.

Superoptimization is not exactly exponential, it is NP-hard.

Of course, in the paranoid conspiracy I was referring to in my previous comment I predict the NSA/GCHQ is also using this directly via assembly. xD

Recently I checked how different compilers vectorize various C++ standard algorithms: http://0x80.pl/notesen/2019-02-02-autovectorization-gcc-clan...

You'll also find there a hint how to rewrite std::replace algorithm to make it vectorizable.

If a failure of auto-vectorization in a hot code path decreases performance enough for the user to perceive it as a bug, then it seems like explicitly maintaining SIMD/NEON code-paths would be preferable to dealing with auto-vectorization black magic.

But I've never done either.

Vectorization failures (where vectorization would be correct) should be reported as a bug. Note that GCC has some parameters to tune vectorization about from forcing it with pragmas.

> It’s not immediately clear how to make this use SIMD… unless you use gathers.

Gathers are slow.

With SSE2, you only need 2 instructions to check for duplicate indices. First _mm_shuffle_epi32 to “rotate” the triangle [a,b,c] into e.g. [b,c,a], second _mm_cmpeq_epi32 to do per-lane comparison between the original and rotated.

Similarly, with AVX2, same approach can be used to check 2 triangles at once.

For best result, unroll the loop into batches of 4 triangles (8 for AVX), and use 128/256 bit loads.

A few thoughts:

* This might be a good candidate for a compute shader, the swizzle abilities of shaders make quite a few of the things they are doing much easier as does the faster memory connections GPUs (usually) have.

* The committee is actually looking at potential ways to handle this http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2018/p110... has been proposed. There is also a lot of discussion about support for GPUs in a standard way, but I haven't seen any proposals myself.

iirc the domain this guy is working in, is one where the GPU is going to be busy with rendering, so he may prefer to keep this on the cpu. But you are right that any time you turn to SIMD, you might be able to turn to the GPU and get a much biggger win.

So what I'm somewhat curious about is if this isn't an argument for heterogeneous GPU usage. Schedule this on an iGPU while the main GPU is busy. The main downside, as I alluded to above, would be memory bandwidth unless there is a good on die cache for the iGPU or it has dedicated memory. That said with a multi-gpu setup you might also have another card you could use.

The other thought I have is that it's very difficult to keep the GPU 100% saturated in my experience. If the vertex count is high enough to make the PCI-e transition worth it (and it would seem to be), then you can potentially cycle steal a bit on the main GPU. This will potentially slow down the main render but speed up the overall. This would be particularly true if you can rebind the resource back for use in the main render pipe.

All ideas that can work in certain situtions. As GPUs get more like CPUs and CPUs get more like GPUs things will just get fuzzier I imagine.

A $850 CPU (Threadripper 2950x) will give you Hundreds of GFlops with ~100GB/s bandwidth to main memory.

A $700 GPU (AMD Radeon VII) will give you 13.8 TFlops with 1TB/s bandwidth to video memory.

If you have a workload that lends itself to GPU programming, then you should buy a 2nd GPU (or 3rd, or 4th GPU) and stick it into your system. Not all workloads work on a GPU, but a LOT of workloads just make more sense on a GPU.

> But you are right that any time you turn to SIMD, you might be able to turn to the GPU and get a much biggger win.

As long as the problem is big enough to mitigate the PCIe latency, kernel startup time, and PCIe bandwidth. Transferring data is relatively slow (PCIe x16 == 15GB/s).

There are many algorithms which are faster to execute on the CPU, because executing out of L1 cache is faster than waiting for the PCIe bus. In these cases, you should prefer CPU-side SIMD, to keep the data "hot" in L1 cache.

I bring this up everytime a comment like this comes up...

You can't just blindly compare flops numbers like this. Hitting 13tflops on a GPU implies data has long been loaded into GPU global memory and tons of computations happen on it.

Not to mention you're limited to 16GB at a time on a Radeon VII and you'd need to either use OpenCL or RoCM which is time consuming.

Fact is, GPUs are great for a minority of workloads (definitely not most) that are very computationally intense but not memory heavy. As soon as you have lots of reductions, etc. where threads will wait a CPU will also be a much better choice.

> As soon as you have lots of reductions, etc. where threads will wait a CPU will also be a much better choice.

Two things:

1. Reductions have been implemented efficiently on GPUs for a while now. See this implementation of Parallel Prefix Scan https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39....

That right there is your standard "reduce" algorithm on a GPU. Its work-efficient and utilization of the GPU remains very high.

2. GPUs have a ridiculously efficient "barrier" instruction. In fact, GPU "Barriers" are implemented at the assembly level on GPUs. Thread synchronization is actually very, very, very cheap on GPUs. In fact, if your workgroup matches the wavefront / warp of the underlying GPU, then a barrier is simply a NOP: as cheap as it gets.

The main issue with GPUs is that they are a bit mysterious right now. Not a lot of people program them. But the more I work with them, the more I realize how flexible and fast their synchronization tools are.

> You can't just blindly compare flops numbers like this. Hitting 13tflops on a GPU implies data has long been loaded into GPU global memory and tons of computations happen on it.

True, saturating a GPU is actually really really hard.

> Not to mention you're limited to 16GB at a time on a Radeon VII and you'd need to either use OpenCL or RoCM which is time consuming.

On a consumer card yes... on professional cards 32GB and up is common. That said most workloads aren't THAT big, and even if they are you can usually stage it in such that the GPU isn't ever really idle.

Vulkan and DX11+ support compute shaders now. MSVC has a custom extension too. There are plenty of compute libraries that exist that will prebuild and do all of this for you. It took me half an hour max to get set up to do this, there are reasonably good tutorials out there.

> Fact is, GPUs are great for a minority of workloads (definitely not most) that are very computationally intense but not memory heavy. As soon as you have lots of reductions, etc. where threads will wait a CPU will also be a much better choice.

Going to disagree here, memory bandwidth on a GPU is insane. HBM2 absolutely blows GDDR5 out of the water on total bandwidth and that blows regular old DDR4 out of the water as well. This is because GPUs would be dead on arrival if they can't actually process data that fast, in fact it's arguable that GPUs are more memory bandwidth constrained than they are anything else. This explains why AMD spent so much of the budget on the Radeon VII on HBM2.

As for reductions, what do you think Anti-aliasing is? There are very well known tile based approaches for this that again, blow a CPU out of the water.

For a GPU the biggest holdback is actually dataset size. Too small and the cost to go over the PCI-E bus in terms of latency will override the savings. If you can do the entire calculation as a series of shaders then write the result back to main memory you can easily out perform any CPU on the market (there are a lot of caveats here however in regards to datatypes, but let it suffice that a professional level card will).

> True, saturating a GPU is actually really really hard.

That's true, but GPUs have over 100x the FLOPs of a CPU.

This means that if you write a GPU-program with only 5% utilization (that's 95% idle), you still are going to be 5x faster than a perfectly utilized CPU.

When you have 100x the raw FLOPs and 10x the memory bandwidth... you can write a program that only utilizes 15% of a GPU and still end up with results way faster than any CPU would give you.


CPUs are way easier to program though. CPUs actively look for work to do on your behalf: its like there's an optimization engine built into the chip itself (out-of-order scheduler, branch prediction, etc. etc.). So achieving high-utilization on CPUs is an easier job (but still hard).

> unfortunately, the use of SIMD can make the code less portable and less maintainable

Unfortunately general purpose processors aren’t getting any faster, but they are getting more specialized. At some point in the future package managers for OSS May have to do LLVM compilation on the machine just to specifically target the bits and baubles of the specific architecture it’s running on.

Good GPU (SIMD) code is written in a very, very different manner than good CPU code.

Case in point: NVidia has 32 x 32-bit shaders per L1 cache / Shared Memory (a Symmetric Multiprocessor, or SM). Each SM can run 32-warps at a time, for an effective 1024 "shader threads" per L1 cache.

In contrast: CPUs have 1-core, with 2-way SMT. That's 2-threads per L1 cache. IBM has a Power9 CPU with 8-way SMT for 8-threads on one core, but that's as far as you get with traditional CPUs.

Effectively: GPUs are memory-constrained. Not memory-bandwidth constrained btw (GPUs have a TON of bandwidth), but literally memory constrained. Each shader only has ~500 bytes, probably less, that it can access efficiently (more if you have "uniform" data that can be shared between shaders). And maybe 2MB total per shader before you run out of RAM entirely.


GPU code is innately parallel, so any memory allocation you do is multiplied by a thousand fold, or more. Vega64 needs at least 16384 work-items before it has occupancy 1 per vALU (64 compute units, 4 work queues per compute unit, 64 vALUs per queue)... and needs x10 work-items for max occupancy (a total of 163840 work-items in flight).

If you have max occupancy 10 and allocate 16-bytes per work-item, and you just allocated 2.6MB of data on the GPU. I'm not kidding. You run out of space very quickly on GPUs.

While GPUs struggle with this multiplicative problem per work-item, CPUs have a ton of memory. That's 32kB of L1 cache typically available per thread (64kB per core, but typically you have two threads per core thanks to Hyperthreading).

That's the core of GPU vs CPU programming IMO. Dealing with the grossly constrained memory conditions of the GPU environment.

> In contrast: CPUs have 1-core, with 2-way SMT. That's 2-threads per L1 cache.

Comparing GPUs and CPUs are tricky. What you want to compare is not the number of logical threads that can be independently scheduled [1], but the number of simultaneous FMA units you can access. If you have an AVX2 processor, that's 2 FMA units each processing 8 lanes of 32-bit values, or a total of 16 lanes per L1 cache. AVX-512 has 32 lanes due to the wider vectors.

[1] The independent scheduling of a GPU thread is also a bit of a lie. GPU threads are closer to maskable lanes of a wide vector, rather like the mask vectors of AVX-512.

> Comparing GPUs and CPUs are tricky

I agree it is tricky, but my point in the above post is not to compare processing girth, but instead to compare memory-allocation. If you are aiming for occupancy 10 on Vega64, every 32-bit DWORD you allocate will allocate a total of 655,360 bytes.

Yes, this code, a simple 4-byte allocation in OpenCL:

    __private uint32_t foobar;
This code just allocated 655,360 bytes on Vega64 (assuming worst-case Occupancy 10 per queue). While this following OpenCL code:

    __private uint64_t wow;
Just allocated 1.25 MB. By itself. The single line of code. The total amount of memory used adds up surprisingly quickly on GPUs. This is IMO the biggest issue with GPU programming, the absurdly small amount of memory you're expected to work with per shader.

> At some point in the future package managers for OSS May have to do LLVM compilation on the machine just to specifically target the bits and baubles of the specific architecture it’s running on.

Future? This sounds basically like what gentoo has been doing for 20 years.

As neighboring comment states, a common feature in Android, and in all managed languages with JIT/AOT toolchains.

AOT compilation on install! the best of AOT and JIT worlds! Android does this already.

Your information is bit outdated regarding Android.

Android did AOT on install between 5 and 7 versions.

With Android 7, they introduced a mixed model, of an hand written interpreter in Assembly, JIT compiler with PGO, followed by AOT compilation with PGO data when the device is idle.

As extra information, AOT on install for mobile phones, was introduced on Windows Phone 8, and has been available in mainframes and some Pascal USCD systems since the early 80's.

One option which I have not seen mentioned here yet is to use a platform independent API with a platform optimized implementation. i.e. if your operations can be expressed in terms of BLAS primitives, then ATLAS can give you a good part of the benefit of hand optimized code, without much of the hassle: http://math-atlas.sourceforge.net

Yes, that's important for portable performance, but not with ATLAS. Unless it's improved since I last tried, it's generally slower than OpenBLAS (or BLIS, by extension) and doesn't do dynamic dispatch on the micro-architecture like OpenBLAS and BLIS -- currently on x86_64 and also aarch64 with OpenBLAS. (To deal with small matrices (on x86_64 only) you want libxsmm backed by OpenBLAS or BLIS.) From previous experience I'd guess BLIS' new generic C micro-kernel for GEMM will beat ATLAS with recent GCC vectorizing it. Note in the case of GEMM, the most important part of BLAS, vectorizing is only really relevant after you've restructured it.

Some thoughts on my experiments with SIMD Programming:

1. AVX2 is good, but tedious to use manually. The difficulty problem with AVX2 is that it is SIMD of SIMD: Its 2-way SIMD of 128-bit. Going "across" the lanes of the 128-bit bunches can only be done with rare instructions... or through L1 cache (write to memory, then read back in another register).

2. "#pragma OpenMP SIMD" seems to be the most portable way to attempt to "force" autovectorization. Its compatible across GCC, CLang, and ICC, and other compilers. Visual Studio unfortunately does NOT support this feature, but VSC++ has well documented auto-vectorization features.

3. If you are sticking to Visual C++, its auto-vectorization capabilities are pretty good. Enable compiler warnings so that you know which loops fail to auto-vectorize. Be sure to read those warnings carefully. https://docs.microsoft.com/en-us/cpp/parallel/auto-paralleli...

4. If you keep reaching for the SIMD button, the GPU Programming model seems superior. If you must use a CPU, try the ISPC: Intel's SPMD Program Compiler (https://ispc.github.io/) so that your "programming model" is at least correct.

5. If a huge portion of your code is SIMD-style, a dedicated GPU is better. GPUs have more flops and memory bandwidth. GPUs have faster local memory (aka "shared memory" on NVidia, or "LDS" on AMD) and faster thread-group communications than a CPU. Know how amazing vpshufb is? Well, GPUs will knock your socks off with ballot(), CUDA __shfl(), AMD Cross-lane operations and more (https://gpuopen.com/amd-gcn-assembly-cross-lane-operations/).

6. To elaborate on point #5: GPUs simply have better gather/scatter capabilities. The GPU's "shared" or "LDS" memory (just 64kB or so) is very small, but provides arbitrary gather/scatter capabilities across "lanes" of GPU SIMD units. They even support relatively efficient atomic operations. Yes, even vpshufb seems relatively limited compared to what is available on GPUs.

7. Raw AVX2 assembly seems "easy" if what you need is a 128-bit or 256-bit register. For example, if you are writing Complex-Doubles (real+imaginary number), then it is very straightforward to write 128-bit SIMD code to handle your arithmetic. But if you are writing "true SIMD" code, such as the style in the 1986 Seminal Paper "Data Parallel Algorithms" (https://dl.acm.org/citation.cfm?id=7903), then stick with ISPC or GPU-style coding instead.

8. Be sure to read that paper: "Data Parallel Algorithms" to get insight into true SIMD Programming. GPU programmers already know what is in there, but its still cool to read one of the first papers on the subject (in 1986 nonetheless!)

I can't comment on the GPU comments, but you may be better off leaving the vectorization to gcc than using the simd pragma. On haswell, it uses avx2, but not fma, so you loose a factor of two on GEMM, for instance. The GCC manual also gives an example for the ivdep pragma.

Would not that be just a question of missing improvements?

If I recall correctly, OpenJDK can use FMA thanks to Intel contributions.

I don't see how openjdk is related to the openmp pragma. GCC has no problem using FMA if you just let it, avoiding the pragma which simply says "simd".

I understood that GCC auto-vectorization wouldn't do it currently, and hence gave an example where auto-vectorization does make use of it, assuming I remember Intel's session at CodeONE correctly.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact