Hacker News new | past | comments | ask | show | jobs | submit login
SIMD for C++ Developers [pdf] (const.me)
129 points by Const-me 19 days ago | hide | past | favorite | 41 comments



https://software.intel.com/sites/landingpage/IntrinsicsGuide...

Anyone who is using SSE / AVX / AVX512 intrinsics probably should know about Intel's excellent Intrinsics Guide. The Intel guide is a reference. This .pdf topic is a tutorial. So both resources will be helpful to anyone seriously doing SIMD on the CPU.


I also use this for nice visualizations while learning:

https://www.officedaytime.com/simd512e/


Nice. For anyone interested in ARM, they also have a guide with diagrams [1] and a searchable reference [2].

1: https://developer.arm.com/documentation/102159/0400/Permutat... 2: https://developer.arm.com/architectures/instruction-sets/sim...


Updated my article from 2019.

It’s not limited to C++, equally good for C.

Over time, the support slowly arrives to other languages too, like C#: https://docs.microsoft.com/en-us/dotnet/api/system.runtime.i... https://docs.microsoft.com/en-us/dotnet/api/system.runtime.i...


Rust as well, and the intrinsics are identically named so your tutorial is good for rust as well.


SIMD support is also currently in development for Java: https://openjdk.java.net/jeps/338


This is really nice, I teach intrinsics to my MSc students so will add this to their reading list. This is a recent set of notes I gave them showing how to go from simple CPU only to Threaded SIMD + OpenGL https://nccastaff.bournemouth.ac.uk/jmacey/post/GridVis/grid...


And D as well, although the support varies a bit across all three backends.


How about other languages that use GCC/LLVM, like Zig, Rust, Ada, or for that matter FORTRAN?


Fortran for sure, given its use in HPC and GPGPU programming since it exits (we no longer use all caps for its name :) ).

Only GNAT Ada uses GCC/LLVM, the other surviving Ada vendors have their own compilers, no idea how much they expose the underlying SIMD intrisics.

Rust has some initial support, I don't follow it, last time I checked it was only available on nightly with a very basic subset on stable.

Zig no idea, I don't follow it that much, being yet another C just with bounds checking isn't something I care about.


It's a really well written article, thank you for the work!


Why not tools like https://github.com/ispc?

This seems really close to the metal, either to have a non-negligible maintenance cost or not being able to fully exploit the hardware at use.


I've used ISPC before, as well as enoki (sort of like ISPC-in-c++), and found that they have a lot of sharp performance edges.

My experience with both was that as I moved away from the super classic SIMD cases, the more I ran into crazy compiler cliffs where tiny tweaks would blow up the codegen. In each case I gave up, reimplemented what I wanted directly in c++ (the second time using anger fog's wonderful vector class library), and easily got the results I wanted without a ton of finagling the compiler and libraries.


It doesn't always emit optimal SIMD code. Plus, when you get the hang of it, writing your own SIMD library is fairly simple so you don't need a tool for it. C++ templates and operator overloading really shines here. For example, you can write sqrt(x*y+z) and have the the template system select the most optimal SIMD intrinsics depending on whether x, y, and z are int, float, int16, float8, double4, etc.


+1 to intrinsics or wrappers giving us more control over performance.

> Plus, when you get the hang of it, writing your own SIMD library is fairly simple

hm.. it's indeed easy to start, but maintaining https://github.com/google/highway (supports clang/gcc/MSVC, x86/ARM/RiscV) is quite time-consuming, especially working around compiler bugs.


Switching compilers is often too high-risk, but there are header-only libraries that get you most of the same benefits with normal C++ and wrappers around the intrinsics: https://github.com/richgel999/CppSPMD_Fast


Harder to use. That’s another language which requires that special compiler from Intel. The intrinsics are already supported in all modern C and C++ compilers, with little to no project setup.

For many practical problems, the ISPC’s abstraction is not a good fit. It’s good for linear algebra with long vectors and large matrices, but SIMD is useful for many other things besides that. A toy problem: compute count of spaces in a 4 GB-long buffer in memory. I’m pretty sure manually written SSE2 or AVX2 code (inner loop doing _mm_cmpeq_epi8 and _mm_sub_epi8, outer one doing _mm_sad_epu8 and _mm_add_epi64) gonna be faster than ISPC-made version.


> It’s good for linear algebra with long vectors and large matrices, but SIMD is useful for many other things besides that

The main goal in ispc's design was to support SPMD (single program multiple data) programming, which is more general than pure SIMD. Handling the relatively easy cases of (dense) linear algebra that are easily expressed in SIMD wasn't a focus as it's pretty easy to do in other ways.

Rather, ispc is focused on making it easy to write code with divergent control flow over the vector lanes. This is especially painful to do in intrinsics, especially in the presence of nested divergent control flow. If you don't have that, you might as well use explicit SIMD, though perhaps via something like Eigen in order to avoid all of the ugliness of manual use of intrinsics.

> I’m pretty sure manually written SSE2 or AVX2 code (inner loop doing _mm_cmpeq_epi8 and _mm_sub_epi8, outer one doing _mm_sad_epu8 and _mm_add_epi64)

ispc is focused on 32-byte datatypes, so I'm sure that is true. I suspect it would be a more pleasant experience than intrinsics for a reduction operation of that sort over 32-bit datatypes, however.


> This is especially painful to do in intrinsics

Depends on use case, but yes, can be complicated due to lack of support in hardware. I’ve heard AVX512 fixed that to an extent, but I don’t have experience with that tech.

> perhaps via something like Eigen

I do, but sometimes I can outperform it substantially. It’s optimized for large vectors. In some cases, intrinsics can be faster, and in my line of work I encounter a lot of these cases. Very small matrices like 3x3 and 4x4 fit completely in registers. Larger square matrices of size like 8 or 24, and tall matrices with small fixed count of columns, don’t fit there but a complete row does, saving a lot of RAM latency when dealing with them.

> to avoid all of the ugliness of manual use of intrinsics

I don’t believe they are ugly; I think they just have a steep learning curve.

> I suspect it would be a more pleasant experience than intrinsics for a reduction operation of that sort over 32-bit datatypes

Here’s an example how to compute FP32 dot product with intrinsics: https://stackoverflow.com/a/59495197/126995 I have doubts the ISPC’s reduction gonna result in similar code. Even clang’s automatic vectorizer (which I have a high opinion of) is not doing that kind of stuff with multiple independent accumulators.


> Here’s an example how to compute FP32 dot product with intrinsics: https://stackoverflow.com/a/59495197/126995 I have doubts the ISPC’s reduction gonna result in similar code. Even clang’s automatic vectorizer (which I have a high opinion of) is not doing that kind of stuff with multiple independent accumulators.

ISPC lets you request that the gang size be larger that the vector size [1] to get 2 accumulators out of the box. If having more accumulator is crucial, you can have them at the cost of not using idiomatic ispc but I'd argue the resulting code is still more readable.

I'm no expert so they might be flaws that I don't see but the generated code looks good to me, the main difference I see is that ISPC does more unrolling (which may be better?).

Here is the reference implementation: https://godbolt.org/z/MxT1Kedf1

Here is the ISPC implementation: https://godbolt.org/z/qcez47GT5

[1] https://ispc.github.io/perfguide.html#choosing-a-target-vect...


> Here is the ISPC implementation

Line 36 computes ymm6 = (ymm6 * mem) + ymm4, the next instruction on line 37 computes ymm6 = (ymm8 * mem) + ymm6

These two instructions form a dependency chain. The CPU can’t start the instruction on line 37 before the one on line 36 has made a result. That gonna take 5-6 CPU cycles depending on CPU model. Same happens for ymm5 vector between instructions on line 38 and 41, and in a few other places.

In the reference code all 4 FMA instructions in the body of the loop are independent from each other, a CPU will run all 4 of them in parallel. The data dependencies are across loop iterations, only the complete loop is limited to 4-5 cycles/iteration. That’s OK because the throughput limit (probably not the FMA throughput though, I think load ports throughput is saturated before FMA, especially for unaligned inputs) is smaller than that.


Oh right, I didn't think of looking for that, guess you're right and doing things by hand is still better


It’s not terribly bad because CPUs are out-of-order. As far as I can tell, there’s no single dependency chain over all instructions in the loop body, some of these FMAs gonna run in parallel in your ISPC version. Still, I would expect manually-vectorized code to be slightly faster.


> Even clang’s automatic vectorizer (which I have a high opinion of) is not doing that kind of stuff with multiple independent accumulators.

I think it does? I see Clang unroll reductions into multiple accumulators quite often.


Probably a HN article on its own, but also related [1]. It's about timestamp parsing using SIMD instructions (among other optimizations). I've noticed when I had a toy HFT project that this type of thinking is needed.

[1] https://kholdstare.github.io/technical/2020/05/26/faster-int...


If you go to the parent directory [0] there is a similar guide for ARM NEON [1]

[0] http://const.me/articles/simd [1] http://const.me/articles/simd/NEON.pdf


If you liked this post you may also like:

SIMD in Java: https://news.ycombinator.com/item?id=14636802 (archived version https://archive.is/C5iZA)

SIMD in Rust: https://news.ycombinator.com/item?id=10111729

SIMD in Python: https://news.ycombinator.com/item?id=10470428

SIMD in Javascript: https://news.ycombinator.com/item?id=8533350

Using SIMD to aggregate billions of values per second: https://news.ycombinator.com/item?id=22803504

Towards fearless SIMD: https://news.ycombinator.com/item?id=18293209

First Impressions of ARM SIMD Programming: https://news.ycombinator.com/item?id=19490542


Perfect timing, I was just about to start looking into SIMD again as my toy game engine is almost at a point where I want to see if I can vectorize some of my processing (much of it is already stored in SOA format, so hopefully it won't be too much trouble). I'm thinking especially about tasks like frustum culling, but also other things. We'll see, after I read this :) I've used SIMD intrinsics before, but I could definitely do with a refresher!


For videogame applications, look there before writing these intrinsics: https://github.com/microsoft/DirectXMath/ That library already implements a lot of complicated things, relatively well.

Here’s for frustum culling https://github.com/microsoft/DirectXMath/blob/jan2021/Inc/Di... Relatively inefficient when you have many boxes to test against same frustum, but (a) compiler may inline and optimize (b) failing that, it’s easy to copy-paste and optimize manually, compute these 6 planes and call BoundingBox::ContainedBy method yourself.


Thanks, I’ll take a look. Although, it’s a for fun engine so I may try myself anyway just to learn. I’ll see. Either way, thanks for the link, very interesting!

As for frustum culling, that code seems to do one bounding box at a time? Or am I misunderstanding? I was planning to try to do 4 (or however many) checks at a time. I’m ok with checking against bounding spheres too if that makes it easier to vectorize.


> that code seems to do one bounding box at a time?

Yep, most parts of that library were designed for doing one thing at a time.

Generally speaking, HPC-style SoA approach can be faster especially if you have AVX. But there’s a price for that, most importantly code complexity but some performance-related things as well: RAM access pattern, uploading to VRAM for rendering.

> I was planning to try to do 4 (or however many) checks at a time

I would start with whatever code is in that library, and only optimize if profiler says so.

They have sphere versus frustum test too, similar one i.e. they also testing against these 6 planes, might be slightly more efficient than boxes.


Ok, thank you! Much appreciated.


I would at least appreciate a disclaimer that the vast majority of these optimizations could be accomplished by encouraging the compilers to make the assembly vectorized. You said in a footnote that compilers will only do these optimizations when they're extremely simple and rarely on integers, but I have not found that to be the case. -O3 and -mavx do an amazing job for most use cases. And more to the point, there are other tricks that I think it's better to turn to (like using the __restrict key word) before you take these fairly steep steps into coding the SIMD commands yourself.

It's cool to learn these things. And it's down right important to learn these things once you're experienced enough, because you have to use them at some point if you're in the game of optimization. But also I would feel pretty bad if some kid out there wasted a week on a project at work (and got reprimanded for it) that could have been accomplished with a couple compiler flags, you know?


They’re getting better over time, especially for floats/doubles, but I still find them limited even for simple use cases.

Here’s an example of auto-vectorizer in clang 12, which I believe represents state of the art at the moment: https://godbolt.org/z/6Pe33187W It automatically vectorized the loop and even manually unrolled it, however I think the code bottlenecks on shuffles not on memory loads. Just too many instructions in the loop, and that vpmovzxbq instruction can only run on port 5 on Skylake.

Compare the assembly with manually vectorized version from an answer on stackoverflow: https://godbolt.org/z/do5e3-


Autovectorization is annoying and hard to work with because it's unpredictable. It also fights you if you manually vectorize - you can write a SIMD loop and scalar trailer loop, and then it'll autovectorize the trailer.

If instead you wrote in an infinite-length vector language and the compiler scalarized it for you, I think that could work better.


+1 for restrict, that certainly helps. Out of curiosity, what's your use case where autovectorization works well?

Personally, I have often been disappointed. Not much progress in 2 years: http://www.0x80.pl/notesen/2021-01-18-autovectorization-gcc-...


I'll admit that there are times when I am both stunned at how well the compiler will optimize, and times when I'm stunned at how poorly it does. It would seem you can never make an assumption on what will happen--hence my addiction to godbolt. I don't have my exact code, but I work heavily with math operations, so it may be that I do encounter an especially easy types of loops to vectorize.


OK, that's fair. Godbolt is a great tool and autovectorization gets initial results quickly, but the price is eternal vigilance. Any compiler update/code maintenance risks falling on the wrong side of some compiler heuristic.


Nice writeup with helpful diagrams, thanks for sharing!

Readers might also find this short intro [1] helpful, including tips on porting. (Disclosure: author)

1: https://github.com/google/highway/blob/master/g3doc/highway_...

> many available instructions are missing from the wrappers

Highway can interop with platform-specific intrinsics (on x86/ARM, hwy_vec.raw is the native intrinsic type).

> vectorized integer math often treats vectors as having different lanes count on every line of code

Fair point, that's a cost of type safety. We usually write `auto` to avoid spelling it out.


What timing! This is probably better for StackOverflow, but is there a way to AND two AVX operands and also get the ZF flag set if the result is zero?

It seems like there's one intrinsic to do the AND but this doesn't set ZF. [0]

But there's another instrinsic that will set ZF but doesn't actually store the result of the AND operation [1].

[0] vpand ymm, ymm, ymm

[1] vtestpd ymm, ymm

I'm guessing that either a) I'm missing an instruction, or having to modify EFLAGS from AVX instructions incurs a large penalty and so it's not advisable?


I don’t think you’re missing an instruction. A few comments, still.

Bitwise instructions are very cheap, 1 cycle of latency. Skylake can run 3 of them every clock, Zen 2 can run 4 of them per clock. I wouldn’t worry about that extra vpand instruction too much.

About vptest, the latency is not great, 6-7 cycles. If you gonna branch on the outcome, and the branch is not predictable (your code takes random branch every time the instruction at specific address is running), you gonna waste time. Sometimes that’s unavoidable, like when your goal is something similar to memchr() function (however I’d recommend _mm256_movemask_epi8 instead for that). But other times it’s possible to rework into something better: mask lanes with _mm256_blendv_[something], zero out stuff with bitwise AND, that kind of stuff.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: