I also use Rust but its perfectly fine for learning about intrinsics in C/C++ or .NET as well. I cover some of the fundamental strategies for using them well, how to lay out data in memory, how to deal with branches, etc.
Pretty weak speedup, maybe a straight up n-body implementation would see closer to the 8x theoretical speedup.
That might be part of the reason. Even with experience it's really hard to optimize code without detailed profiling.
Either with a profiler that shows clock-ticks per instruction or by making very small changes to your code and keep a log of the total running time after each change.
> That might be part of the reason.
Yes, a cosine calculation should dominate all the rest of the computation. Grepping through https://www.agner.org/optimize/instruction_tables.pdf, the latency of FCOS is listed as at least 10x the latency of a floating-point add or multiply across pretty much all microarchitectures.
I'm also unsure about re-packing the results of the cosine just to allow a single multiply, the results of which are then unpacked again. It might be faster to just do that multiply in scalar code, though that's exactly the thing that would need to be measured.
I don't know anything about Rust, but a nicer word is probably "intrinsics". They usually compile to a single instruction.
This is a low-level lib. They don't want to hide anything. If you see _mm* you know you are using AVX and which version (which is important to know which CPU is supported).
High level lib do use more natural names.
I'm not sure renaming the primitive operations provided by intel/amd to something "nicer" would help much here. Using plain SIMD will always be ugly and at least you can Google the names and get back the Intel documentation without first translating from a different name.
I think OpenMP / Intel Autovectorizers / etc. etc. are all taking the wrong approach. The "graphics guys" have figured out a better model for thinking in SIMD.
With that being said, normal code has major issues before it can be converted into "Graphics-SIMD" form. Most importantly: data-layout is straight up "wrong", with most data-layouts in AOS (array of structs) instead of SOA form.
Writing code that interfaces between AOS and SOA is tedious, and I'm unconvinced that any general solution can be done automatically. (Remember: the key is to convert between the forms "efficiently", because the only reason we're putting up with SIMD at all is due to performance reasons).
Because it is low level, it won't be fancy, but you can find several library that wrap those low-level ops in more fancy APIs.
Comparing it with CUDA is hardly a good comparison. Even if the GPU is basically a bunch of SIMD unit, GPGPU programming is still very different than adding SIMD capability to an x86 program.
With that being said, there's good reason to use raw intrinsics in modern code. But ipsc / CUDA model is superior for most uses in my experience. Its just easier to think about.
The main issue with IPSC is that you're innately SOA, and the data-layout is just different compared to how people normally organize their data. Data-layout issues (AOS vs SOA) are probably one of the most tedious issues to deal with when using SIMD.
For the "interface", where you're converting AOS to SOA, manual intrinsics can help.
Of course, which is why the language already allows you to do that if you want to, and it often performs better than ISPC, while being memory and thread safe.
However, because Rust is a low-level language, it also allows you to write low-level code that uses assembly-like intrinsics for specific instructions manually, which is what this blog post shows.
I personally think that if your goal is to teach a new programming paradigm, like data-parallel programing (SIMD/SIMT/..), using assembly is a pretty inefficient way to do that. If you already know a data-parallel programming language like ISPC, then there is a lot of value on learning to which assembly instructions your code should lower to on each hardware and getting an intuition for that.
Apparently the results have a spread, depending on the CPU used to perform the benchmark - they appear to be testing from haswell to skylake so that covers a wide range of x86 hardware.
Matching ISPC perf in the low end, and being ~1.5x faster on the high end, while being able to target all hardware that Rust can target (arm, wasm, ppc, riscv...) sound better than ISPC to me, which works for x86 very well, but not so good for ARM, and not really at all for anything else.
Like total latency, no point to offload something you can finish processing on CPU faster than transferring to GPU and back.
Some systems just don't have GPUs, and there's nothing you can do about it.
Sometimes CPUs are simply much faster due to a branchy serial algorithm. However, you might still be able use SIMD to get some speedup.
Sometimes I end up going single threaded SIMD, if the whole system is memory bandwidth limited anyways. Work stealing queues can also be great. Thread per CPU core pulling work from a common pool. You might be able to do some rough data locality based scheduling to reuse cache hierarchy contents.
Overall, I feel the biggest challenges often come from cache and memory bandwidth management. CPUs are fast, but SDRAM is not. You don't want different threads fighting for CPU socket local resources and even less for global ones. I usually do rough estimates of required bandwidth and computation, write some prototypes and do a lot of profiling, including taking a good hard look at the CPU counters.
Not trying to say anything particular, except that solution space has some options. That there are no silver bullets. The solutions you suggested can also be great.
> Overall, I feel the biggest challenges often come from cache and memory bandwidth management. CPUs are fast, but SDRAM is not. You don't want different threads fighting for CPU socket local resources and even less for global ones. I usually do rough estimates of required bandwidth and computation, write some prototypes and do a lot of profiling, including taking a good hard look at the CPU counters.
I think memory-layout is the #1 issue these days. CPUs / GPUs have so much compute available that its almost impossible to actually achieve high utilization. In most cases, you're sitting around just waiting for memory...
CPU memory movement is still subpar compared to GPUs. AVX512 finally implements "scatter" operations, but GPUs have had highly-optimized "gather-scatter" to __local or __shared__ memory for years (ex: GPUs have 32 banks and 32-load/store units per GPU-compute unit or NVidia SM: that's either 1/2 or 1 load/store unit per GPU shader. AVX512 Skylake however has 3-load/store units across 16 SIMD-threads...)
Intel really needs to write more instructions like "pshufb" to handle more ways for register-to-register movement. It seems like a lot of data-movement in the AVX world is still best handled by AVX -> L1 cache -> back into AVX register (which is limited by the very few load/store units in modern CPUs).
Yeah, you can cheat a lot of cases through pshufb, but that instruction doesn't always work. There's something to be said about the brute-force option of 32x load/store units on a GPU-unit and sticking 32-load/store units for all the threads to leverage.
Absolutely agree. Cache lines should be packed with data that is useful together. Memory streaming access patterns should be favored.
> CPU memory movement is still subpar compared to GPUs. AVX512 finally implements "scatter" operations, but GPUs have had highly-optimized "gather-scatter"...
Well, it goes both ways. CPU gather/scatter may be slow, but GPU memory access latency is astronomically high — talking about microseconds. Of course GPUs mask the latency with a ton of hardware threads. CPUs are memory access latency kings by far. GPUs do have amazing memory controllers when you have gather/scatter access patterns, as long as high latency is acceptable.
> Intel really needs to write more instructions like "pshufb" to handle more ways for register-to-register movement.
Yeah, it'd be useful, but not so critical when you're memory bound anyways. I often find myself having a lot of "free computation slots" for data shuffling while the CPU is waiting for the memory. Or in other words, memory stalls.
Oh, I mean gather/scatter to shared / local memory. General purpose gather/scatter is very high latency as you say (I think read/writes were like 500 nanoseconds to L1 cache, and far slower to L2 and VRAM), but gather/scatter to shared/local memory is basically limited by bank-conflicts (~32 cycles worst case, ~2 cycles best case).
I'm pretty sure AVX512 gather/scatter to L1 cache is still dozens of cycles for just 16 SIMD-lanes.
> Yeah, it'd be useful, but not so critical when you're memory bound anyways. I often find myself having a lot of "free computation slots" for data shuffling while the CPU is waiting for the memory. Or in other words, memory stalls.
Fair point. I presume you mean that you can shuffle data to L1 cache while waiting for L3 or DDR4 RAM instead.
What I really want is "shared memory" to be implemented on CPUs, and for AVX-lanes to be able to shuffle data to and from there independently of the L1 / L2 / L3 / DDR4 memory system.
Yeah, last I checked, it performed like scalar loads and stores. I presume Intel intends to eventually optimize for buffered/L1 hit cases. I mean, why would those instructions even exist otherwise?
This also is a nice way of handling a limited subset of other assembly instructions for systems programming while they figure out how to have inline assembly without coupling the language to its implementation.
I write a lot of Rust code at work, and I admit that it can sometimes be pretty noisy. There are several major contributors to this:
1. Rust offers fine-grained control over pass-by-value, pass-by-reference, and pass-by-mutable reference. This is great for performance. But it also adds a lot of "&" and "&mut" and "x.to_owned()" clutter everywhere.
2. Rust provides support for generics (aka parameterized types). Once again, this is great for performance, and it also allows better compile-time error detection. But again, you wind up adding a lot of "<T>" and "where T:" clutter everywhere.
3. Usually, Rust can automatically infer lifetimes. But every once in a while, you want to do something messy, and you end up needing to write out the lifetimes manually. This is when you end up seeing weird things like "'a". But in my experience, this is pretty rare unless I'm doing something hairy. And if I'm doing something hairy, I'm just as happy to have more explicit documentation in the source code.
Really, the underlying problem here is that (a) Rust fills the same high-control, high-performance niche as C++, but (b) Rust prefers explicit control where C++ sometimes offers magic, invisible conversions. (Yes, I declare all my C++ constructors "explicit" and avoid conversion operators.)
Syntax is a hard problem, and I've struggled to get syntax right for even tiny languages. But syntax for languages with low-level control is an even harder problem. At some point, you just need to make a decision and get used to it.
In practice, I really enjoy writing Rust. It's definitely not as simple as Ruby, Python or Go. But it fills a very different ecological niche, with finer-grained control over memory representations, and support for generics.