Hacker News new | past | comments | ask | show | jobs | submit login
Learning SIMD with Rust by Finding Planets (2018) (medium.com)
112 points by btashton 45 days ago | hide | past | web | favorite | 29 comments



You may enjoy my video tutorial on SIMD Intrinsics as well:

https://www.youtube.com/watch?v=4Gs_CA_vm3o

I also use Rust but its perfectly fine for learning about intrinsics in C/C++ or .NET as well. I cover some of the fundamental strategies for using them well, how to lay out data in memory, how to deal with branches, etc.


> After running benchmarks with all the variants and planets, the improvement is about 9% to 12%.

Pretty weak speedup, maybe a straight up n-body implementation would see closer to the 8x theoretical speedup.


> but Rust does not provide the Intel _mm256_cos_pd() instruction yet.

That might be part of the reason. Even with experience it's really hard to optimize code without detailed profiling. Either with a profiler that shows clock-ticks per instruction or by making very small changes to your code and keep a log of the total running time after each change.


> > but Rust does not provide the Intel _mm256_cos_pd() instruction yet.

> That might be part of the reason.

Yes, a cosine calculation should dominate all the rest of the computation. Grepping through https://www.agner.org/optimize/instruction_tables.pdf, the latency of FCOS is listed as at least 10x the latency of a floating-point add or multiply across pretty much all microarchitectures.

I'm also unsure about re-packing the results of the cosine just to allow a single multiply, the results of which are then unpacked again. It might be faster to just do that multiply in scalar code, though that's exactly the thing that would need to be measured.


> AVX functions start with _mm256_

I don't know anything about Rust, but a nicer word is probably "intrinsics". They usually compile to a single instruction.


It's because they just use the name of the actual AVX ops (https://software.intel.com/sites/landingpage/IntrinsicsGuide...).

This is a low-level lib. They don't want to hide anything. If you see _mm* you know you are using AVX and which version (which is important to know which CPU is supported).

High level lib do use more natural names.


The commenter was pointing out that intrinsic functions shouldn't just be called functions (I have no strong opinion on that comment). He wasn't commenting about the names of the functions themselves.


Ah ok, I didn't understand.


This looks kinda gross to me. Do the rust developers not want to emulate what ipsc and cuda do? Writing intrinsics by hand is not what I expect from a 2019 language.


There are libraries abstracting the SIMD calls if that's what you want, e.g. faster (https://github.com/AdamNiederer/faster) or simdeez (https://github.com/jackmott/simdeez), but these have to use basic operations at the end of the day too and this post shows how to do that.

I'm not sure renaming the primitive operations provided by intel/amd to something "nicer" would help much here. Using plain SIMD will always be ugly and at least you can Google the names and get back the Intel documentation without first translating from a different name.


I disagree with the parent poster's phrasing, but they have a point: CUDA (for GPUs) and IPSC (for CPUs) makes SIMD code far easier to write.

I think OpenMP / Intel Autovectorizers / etc. etc. are all taking the wrong approach. The "graphics guys" have figured out a better model for thinking in SIMD.

With that being said, normal code has major issues before it can be converted into "Graphics-SIMD" form. Most importantly: data-layout is straight up "wrong", with most data-layouts in AOS (array of structs) instead of SOA form.

Writing code that interfaces between AOS and SOA is tedious, and I'm unconvinced that any general solution can be done automatically. (Remember: the key is to convert between the forms "efficiently", because the only reason we're putting up with SIMD at all is due to performance reasons).


This is an example of doing manual SIMD. In some cases, it's still needed. Manual SIMD tend to be faster than what even the most modern compiler can provide.

Because it is low level, it won't be fancy, but you can find several library that wrap those low-level ops in more fancy APIs.

Comparing it with CUDA is hardly a good comparison. Even if the GPU is basically a bunch of SIMD unit, GPGPU programming is still very different than adding SIMD capability to an x86 program.


You really should check out ipsc, which is a CPU-SIMD compiler similar to CUDA.

With that being said, there's good reason to use raw intrinsics in modern code. But ipsc / CUDA model is superior for most uses in my experience. Its just easier to think about.

The main issue with IPSC is that you're innately SOA, and the data-layout is just different compared to how people normally organize their data. Data-layout issues (AOS vs SOA) are probably one of the most tedious issues to deal with when using SIMD.

For the "interface", where you're converting AOS to SOA, manual intrinsics can help.


> Do the rust developers not want to emulate what ipsc and cuda do?

Of course, which is why the language already allows you to do that if you want to, and it often performs better than ISPC, while being memory and thread safe.

However, because Rust is a low-level language, it also allows you to write low-level code that uses assembly-like intrinsics for specific instructions manually, which is what this blog post shows.

I personally think that if your goal is to teach a new programming paradigm, like data-parallel programing (SIMD/SIMT/..), using assembly is a pretty inefficient way to do that. If you already know a data-parallel programming language like ISPC, then there is a lot of value on learning to which assembly instructions your code should lower to on each hardware and getting an intuition for that.


Performing better than ISPC is a pretty bold claim, you are definitely going to need to provide a source for that. Rust developers have had people telling them about ISPC for years and have waived off any need to look at it and understand why it works so well.


See https://github.com/rust-lang-nursery/packed_simd#performance

Apparently the results have a spread, depending on the CPU used to perform the benchmark - they appear to be testing from haswell to skylake so that covers a wide range of x86 hardware.

Matching ISPC perf in the low end, and being ~1.5x faster on the high end, while being able to target all hardware that Rust can target (arm, wasm, ppc, riscv...) sound better than ISPC to me, which works for x86 very well, but not so good for ARM, and not really at all for anything else.


Intrinsics is still what modern high-performance "C++" code uses. Auto-vectorizers are pretty fragile and require too much babysitting to be worth it.


Semi auto vectirizers are the best compromise Cf openMP and also allow to multithread or offload your code on the gpu.


Sometimes offloading to GPU is not possible or desirable for a reason or another.

Like total latency, no point to offload something you can finish processing on CPU faster than transferring to GPU and back.

Some systems just don't have GPUs, and there's nothing you can do about it.

Sometimes CPUs are simply much faster due to a branchy serial algorithm. However, you might still be able use SIMD to get some speedup.

Sometimes I end up going single threaded SIMD, if the whole system is memory bandwidth limited anyways. Work stealing queues can also be great. Thread per CPU core pulling work from a common pool. You might be able to do some rough data locality based scheduling to reuse cache hierarchy contents.

Overall, I feel the biggest challenges often come from cache and memory bandwidth management. CPUs are fast, but SDRAM is not. You don't want different threads fighting for CPU socket local resources and even less for global ones. I usually do rough estimates of required bandwidth and computation, write some prototypes and do a lot of profiling, including taking a good hard look at the CPU counters.

Not trying to say anything particular, except that solution space has some options. That there are no silver bullets. The solutions you suggested can also be great.


IPSC gives you SIMD code on the CPU, but programs as if the CPU-SIMD units were a GPU. Its an excellent project.

---------

> Overall, I feel the biggest challenges often come from cache and memory bandwidth management. CPUs are fast, but SDRAM is not. You don't want different threads fighting for CPU socket local resources and even less for global ones. I usually do rough estimates of required bandwidth and computation, write some prototypes and do a lot of profiling, including taking a good hard look at the CPU counters.

I think memory-layout is the #1 issue these days. CPUs / GPUs have so much compute available that its almost impossible to actually achieve high utilization. In most cases, you're sitting around just waiting for memory...

CPU memory movement is still subpar compared to GPUs. AVX512 finally implements "scatter" operations, but GPUs have had highly-optimized "gather-scatter" to __local or __shared__ memory for years (ex: GPUs have 32 banks and 32-load/store units per GPU-compute unit or NVidia SM: that's either 1/2 or 1 load/store unit per GPU shader. AVX512 Skylake however has 3-load/store units across 16 SIMD-threads...)

Intel really needs to write more instructions like "pshufb" to handle more ways for register-to-register movement. It seems like a lot of data-movement in the AVX world is still best handled by AVX -> L1 cache -> back into AVX register (which is limited by the very few load/store units in modern CPUs).

Yeah, you can cheat a lot of cases through pshufb, but that instruction doesn't always work. There's something to be said about the brute-force option of 32x load/store units on a GPU-unit and sticking 32-load/store units for all the threads to leverage.


> I think memory-layout is the #1 issue these days.

Absolutely agree. Cache lines should be packed with data that is useful together. Memory streaming access patterns should be favored.

> CPU memory movement is still subpar compared to GPUs. AVX512 finally implements "scatter" operations, but GPUs have had highly-optimized "gather-scatter"...

Well, it goes both ways. CPU gather/scatter may be slow, but GPU memory access latency is astronomically high — talking about microseconds. Of course GPUs mask the latency with a ton of hardware threads. CPUs are memory access latency kings by far. GPUs do have amazing memory controllers when you have gather/scatter access patterns, as long as high latency is acceptable.

> Intel really needs to write more instructions like "pshufb" to handle more ways for register-to-register movement.

Yeah, it'd be useful, but not so critical when you're memory bound anyways. I often find myself having a lot of "free computation slots" for data shuffling while the CPU is waiting for the memory. Or in other words, memory stalls.


> Well, it goes both ways. CPU gather/scatter may be slow, but GPU memory access latency is astronomically high — talking about microseconds. Of course GPUs mask the latency with a ton of hardware threads. CPUs are memory access latency kings by far. GPUs do have amazing memory controllers when you have gather/scatter access patterns, as long as high latency is acceptable.

Oh, I mean gather/scatter to shared / local memory. General purpose gather/scatter is very high latency as you say (I think read/writes were like 500 nanoseconds to L1 cache, and far slower to L2 and VRAM), but gather/scatter to shared/local memory is basically limited by bank-conflicts (~32 cycles worst case, ~2 cycles best case).

I'm pretty sure AVX512 gather/scatter to L1 cache is still dozens of cycles for just 16 SIMD-lanes.

> Yeah, it'd be useful, but not so critical when you're memory bound anyways. I often find myself having a lot of "free computation slots" for data shuffling while the CPU is waiting for the memory. Or in other words, memory stalls.

Fair point. I presume you mean that you can shuffle data to L1 cache while waiting for L3 or DDR4 RAM instead.

What I really want is "shared memory" to be implemented on CPUs, and for AVX-lanes to be able to shuffle data to and from there independently of the L1 / L2 / L3 / DDR4 memory system.


> I'm pretty sure AVX512 gather/scatter to L1 cache is still dozens of cycles for just 16 SIMD-lanes.

Yeah, last I checked, it performed like scalar loads and stores. I presume Intel intends to eventually optimize for buffered/L1 hit cases. I mean, why would those instructions even exist otherwise?


Some further background: vendor instrinsics are well documented and unchanging. They are a lot easier for a language to standardize on. As others have pointed out, there are higher level libraries being built on top of them. This gives the Rust community the freedom to experiment on how the instrinsics should be implemented with less concerns over compatibility.

This also is a nice way of handling a limited subset of other assembly instructions for systems programming while they figure out how to have inline assembly without coupling the language to its implementation.


Rust is lagging behind in the performance world, especially because it lack openMP/ACC support.


The rayon library provides a lot of the same features as those.


To be honest, everything in Rust looks a bit ugly to me. I really tried to like the language but the syntax, everything being overly annotated, and the number of features that you need to understand to do simple tasks - all of these make it not really worth it to pick Rust for new projects. There are other problems like the ecosystem of crates and the lack of learning resources but at least they aren't intrinsic to the language itself.


> To be honest, everything in Rust looks a bit ugly to me.

I write a lot of Rust code at work, and I admit that it can sometimes be pretty noisy. There are several major contributors to this:

1. Rust offers fine-grained control over pass-by-value, pass-by-reference, and pass-by-mutable reference. This is great for performance. But it also adds a lot of "&" and "&mut" and "x.to_owned()" clutter everywhere.

2. Rust provides support for generics (aka parameterized types). Once again, this is great for performance, and it also allows better compile-time error detection. But again, you wind up adding a lot of "<T>" and "where T:" clutter everywhere.

3. Usually, Rust can automatically infer lifetimes. But every once in a while, you want to do something messy, and you end up needing to write out the lifetimes manually. This is when you end up seeing weird things like "'a". But in my experience, this is pretty rare unless I'm doing something hairy. And if I'm doing something hairy, I'm just as happy to have more explicit documentation in the source code.

Really, the underlying problem here is that (a) Rust fills the same high-control, high-performance niche as C++, but (b) Rust prefers explicit control where C++ sometimes offers magic, invisible conversions. (Yes, I declare all my C++ constructors "explicit" and avoid conversion operators.)

Syntax is a hard problem, and I've struggled to get syntax right for even tiny languages. But syntax for languages with low-level control is an even harder problem. At some point, you just need to make a decision and get used to it.

In practice, I really enjoy writing Rust. It's definitely not as simple as Ruby, Python or Go. But it fills a very different ecological niche, with finer-grained control over memory representations, and support for generics.


this is exactly how simd intrinsics look in c, its not rust thing its an intrinsics thing.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: