

SIMD in Rust - dbaupp
http://huonw.github.io/blog/2015/08/simd-in-rust/

======
CUViper
On the last paragraph, for runtime selection of SIMD implementations, would
that use something like STT_GNU_IFUNC? Perhaps exactly that on Linux, since
it's compiling to ELF, after all.

I would have liked to see benchmarks compared to non-simd rust too. Thankfully
this is in the source code. Here's what I get from cargo bench, fwiw:

    
    
             Running target/release/mandelbrot-aefed80dbc3f2841
    
        running 2 tests
        test mandel_naive ... bench:     802,072 ns/iter (+/- 25,106)
        test mandel_simd4 ... bench:     235,853 ns/iter (+/- 6,374)
    
        test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured
    
             Running target/release/matrix-58de8ccd4bd58dcd
    
        running 6 tests
        test inverse_naive   ... bench:       4,967 ns/iter (+/- 251)
        test inverse_simd4   ... bench:       1,984 ns/iter (+/- 94)
        test multiply_naive  ... bench:       2,226 ns/iter (+/- 26)
        test multiply_simd4  ... bench:         897 ns/iter (+/- 32)
        test transpose_naive ... bench:         627 ns/iter (+/- 16)
        test transpose_simd4 ... bench:         361 ns/iter (+/- 7)
    
        test result: ok. 0 passed; 0 failed; 0 ignored; 6 measured

~~~
dbaupp
Yeah, the runtime selection could indeed leverage ifunc when available (the
actual dispatch isn't so interesting, since a simple branch on CPUID is
perfectly workable in many cases: usually the dispatch is done when calling an
expensive, long-running function, so a little bit of O(1) setup is in the
noise). The really hard bit is getting dependencies compiled in multiple modes
using different features, which will be especially important if/when Rust gets
an ecosystem of libraries with SIMD utilities/functions. I hope we do get a
wide variety of such things (rather than everyone just using the intrinsics
directly, as in C/C++).

BTW, the benchmarks are comparing to non-SIMD rust: the graphs are of how many
times faster the SIMD Rust code is than scalar Rust code (i.e. if they were
plotted, the scalar bars would all be at 1.0).

~~~
CUViper
Aha, the y-axis is "times faster than scalar", I totally missed that.

------
pavanky
To anyone interested, we have a rust port of our general purpose CUDA and
OpenCL library: [https://github.com/arrayfire/arrayfire-
rust](https://github.com/arrayfire/arrayfire-rust)

This is in extreme alpha stages and is written by people who are still coming
to grips with rust. If anyone wants to chat about it or has some feedback,
please drop by over here: [https://gitter.im/arrayfire/arrayfire-
rust](https://gitter.im/arrayfire/arrayfire-rust)

~~~
dbaupp
Nice! I dream of the days when Rust will be the best language for data
parallelism, with an amazing ecosystem of libraries for threading, SIMD and
GPU programming.

~~~
rancur
how much longer until I get to use it in something on my desktop that runs 3x
faster?

~~~
robin_reala
Grab and build Servo? To quote
[https://lwn.net/Articles/647969/](https://lwn.net/Articles/647969/) :

 _But the results are already impressive; the team said that Servo provides a
2x speed increase when rendering the CNN homepage and a 3x speed-up on
Reddit._

------
jklontz
This is _not_ a post about auto-vectorization in Rust (which would have been a
lot more interesting!). The provided Mandelbrot Set example is algorithmically
similar to loop unrolling with a 4x unroll factor. The strategy works well in
this case because neighboring locations in the Mandelbrot Set tend to require
similar numbers of iterations to compute.

~~~
dbaupp
I'm sorry you didn't find the blog post as interesting as you hoped. :) As
others have pointed out, I talk a bit about how rustc gets autovectorisation
by leaning on LLVM in the "Explicit SIMD in the Compiler" section.

In any case, I agree the Mandelbrot example isn't so interesting: I included
it because it is relatively simple, well-known and gives a pretty picture
(i.e. good for a blog post where a single example isn't mean to be the focus).
In fact, manual unrolling catering to autovectorisation is how Rust is
currently top of the mandelbrot benchmark game[1], and explains the equal
performance of the explicit-SIMD and scalar versions of spectral-norm on
AArch64 (although the fact they aren't equal on x86 hints at the lack of
guarantees around autovectorisation).

I find the examples like matrix inversion, nbody and fannkuch-redux are more
compelling because the vectorised version is far less similar to the scalar
one ("strange" shuffles, approximation of floating point ops and dynamic byte
shuffles with precomputed values, respectively).

[1]:
[http://benchmarksgame.alioth.debian.org/u64/performance.php?...](http://benchmarksgame.alioth.debian.org/u64/performance.php?test=mandelbrot)

~~~
exDM69
This article could use some disassembly (and LLVM IR) from compiled code to
see what a piece of Rust SIMD code looks like when compiled for different
architectures. No doubt you've done this when debugging, but it would also be
useful for the rest of us.

How well does it work in general? When you write SIMD code, can the compiler
keep the values in vector registers or is there spilling going on?

~~~
dbaupp
As you can see from the benchmarks, it works basically as well as industrial
C/C++ compilers like Clang (if not slightly better) and GCC (although GCC's
older and more optimised backend leaps ahead of the LLVM-based compilers in
some cases).

I'm planning follow up posts which may involve more assembly/IR, but this is
designed to be an introduction/high-level post, and the graphs are meant to
serve as a summary/replacement for digging through reems of assembly.

------
ridiculous_fish
This is cool! A lot of use of SIMD is working on a large array. How does the
bounds checking not make this prohibitively expensive? Also, how are the
alignment requirements managed?

~~~
dbaupp
There is the `load` method on SIMD types, e.g. [1]. This does an unaligned
load from any place within an array, with bounds checks, i.e. the safest
always-works method.

The bounds checking isn't _that_ bad in and of itself, e.g. for f32x4 it is
one bounds check for 4 elements and that bounds check is just a comparison and
an extremely well-predicted branch. However, it definitely can be noticable
and can inhibit other optimisations.

There's various routes this can be improved for sure, e.g.

\- `unsafe` versions of the `load` that don't bounds-check and/or do aligned
loads,

\- functions that convert a `&[f32]` into a `&[f32x4]` (possibly with prefix
and suffix &[f32]'s for unaligned left-overs),

\- tweaking the set of optimisation passes the compiler runs to handle the
patterns that occur in Rust better (I believe rustc just runs the default set
of LLVM passes, which are likely more tuned to C/C++ than Rust, e.g. the IRCE
pass[2] should help eliminate more bounds checks but isn't enabled by default
because presumably C/C++ don't use bounds checks enough for it to be worth it)

\- higher-level combinators/"algorithms" that avoid the need to do the
loading/memory-management manually

[1]:
[http://huonw.github.io/simd/simd/struct.f32x4.html#method.lo...](http://huonw.github.io/simd/simd/struct.f32x4.html#method.load)

[2]: [https://github.com/rust-lang/rust/issues/22987](https://github.com/rust-
lang/rust/issues/22987)

------
kibwen
Are there any preliminary attempts to integrate this into Servo yet? I'm
curious to know the results, if so.

~~~
dbaupp
I've done some experimental work, yes. However, I haven't found many places
where it will obviously help a lot: it's very easy to use it for 3D transforms
(which does a pile of 4D matrix ops, exactly what the matrix benchmarks in the
post measure), but other things are less nice.

------
frozenport
Weird question, where is the restrict keyword in Rust?

~~~
shepmaster
If I understand the keyword correctly, it's not really needed in Rust. The
Rust compiler keeps very close track of who has a reference (many immutable or
one mutable) to a value. There isn't really "accidental" aliasing.

~~~
Too
That's awesome. It feels like this alone could, at least theoretically, make
rust faster than C on average. Most C programmers are not even aware that the
restrict keyword exists and don't use it in places where they could greatly
benefit from it.

------
stagger87
My opinion, ensuring zero overhead and zero boilerplate code for calling
external libraries will solve the SIMD problem. Anything worth vectorizing
will be written in C/assembly.

~~~
JoshTriplett
Rust has inline assembly these days. Between that and approaches like this, I
certainly hope using FFI for performance won't always be necessary.

~~~
stagger87
I don't think inline assembly solves the problem. If I optimize any worthwhile
problem, I would be smart to do it in a way that is accessible from every
modern language.

~~~
steveklabnik
Rust, being able to expose a C ABI and having roughly the same amount of
runtime, is accessible from every modern language.

