
Faster: Fast numerical calculations in Rust - blacksmythe
https://github.com/AdamNiederer/faster
======
gatmne
When I read "absurdly fast", I expected to see some benchmarks against an
equivalent SIMD implementation in C or at least a similar library in another
language.

Please consider adding such benchmarks so that we can assess how fast this
library really is when compared to other more established libraries.

~~~
vvimpcrvsh
I'll add benchmarks soon; I'm trying to make it usable first ;).

Right now, it's within 5% of ugly-style explicit SIMD code in the absolute
worst case (the example in the README is more like 1%). The hot bit (the body
of the map) compiles down to 100% simd instructions in the order LLVM thinks
will be the fastest, and the load/store bit has an extra branch per
load/store. This is because faster is mainly a bunch of 1-line wrappers around
SIMD intrinsics, and a bunch of polyfills which only get used on machines
which don't support those intrinsics.

The main reasons you'd experience a perf hit compared to explicit SIMD when
using this library would be using runtime feature detection (minimal, and not
implemented yet - would add around 2 branches per simd_iter call), getting
lazy with gathers/scatters rather than ugly bit-level hacks, and accidentally
using operations which aren't vectorizable on your machine (scatters on non-
avx512, reductive operations on non-sse4.2, etc) and not realizing it
(explicit simd will throw a SIGILL; faster will use a scalar polyfill).

~~~
ibotty
Can't you test for the capabilities of the processor once and use the right
SIMD code from then on? Maybe it makes sense to introduce a block around the
SIMD code, which gets compiled into different blocks of SIMD instructions.
Then you only pay for choosing the right block once. That's what I would do in
Haskell at least.

~~~
exDM69
This has to be done at a granularity coarse enough that the function call
overhead doesn't kill the performance. It has to be entire loops, for example,
not just some primitive operation like a dot product (which has at least 3
different implementations on x86).

On my C code, I have a build script that rebuilds a shared library 3-4 times
for different CPUs (e.g. with and without -mavx2) and then uses dlopen() to
choose the right one at runtime. The actual SIMD code has a bunch of #ifdefs
to choose the appropriate implementation for primitive operations.

GCC has function multiversioning [0][1] (__attribute__((target("sse4.2")))) to
enable choosing a function implementation based on CPU capabilities at
runtime, but that is not suitable for the use case I describe. I want my "high
level" code to be the same, but using the right implementation for "primitive"
ops like dot product. Using multiversioning would completely kill the
optimizer and cause excessive register spilling and other unwanted artifacts.

[0]
[https://gcc.gnu.org/wiki/FunctionMultiVersioning](https://gcc.gnu.org/wiki/FunctionMultiVersioning)
[1] [https://gcc.gnu.org/onlinedocs/gcc-6.2.0/gcc/Function-
Multiv...](https://gcc.gnu.org/onlinedocs/gcc-6.2.0/gcc/Function-
Multiversioning.html)

~~~
ibotty
> This has to be done at a granularity coarse enough that the function call
> overhead doesn't kill the performance. It has to be entire loops, for
> example, not just some primitive operation like a dot product.

Yes, You are right! I understood, that they wanted to add the branch to
_every_ SIMD instruction. I concur with branching outside of loops.

Re Rust: I expect that one could use a macro around the function (or annotate
it, if the macro system is good enough) that generates these different
functions and branches to the right one.

------
mncharity
I wonder if one could do a Rust version of
[https://github.com/ekmett/rts](https://github.com/ekmett/rts) ? It's a SPMD-
on-SIMD exploratory hack, with SOA and masked branching, built on C++
templates.

Core is
[https://github.com/ekmett/rts/blob/master/src/rts/varying.hp...](https://github.com/ekmett/rts/blob/master/src/rts/varying.hpp)
and
[https://github.com/ekmett/rts/blob/master/src/rts/vec.hpp](https://github.com/ekmett/rts/blob/master/src/rts/vec.hpp)
.

~~~
Ar-Curunir
So something like rayon + faster?

~~~
mncharity
Yes, plus integrated branched control flow using simd masks.[1]

[1] eg, random ggl link: [https://gain-performance.com/2017/05/14/umesimd-
tutorial-8-c...](https://gain-performance.com/2017/05/14/umesimd-
tutorial-8-conditional-execution-using-masks/)

[https://github.com/AdamNiederer/faster](https://github.com/AdamNiederer/faster)
[https://github.com/rayon-rs/rayon](https://github.com/rayon-rs/rayon)

------
FridgeSeal
For better or worse I have a very big soft spot for things in programming that
go fast, even moreso for things that go _really fast_.

I look forward to learning more Rust so I can get an excuse to use this.

~~~
dboreham
Back in the day we had to add extra STOP statements at the end of our Fortran
programs if the machine was really fast because it would crash through the
first one or two.

~~~
FridgeSeal
Really?

That's actually awesome! CPU's and GPU's are capable of breakneck speeds, so I
love to see things that really make the most out of the capacity available.

~~~
dboreham
Nope. This is what used to pass for a programmer joke.

    
    
          STOP
      C Add a couple more stop statements in case someone runs this on a Cray and it is going so fast it breaks though the first one:
          STOP
          STOP

~~~
Someone
Well, there _is_ this: [http://elixir.free-
electrons.com/linux/latest/source/arch/x8...](http://elixir.free-
electrons.com/linux/latest/source/arch/x86/kernel/process.c#L360)

Yes, Linux does the equivalent of

    
    
        while(true)
        {
          halt
        }
    

I don’t think that code will loop, but why take the risk?

There is/was a detailed description of what it takes to stop an OS that
included multiple ways to stop the CPUs, executed in order in the
hope/expectation that one of them would work that in the end hit a similar
loop, but I can’t find it anymore.

------
vvimpcrvsh
Author of Faster here. Thanks so much for posting this; I was wondering where
the flood of stars was coming from!

------
nerdponx
How do these numbers stack up to equivalent algorithms in C and Fortran?

Are we going to see a RustBLAS soon? Something open-source that could go up
against MKL for performance would be really amazing.

~~~
adrianN
What's the point of reimplementing BLAS in Rust? Those libraries have been
polished for literally decades. They're not even security critical.

~~~
nerdponx
No point of doing it in Rust specifically, as opposed to any other language.

But the point would be no longer having to trade software freedom (or at least
open source) for performance -- try benchmarking Numpy built on MKL vs
OpenBLAS and you'll see what I mean.

If someone gets inspired by a new language to try and write a free or open-
source MKL competitor, because that language has such-and-such features that
make it more feasible than in Fortran or C, then that'd be fantastic. All I
was suggesting is that it'd be cool if Rust was that language.

------
partycoder
Superlatives are problematic because for instance "fast" is relative to
something and that something might change over time.

Superlatives are usually challenged and you will require proof. The proof
itself can also be challenged. Suddenly you are spending a lot of time.

I would rather say: "SIMD math library" rather than "Fast library".

~~~
smnc
"Faster" is technically a comparative, not a superlative. Otherwise, sound
advice.

~~~
partycoder
Right, superlative is not the right term. "Fastest" would be the superlative,
and is not in use.

I am just generally against referring to software in absolute terms like:
simple, fast, friendly, lightweight, etc. since it can be misleading.

I consider a good practice to separate facts and opinions to achieve clear
communication. Passing opinions as facts is dogmatism.

------
SloopJon
Is there a simpler way than building to find out whether a crate requires
nightly, as this one does?

~~~
eridius
You could try clicking on the "build|passing" badge and see what it's being
built against.

------
pzone
Utterly superficial thought, but how about the name "fastr" for this crate?

~~~
bpicolo
Making it harder to look up / talk to people about seems worse ;)

~~~
drb91
Are you referring to “fastr” or “faster”, though? I’d imagine the former is
much, much, much easier to google for and remember than “faster rust library”,
which could return basically any piece of rust code ever.

~~~
bpicolo
fastr is a loaded term already anyway, plenty of google hits for it (graalvm’s
r implementation). Crates.io search works just fine though

------
mkj
Won't the compiler be autovectorising that kind of code already?

~~~
vvimpcrvsh
Sadly, even something like [1.0f32].map(|s| s.sqrt().abs().recip()).collect()
doesn't get vectorized 99% of the time; you can check the disassembly and
you'll see sqrtss et al.

~~~
galangalalgol
There are some llvm flags you can pass to see what got autovectorized in the
rustic output. And some that make it autovec more stuff. It is really
sensitive to container size. Just throwing a few flags when compiling n-body
speeds it up quite a bit over the one up on the benchmark game. I got it to
beat the c implementation which had explicit SIMD with portable rust code.
Still didn't beat FORTRAN with ifort but faster than gfortran. Also, when rust
brings in an llvm with newgvn the autovec should get better.

------
fnord77
I thought modern languages optimized code into SIMD instructions where it
could automatically...

I know Java does this.

~~~
burntsushi
They do. It's called auto-vectorization and the Rust compiler does it (mostly
by virtue of LLVM). But there are _thousands_ [1] of SIMD instructions just
for x86_64 alone and auto-vectorization doesn't come close to covering all of
them. Even if auto-vectorization does cover the instructions you want to use,
you might be justifiably uncomfortable with relying on it for performance in
every case (or perhaps you simply can't rely on it), and would instead prefer
to explicitly write out the transformations.

[1] -
[https://software.intel.com/sites/landingpage/IntrinsicsGuide...](https://software.intel.com/sites/landingpage/IntrinsicsGuide/)

------
IncRnd
I don't see absurdly fast anything in this post. Are there any benchmarks that
would support the statement "absurdly fast numerical calculations"?

