
Rust and C++ on Floating-Point Intensive Code - payasr
https://www.reidatcheson.com/hpc/architecture/performance/rust/c++/2019/10/19/measure-cache.html
======
diamondlovesyou
I have some experience with this, ie ensuring LLVM optimizes and codegens the
"best"! I have been working to generate _target independent_ "kernels" for the
Rav1e AV1 encoder and have had to do a lot of unidiomatic things to get LLVM
to generate machine code similar in quality to hand written ASM. Granted, this
is on integers and not floats, but the same principles should apply.

What I've found is that you need to ignore most of Rust: use/load raw
pointers, don't use slices, unroll manually, vectorize manually, and check
preconditions manually. You'll still get the amazing type system, but the code
will have to be more C-like than Rust-like.

* raw pointers: LLVM is pretty good at optimizing C code. Rust specific optimization needs some work. (edit: I assumed arrays here, so you'll need the pointer for offsets; references are still okay. You'd also use the pointers for iterating instead of the usual slice iteration patterns)

* no slices: index checking is expensive, not to the CPU, the CPU rarely misses the check branches, but to the optimizer. I've found these are mostly left un-elided, even after inlining.

* no slices: slice indexing uses overflow checking. For Rav1e's case, the block/plane sizes mean that doing the index calculation using `u32` will never overflow, so calculating the offsets using u32 is fine (I'll have to switch to using a pseudo u24 integer for GPUs though, because u32 is still expensive on them).

* unroll manually: LLVM would probably do more of this with profiling info, but I've never found it (this is subjective!) to do any unrolling w/o. Maybe if all the other items here are also done...

* vectorize manually: Similar to unrolling. I've observed only limited automatic vectorization.

* And to get safety back: check, check, and check before calling the fast kernel! Ie wrap the kernel function with one that does all the checks elided in the kernel.

Source: Wrote
[https://github.com/xiph/rav1e/pull/1716](https://github.com/xiph/rav1e/pull/1716),
which speeds up the non-asm encodes by over 2x!

~~~
dochtman
Not sure I understand the part about raw pointers. As far as I understand,
Rust references will surely turn into pointers at the LLVM IR level?

~~~
devit
Rust references should in general optimize better because they give stronger
aliasing guarantees.

Even for slices, using get_unchecked(1..) to get a smaller subslice without
bounds checking might be better than pointer arithmetic as long as the slice
lengths get optimized away (i.e. they are never used and never passed to non-
inlined functions).

~~~
littlestymaar
> Rust references should in general optimize better because they give stronger
> aliasing guarantees.

AFAIK this does not work atm due to a codegen bug in llvm (which can also
affect code using _restrict_ in C in some cases). This bug will get fixed one
day, but most likely another bug will be revealed at this point… This part of
LLVM was never really used in C as much as in Rust, so they keep finding bugs
in it. Hopefully it will get fine in the long run, but I'm not holding my
breath.

------
lovasoa
It looks like what the author was looking for is [1]

    
    
        f64::mul_add(self, a: f64, b: f64) -> f64
    
    

Adding it to the code indeed allows the LLVM to generate the "vfma"
instruction. But it didn't significantly improve performance, on my machine at
least.

    
    
        $ ./iterators 1000
        Normalized Average time = 0.0000000011943495282455513
        sumb=89259.51980374461
    
        $ ./mul_add 1000
        Normalized Average time = 0.0000000011861410852805122
        sumb=89259.52037960211
    
    

Maybe the performance gap is not due to what the author thought...

[1] [https://doc.rust-
lang.org/std/primitive.f64.html#method.mul_...](https://doc.rust-
lang.org/std/primitive.f64.html#method.mul_add)

~~~
gpderetta
Hum, did the program get vectorized?

~~~
lovasoa
As I said, the compiler did generate FMA instructions. These are SIMD
instructions, so yes, the program was vectorized.

~~~
gameswithgo
it isn't always that simple. FMA instructions are tricky to use in a way that
actually improves performance, llvm may be doing it right while doing it
manually that way may not.

also, sometimes a SIMD instruction is used but only on 1 lane at a time. this
is actually common with floating point code.

~~~
jcl
Something I found surprising: Some AVX2 and AVX-512 instructions consume so
much power that Intel chose to have their chips dynamically slow their clock
frequency when the instructions are executed. So naively switching to SIMD
instructions can not only fail to improve performance, but it can also hurt
the performance of unaltered code executed after it -- even unrelated code
running on other cores.

[https://blog.cloudflare.com/on-the-dangers-of-intels-
frequen...](https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-
scaling/)

------
leni536
So the difference basically boils down to -ffast-math, right? Is there an
equivalent in Rust?

Edit: After some search I found these:

[https://github.com/rust-lang/rust/issues/21690](https://github.com/rust-
lang/rust/issues/21690)

[https://doc.rust-
lang.org/core/intrinsics/fn.fadd_fast.html](https://doc.rust-
lang.org/core/intrinsics/fn.fadd_fast.html)

Writing a wrapper around f64 that uses these intrinsics shouldn't be too hard.
I don't program in Rust though.

~~~
pedrocr
It doesn't exist yet and it's not clear it should be replicated as is. The
fast-math flag does a bunch of related things that should probably be exposed
separately so it's not a footgun in several situations. I'm also partial to
exposing it per-function so the control is actually in the hands of the people
writing the code that know the context and not subject to someone fiddling
with compiler flags and getting incorrect code.

For this example you'd probably want -fassociative-math and not the other
stuff that may result in incorrect code. -ffast-math was not actually used in
the clang compilation and it's possible that the -fvectorize that was used
picks a sensible mix of options.

Here's a preliminary discussion:

[https://internals.rust-lang.org/t/pre-rfc-whats-the-best-
way...](https://internals.rust-lang.org/t/pre-rfc-whats-the-best-way-to-
implement-ffast-math/5740)

Trying to do an RFC process for this could be useful. The rust development
process seems to be pretty good at thinking deeply about these things.

~~~
cannam
> I'm also partial to exposing it per-function so the control is actually in
> the hands of the people writing the code that know the context

As a C++ programmer who routinely uses fast-math "until something breaks" with
DSP code, I would find that capability very attractive.

~~~
fluffything
That's kind of at odds with Rust guarantee that your code never breaks.

~~~
_bxg1
Technically Rust only guarantees _memory-safety_ (and only outside of
unsafe!{}). It has many features that aid in other kinds of safety - strongly
encouraging you to unwrap Option<> and Result<>, requiring all match cases to
be covered, allowing for lots of strategic immutability, etc. But it doesn't
_guarantee_ that kind of correctness.

~~~
fluffything
That's not correct. Safe Rust is advertised as sound, and Rust defines that as
"safe Rust programs do not exhibit undefined behavior". Undefined behavior is
a much larger term than just memory safety, and include things like thread
safety, const safety, unwind safety, etc.

------
dhruvdh
It's very easy to do FMA's using .mul_add() on floats in Rust, which the
author didn't seem to know about.

~~~
magicalhippo
Ideally the compiler should be able to do this by itself though, at least with
the appropriate flag to enable it.

~~~
paulddraper
FMA isn't a safe optimization as it can give different results.

C++ compilers have flags to enable it globally. gcc and clang include the
optimization in -Ofast.

Rust allows you to choose at a code level (but usually people don't know about
it). Perhaps it should also have a global fast-math flag that would
automatically optimize it. Pros and cons to that.

~~~
nwallin
FMA is "safe" in that if it breaks your code, it was already broken. It can
only make the results slightly more accurate, unlike for instance the rsqrt
instruction which is less accurate. (and as such is not a safe optimization)

GCC emits FMA instructions at -O2 without -ffast-math.

~~~
lilyball
Well, it could "break" your code in that it might make your code produce
different results than a separate equivalent implementation that didn't use
FMA.

Edit: Ok actually it sounds like it could literally break some algorithms, see
[https://news.ycombinator.com/item?id=21342974](https://news.ycombinator.com/item?id=21342974)

------
brandmeyer
This is interesting to see. But if I'm going to compare numerical C++ against
numerical Rust, then I would be using a higher-level library for the
comparision. What is Rust's Other Leading Brand (TM) for the Eigen C++
library?

~~~
steveklabnik
That comparison (I’m not familiar enough with Eigen to truly say) is going to
change over time too; once const generics lands (which is proceeding, finally)
the APIs for numerics libraries are going to be significantly different in
Rust.

------
toolslive
before you look at speed, did you verify you get the exact same math results
in Rust and C++ (and for each compiler and platform) ? For C++ code, I have
seen the results of calculations vary across compilers (and flags)

------
nestorD
The authors ends by noting that FMA would probably have improved the
performances for the Rust code.

It is interesting to note that, whereas most ffast-math optimization will
trade precision for reduced computing time, adding an FMA can only improve the
precision of the output (and thus it is a safe optimization).

~~~
gameswithgo
pedantry: ffast-math does not always trade precision. It simply trades the
results being the _same_ as if they were not vectorized. A vectorized sum of
floats for instance is more accurate, not less.

------
leshow
What's "almost" algebraic about enum? It can definitely be used to construct
sum types, and you can make product types with struct or inline in an enum

~~~
uryga
my best guess is that you can't do recursive enums without _explicit_ boxing
[edit: or other forms of indirection, like &T]¹. so you can't do this:

    
    
      enum List<T> {
        Nil,
        Cons(T, List<T>)
      }
    

instead, you have to box/reference-ify the recursive occurrence:

    
    
      enum List<T> {
        Nil,
        Cons(T, Box<List<T>>)
      }
    

so in certain circumstances it doesn't let you "coproduct" two types together,
you might need to modify one a bit, which makes it a technically-not-exactly-
a-coproduct (i think). a bit of a stretch but it sort of makes sense next to a
by-reference-only ML langs where you can (co)product anything as you please

(btw, it's the same for recursive products)

\---

1 - [https://users.rust-lang.org/t/recursive-enum-
types/2938/2](https://users.rust-lang.org/t/recursive-enum-types/2938/2)

~~~
steveklabnik
You don't have to box, but you do need some sort of type to make things sized.
This is usually a pointer of some kind, but any kind of pointer works. Take
references, for example:

    
    
      enum List<'a, T> {
        Nil,
        Cons(T, &'a List<'a, T>)
      }
      
      fn main() {
          let list = List::Cons("hello", &List::Nil);
      }
    

Box is usually chosen because it's a good default choice.

~~~
uryga
you're right of course! i should've used a more generic term like
"indirection" or "reference", didn't mean to put emphasis on Box

~~~
steveklabnik
It’s all good, most people say just Box, because it is the majority case.

------
vkaku
Okay, can someone give an explanation for why Rust does not mimic the -O fast
behavior? Is this something they plan to add?

~~~
steveklabnik
It leads to undefined behavior in safe code in the general case.

We may add a wrapping type, similar to what we do for integer behavior. But in
general, adding flags to change major behavior is not something we do.

------
Myrmornis
This is an extremely clear article and apparently well-executed experiment.

------
rathinmadhu
Excellent

------
rathinmadhu
Supper

