Hacker News new | past | comments | ask | show | jobs | submit login
Rust and C++ on Floating-Point Intensive Code (reidatcheson.com)
216 points by payasr on Oct 24, 2019 | hide | past | favorite | 94 comments

I have some experience with this, ie ensuring LLVM optimizes and codegens the "best"! I have been working to generate target independent "kernels" for the Rav1e AV1 encoder and have had to do a lot of unidiomatic things to get LLVM to generate machine code similar in quality to hand written ASM. Granted, this is on integers and not floats, but the same principles should apply.

What I've found is that you need to ignore most of Rust: use/load raw pointers, don't use slices, unroll manually, vectorize manually, and check preconditions manually. You'll still get the amazing type system, but the code will have to be more C-like than Rust-like.

* raw pointers: LLVM is pretty good at optimizing C code. Rust specific optimization needs some work. (edit: I assumed arrays here, so you'll need the pointer for offsets; references are still okay. You'd also use the pointers for iterating instead of the usual slice iteration patterns)

* no slices: index checking is expensive, not to the CPU, the CPU rarely misses the check branches, but to the optimizer. I've found these are mostly left un-elided, even after inlining.

* no slices: slice indexing uses overflow checking. For Rav1e's case, the block/plane sizes mean that doing the index calculation using `u32` will never overflow, so calculating the offsets using u32 is fine (I'll have to switch to using a pseudo u24 integer for GPUs though, because u32 is still expensive on them).

* unroll manually: LLVM would probably do more of this with profiling info, but I've never found it (this is subjective!) to do any unrolling w/o. Maybe if all the other items here are also done...

* vectorize manually: Similar to unrolling. I've observed only limited automatic vectorization.

* And to get safety back: check, check, and check before calling the fast kernel! Ie wrap the kernel function with one that does all the checks elided in the kernel.

Source: Wrote https://github.com/xiph/rav1e/pull/1716, which speeds up the non-asm encodes by over 2x!

the first order reason Rust and C++ differ in the article is because Rust will not pass the ffastmath flag to llvm, not because of any of this stuff.

Be that as it may, I wasn't talking about C++. Also, the integer operation optimizations require no such flags on either side, and yet the resulting machine code was still very poor for Rust.

Not sure I understand the part about raw pointers. As far as I understand, Rust references will surely turn into pointers at the LLVM IR level?

They do. The issue may be that references don't support pointer arithmetic – that is, given `x: &u32` you cannot get `x[1]`. The normal approach is to use a reference to a slice, `x: &[u32]`, and the default bounds-checked indexing. Rust does let you do unchecked indexing on slices (with `unsafe`), but it may be more efficient to avoid indices entirely in favor of pointer arithmetic. LLVM can often optimize index calculations into pointer arithmetic, but not always.

Edit: Although, on rereading the above post, I see diamondlovesyou did mention indices, so... not really sure what's going on.

True. I should have explained that better. If you're accessing an array, you'll have to do the pointer offset-ing yourself (otherwise you're using a slice and all the checking that entails). Thus, you can't use a reference type, because reference types don't have `<*const T>::add` like pointers do (also, `&T as usize` is invalid; you have to go through a pointer type first). I suppose I assumed you'd be accessing an array/slice; references otherwise are fine.

Rust references should in general optimize better because they give stronger aliasing guarantees.

Even for slices, using get_unchecked(1..) to get a smaller subslice without bounds checking might be better than pointer arithmetic as long as the slice lengths get optimized away (i.e. they are never used and never passed to non-inlined functions).

> Rust references should in general optimize better because they give stronger aliasing guarantees.

AFAIK this does not work atm due to a codegen bug in llvm (which can also affect code using restrict in C in some cases). This bug will get fixed one day, but most likely another bug will be revealed at this point… This part of LLVM was never really used in C as much as in Rust, so they keep finding bugs in it. Hopefully it will get fine in the long run, but I'm not holding my breath.

You can just pass a reference/mut ref to the first element of the slice. This is actually how the generated kernels in my Rav1e PR do it, just for the aliasing reason you mentioned.

You can use iterators to avoid unnecessary bounds checking on every element in a slice and you can still get an index to the value. Something like this:

``` for (i, val) in my_slice.iter().enumerate() { let x = *val + 9999; } ```

Not really in this case; Rav1e runs over blocks of image planes (like say a 16x16 block of a specific 720p color channel), so there is no continuous slice to iterate over.

Maybe you can use chunks_exact() and chunks_exact_mut(). The _exact versions allow lifting the bounds checks out of the loop and gave me some great performance boosts in image processing code.

These blocks are non-continuous. It operates over a part of a row, where the start of the next row is the start of the current row + some stride.

I mean maybe? But I probably wouldn't anyway for this case as a slice reference is actually a "fat" pointer (ie twice the size of a normal pointer) and the length of the slice won't be used (the block size is known per kernel); LLVM might delete the length part anyway.

These are automatically generated kernels, so readability isn't the primary concern here.

Once const-generics are released, I feel like you should be able to create your own "fixed-size slice" type, which uses the constant parameter for "bounds checking". I imagine that would be a lot more optimiser-friendly...

To sum up: LLVM is heavily optimized for C code, not so much for Rust. So Rust code has to imitate certain C mannerisms for the optimizer to kick in.

You have to pay the price of compatibility with the existing toolchain even if it's not your explicit goal.

I'm not sure this is the optimizer's fault. If I'm doing a bounds-checked array access that might panic, that panic is an observable side effect that cannot be reordered with respect to other observable side effects. That puts constraints on what the optimizer can do, not because it's not smart enough, but because it has to respect the meaning of the program.

there are aspects of Rust that are easier for llvm to optimize with than c. there are things rust isn't yet doing that it will do to make llvm be able to optimize it better as well.

the primary reason THIS code didn't optimize as well is that you can't pass ffast-math to llvm from rust in any useful way (yet)

Can anyone elaborate a bit on why this PR has such terrible compile times? As someone with only a passing familiarity of Rust (I do most of my work in C/C++ and generally expect sub-10s compile times), I would naively expect that writing more C-like code which uses fewer of Rust's language features would compile faster or at least not considerably slower. What's going on there?

The biggest Rust language feature is the borrow checker IMO, and unless you're running in unsafe, that's still doing its job. I always assumed the slowness was more based on the borrow checker analysis than on the error wrapping (option) stuff, slices, or functional bits.

I don't have anything solid backing this up, but I've often heard Rust proponents claim that the borrow checker is actually fairly cheap in terms of compile time overhead. Looking at the PR, it's obvious that it generates quite a bit of code (separate kernels for different block sizes, etc.) and does things like generics over tuples, but I don't know enough about Rust to pick out patterns that might be particularly slow to compile. It does have somewhat fewer unsafe blocks than I expected.

https://wiki.alopex.li/WhereRustcSpendsItsTime is a recent exploration of this topic that's very interesting.

That was interesting indeed, thanks. For an outsider, the most trivial takeaway from that is something along the lines of "Rust is slow to compile and it's probably not getting 10x faster any time soon". Beyond the obvious culprits like heavy metaprogramming, it looks like death by a thousand cuts (not to mention that ggez spends 15 seconds on something that's not even accounted for in the profile, that alone is huge) and there obviously isn't a common root cause for huge compile times in general.

Sounds like it's fair to say most of this boils down to bounds checks and SIMD?

> index checking is expensive

Seems a little gung ho to disable guaranteed index checking in a video codec no? I know you still do the checks, but it sounds like it's not in a statically guaranteed way.

Surely tight loops in performance-sensitive code, are precisely where you'd consider disabling runtime checks, no?

You can still have them enabled for debugging, at least. (Something not generally possible in C/C++, sadly.)

Depends how much you care about safety, and how much safety is needed. If you're doing an FEA analysis or whatever, fine disable all checking. But a video codec is guaranteed to be used in dangerous environments.

Also depends how much it affects performance. It it doubles it, ok I'll accept the risk. 10% faster? No thanks I'd rather play it safe.

It looks like what the author was looking for is [1]

    f64::mul_add(self, a: f64, b: f64) -> f64

Adding it to the code indeed allows the LLVM to generate the "vfma" instruction. But it didn't significantly improve performance, on my machine at least.

    $ ./iterators 1000
    Normalized Average time = 0.0000000011943495282455513

    $ ./mul_add 1000
    Normalized Average time = 0.0000000011861410852805122

Maybe the performance gap is not due to what the author thought...

[1] https://doc.rust-lang.org/std/primitive.f64.html#method.mul_...

Hum, did the program get vectorized?

As I said, the compiler did generate FMA instructions. These are SIMD instructions, so yes, the program was vectorized.

> yes, the program was vectorized

"The program" contains two hot loops. Judging from the assembly code you linked in a sibling comment, only the second of these loops was vectorized, the first one wasn't. This slower non-vectorized loop will still dominate execution time.

And for whatever it's worth, dropping the original article's

            for i in 0..n{
in place of your loop using iterators and FMA, you still get nicely vectorized code (though without FMA) for this loop:

        vmovupd zmm2, zmmword ptr [r12 + 8*rcx]
        vmovupd zmm3, zmmword ptr [r12 + 8*rcx + 64]
        vmovupd zmm4, zmmword ptr [r12 + 8*rcx + 128]
        vmovupd zmm5, zmmword ptr [r12 + 8*rcx + 192]
        vmovupd zmm6, zmmword ptr [r14 + 8*rcx]
        vmovupd zmm7, zmmword ptr [r14 + 8*rcx + 64]
        vmovupd zmm8, zmmword ptr [r14 + 8*rcx + 128]
        vmovupd zmm9, zmmword ptr [r14 + 8*rcx + 192]
        vmulpd  zmm10, zmm2, zmmword ptr [rbx + 8*rcx]
        vsubpd  zmm6, zmm6, zmm10
        vmulpd  zmm10, zmm3, zmmword ptr [rbx + 8*rcx + 64]
        vsubpd  zmm7, zmm7, zmm10
Neither FMA nor iterators make a difference for whether this loop is vectorized, so any speedups are necessarily limited.

Hi I wrote the blog post linked - and I feel a little silly that I didn't check that _both_ loops vectorized. So I fixed the Rust implementation to keep a running vector of partial sums which I finish up at the end - this one did vectorize. The result was a 2X performance bump, which I'm about to include in the blog post as an update.

If it's OK I'll link to this comment as the inspiration.

On the iterators versus loop: for some reason when I use the raw loop _nothing_ vectorizes, not even the obvious loop. What I read online was that bounds checking happens inside the loop body because Rust doesn't know where those indices are coming from. Using iterators instead is supposed to fix this, and it did seem to in my experiments.

I liked your trick to iterate on chunks to force the compiler to vectorize the code ! Now that the code is properly vectorized, you can add the `mul_add` function, and this time you'll see a significant speedup. I tried it on my machine and it made the code 20% faster.

See the generated assembler here: https://rust.godbolt.org/z/G5A2u0

Thanks! The chunks trick was a fairly straightforward translation of what I would do in C++ if the compiler wouldn't vectorize the reduction for some reason. These days most compilers will do it if you pass enough flags, a fact I really took for granted when doing this because Rust is more conservative.

I've tried using mul_add, but at the moment performance isn't much better. But I also noticed someone else on my machine running a big parallel build, so I'll wait a little later and run the full sweep over the problem sizes with mul_add.

So really the existence of FMA didn't have a performance implication it seems except to confirm that Rust wasn't passing "fast math" to LLVM where Clang was. It just so happens that "fast math" will also allow vectorization of reductions.

Great to hear that you managed another 2x speedup! Sure, feel free to link my comment if you like.

it isn't always that simple. FMA instructions are tricky to use in a way that actually improves performance, llvm may be doing it right while doing it manually that way may not.

also, sometimes a SIMD instruction is used but only on 1 lane at a time. this is actually common with floating point code.

Something I found surprising: Some AVX2 and AVX-512 instructions consume so much power that Intel chose to have their chips dynamically slow their clock frequency when the instructions are executed. So naively switching to SIMD instructions can not only fail to improve performance, but it can also hurt the performance of unaltered code executed after it -- even unrelated code running on other cores.


What do you mean "manually" ? `mul_add` is a rust function that operates on a single f64, it's still up to LLVM to choose which instructions to use and to do the vectorization.

well technically FMA doesn't necessarily imply vectorization; it depends on whether the P{S,D} vs the S{S,D} suffixed instructions were being used, but if you saw the (P)arallel variants, then yes, it was vectorized.

You can see the full compiler output here:


Thanks. It seems that at least parts of the inner loop have been vectorized. Edit: If Im reading the asm correctly, the second zip has been vectorized, but the first fold was not.

So the difference basically boils down to -ffast-math, right? Is there an equivalent in Rust?

Edit: After some search I found these:



Writing a wrapper around f64 that uses these intrinsics shouldn't be too hard. I don't program in Rust though.

It doesn't exist yet and it's not clear it should be replicated as is. The fast-math flag does a bunch of related things that should probably be exposed separately so it's not a footgun in several situations. I'm also partial to exposing it per-function so the control is actually in the hands of the people writing the code that know the context and not subject to someone fiddling with compiler flags and getting incorrect code.

For this example you'd probably want -fassociative-math and not the other stuff that may result in incorrect code. -ffast-math was not actually used in the clang compilation and it's possible that the -fvectorize that was used picks a sensible mix of options.

Here's a preliminary discussion:


Trying to do an RFC process for this could be useful. The rust development process seems to be pretty good at thinking deeply about these things.

> I'm also partial to exposing it per-function so the control is actually in the hands of the people writing the code that know the context

As a C++ programmer who routinely uses fast-math "until something breaks" with DSP code, I would find that capability very attractive.

This should probably be exposed as separate floating point types. With relatively cheap conversions. (Mostly done error checks.)

There's some conversation about exposing fast math on this internals thread: https://internals.rust-lang.org/t/pre-pre-rfc-floating-point...

Floating point code is really difficult to do correctly. LLVM doesn't actually model IEEE 754 correctly yet, although hopefully the remaining issues with the constrained intrinsics will be fixed by the end of the year (even then, sNaN support is likely to still be broken for some time).

What makes floating point more difficult than integers is two things. First, there is an implicit dependency on the status register that greatly inhibits optimization, yet very few users actually care about that dependency. The second issue is that there is many more properties that matter for floating point. For an integer, you essentially only care about three modes: wrapping (w/ optional carry flag), saturating, and no-overflow. Exposing these as separate types exhaustively is easy. For floating point, you have orthogonal concerns of rounding mode (including dynamic), no-NaN, no-infinity, denormals (including flush-to-zero support), contraction to form FMA, reciprocal approximation, associative, acceptable precision loss (can I use an approximate inverse sqrt?), may trap, and exception sticky bits. Since they're orthogonal, that means that instead of a dozen types, you need a few thousand types to combine them, although many combinations are probably not going to be too interesting.

That's kind of at odds with Rust guarantee that your code never breaks.

Technically Rust only guarantees memory-safety (and only outside of unsafe!{}). It has many features that aid in other kinds of safety - strongly encouraging you to unwrap Option<> and Result<>, requiring all match cases to be covered, allowing for lots of strategic immutability, etc. But it doesn't guarantee that kind of correctness.

That's not correct. Safe Rust is advertised as sound, and Rust defines that as "safe Rust programs do not exhibit undefined behavior". Undefined behavior is a much larger term than just memory safety, and include things like thread safety, const safety, unwind safety, etc.

rust doesn't guarantee anything if you opt out of the guarantees. two examples that come to mind: unsafe and maybe bounds checks in release mode.

fastmath is probably different anyway, as the "breaking" is on a floating point logic level, as in: results become imprecise, but not exactly "wrong" - as in undefined behaviour. but i don't know fastmath, so i might be wrong.

(bounds checks are not removed in release mode, you have to use unsafe to not have the bounds check)

On the /r/rust thread, folks provided examples of why fastmath would produce UB in safe Rust.

ah, i think i meant overflow checks. and thanks for the pointer, i'll have a look.

ah yes, overflow checks are not in release mode today. They may be in the future. And overflow isn't UB, it's two's compliment wrapping.

Clang -Ofast implies -ffast-math.

yes, this is all purely down to Rust not doing ffast-math (on purpose)

writing a wrapper is not trivial, at all.

what Rust could do though, is add a wrapped float type, that the compiler will forward to llvm saying "you can ffast-math these" That is the approach Rust tends to take for these kinds of things, though no plans are in the works to do this to make floats vectorizeable yet. Maybe we should start such plans?

> on purpose

It's worth explaining why. So far as I understand the situation, dependencies may assume and depend on the absence of ffast-math. Enabling it globally for a compilation is generally considered to be a footgun for this reason.

Yes, and that doesn't surprise me.

In my experience -Ofast/-ffast-math yields very impressive results for FP code.

If you can tolerate platform-specific variation in the trailing parts of floating point numbers, it's wonderful.

It's very easy to do FMA's using .mul_add() on floats in Rust, which the author didn't seem to know about.

Ideally the compiler should be able to do this by itself though, at least with the appropriate flag to enable it.

FMA isn't a safe optimization as it can give different results.

C++ compilers have flags to enable it globally. gcc and clang include the optimization in -Ofast.

Rust allows you to choose at a code level (but usually people don't know about it). Perhaps it should also have a global fast-math flag that would automatically optimize it. Pros and cons to that.

FMA is "safe" in that if it breaks your code, it was already broken. It can only make the results slightly more accurate, unlike for instance the rsqrt instruction which is less accurate. (and as such is not a safe optimization)

GCC emits FMA instructions at -O2 without -ffast-math.

Well, it could "break" your code in that it might make your code produce different results than a separate equivalent implementation that didn't use FMA.

Edit: Ok actually it sounds like it could literally break some algorithms, see https://news.ycombinator.com/item?id=21342974

I wasn't trying to imply it should be on by default. Often one does not care about the lower bits of the floats, but do want the speed. For some tasks it's very much the opposite. Being able to specify a global option with local override is a great combo.

FMA is not always faster, it has a high latency: 5-6 cycles depending on the CPU while Add and MUL have very low-latency.

This means that to fully utilizes FMA you need to unroll a loop more. Sometimes yyou just can't, and the other time you use more instructions, use more cache.

In short it's not always better.

Also as other said, FMA has better accuracy than separate Add + Mul

If you’re doing fiddly numerical work, this must definitely be optional, as swapping separate multiplication and addition for FMA (or vice versa) can compromise correctness. In some cases you need two different algorithms if FMA is present or absent.

Do you have concrete examples of such algorithms?

Some algorithms guarantee that some arithmetic operation(s) applied to 2+ floating point inputs will result in a list of floating point outputs which when summed have exactly the correct result. This gets all screwed up if you mess with the order of operations or the rounding of intermediate results.

e.g. https://www.cs.cmu.edu/~quake/robust.html

Some keywords to look for: “compensated arithmetic”, “error-free transformations”.

FMAs generally speed up these tools, but you need to be careful and deliberate about how they are used.

(Disclaimer: I am not an expert on this, just some guy on the internet.)

Fair, I think it would be very helpful for Rust if some expert actually knows of a specific example for which this is the case. I think there is an RFC about enabling floating-point contraction by default, that would "silently" be able to do some of these transformations depending on the optimization level.

The one very important thing that often get destroyed by compiler using associativity is the TwoSum Error free transform which is a vital composant of several algorithms that deal with numerical error (most notably the Kahan Summation).

The problem is mentionned in the Wikipedia page of the Kahan summation (and I have been able to reproduce it with gcc) : https://en.wikipedia.org/wiki/Kahan_summation_algorithm#Poss...

This is actually my area of research, I could contribute if you point me to an RFC.

I think the key issue with the optimizations that ICC is performing for C++ but Rust is not doing in this case is just FP-contraction, which is related to, but not the same as, assuming associativity.

The RFC about that is https://github.com/rust-lang/rfcs/pull/2686 , where you see users kind of split into the "I want faster binaries" and "I want more deterministic execution" camps. Neither are wrong TBH.

Some people have tried to show there that enabling FP-contraction by default isn't always better / more precise, but I'm not sure if they succeeded.

> I think the key issue with the optimizations that ICC is performing for C++ but Rust is not doing in this case is just FP-contraction, which is related to, but not the same as, assuming associativity.

I think associativity is necessary to vectorize reduction operations like:

I haven't looked at the code generated by ICC, but I would expect it to vectorize this by computing tuples of "partial sums", roughly as follows:

    r0 += (c[i+0]-a[i+0]*b[i+0])*a[i+0]*(c[i+0]-a[i+0]*b[i+0]);
    r1 += (c[i+1]-a[i+1]*b[i+1])*a[i+1]*(c[i+1]-a[i+1]*b[i+1]);
    r2 += (c[i+2]-a[i+2]*b[i+2])*a[i+2]*(c[i+2]-a[i+2]*b[i+2]);
and then doing a horizontal sum r = r0 + r1 + r2 + ... in the end. But this requires associativity. (And commutativity, but that's a given.)

Yes, you are right, for those reduce operations you need associativity.

For the error free transform, the important part is the associativity and not the fusion (which would not degrade the operation).

As illustrated in the Wikipedia page, if move terms according to associativ rules you can show that one of the term is always zero and conclude that it is useless, dropping it from the computation.

However, in practice, floating-points are not associativ and that term contains the numerical error of the previous operation.

It isn't a matter of being a rust expert, fused multiply add is an instruction on CPUs.

Care to rewrite the program with `.mul_add()`?

I did rewrite the code with mul_add, and didn't see any significant performance improvement. See my comment above.

very appreciated! Thank you.

This is interesting to see. But if I'm going to compare numerical C++ against numerical Rust, then I would be using a higher-level library for the comparision. What is Rust's Other Leading Brand (TM) for the Eigen C++ library?

That comparison (I’m not familiar enough with Eigen to truly say) is going to change over time too; once const generics lands (which is proceeding, finally) the APIs for numerics libraries are going to be significantly different in Rust.

There's a bunch of Eigen analogues in Rust, which is slightly frustrating. ndarray is pretty great though

before you look at speed, did you verify you get the exact same math results in Rust and C++ (and for each compiler and platform) ? For C++ code, I have seen the results of calculations vary across compilers (and flags)

The authors ends by noting that FMA would probably have improved the performances for the Rust code.

It is interesting to note that, whereas most ffast-math optimization will trade precision for reduced computing time, adding an FMA can only improve the precision of the output (and thus it is a safe optimization).

pedantry: ffast-math does not always trade precision. It simply trades the results being the same as if they were not vectorized. A vectorized sum of floats for instance is more accurate, not less.

What's "almost" algebraic about enum? It can definitely be used to construct sum types, and you can make product types with struct or inline in an enum

my best guess is that you can't do recursive enums without explicit boxing [edit: or other forms of indirection, like &T]¹. so you can't do this:

  enum List<T> {
    Cons(T, List<T>)
instead, you have to box/reference-ify the recursive occurrence:

  enum List<T> {
    Cons(T, Box<List<T>>)
so in certain circumstances it doesn't let you "coproduct" two types together, you might need to modify one a bit, which makes it a technically-not-exactly-a-coproduct (i think). a bit of a stretch but it sort of makes sense next to a by-reference-only ML langs where you can (co)product anything as you please

(btw, it's the same for recursive products)


1 - https://users.rust-lang.org/t/recursive-enum-types/2938/2

You don't have to box, but you do need some sort of type to make things sized. This is usually a pointer of some kind, but any kind of pointer works. Take references, for example:

  enum List<'a, T> {
    Cons(T, &'a List<'a, T>)
  fn main() {
      let list = List::Cons("hello", &List::Nil);
Box is usually chosen because it's a good default choice.

you're right of course! i should've used a more generic term like "indirection" or "reference", didn't mean to put emphasis on Box

It’s all good, most people say just Box, because it is the majority case.

I think that's because Rust types are Sized, but I could be wrong. The first example has size = Infinity, while the second has a constant size.

Thanks for commenting, that's probably it. I was aware of the requirement for explicit box-ing but it didn't immediately come to mind.

Okay, can someone give an explanation for why Rust does not mimic the -O fast behavior? Is this something they plan to add?

It leads to undefined behavior in safe code in the general case.

We may add a wrapping type, similar to what we do for integer behavior. But in general, adding flags to change major behavior is not something we do.

This is an extremely clear article and apparently well-executed experiment.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact