Thanks! The chunks trick was a fairly straightforward translation of what I woul...

Thanks! The chunks trick was a fairly straightforward translation of what I would do in C++ if the compiler wouldn't vectorize the reduction for some reason. These days most compilers will do it if you pass enough flags, a fact I really took for granted when doing this because Rust is more conservative.

I've tried using mul_add, but at the moment performance isn't much better. But I also noticed someone else on my machine running a big parallel build, so I'll wait a little later and run the full sweep over the problem sizes with mul_add.

So really the existence of FMA didn't have a performance implication it seems except to confirm that Rust wasn't passing "fast math" to LLVM where Clang was. It just so happens that "fast math" will also allow vectorization of reductions.