
Comparing Parallel Rust and C++ - mmastrac
https://parallel-rust-cpp.github.io/introduction.html
======
keldaris
This is a wonderfully written comparison benchmark and it deserves attention
even for that reason alone. It knows its target audience, explains what's
going on succinctly but completely, and avoids most of the usual benchmark
pitfalls that result in comparing apples to oranges. Great job.

The one glaring issue that is ignored is the floating point model used. I
understand Rust still doesn't have a usable equivalent to -ffast-math, so I
assume it wasn't used for that reason. Some discussion of whether it's
permissible in this algorithm (I believe so?) and how much advantage that
might give to C++ seems crucial when performance is a priority.

Ironically, reading this has further convinced me that, for all of its
disadvantages, I much prefer C++ to Rust for my needs. I'm sure others will
draw the opposite conclusion and that's great. Rust is a language that clearly
knows what it wants and if your priorities are aligned with that, the
performance gap is shrinking and implementation-related reasons to avoid it
are rapidly decreasing in number.

~~~
dgellow
> reading this has further convinced me that, for all of its disadvantages, I
> much prefer C++ to Rust for my needs.

What are your needs, if I may ask? I'm not asking this to start a thread
"rust/C++ is better", I'm just curious regarding situations in which someone
decides that one language matches their needs more than another, it's always
interesting to see what boundaries people consider for those decisions.

~~~
keldaris
I mostly use C++ for numerical simulations in physics and associated code
(analysis, some visualization, etc.). That means my primary consideration is
the ease and convenience of writing high performance code for a narrow set of
hardware. I care about the quality of tooling, especially for performance
analysis (including things like likwid [1]) and GPGPU computation. I do not
care about safety (memory or otherwise) - my code doesn't take arbitrary
input, run on shared hardware, do much of anything over networks or have
memory safety crashes.

From this rather narrow point of view, Rust does very little to help and quite
a lot to hinder me. Rust is very much about memory safety - an issue extremely
far down my list of concerns - and to me the borrow checker is an anti-feature
I'd love to turn off. None of this is in any way an indictment of Rust - it
looks like a very well designed language that knows what it wants to
accomplish. It just happens to want the exact opposite things from what I
want, and that's fine. I know I'm in a weird minority (most of the people who
think like I do seem to be game engine developers).

[1] [https://github.com/RRZE-HPC/likwid](https://github.com/RRZE-HPC/likwid)

~~~
wyldfire
> I do not care about safety (memory or otherwise) - my code doesn't take
> arbitrary input, run on shared hardware, do much of anything over networks
> or have memory safety crashes

How do you know it's the case that you don't have memory safety defects in
your implementation?

It's almost certainly not the case that you don't care about safety. Out of
bounds accesses and writes would make the simulation defective. Defective
simulations are useless. However it might make sense if you said that you
don't find yourself fixing defects like these often.

~~~
bluGill
Memory safety defects tend to manifest themselves in obvious ways. A few unit
tests, and some work in memory sanitizes will find them.

~~~
pcwalton
This is empirically not true. A look through
[https://twitter.com/lazyfishbarrel](https://twitter.com/lazyfishbarrel)
confirms this.

~~~
bluGill
The exceptions tend to escape notice for years. They get a lot of attention
(and are often really hard to fix unlike the early ones), but I stand by my
statement: most are easily found and fixed - but they are also fixed early in
development so you don't hear about them.

------
kbumsik
Maybe because I'm not familiar with Rust, but I always have kind of
impressions that Rust code is very hard to read. It is just impossible to
figure out what the code is doing for beginners.

[C++ code]: [https://github.com/parallel-rust-cpp/shortcut-
comparison/blo...](https://github.com/parallel-rust-cpp/shortcut-
comparison/blob/8cdab059d22eb8f30e1408c2fbf0ae666fa231d9/src/cpp/v7_cache_reuse/step.cpp)

[Rust code]: [https://github.com/parallel-rust-cpp/shortcut-
comparison/blo...](https://github.com/parallel-rust-cpp/shortcut-
comparison/blob/8cdab059d22eb8f30e1408c2fbf0ae666fa231d9/src/rust/v7_cache_reuse/src/lib.rs)

~~~
edflsafoiewq
The Rust is practically line noise. This is just awful

    
    
        let pack_simd_row = |(i, (vd_stripe, vt_stripe)): (usize, (&mut [f32x8], &mut [f32x8]))| {
            for (jv, (vx, vy)) in vd_stripe.iter_mut().zip(vt_stripe.iter_mut()).enumerate() {
                let mut vx_tmp = [std::f32::INFINITY; simd::f32x8_LENGTH];
                let mut vy_tmp = [std::f32::INFINITY; simd::f32x8_LENGTH];
                for (b, (x, y)) in vx_tmp.iter_mut().zip(vy_tmp.iter_mut()).enumerate() {
    

And I tend to write stuff like this in Rust too. The loops especially. It's
just so easy to throw together all those iterator combinators and get some
ugly blob that's going to make people's eyes glaze over.

~~~
brutt
Code is easy to follow. I see no problem there.

`pack_simd_row` is lambda: `|arguments: types| { body }` Arguments are: * `i`
of type `usize` (unisgned size_t), * tuple (anonymous struct) with two fields:
vd_stripe and vt_stripe, which are modified inside of lambda. They are
references to fixed size array of 8 floats.

Inside function we have loop over result of iterator, which produces tuple
with two fields: `jv` and tuple with two fields: `vx` and `vy`. `vx` and `vy`
are elements from `vd_stripe` and `vt_stripe`. `jv` is their index.

Inside loop we create two temporary mutable variables: `vx_tmp` and `vy_tmp`,
which are fixed size array of 8 floats, which are initialized with infinity.

Then we have next loop, which goes to modify these temporary arrays in place.

And so on.

~~~
jml7c5
While the Rust version does flow logically, succinctness helps a lot more than
people seem to appreciate. It's one of the reasons math notation is often so
inconsistent: the brevity allows for easier manipulation and scanning. (Though
I would not stretch the analogy too far, as inconsistent and overy brief
notation can limit understanding while reading proofs.)

Being able to fit part of a program clearly into 5 lines with fewer characters
makes it easier to ensure correctness than an algorithm spread over 10 lines
that is full of extra "noise". It's why there's so much syntactic sugar in so
many languages.

------
gpderetta
very nice.

Most language comparison benchmark are completely useless other than for
bragging points, but those, like this one, that go into details of why one
specific implementation is faster or slower than another are much more
interesting and allow making an idea of what makes a language slower or
faster.

Also, it interesting that the hand optimized program is about 100 times faster
than the unoptimized one, showing that even today there is room for manual
optimizations and you cannot trust the compiler blindly, but you have to
iteratively work with it to get to an optimal solution. I can't figure out
from either this article or the original one whether -fast-math was being
used. Would be nice to know if that would help the compiler vectorize and
unroll the loop with multiple accumulators.

~~~
mratsim
If you are interested on the exact same topics (matrix multiplication
parallelization) here are other step by step tutorials I used:

\- UlmBLAS [1], for a HPC course in 14 steps

\- BlisLab [2], make sure to checkout the tutorial.pdf. It gives you exercises
to solve in C and each one build on the previous one

In Rust matrixmultiply crate [3] implements those techniques to reach 80% of
OpenBLAS speed, see blog post GEMM: a rabbit hole [4].

This is a generic approach that can be followed by any low-level languages.

I reach 100% of OpenBLAS and MKL-DNN speed in Nim on large matrices as well
[5] without any assembly and a generic code that can also generates integer
matrix multiply [6].

Regarding fast-math, that's what you do manually, you interleave the fused-
multiply adds as they have a latency of about 6 cycles (Broadwell, I don't
remember on Skylake)

[1]: [http://apfel.mathematik.uni-
ulm.de/~lehn/sghpc/gemm/](http://apfel.mathematik.uni-
ulm.de/~lehn/sghpc/gemm/)

[2]: [https://github.com/flame/blislab](https://github.com/flame/blislab)

[3]:
[https://github.com/bluss/matrixmultiply](https://github.com/bluss/matrixmultiply)

[4]: [https://bluss.github.io/rust/2016/03/28/a-gemmed-rabbit-
hole...](https://bluss.github.io/rust/2016/03/28/a-gemmed-rabbit-hole/)

[5]:
[https://github.com/numforge/laser/blob/e660eeeb723426e80a7b1...](https://github.com/numforge/laser/blob/e660eeeb723426e80a7b1187864323d85527d18c/benchmarks/gemm/gemm_bench_float32.nim#L393-L439)

[6]:
[https://github.com/numforge/laser/blob/e660eeeb723426e80a7b1...](https://github.com/numforge/laser/blob/e660eeeb723426e80a7b1187864323d85527d18c/laser/primitives/matrix_multiplication/gemm_ukernel_avx512.nim)

~~~
gpderetta
Thanks for the pointers, I'll take a look.

re fast-math, I was interested on how much additional parallelism the compile
can extract from the naive code. Fast-math should,at least in principle, allow
the compiler to unroll the loop and add the additional accumulators, although
that violates strict IEEE semantics.

~~~
mratsim
Yes exactly.

Usually you can extract 2x to 4x instruction level parallelism on simple add
instructions [1] vs [2]

For fused-multiply-add even though the latency is higher the instruction is
also slower so its the same. BLIS, OpenBLAS and my code extract 2x parallelism
(2 accumulators) at the lowest level because we are restricted by the number
of registers and the fact that x86 can only issue 2 FMAs per cycle [3].

Details, we divided the work until we have a small micro matrix C or size MR x
NR (input matrix of size MxK and KxN so output of MxN), and then here is are
the constraints you have to deal with:

Registers constraints and micro-kernel tuning

    
    
      - To issue 2xFMAs in parallel we need to use 2x SIMD registers
      - We want to hold C of size MR * NR completely in SIMD registers as well
        as each value is reused k times during accumulation C[i, j] += A[i, k] * B[k, j]
      - We should have enough SIMD registers left to hold
        the corresponding sections of A and B (at least 4, 2xA and 2xB for FMAs)
    

On x86-64 X SIMD registers that can issue 2xFMAs per cycle:

    
    
       - NbVecs is 2 minimum
       - RegsPerVec = 2 * NbVecs => 4 minimum (for A and for B)
       - NR = NbVecs * NbScalarsPerSIMD
       - C: MR*NR and uses MR*NbVecs SIMD registers 
       - MR*NbVecs + RegsPerVec <= X
          -> MR*NbVecs + 2 * NbVecs <= X
          -> (MR+2) * NbVecs <= X
    

Some solutions:

    
    
       - AVX with 16 registers:
             - MR = 6, NbVecs = 2
               FP32: 8xFP32 per SIMD --> NR = 2x8
                     ukernel = 6x16
               FP64, ukernel = 6x8
             - MR = 2, NbVecs = 4
               FP32: 8xFP32 per SIMD --> NR = 4x8
                     ukernel = 2x32
               FP64, ukernel = 2x16
       - AVX512 with 32 registers
             - MR = 6, NbVecs = 4
               FP32 ukernel = 6x64
               FP64 ukernel = 6x32
             - MR = 2, NbVecs = 8
               FP32 ukernel = 2x128
               FP64 ukernel = 2x64
             - MR = 14, NbVecs = 2
               FP32 ukernel = 14x32
               FP64 ukernel = 14x16
    

And in-depth overview of the lowest level details is available in the paper
Automating the last mile for High Performance Dense Linear Algebra[5].

In short, the compiler is completely unable to deal with this, and a high
performance computing compiler should give an escape hatch to allow hand
optimization like Halide[6] or Tiramisu do[7].

[1]:
[https://github.com/numforge/laser/blob/e660eeeb723426e80a7b1...](https://github.com/numforge/laser/blob/e660eeeb723426e80a7b1187864323d85527d18c/benchmarks/fp_reduction_latency/reduction_bench.nim#L304-L341)

[2]:
[https://github.com/numforge/laser/blob/e660eeeb723426e80a7b1...](https://github.com/numforge/laser/blob/e660eeeb723426e80a7b1187864323d85527d18c/benchmarks/fp_reduction_latency/reduction_bench.nim#L493-L505)

[3]:
[https://github.com/numforge/laser/blob/e660eeeb723426e80a7b1...](https://github.com/numforge/laser/blob/e660eeeb723426e80a7b1187864323d85527d18c/laser/primitives/matrix_multiplication/gemm_tiling.nim#L111-L146)

[4]:
[https://github.com/numforge/laser/blob/e660eeeb723426e80a7b1...](https://github.com/numforge/laser/blob/e660eeeb723426e80a7b1187864323d85527d18c/laser/primitives/matrix_multiplication/gemm_ukernel_generator.nim#L149-L160)

[5]:
[https://arxiv.org/pdf/1611.08035.pdf](https://arxiv.org/pdf/1611.08035.pdf)

[6]: [https://halide-lang.org/](https://halide-lang.org/)

[7]: [http://tiramisu-compiler.org/](http://tiramisu-compiler.org/)

------
fyp
In case people missed the link, the reference implementation is an amazing
post on its own:
[http://ppc.cs.aalto.fi/ch2/v7/](http://ppc.cs.aalto.fi/ch2/v7/)

~~~
xyst
So this is where my parallel computing professor got his material from.
Definitely a good read for parallel computing, but that matrix multiplication
problem was discussed to ad nauseum in the first few weeks.

------
gatherhunterer
For some reason many people are choosing to focus on code readability. This is
not how a real-world program would be written. Who actually writes a single
train-of-thought implementation without isolating concerns or using any code
separation features? No realistic coder would expect their team to be
satisfied working with code like this. It’s just a benchmark program, it
doesn’t actually fit a use case or solve a problem.

~~~
jcelerier
> This is not how a real-world program would be written.

I have seen much more programs that look exactly like this (at least for the
C++ part) than programs with clean separation of concerns in my life.

> No realistic coder would expect their team to be satisfied working with code
> like this.

assuming you've got a team (and that they are trained and not a bunch of
interns or barely-graduated with 100% year-over-year turnover), assuming you
are a professional developer and not someone who codes in the context of
another profession (researcher, artist... heck, I've seen a music teacher
making small python apps for the lobby of their music school once, etc).

------
larusso
Very cool. I love the visual explanations for certain iteration to understand
the better see how the pre process step prepares the data. What I always miss
in such posts is the full tool explanation how to retrieve the resulting
assembly code. Don’t get me wrong I know how to search the internet. But the
post goes quite some length to explain the basic rust setup. Maybe the post is
aimed for veteran cpp programmers. I certainly would appreciate a link or
example line how to generate the assembly code lines :)

------
ChrisSD
Out of interest, why was lto (link time optimisation) set to false? I doubt
this would affect the results much but it's useful for cross-crate inlining.

~~~
OskarS
Presumably because the thing they're benchmarking is in a single translation
unit, so LTO wouldn't matter. And it might screw up the benchmark if the C++
code was optimized between the "test harness" translation unit and the "code
to benchmark" translation unit, but Rust wasn't. It's sensible for this kind
of benchmark not to use LTO.

~~~
fluffy87
Not that it matters but you can use LTO with C++ and Rust, just need to
compile and link both with LTO enabled.

------
Hitton
Awesome. I haven't started to learn Rust yet, but I still learned a lot.

------
The_rationalist
Would have been nice to compare code uglyness and performance of OpenMP 5 vs
rayon + SIMD.

------
sdan
TLDR: Most of the time C++ with GCC was best.

~~~
galangalalgol
Were you even looking at the same article? Rust was always better at
intruction level parralellism and gcc was usually best at cache depending
greatly on which i5 xeon was used. Clang was slways the worst.my main takeaway
was that they are all three roughly identical and that small processor
differences dominated the results.

~~~
fluffy87
Yeah, there were differences, but these were peanuts. Using two different
versions of GCC will givenyou similar differences.

~~~
gpderetta
yes, you can see from the assembly listings in the final stages, that most of
the differences were due to very minor code changes, due to details in the
optimization passes and not really to language differences.

------
wscott
I am surprised he didn't benchmark the C++ program with Clang to give a closer
comparison to rust. In my experience, despite all its other advantages, Clang
still lags GCC a bit in raw performance.

Still a really useful set of comparisons. I am impressed Rust is able to
compete with all the magic OpenMP is doing in the background.

~~~
steveklabnik
I haven’t read the whole thing yet, but clang seems to be benchmarked here:
[https://parallel-rust-cpp.github.io/results.html](https://parallel-rust-
cpp.github.io/results.html)

