
How Rust 1.26 more than tripled the speed of my code - nonsince
http://troubles.md/posts/the-power-of-compilers/
======
pbsd
The code in the blog post does not match what is actually benchmarked. The
reason `u256_full_mul` is so much faster than the inline assembly version is
that it omits the upper part of the result, due to how the benchmark is done
(cf.
[https://github.com/paritytech/bigint/blob/44a3133030306d9fcc...](https://github.com/paritytech/bigint/blob/44a3133030306d9fcc9865b2e8e8368f6992cc98/benches/bigint.rs#L114-L115)).
There should be 16 full 64x64->128 multiplications (unless you're doing
Karatsuba, which is not the case here) in a full 256-bit multiplication, but I
only count 6 in the benchmark's inner loop (plus 4 64x64->64 multiplications).
No wonder it's faster.

The disassembly for the Rust output of `u256_full_mul` at the end is also
terrible, and it looks like the output of a debug build. There's no reason an
optimizing compiler should be outputting `pushfd` and `popfd` in performant
code. The benchmarked loop does not have those instructions.

~~~
ajross
PUSH/POPF* instructions push the flags register to save comparison results
when you need to reuse the registers containing the original comparison
arguments. They're well-pipelined, integrated with the stack engine and very
fast on modern CPUs, and I've definitely seen the compiler emit them (more on
i686 than x86_64 to be fair).

I won't speak to the quality of the generated code otherwise, but this bit of
evidence by itself doesn't seem persuasive.

( _edit: pbsd is right: per Agner Fog, these guys are slow. It 's likely that
the L/SAHF trick is the one I was remembering._)

~~~
pbsd
Where are you getting that from? `pushfd` is OK-ish, but `popfd` is
microcoded, requires 9 uops, and can only be issued once every ~20 cycles.
That's not what I would call well-pipelined.

You could probably achieve the same result (assuming you want to avoid adc
instructions, for some reason) by using setc r8 plus shr r8, 1 to get the
carry back.

~~~
nonsince
If you use `-C target-cpu=native` (Rust's equivalent of `-march=native`) you
get code that uses `mulxq` in order to avoid `pushf`/`popf`.
[https://gist.github.com/Vurich/5cb83c773e90fc7a463ccb58e1dad...](https://gist.github.com/Vurich/5cb83c773e90fc7a463ccb58e1dad4e3)

~~~
pbsd
It's still doing it, but now it uses `lahf` + `sahf` instead to (re)store the
flags. These are better than pushf + popf for sure, but they cannot be used in
general code because some early x86_64 chips forgot to implement them.

~~~
deaddodo
If your general code is only intended to be used on newer versions of Windows,
you can. Windows has required it since 8.1.

------
comex
The behavior with "m" is quite strange. In theory, it should produce a memory
operand pointing directly to where the function argument was passed on the
stack, rather than copying it to registers and then back onto the stack. I can
reproduce this with Rust nightly:

[https://play.rust-lang.org/?version=nightly&mode=release](https://play.rust-
lang.org/?version=nightly&mode=release)

However, Clang appears to do the right thing with equivalent C code:

[https://godbolt.org/g/B1sjfS](https://godbolt.org/g/B1sjfS)

This is strange because Rust and Clang both use LLVM for their backend.

Looking at the LLVM IR produced, Rust's IR straightforwardly corresponds to
the source code: a load instruction (to evaluate `self_t[0]`) whose result is
passed to an asm instruction:

    
    
        %0 = getelementptr inbounds %U256, %U256* %self, i64 0, i32 0, i64 0
        %1 = load i64, i64* %0, align 8
        tail call void asm sideeffect "mulq $0", "m,~{dirflag},~{fpsr},~{flags}"(i64 %1) #1, !srcloc !0
    

But Clang's omits the load, and transforms the "m" constraint to "* m"
(without the space, bah HN formatting):

    
    
        %2 = getelementptr inbounds %struct.U256, %struct.U256* %0, i64 0, i32 0, i64 0, !dbg !24
        call void asm sideeffect "mulq $0", "*m,~{dirflag},~{fpsr},~{flags}"(i64* nonnull %2) #2, !dbg !26, !srcloc !27
    

Hmm… seems to be related to this Rust bug: [https://github.com/rust-
lang/rust/issues/16383](https://github.com/rust-lang/rust/issues/16383)

------
tomsmeding
> especially since the original code used highly-optimised assembly
> incantations from our resident cycle wizard.

He/she might be cool, but I think someone familiar with assembly optimisation
would think to look at the generated assembly as well, after noticing the
speedup is not what you expect. Don't go hand-writing assembly routines
thinking your code is fast without actually checking whether your code is
fast. :)

~~~
nonsince
They're certainly familiar with assembly itself, but the `asm!` macro and the
related constructs in Clang/GCC is its own beast and has a lot of footguns
(like GCC allocating registers as overlapping by default). I don't begrudge
them writing suboptimal assembly, since it already beat our Rust
implementation by a significant margin.

------
jsd1982
Are you sure those are missed optimization opportunities where llvm copied
values to registers as temporary storage instead of a direct copy to the
destination register? I saw use of the temporary registers below the lines in
question. Perhaps it was just a cheap copy of the value to a register which
will be used later because the original destination register was clobbered.

Also I would hazard a guess that pushing flag state into a register is a lot
faster than the stack.

Finally, it'd be interesting to see the original hand-rolled inline assembly
used with "r" instead of "m" to see how llvm stacks up against the intended
version without the redundant memory load/stores.

~~~
nonsince
Great catch, you're absolutely right. LLVM is smarter than me once again.

------
the_new_guy_29
"decreased the execution time by a factor of 3"

So article is about a flew in previous versions of Rust not supporting 128-bit
numbers on x86_64. But title makes it sound like "switching to Rust was the
breakthrough". Or am i paranoid about it..

~~~
viraptor
The original title is the more precise "How a Rust upgrade more than tripled
the speed of my code". Also I think mentioning the version made it clear that
it's about a specific version, not about rust in general.

~~~
magduf
Maybe they should change it to: "How an upgrade to Rust 1.26 more than tripled
the speed of my code"

------
kibwen
Happy to see alternative language features reducing the need for inline
assembly, not just with the 128-bit integer support in 1.26 but also with the
stable SIMD support coming in 1.27 (so crates like [https://users.rust-
lang.org/t/jetscii-now-works-with-future-...](https://users.rust-
lang.org/t/jetscii-now-works-with-future-stable-rust-1-27-0/17426) can at last
work on stable). Sadly the current implementation of inline assembly is so
inextricably tied to LLVM that there's resistance to stabilizing it in its
current form, though in this case I think the perfect is being allowed to be
the enemy of the good^W good-enough^W janky-but-acceptable.

------
the8472
There's a proposal to introduce a double wide mul which should make this much
more ergonomical to solve.

[https://github.com/rust-lang/rfcs/pull/2417](https://github.com/rust-
lang/rfcs/pull/2417)

------
hoare
What other factors played a role in your decision to use rust over other
languages when 128bit arithmetic played such an important role you wrote
inline assembly for it?

~~~
lmm
What other languages are sensible options for when you need to work with
inline assembly?

~~~
e12e
From just reading the discussion so far, I'd like to see how a naive
implementation in Julia would do (Julia does type inference, and some quick
duck-duck-ing indicates that unsigned 128 bit int is the biggest regular
number type - before having to go to bigint).

[ed: ok, from a skim, I see this is actually about 256bit multiplication -
which makes me curious how just using bigints and * (mul operator) would work
in Julia.

Also, I don't get this:

> u256_mul multiplies two 256-bit numbers to get a 256-bit result (in Rust, we
> just create a 512-bit result and then throw away the top half but in
> assembly we have a seperate implementation)

How do they know the result will fit in 256 bits? Sounds like they know more
about the arguments than both having to be 256 bits long?

Or do they want multiply and shift?]

~~~
wallnuss
Julia has `widemul` which takes two `Int64` and produces an `Int128`. It is
also not to hard to actually add your own primitive type for `Int256` (with
the caveat that it needs support from LLVM, which I haven't checked)

------
Const-me
Their assembly implementation can be improved, at least when targeting modern
hardware.

Here’s an article explaining how to use new instructions to implement large
integers multiplication:
[http://www.intel.com/content/dam/www/public/us/en/documents/...](http://www.intel.com/content/dam/www/public/us/en/documents/white-
papers/ia-large-integer-arithmetic-paper.pdf)

~~~
nonsince
If you use Rust's equivalent of `-march=native` you get faster code that can
target modern hardware. We would have to get another factor-of-two speedup for
the maintainance burden of using inline assembly to be worthwhile.

[https://gist.github.com/Vurich/5cb83c773e90fc7a463ccb58e1dad...](https://gist.github.com/Vurich/5cb83c773e90fc7a463ccb58e1dad4e3)

------
FPGAhacker
>”When Rust does a * b when a and b are both u64 the CPU actually multiplies
them to create a 128-bit result and then Rust just throws away the upper 64
bits.”

Is that right? Wouldn’t that answer look like nonsense? I would have guessed
it tossed the lower 64bits so it looked like rounding by truncation at least.

~~~
comex
By the way, it's only on x86-64 that the CPU's multiplication instruction
_always_ returns both the upper and lower 64 bits (in two different
registers).

On, say, ARM64, there are two separate instructions: MUL is the normal one
that takes two 64-bit inputs and produces one 64-bit output (the lower half of
the product), while UMULH ("Unsigned Multiply High") performs the same
multiplication but gives you only the upper half. This way, in the common case
where you don't care about the upper half, the CPU doesn't have to waste time
calculating it.

~~~
monocasa
I wouldn't say that it's only x86. MIPS does the same thing, but doesn't even
stick the result in the GPRs, you issue a multiply, and the result ends up in
the hi and lo registers that you have to mfhi and mflo to move back into GPRs.

~~~
comex
Oh, good point, I've seen that before but forgot about it. Though to be fair,
that's 32x32->64, not 64x64->128 :)

~~~
monocasa
The combined HI and LO are 64x64->128 on Mips64. You just have to use dmult
instead of mult.

------
andrepd
Could this be compared with C/gcc and C++/g++? Because this is what Rust is
aiming to displace.

~~~
steveklabnik
Sure; this is exposing LLVM's support, so it should be the same as clang.

------
tzahola
* compared to Rust 1.25

~~~
alimbada
An unnecessary distinction as it can quite be easily inferred from the current
title. Despite this there seem to be a lot of people nitpicking on semantics
of the title in this thread. It's unnecessary noise and detracts (and
distracts) from the real conversation.

~~~
paulmd
It's not just this thread. HN is absolutely obsessed with nitpicking titles
and yes, it's distracting and detracts from actual discussion of the topics.

The moderation team here tacitly encourages it by catering to it. I've seen
some threads go through 2 or 3 title changes as people find something new to
complain about.

Really it's almost a form of editorializing in itself - the title of the
article is what it is, people here just think they know better than the
author.

