
ARM and Intel have different performance characteristics: a case study - zdw
https://lemire.me/blog/2019/03/20/arm-and-intel-have-different-performance-characteristics-a-case-study-in-random-number-generation/
======
userbinator
I suspect if you were to compare code that uses both the remainder and
quotient of a division, you would find a similar trend: the x86 division
instructions produce both, but ARM has only a quotient-generating division
instruction, and you have to do another multiplication(!) and subtraction to
get the remainder.

That, along with the fact that earlier ARMs _didn 't even have a divide
instruction_, is one of the reasons why I'm not at all fan of the "RISC
approach" \--- division may not have been easy to implement, but microcoding
would've been at least as fast as doing the algorithm in software while taking
up far less space, and with hardware advances that move operations from
microcode into direct execution, existing software that already has those
instructions would automatically become faster on newer hardware with no other
action needed.

Division hardware almost always will generate the remainder along with the
quotient, but ARM would either need to add another instruction (no benefit to
existing software, only new software that knows how to use it) or attempt to
detect patterns of simpler instructions that are doing the same thing in order
to replace them with one "fused uop" (much more difficult to do.)

~~~
pcwalton
There is a good reason why nobody implements multiply and divide like x86 did:
it's terrible for register allocation. Especially on 32-bit, where register
pressure is very high, having division input _and_ output registers hardwired
is obnoxious, _and_ it kills a register for the upper half of the product or
remainder or quotient that you almost never need. ARM has the correct design.
Even Intel admits this: look at the MULX instruction that retrofits the right
design onto x86.

Earlier versions of ARM didn't have a divide instruction because of die space
and the fact that it's rarely needed. Integer division is usually by a
constant, so a shift and/or a multiplication suffices. There's no need to
microcode a divide instruction either, because the kernel can just trap the
#UD and perform the division itself, just like soft float works. (Yeah, it's
slow, but software division is always slow.) Modern designs like RISC-V do
this properly: expose division as an optional extension so as to scale down to
microcontrollers.

~~~
stevekemp
I wrote a simple compiler for mathematical operations recently,
[https://github.com/skx/math-compiler/](https://github.com/skx/math-compiler/)
and I have to say I really appreciated the ability to get the two halves of
the result in one operation.

I've just provisioned an ARM server so I can experiment with generating ARM
assembly as well as Intel. No doubt I'll have fun learning of the different
instructions available - until now I've only ever written assembly for intel
and Z80 processors.

~~~
pcwalton
The problem doesn't appear until you start implementing register allocation.

~~~
stevekemp
Yeah I can appreciate register starvation, something I've come across in
previous coding.

------
palango
Can someone explain this in a bit more detail? Why does "the computation of
the most significant bits of a 64-bit product on an ARM processor requires a
separate and expensive instruction"?

~~~
ridiculous_fish
ARM64 has separate instructions for computing the low and high halves of a
product, in keeping with its single destination register approach. x86-64 has
a single instruction that computes both halves simultaneously, writing to two
registers.

~~~
sligor
On a modern out of order uarch that kind of "complex" instruction would by
spited in two micro operation, one for each register result. And Agner's
tables [1] confirms that the mul 64*64 => 128 is split in two micro-ops on
Skylake. So it doesn't give any strong advantage.

[1]
[https://www.agner.org/optimize/instruction_tables.pdf](https://www.agner.org/optimize/instruction_tables.pdf)

~~~
BeeOnRope
Yes, but the second uop is not expensive like the first in this case. That is,
it seems like the the full multiplication is done by the latency-3 op on p1
and the other uop is just needed to move the high half of the result to the
destination (indeed, instructions with 2 outputs always need 2 uops due to the
way the renamer works). The whole 64x64->128 multiplication still has a
latency of only 3, and a throughput of 1 per cycle.

So the 64x64->128 multiplication is still quite efficient compared to ARM
where two "full strength" multiplications are needed. It is odd though that
there is nearly a 20x difference in relative speeds though, I wouldn't expect
multiply upper to be _that_ slow.

~~~
dragontamer
Note: The test seems to have been done on Skylark (aka: Ampere), which is a
non-standard ARM core. I can't find any documentation on Skylark's latency /
throughput specifications.

------
monocasa
ARMv8 NEON has a 64x64->128 multiply.

~~~
profquail
Aye, it’s PMULL, and it’s available as part of the ARM Crypto Extensions.

This patch shows that using it can result in a large performance improvement:
[https://github.com/randombit/botan/issues/842](https://github.com/randombit/botan/issues/842)

~~~
floatboth
Hm, why aren't compilers generating that instruction?

upd: apparently reasons like:

> So I guess for most of the case loading or storing i128, the data will be
> used by some library functions running on cores instead of NEON, so storing
> i128 to two GPR64 is more general.

[https://reviews.llvm.org/D2344](https://reviews.llvm.org/D2344)

~~~
dragontamer
> Hm, why aren't compilers generating that instruction?

Thats polynomial multiply. Its (almost) a multiplication in GF2 for elliptical
curves. Thats not a "normal" multiply.

"PMULL" is basically a bitshift and XOR. Your traditional "MUL" is bitshift
and ADD. Its called "polynomial multiply" because bitshift-and-xor has very
similar properties to bitshift-and-add (it distributes over XOR, associative,
communative, etc. etc).

Bitshift-and-xor has a few features that are better for cryptography. But its
NOT the multiplication you are taught in grade school.

\--------

EDIT: With that being said... those "better features" for cryptography would
make PMULL probably a better function for random-number generation. PMULL will
return a different result than the real multiplication, but you'll have an
easier time making a field (aka: reversable 1-to-1 bijections) out of PMULL
than MUL...

------
m0zg
Moreover, different ARM chips have different performance characteristics as
well. Apple's implementation, for instance, seems to be easily twice as fast
on most problems at the same clock speed as its leading competitors.

------
klingonopera
I'm guessing the 128-bit multiplication implementation on the ARM architecture
isn't as well done as is on the Intel platform?

You might be able to reclaim the performance if you manually implement the
multiplication using 64-bit variables instead.

~~~
klodolph
No, the compiler is generating good code. If you use a smaller word size you
just end up doing more multiplications (e.g. cut your word size in half, do 4x
as many multiplications).

------
rsp1984
Should we be surprised? The __(u)int128_t types are an optional extension of
the C standard, hence there can be no expectation that operations with these
are implemented well or at all, let alone implemented in silicon.

~~~
ajross
I don't see what specification has to do with this. I mean, a 32 bit 2's
complement integer is __also __a technically optional part of the C standard,
and indeed there is hardware that doesn 't support multiplications on them
with a single instruction.

What's happening here isn't related to word size, really. It's that
multiplication, as an operation, is lossy. It produces 2 words worth of
output, not one. Traditionally, most RISC architectures have just skipped the
complexity and returned the product modulo the word space. But x86 didn't, it
put the product into two registers (specific registers: DX and AX, and their
subsequent extensions).

Most of the time this just a quirk of the instruction set and an annoyance to
compiler writers. But sometimes (and this trick has been exploited on x86 for
decades for applications like hashing) it turns out to be really, really
useful.

~~~
rsp1984
Integer multiplication always carries the risk of Integer overflow. Integer
Overflow is undefined behavior in C, so it's the programmer's responsibility
to make sure it doesn't happen.

To that end in the example a __uint128_t was used, which is nonstandard, and
apparently not implemented all that well with the given combination of
compiler and ARM CPU. Given that we're talking about a 64-bit CPU, my argument
is that this is not very surprising.

~~~
klingonopera
> Integer Overflow is undefined behavior in C

 _Signed_ overflow is undefined behavior, unsigned overflow is defined in both
C/C++.

Apart from that, I agree with you. It has to do with the fact that OP is using
128-bit variables on a 64-bit architecture.

Come to think of it, it's actually more mesmerizing that x86 is _not_ slowed
down by a 128-bit variable. The ARM architecture is behaving as is to be
expected, Intel is actually the odd one out.

Someone mentioned cryptography, I can imagine that because of it, Intel has a
few instructions to optimize integer arithmetic on wider integers, and that is
probably the reason of the anomaly, which is actually Intel and not ARM.

~~~
ajross
As mentioned upthread, the mermerizing instruction in question is "MUL", which
debuted in 1978 on the 8086 and, except for register width, behaves
identically today.

~~~
klingonopera
I'm no expert, but shouldn't x86 then produce two 128-bit register entries if
it multiplies two 128-bit integers, so totaling actually four registry entries
on a 64-bit architecture? If this were the case, Intel would slow down just as
much as ARM on a double-than-archictecture-width-integer multiplication, but
it doesn't. That's what I find mesmerizing. I'm guessing that Intel simply
discards the earlier double registry logic once it goes beyond architecture
width, which would explain the speed up.

I.e. 64b * 64b = 2x64b registry entries, according to MUL should be 128b *
128b = 2x64b * 2x64b = 4x64b, but Intel discards this in favor for 128b * 128b
= 2x64b * 2x64b = 2x64b.

~~~
ajross
x86 can't multiply two 128 bit numbers at a time. But it can multiply two 64
bit numbers without losing the high 64 bits of the 128 bit product, which
makes the 128 bit multiplication much faster to implement.

~~~
klingonopera
> x86 can't multiply two 128 bit numbers at a time.

What's happening here then? Are these not two 128-bit integers? One's a 64-bit
recasted to 128-bit, the other a 128-bit constant. Code would be doing faulty
math, if it just decides to drop any bits. Coincidence, maybe, that the upper
half of the recasted is in this case 0x0, but the code must work for
0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF as well, and probably does too.

    
    
      __uint128_t tmp;
      tmp = (__uint128_t) wyhash64_x * 0xa3b195354a39b70d;

------
glangdale
CSA: Don't fixate on the code generated by Godbolt in isolation; it's not
going to reflect what happens in the benchmark loop.

(see my other reply to dimitrgy for more details)

------
dmitrygr
Article is very light on details, and contains zero citations, and only a
single result of a single benchmark the guy ran, with no details of how it was
run. It follows up by stating his theory as to why this happens as a fact
(again with no citations). Author does not even offer us a clue as to what ARM
core is used. The claim is:

    
    
      > The difference is that the computation
      > of the most significant bits of a
      > 64-bit product on an ARM processor
      > requires a separate and expensive
      > instruction.
    

I see no proof of this anywhere in the ARMv8 spec. You get the lower 64 bits
of result using _MUL_ and higher 64 bits using _UMULH_. Neither of those is
that expensive.

Looking at [1] we can see that MUL has throughput of 1 and latency of 3, UMULH
has 1/4 and 6, but as long as you do not issue another multiply just after
your UMULH, this 1/4 throughput is easily hidden, since only the multiplier is
busy, the rest of the CPU can go on. So unless your entire loop is under 6
cycles, or you simply have no instructions to schedule that do not need a
multiply within the next 3 of UMULH, it shouldn't matter. Given those large
constants that need to be loaded, they will each need 4 instrs
(mov+movk+movk+movk), there are plenty of instrs to schedule after UMULH.
Either OP's compiler messed up, or something entirely different is going on.

If, the author was using a weaker in-order core, say Cortex-A55, still more
performance is expected than appears demonstrated. There [2] the low part is
calculated in 2 or 3 cycles, the high in 4. But comparing an ARM in-order
little core to a modern OoO x86 is just not fair.

EDIT: Indeed, looking [3] at what gcc produces for this code is sad. For
example, why it is bothering synthesizing 0x1b03738712fad5c9 before issuing
the first UMULH is unclear, but it _IS_ stupid.

EDIT2: on skylake [4] MUL has a latency of 3, so faster than on ARM but not by
that much. I'd guess the constant loading on arm using 4 instructions per
constant hurts more than UMULH

EDIT3: in comments on original site, author said the ARM chip being used is a
"Skylark" by "Ampere Computing" [5]. Given that I cannot find any info on that
microarchitecture, I cannot say more about why it might be slow.

[1] Cortex®-A72 Software Optimization Guide:
[https://static.docs.arm.com/uan0016/a/cortex_a72_software_op...](https://static.docs.arm.com/uan0016/a/cortex_a72_software_optimization_guide_external.pdf)

[2] Cortex®-A55 Software Optimization Guide:
[https://static.docs.arm.com/epm128372/20/arm_cortex_a55_soft...](https://static.docs.arm.com/epm128372/20/arm_cortex_a55_software_optimization_guide_v2.pdf)

[3] Godbolt for this code:
[https://godbolt.org/z/UeOo6C](https://godbolt.org/z/UeOo6C)

[4] Lists of instruction latencies, throughputs and micro operation breakdowns
for Intel, AMD and VIA CPUs:
[https://www.agner.org/optimize/instruction_tables.pdf](https://www.agner.org/optimize/instruction_tables.pdf)

[5] Skylark - Microarchitectures - AppliedMicro:
[https://en.wikichip.org/wiki/apm/microarchitectures/skylark](https://en.wikichip.org/wiki/apm/microarchitectures/skylark)

~~~
glangdale
I think Daniel's use of the word "separate" in "separate and expensive" is
ill-advised, as it implies a critique of ARM's ISA design in a way that isn't
relevant _for this case_. One might be concerned if you needed all 128 bits in
some other use, but not here.

As for loading large constants, if you read the post and follow the link at
"reuse my benchmark" ([https://github.com/lemire/Code-used-on-Daniel-Lemire-s-
blog/...](https://github.com/lemire/Code-used-on-Daniel-Lemire-s-
blog/blob/master/2019/03/20/fastestrng.cpp)) you will see that these functions
as _measured_ are inside hot loops. As such, presumably constant loading is
very likely to be hoisted out of these loops on both architectures.

This will make the _considerably_ slower UMULH stick out like a sore thumb.
Also note that the measurement loop allows most of the work of each iteration
to be done in parallel - the work of the rng is a long dependency chain within
the calculation but the update of the seed is quick and independent of that.

I would guess that the Ampere box has a wretchedly slow multiply. In a comment
on the post, Daniel finds an ugly performance corner on A57 (possibly related,
possibly not): "On a Cortex A57 processor, to compute the most significant 64
bits of a 64-bit product, you must use the multiply-high instructions (umulh
and smulh), but they require six cycles of latency and they prevent the
execution of other multi-cycle instructions for an additional three cycles."

~~~
brandmeyer
There could be an instruction scheduler impact here as well. Intel processors
are known for having an uncommonly deep execution window.

It turns out that the Nth wyhash64_x doesn't depend on any of the multiplies
in the N-1th iterations. It only depends on the addition of the zeroth order
constant.

So, with a sufficiently deep pipeline, the instruction scheduler can
effectively be in the middle of several of those wyhash iterations all at the
same time, thus hiding nearly all of the hash's latency by using the other
iterations to do it.

Such are the perils of micro-benchmarking.

~~~
glangdale
Indeed. Of course, the idea that this is invalid implies that "real"
application code (whatever that is) would be designed to have a sequential
dependency on a single wyhash64 result and to sit on its thumbs waiting.
Maybe, and maybe not. One can make up any argument one likes.

------
deepsun
Uhm, isn't the article title obvious? It would be rather surprising if they
had the same characteristics.

~~~
microcolonel
Sure, but it may not be intuitive to people that one function could be exactly
as fast on two very different processors, but another very similar function
would be orders of magnitude different.

~~~
tyingq
This was pretty common in the RISC heyday. The competition meant lots of
jockeying to lead benchmarks from Sparc, Alpha, PA-RISC, Power, etc.

This seems to be on the rise again with AMD and ARM being closer to Intel in
servers than they were in the recent past.

------
TheMagicHorsey
Novice here ... but how do these results give random results? Are the
uninitialized memory considered random? Or is there some other source of
randomness. It seems like a deterministic function to me if the variables are
initialized to zero.

~~~
aidenn0
They are deterministic.

~~~
TheMagicHorsey
Why wouldn't one use some sort of pseudorandom seed instead of just
uninitialized memory? Couldn't one sample a clock, image sensor, thermometer
or some other sensor that would have a random value to use as a seed? Seems
like a part of memory allocated by the compiler might always be zero.

~~~
wahern
There's no uninitialized memory. File-scoped variables are initialized to 0 in
C.

------
olliej
I am very confused isn’t mul a 64bit multiply on arm64? Or is this a
comparison of a 64 bit processor to a 32bit one?

~~~
monocasa
He's trying to get a 128bit result from a 64x64 multiply.

~~~
olliej
ahhhh ok

------
amelius
Are there differences in speculative execution?

