That, along with the fact that earlier ARMs didn't even have a divide instruction, is one of the reasons why I'm not at all fan of the "RISC approach" --- division may not have been easy to implement, but microcoding would've been at least as fast as doing the algorithm in software while taking up far less space, and with hardware advances that move operations from microcode into direct execution, existing software that already has those instructions would automatically become faster on newer hardware with no other action needed.
Division hardware almost always will generate the remainder along with the quotient, but ARM would either need to add another instruction (no benefit to existing software, only new software that knows how to use it) or attempt to detect patterns of simpler instructions that are doing the same thing in order to replace them with one "fused uop" (much more difficult to do.)
Earlier versions of ARM didn't have a divide instruction because of die space and the fact that it's rarely needed. Integer division is usually by a constant, so a shift and/or a multiplication suffices. There's no need to microcode a divide instruction either, because the kernel can just trap the #UD and perform the division itself, just like soft float works. (Yeah, it's slow, but software division is always slow.) Modern designs like RISC-V do this properly: expose division as an optional extension so as to scale down to microcontrollers.
I've just provisioned an ARM server so I can experiment with generating ARM assembly as well as Intel. No doubt I'll have fun learning of the different instructions available - until now I've only ever written assembly for intel and Z80 processors.
Java might be a counter-example but I’m not sure you can separate that from all of the other dubious decisions which happened over the years there.
Hah. That's interesting. I think that the cheaper variants of the AMD 29000 had the same feature.
A voice in my head tells me that divides for realtime video transforms used to really be a trip.
So the 64x64->128 multiplication is still quite efficient compared to ARM where two "full strength" multiplications are needed. It is odd though that there is nearly a 20x difference in relative speeds though, I wouldn't expect multiply upper to be that slow.
like this one: https://github.com/llvm-mirror/compiler-rt/blob/master/lib/b...
Wyhash is built around the 128-bit result. This is fast on Intel, but slower on ARM.
This patch shows that using it can result in a large performance improvement:
upd: apparently reasons like:
> So I guess for most of the case loading or storing i128, the data will be used by some library functions running on cores instead of NEON, so storing i128 to two GPR64 is more general.
Thats polynomial multiply. Its (almost) a multiplication in GF2 for elliptical curves. Thats not a "normal" multiply.
"PMULL" is basically a bitshift and XOR. Your traditional "MUL" is bitshift and ADD. Its called "polynomial multiply" because bitshift-and-xor has very similar properties to bitshift-and-add (it distributes over XOR, associative, communative, etc. etc).
Bitshift-and-xor has a few features that are better for cryptography. But its NOT the multiplication you are taught in grade school.
EDIT: With that being said... those "better features" for cryptography would make PMULL probably a better function for random-number generation. PMULL will return a different result than the real multiplication, but you'll have an easier time making a field (aka: reversable 1-to-1 bijections) out of PMULL than MUL...
I see "pmull" (vmull_p64), but that's polynomial multiply, which is used for Galois Field multiplication. Not "normal" multiplication.
You can search for "uint64", to look for all NEON intrinsics that take a 64-bit integer. (ex: uint64x2_t). I personally didn't see any 64x64->128 bit standard multiply in NEON.
You might be able to reclaim the performance if you manually implement the multiplication using 64-bit variables instead.
What's happening here isn't related to word size, really. It's that multiplication, as an operation, is lossy. It produces 2 words worth of output, not one. Traditionally, most RISC architectures have just skipped the complexity and returned the product modulo the word space. But x86 didn't, it put the product into two registers (specific registers: DX and AX, and their subsequent extensions).
Most of the time this just a quirk of the instruction set and an annoyance to compiler writers. But sometimes (and this trick has been exploited on x86 for decades for applications like hashing) it turns out to be really, really useful.
Interestingly, MIPS does a similar thing to x86 with a dedicated register for its multiplier/divider: http://chortle.ccsu.edu/assemblytutorial/Chapter-14/ass14_9....
To that end in the example a __uint128_t was used, which is nonstandard, and apparently not implemented all that well with the given combination of compiler and ARM CPU. Given that we're talking about a 64-bit CPU, my argument is that this is not very surprising.
I mean, I can't sit here and tell you what to be surprised about, but to me, as someone interested in how machines behave, it's absolutely interesting and "surprising" that one machine with an otherwise very similar architecture can be 8x slower than another on straightforward code. And trying to wave that away with "but the spec never promised!" seems like it's only going to get you in this kind of trouble (but "unsurprising trouble", I guess) more, and not less.
 Not here, precisely, since I already understand what's going on, but in general.
Signed overflow is undefined behavior, unsigned overflow is defined in both C/C++.
Apart from that, I agree with you. It has to do with the fact that OP is using 128-bit variables on a 64-bit architecture.
Come to think of it, it's actually more mesmerizing that x86 is not slowed down by a 128-bit variable. The ARM architecture is behaving as is to be expected, Intel is actually the odd one out.
Someone mentioned cryptography, I can imagine that because of it, Intel has a few instructions to optimize integer arithmetic on wider integers, and that is probably the reason of the anomaly, which is actually Intel and not ARM.
I.e. 64b * 64b = 2x64b registry entries, according to MUL should be 128b * 128b = 2x64b * 2x64b = 4x64b, but Intel discards this in favor for 128b * 128b = 2x64b * 2x64b = 2x64b.
What's happening here then? Are these not two 128-bit integers? One's a 64-bit recasted to 128-bit, the other a 128-bit constant. Code would be doing faulty math, if it just decides to drop any bits. Coincidence, maybe, that the upper half of the recasted is in this case 0x0, but the code must work for 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF as well, and probably does too.
tmp = (__uint128_t) wyhash64_x * 0xa3b195354a39b70d;
(see my other reply to dimitrgy for more details)
> The difference is that the computation
> of the most significant bits of a
> 64-bit product on an ARM processor
> requires a separate and expensive
Looking at  we can see that MUL has throughput of 1 and latency of 3, UMULH has 1/4 and 6, but as long as you do not issue another multiply just after your UMULH, this 1/4 throughput is easily hidden, since only the multiplier is busy, the rest of the CPU can go on. So unless your entire loop is under 6 cycles, or you simply have no instructions to schedule that do not need a multiply within the next 3 of UMULH, it shouldn't matter. Given those large constants that need to be loaded, they will each need 4 instrs (mov+movk+movk+movk), there are plenty of instrs to schedule after UMULH. Either OP's compiler messed up, or something entirely different is going on.
If, the author was using a weaker in-order core, say Cortex-A55, still more performance is expected than appears demonstrated. There  the low part is calculated in 2 or 3 cycles, the high in 4. But comparing an ARM in-order little core to a modern OoO x86 is just not fair.
EDIT: Indeed, looking  at what gcc produces for this code is sad. For example, why it is bothering synthesizing 0x1b03738712fad5c9 before issuing the first UMULH is unclear, but it IS stupid.
EDIT2: on skylake  MUL has a latency of 3, so faster than on ARM but not by that much. I'd guess the constant loading on arm using 4 instructions per constant hurts more than UMULH
EDIT3: in comments on original site, author said the ARM chip being used is a "Skylark" by "Ampere Computing" . Given that I cannot find any info on that microarchitecture, I cannot say more about why it might be slow.
 Cortex®-A72 Software Optimization Guide: https://static.docs.arm.com/uan0016/a/cortex_a72_software_op...
 Cortex®-A55 Software Optimization Guide: https://static.docs.arm.com/epm128372/20/arm_cortex_a55_soft...
 Godbolt for this code: https://godbolt.org/z/UeOo6C
 Lists of instruction latencies, throughputs and micro operation breakdowns for Intel, AMD and VIA CPUs: https://www.agner.org/optimize/instruction_tables.pdf
 Skylark - Microarchitectures - AppliedMicro: https://en.wikichip.org/wiki/apm/microarchitectures/skylark
As for loading large constants, if you read the post and follow the link at "reuse my benchmark" (https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...) you will see that these functions as measured are inside hot loops. As such, presumably constant loading is very likely to be hoisted out of these loops on both architectures.
This will make the considerably slower UMULH stick out like a sore thumb. Also note that the measurement loop allows most of the work of each iteration to be done in parallel - the work of the rng is a long dependency chain within the calculation but the update of the seed is quick and independent of that.
I would guess that the Ampere box has a wretchedly slow multiply. In a comment on the post, Daniel finds an ugly performance corner on A57 (possibly related, possibly not): "On a Cortex A57 processor, to compute the most significant 64 bits of a 64-bit product, you must use the multiply-high instructions (umulh and smulh), but they require six cycles of latency and they prevent the execution of other multi-cycle instructions for an additional three cycles."
It turns out that the Nth wyhash64_x doesn't depend on any of the multiplies in the N-1th iterations. It only depends on the addition of the zeroth order constant.
So, with a sufficiently deep pipeline, the instruction scheduler can effectively be in the middle of several of those wyhash iterations all at the same time, thus hiding nearly all of the hash's latency by using the other iterations to do it.
Such are the perils of micro-benchmarking.
In my experience he's quite willing to answer any questions you leave as a comment, also.
I think the author writes most of his posts as food for thought and a starting point for further discussion rather than the final word on anything.
I'm not sure it would help much, though. The algorithm is over-serialized. There's a bare minimum latency of 3x ALU ops + 2x MUL. It just so happens that the 2x high-result multiplies are particularly high-latency on this core, and he must also pay the pipeline piper of having only one multiplier pipe available.
EDIT: reply to comment below: you are right, i misread
If you look at the whole section, all of the multiplication results that take more than 3 cycles stall the multiplier pipe for N-3 cycles.
Amendment: Clang's decision to schedule the mul right after the umulh would also appear to be terrible. But in fact, I think that if the umulh enters on cycle 0, that the mul enters on cycle 3, the umulh's result appears on cycle 5 and the mul's result appears on cycle 6. So, it has the same total latency through the pipe as mul followed by umulh: 7 cycles.
Here is the clang aarch64 output: https://godbolt.org/z/nH0y7F
> I see no proof of this anywhere.
Huh. I see proof in your comment.
> UMULH has 1/4 and 6
expensive (relative to MUL)? Check.
Since the wyhash function needs 2 64x64->128 multiplication, you'll need two high and two low muls on ARM, so this is pretty much a dense multiplication benchmark. No amount of scheduling can save you.
Still by my calculation that should only put the ARM chip at a 5x disadvantage in multiplication throughput, but it was nearly 18x slower. Frequency difference probably explains some of that, but not all.
This seems to be on the rise again with AMD and ARM being closer to Intel in servers than they were in the recent past.