Hacker News new | past | comments | ask | show | jobs | submit login

ARM64 has separate instructions for computing the low and high halves of a product, in keeping with its single destination register approach. x86-64 has a single instruction that computes both halves simultaneously, writing to two registers.

It would have been just simpler for the author to show us the difference in the multiplication performance directly. Here the benchmarks are comparing the completely different hash functions and we have to simply take his word that it is this instruction that is causing the performance discrepancy.

On a modern out of order uarch that kind of "complex" instruction would by spited in two micro operation, one for each register result. And Agner's tables [1] confirms that the mul 64*64 => 128 is split in two micro-ops on Skylake. So it doesn't give any strong advantage.

[1] https://www.agner.org/optimize/instruction_tables.pdf

Yes, but the second uop is not expensive like the first in this case. That is, it seems like the the full multiplication is done by the latency-3 op on p1 and the other uop is just needed to move the high half of the result to the destination (indeed, instructions with 2 outputs always need 2 uops due to the way the renamer works). The whole 64x64->128 multiplication still has a latency of only 3, and a throughput of 1 per cycle.

So the 64x64->128 multiplication is still quite efficient compared to ARM where two "full strength" multiplications are needed. It is odd though that there is nearly a 20x difference in relative speeds though, I wouldn't expect multiply upper to be that slow.

Note: The test seems to have been done on Skylark (aka: Ampere), which is a non-standard ARM core. I can't find any documentation on Skylark's latency / throughput specifications.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact