Hacker News new | past | comments | ask | show | jobs | submit login

As mentioned upthread, the mermerizing instruction in question is "MUL", which debuted in 1978 on the 8086 and, except for register width, behaves identically today.

I'm no expert, but shouldn't x86 then produce two 128-bit register entries if it multiplies two 128-bit integers, so totaling actually four registry entries on a 64-bit architecture? If this were the case, Intel would slow down just as much as ARM on a double-than-archictecture-width-integer multiplication, but it doesn't. That's what I find mesmerizing. I'm guessing that Intel simply discards the earlier double registry logic once it goes beyond architecture width, which would explain the speed up.

I.e. 64b * 64b = 2x64b registry entries, according to MUL should be 128b * 128b = 2x64b * 2x64b = 4x64b, but Intel discards this in favor for 128b * 128b = 2x64b * 2x64b = 2x64b.

x86 can't multiply two 128 bit numbers at a time. But it can multiply two 64 bit numbers without losing the high 64 bits of the 128 bit product, which makes the 128 bit multiplication much faster to implement.

> x86 can't multiply two 128 bit numbers at a time.

What's happening here then? Are these not two 128-bit integers? One's a 64-bit recasted to 128-bit, the other a 128-bit constant. Code would be doing faulty math, if it just decides to drop any bits. Coincidence, maybe, that the upper half of the recasted is in this case 0x0, but the code must work for 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF as well, and probably does too.

  __uint128_t tmp;
  tmp = (__uint128_t) wyhash64_x * 0xa3b195354a39b70d;

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact