Hacker News new | past | comments | ask | show | jobs | submit login

I don't see what specification has to do with this. I mean, a 32 bit 2's complement integer is also a technically optional part of the C standard, and indeed there is hardware that doesn't support multiplications on them with a single instruction.

What's happening here isn't related to word size, really. It's that multiplication, as an operation, is lossy. It produces 2 words worth of output, not one. Traditionally, most RISC architectures have just skipped the complexity and returned the product modulo the word space. But x86 didn't, it put the product into two registers (specific registers: DX and AX, and their subsequent extensions).

Most of the time this just a quirk of the instruction set and an annoyance to compiler writers. But sometimes (and this trick has been exploited on x86 for decades for applications like hashing) it turns out to be really, really useful.




The other thing x86 has which ARM doesn't is a doublewidth divide instruction that also outputs the remainder.

Interestingly, MIPS does a similar thing to x86 with a dedicated register for its multiplier/divider: http://chortle.ccsu.edu/assemblytutorial/Chapter-14/ass14_9....


Integer multiplication always carries the risk of Integer overflow. Integer Overflow is undefined behavior in C, so it's the programmer's responsibility to make sure it doesn't happen.

To that end in the example a __uint128_t was used, which is nonstandard, and apparently not implemented all that well with the given combination of compiler and ARM CPU. Given that we're talking about a 64-bit CPU, my argument is that this is not very surprising.


Again, I think you're looking at this backwards. C's undefined behavior rules exist because targetted hardware didn't have common behavior, where you seem to be arguing the reverse.

I mean, I can't sit here and tell you what to be surprised about, but to me, as someone interested in how machines behave, it's absolutely interesting and "surprising"[1] that one machine with an otherwise very similar architecture can be 8x slower than another on straightforward code. And trying to wave that away with "but the spec never promised!" seems like it's only going to get you in this kind of trouble (but "unsurprising trouble", I guess) more, and not less.

[1] Not here, precisely, since I already understand what's going on, but in general.


Undefined behavior occurs when you cannot reasonably optimize without invoking it. What you are thinking of is implementation-defined behavior.


> Integer Overflow is undefined behavior in C

Signed overflow is undefined behavior, unsigned overflow is defined in both C/C++.

Apart from that, I agree with you. It has to do with the fact that OP is using 128-bit variables on a 64-bit architecture.

Come to think of it, it's actually more mesmerizing that x86 is not slowed down by a 128-bit variable. The ARM architecture is behaving as is to be expected, Intel is actually the odd one out.

Someone mentioned cryptography, I can imagine that because of it, Intel has a few instructions to optimize integer arithmetic on wider integers, and that is probably the reason of the anomaly, which is actually Intel and not ARM.


As mentioned upthread, the mermerizing instruction in question is "MUL", which debuted in 1978 on the 8086 and, except for register width, behaves identically today.


I'm no expert, but shouldn't x86 then produce two 128-bit register entries if it multiplies two 128-bit integers, so totaling actually four registry entries on a 64-bit architecture? If this were the case, Intel would slow down just as much as ARM on a double-than-archictecture-width-integer multiplication, but it doesn't. That's what I find mesmerizing. I'm guessing that Intel simply discards the earlier double registry logic once it goes beyond architecture width, which would explain the speed up.

I.e. 64b * 64b = 2x64b registry entries, according to MUL should be 128b * 128b = 2x64b * 2x64b = 4x64b, but Intel discards this in favor for 128b * 128b = 2x64b * 2x64b = 2x64b.


x86 can't multiply two 128 bit numbers at a time. But it can multiply two 64 bit numbers without losing the high 64 bits of the 128 bit product, which makes the 128 bit multiplication much faster to implement.


> x86 can't multiply two 128 bit numbers at a time.

What's happening here then? Are these not two 128-bit integers? One's a 64-bit recasted to 128-bit, the other a 128-bit constant. Code would be doing faulty math, if it just decides to drop any bits. Coincidence, maybe, that the upper half of the recasted is in this case 0x0, but the code must work for 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF as well, and probably does too.

  __uint128_t tmp;
  tmp = (__uint128_t) wyhash64_x * 0xa3b195354a39b70d;


32-bit systems do have long long in the standard to do 64-bit arithmetic, yet they have exactly the same issue on ARM.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: