This is a bit misleading. Quotient and modulo with constant right-hand side is r...

Someone · 2024-07-16T22:18:56 1721168336

> Quotient and modulo with a variable RHS is expensive

For dividing ‘big’ bignums by ‘small’ divisors, I would use/tweak libdivide (https://libdivide.com/), and spend the time to optimize the division.

For dividing ‘big’ bignums by ‘big’ divisors, that may not be worth it.

(Determining the correct values of the various ‘big’ and ‘small’ values left as an exercise for the reader. Those will depend on the hardware at hand)

anonymoushn · 2024-07-17T00:08:46 1721174926

The oldish version of gcc used by highload.fun seems to stop performing this optimization after the 5th or so constant divisor within a small function body. Always good to double check :(

worstspotgain · 2024-07-16T22:19:39 1721168379

The reason a division by 1000000 can be efficiently turned into a multiplication is because you're targeting a system that has a 32x32->64 multiplication instruction, which has plenty of room to spare. But if you have that, you would be better off using e.g. 2^32 as your modulo.

colonwqbang · 2024-07-17T21:06:23 1721250383

I agree that 2^32 is better, but the reason is not that division is too expensive because OPs algorithm after standard optimisation does not need any division.

Clearly choosing a base type that is fast on the machine, is key. But I feel even a 64/32 divide with constant operand will be optimised into something reasonably fast on most systems. If you have a counterexample it would be interesting to hear.

A system without fast divide likely doesn't have any floats either. So all arithmetic would be fixed point and 32x32->64 multiplication would be a pretty common operation.

worstspotgain · 2024-07-17T22:39:15 1721255955

I just meant that the compiler in your example is computing x / 1000000 as follows:

   *q = (x * 1125899907) >> 50;

This works out OK because there's a 32x32->64 instruction, arm64's umull in this case. If it didn't have one, and it had a 32/32->32 division instruction, it might have emitted that instead of hand-rolling the same scheme using 32x32->32 multiplications.

Note that it's technically an arithmetic approximation that just happens to work for all possible values of x, since obviously 1<<50 is only divisible by 1<<(0 to 50).

colonwqbang · 2024-07-18T17:32:03 1721323923

I see. However, 32x32->64 multiply has been common on 32-bit processors since the mid 80's. It would be interesting to know if somebody is still building new products on (32-bit) processors that lack it.