
How do compilers optimize divisions? - moyix
https://zneak.github.io/fcd/2017/02/19/divisions.html
======
Someone
On this subject, these three blog posts are hard to beat:

[http://ridiculousfish.com/blog/posts/labor-of-division-
episo...](http://ridiculousfish.com/blog/posts/labor-of-division-
episode-i.html)

[http://ridiculousfish.com/blog/posts/labor-of-division-
episo...](http://ridiculousfish.com/blog/posts/labor-of-division-episode-
ii.html)

[http://ridiculousfish.com/blog/posts/labor-of-division-
episo...](http://ridiculousfish.com/blog/posts/labor-of-division-episode-
iii.html)

The same author wrote a library for doing this (and more; it also uses bit
shifts) at runtime: [http://libdivide.com](http://libdivide.com)

Of course, that only makes sense if you know you will do lots of divisions by
the same number

------
glangdale
Hacker's Delight - an excellent book in general - has an extensive chapter on
this, for those who are interested in this topic.

[http://www.hackersdelight.org/](http://www.hackersdelight.org/)

------
WalterBright
Here's how the dmd D compiler does it:

[https://github.com/dlang/dmd/blob/master/src/backend/divcoef...](https://github.com/dlang/dmd/blob/master/src/backend/divcoeff.c)

~~~
metaphor
Noticed Granlund and Montgomery paper[1] cited in comment. Apparently, Warren
also cites the same in Hacker's Delight.

[1]
[https://doi.org/10.1145/773473.178249](https://doi.org/10.1145/773473.178249)

~~~
Cyph0n
He is also known for Montgomery modular multiplication, which is a technique
widely used to efficiently implement RSA in hardware.

------
raphlinus
Here's a trick I came across recently which I found quite neat: you can test
for divisibility by 3 using (x * 0xaaaaaaab) < 0x55555556. Same concept works
for 5 and 15, but sadly not other factors.

LLVM does not currently generate that code, but you can make a case it should.

~~~
conistonwater
That's pretty cool, although it ought to be mentioned that x must be a uint32.
Here's a proof:

    
    
        λ> import Data.SBV
        λ> prove $ do x <- Data.SBV.sWord32 "x"; return $ (x * 0xaaaaaaab .< 0x55555556) .== (x `sMod` 3 .== 0)
        Q.E.D.
    

Are you sure it doesn't work for other factors? It seems it works for 7, 9,
and 11.

    
    
        λ> sat $ do [a,b] <- sWord16s ["a", "b"]; forAll_ (\x -> (x * a .< b) <=> (x `sMod` 7 .== 0))
        Satisfiable. Model:
          a = 28087 :: Word16
          b =  9363 :: Word16
    
        λ> sat $ do [a,b] <- sWord16s ["a", "b"]; forAll_ (\x -> (x * a .< b) <=> (x `sMod` 9 .== 0))
        Satisfiable. Model:
          a = 36409 :: Word16
          b =  7282 :: Word16
    
        λ> sat $ do [a,b] <- sWord16s ["a", "b"]; forAll_ (\x -> (x * a .< b) <=> (x `sMod` 11 .== 0))
        Satisfiable. Model:
          a = 35747 :: Word16
          b =  5958 :: Word16
    

So a is equal to 7^{-1} mod 2^16, and b to ceil((2^16-1)/7), etc.

~~~
raphlinus
Duh, you're right on both counts. It should work for any odd divisor, I was
just computing the inverse in a dumb way when I tried out other divisors. And
I should have stated u32 wrapping arithmetic.

~~~
im3w1l
And for even divisors you can separate the divisor into a power of two, and an
odd part. The odd part is checked as before, and the power of two is checked
with an _and mask_.

~~~
raphlinus
An even trickier way to do this is to rotate right by the power of two factor
between doing the multiply and the compare.

~~~
im3w1l
That wouldn't work, because a number that is bigger than the threshold could
become smaller than it after the rotation.

------
morecoffee
From what I understand, inlining is the key optimization that compilers make
to speed up code. Once inlined, the excess computations of a function body can
be removed or shortened, based on the calling context. With that in mind, how
can the decompiler identify such "magic multiplication" division given that
the compiler might mutate the division code at the callsite?

~~~
WalterBright
Register allocation is the key to victory.

~~~
qznc
My rule of thumb is: If an optimization reduces memory accesses, it is
important today. Register allocation fits that pattern, because the
alternative is to spill values to memory.

Disclaimer: That rule is restricted to fast processors, not low-power embedded
stuff.

Inlining is not that important for speed itself. It rather is an optimization
booster, which makes many _other_ optimization much more effective. If your
other optimizations are crap, then inlining will not help much, because the
call overhead is not that big.

~~~
barrkel
Registers are also the nodes in the DAG of evaluation. Superscalar execution
relies on having multiple edges active over time. Thus, good register
allocation won't have bottlenecks, and won't make the evaluation DAG look like
a series of linked lists.

There's a nice overview for people with a passing interest here:
[http://www.lighterra.com/papers/basicinstructionscheduling/](http://www.lighterra.com/papers/basicinstructionscheduling/)

~~~
techdragon
How did I go so far without hearing the path of execution described as a DAG.
Damn that's a useful concept.

I might start prototyping some things in DAG notation just to avoid ever
forgetting this conceptual abstraction.

~~~
barrkel
Expressions are DAGs after common subexpression elimination. DAGs are what you
get when you symbolically evaluate expressions on a stack machine without
backward edges in control flow. Leave out the backward edges and the phi nodes
they create, and you get DAGs from SSA form as well (but sequencing has
information that you lose).

------
dmichulke
Probably I'm missing something here:

Why can't one just multiply with the inverse of 19 (which can be calc'ed
during compile time)?

~~~
desdiv
On most modern platforms, the cost of each operation, from smallest to
largest, is roughly something like this[0]:

1\. Bit shifting

2\. Integer multiplication

3\. Integer division

4\. Floating point multiplication

The trick in the article works because the cost of 1 + 2 is still smaller than
3.

Multiplying with the inverse of 19, a floating point number, wouldn't work
because 4 is more costly than 3.

[0]
[http://nicolas.limare.net/pro/notes/2014/12/12_arit_speed/](http://nicolas.limare.net/pro/notes/2014/12/12_arit_speed/)

~~~
kr7
That page has a warning at the top:

> IMPORTANT: Useful feedback revealed that some of these measures are
> seriously flawed. A major update is on the way.

Looking over the results, some of the numbers are off.

On Intel CPUs, FP multiplication is faster than integer division. Might not be
true on ARM CPUs which generally have slower FPUs.

On Skylake, for example, 32-bit unsigned integer division has a 26 cycle
latency with a throughput of 1 instruction / 6 cycles, while 32/64-bit
floating point multiplication has a 4 cycle latency with a throughput of 2
instructions / cycle.

Source:
[http://agner.org/optimize/instruction_tables.pdf](http://agner.org/optimize/instruction_tables.pdf)

~~~
yuriks
The crucial point, however, is that while FP multiplication is faster than
integer division, converting between a floating point and an integer is very
slow: cvtsi2ss and cvtss2si have a latency of 6 cycles each. This adds a
latency of 12 cycles for each of these multiplications.

For divisions by a constant value that don't easily decompose into shifts you
can fall back to multiplication by a magic constant which is the integer
reciprocal. (This is also something compilers do and is what's being explained
in the article.)

~~~
caf
Multiplying by the integer reciprocal only works if the dividend is an integer
multiple of the divisor.

What's being explained in the article is multiplying by a fraction the value
of which is close to the rational reciprocal of the divisor, and where the
denominator of the fraction is an integer power of two (so dividing by the
denominator can be done with a shift).

The fraction in this case is (2938661835 + 2^32) / 2^37.

------
caf
_The result of 7233629131 / 137438953472 is 18.999999997649866._

This is wrong, the divisor and dividend are the wrong way around.

..and in case anyone else is wondering how those numbers relate to the
constant in the code, it's

    
    
      2^37 / (2938661835 + 2^32)

~~~
jwilk
It's been fixed.

------
xelxebar
Is there a reason compilers don't just use modular inverses and let overflow
take care of things?

19 * 678,152,731 = 1 (mod 2^32)

19 * 9,708,812,670,373,448,219 = 1 (mod 2^64)

As long as we already know the size of an int Euclid's algorithm should be
able to find these inverses if they exist.

~~~
SamReidHughes
That only works when the value is known to be a multiple of whatever you're
dividing by, and they do use that when it is the case, such as for pointer
subtraction.

------
amelius
It could also be a neat trick around the Intel Pentium's FDIV bug, [1].

[1]
[https://en.wikipedia.org/wiki/Pentium_FDIV_bug](https://en.wikipedia.org/wiki/Pentium_FDIV_bug)

~~~
lfowles
I'm not sure I follow your line of thought given the post is talking integer
division and FDIV is floating point division.

~~~
amelius
Well, you can build a floating point DIV out of an integer DIV, I suppose.
Roughly speaking, you divide the (integer) mantissa, and subtract the
exponent, and you normalize.

