
Faster Unsigned Division by Constants (2011) - simonster
http://ridiculousfish.com/blog/posts/labor-of-division-episode-iii.html
======
nkurz
I just tested GCC, ICC, and Clang to see how well they perform these
optimizations on their own. Code for my test is here:
[https://gist.github.com/nkurz/6844adbc58a03eb3420f](https://gist.github.com/nkurz/6844adbc58a03eb3420f)

The short answer is that they all do it well. For a loop like this compiled
with -O3:

    
    
        for (int i = 0; i < COUNT; i++) {
            compilerSum += randomInt[i]/CONST;
        }
    

ICC 14.0.3 takes 4.1 cycles per iteration, Clang 3.4.1 takes 3.8 cycles, and
GCC 4.8.1 manages to spend only 1.3 cycles for each division!

Apparently GCC realized it was able to not only use a shortcut for the
division, but also managed to vectorize the loop. Here's the inner loop it
produced:

    
    
      4019f0:       66 0f 6f 00             movdqa (%rax),%xmm0
      4019f4:       48 83 c0 10             add    $0x10,%rax
      4019f8:       48 39 c5                cmp    %rax,%rbp
      4019fb:       66 0f 6f c8             movdqa %xmm0,%xmm1
      4019ff:       66 0f 6f e0             movdqa %xmm0,%xmm4
      401a03:       66 0f 62 c8             punpckldq %xmm0,%xmm1
      401a07:       66 0f 6a e0             punpckhdq %xmm0,%xmm4
      401a0b:       66 0f f4 cb             pmuludq %xmm3,%xmm1
      401a0f:       66 0f f4 e3             pmuludq %xmm3,%xmm4
      401a13:       0f c6 cc dd             shufps $0xdd,%xmm4,%xmm1
      401a17:       66 0f fa c1             psubd  %xmm1,%xmm0
      401a1b:       66 0f 72 d0 01          psrld  $0x1,%xmm0
      401a20:       66 0f fe c1             paddd  %xmm1,%xmm0
      401a24:       66 0f 72 d0 02          psrld  $0x2,%xmm0
      401a29:       66 0f fe d0             paddd  %xmm0,%xmm2
      401a2d:       75 c1                   jne    4019f0 <main+0xb0>
    

I put the other two here:
[https://gist.github.com/nkurz/b1c3e01498dd11fda42f](https://gist.github.com/nkurz/b1c3e01498dd11fda42f)

I didn't look closely enough to identify which methods they were using. Can
someone else tell?

~~~
pbsd
GCC's vectorized loop is using the usual "round-up" method, but we can easily
implement the distributive method there by mixing 32 and 64-bit arithmetic:

    
    
        ; xmm7 = 0 magic 0 magic
        ; xmm6 = 0
        ; xmm5 = 0 (accumulator)
        division_loop:
          vmovdqu  xmm4, [rax]         ; xmm4 = w3 w2 w1 w0
          
          vpunpckldq xmm0, xmm4, xmm6  ; xmm0 = 0 w1 | 0 w0
          vpunpckhdq xmm2, xmm4, xmm6  ; xmm2 = 0 w3 | 0 w2
          vpmuludq xmm1, xmm0, xmm7    ; xmm1 = w1 * magic | w0 * magic
          vpmuludq xmm3, xmm2, xmm7    ; xmm3 = w3 * magic | w2 * magic
          vpaddq   xmm1, xmm1, xmm0    ; xmm1 = w1 * magic + w1 | w0 * magic + w0
          vpaddq   xmm3, xmm3, xmm2    ; xmm3 = w3 * magic + w3 | w2 * magic + w2
          vpsrlq   xmm1, xmm1, XX      ; XX = shift constant
          vpsrlq   xmm3, xmm3, XX      ; XX = shift constant
          
          vshufps    xmm1, xmm1, xmm3, 10001000b ; back to w3 w2 w1 w0
          vpaddd     xmm5, xmm5, xmm1  ; add to accumulator 
          
          add rax, 16
          cmp rax, rbp
        jnz division_loop
    

The shifts can be further tweaked to allow to use a blend instead of a
shuffle, in case there's a penalty for changing from integer to floating-point
domains. Saturating increment also works well, despite there not being a
PADDUSD instruction; it can be done in two cycles:

    
    
          ; xmm6 = -1 -1 -1 -1
          vpcmpeqd xmm3, xmm4, xmm6 ; xmm3[i] = (xmm4[i] == -1 ? -1 : 0) 
          vpsubd   xmm4, xmm4, xmm6 ; xmm4[i] += 1  
          vpaddd   xmm4, xmm4, xmm3 ; xmm4[i] += xmm3[i]

~~~
nkurz
Nice work! Seems like it extends fine to AVX2 for doing 8 divides at a time in
less than 1 cycle per division if for some reason you need to divide a whole
array by the same constant. Though rare, this is impressive considering the
tradition of expensive division.

I don't know if you will have a domain penalty for the vshufps, but I think
the answer is yes, and that the blend would be better. The blend can also run
on p015, versus only p5 for the shufps, so it might pipeline better. I hadn't
realized until now that vshufps had different semantics than vpshufd and is
able to combine from two different registers. If there is no penalty, this
could be quite a useful instruction.

Any reason for using 'w' for your 32-bit elements rather than 'd' for double-
word as used in the instruction names?

~~~
pbsd
According to Agner Fog, data domain penalties with SHUFPS are nonexistent, or
at least pretty hard to trigger, on Haswell. Previous chips might complain,
though. Problem with blends is that they require SSE4.1+, unlike the current
version that is fully SSE2 (modulo AVX encoding niceties).

I chose 'w' for no real reason other than being thinking of 'word' at the
time.

~~~
nkurz
I just tested, and I can now confirm that there is no domain switching penalty
for using vshufps versus vpshufd on integer data for XMM and YMM on Haswell,
nor for XMM on Sandy Bridge (no YMM comparison possible there because vpshufd
requires AVX2).

------
herf
Nice analysis! There is a similar problem in base-255 pixel math, with a
similar solution. You want 255 * 255 = 255 when you multiply pixels, so you
are doing divide-by-255 (not-256), which in 8 bits is somewhat
"uncooperative".

I wrote up a few solutions many years ago, one of which is Sree Kotay's idea:
adding one to the numerator (what this article says but without the awesome
proof):
[http://stereopsis.com/doubleblend.html](http://stereopsis.com/doubleblend.html)

It would be great if compilers could do all this for you, with a proof like
this to go with it.

~~~
colanderman
I would think the better (symmetric and slightly more accurate) solution would
be to add the MSB of the result to the LSB and then round to the nearest MSB
(by adding 0x80 before shifting)? e.g:

    
    
        ((0xFF*0xFF = 0xFE01) + 0xFE + 0x80 = 0xFF7F) >> 8 = 0xFF
        ((0xC0*0xC0 = 0x9000) + 0x90 + 0x80 = 0x9110) >> 8 = 0x91
        ((0xB4*0xB5 = 0x7F44) + 0x7F + 0x80 = 0x8043) >> 8 = 0x80
        ((0xB4*0xB4 = 0x7E90) + 0x7E + 0x80 = 0x7F8E) >> 8 = 0x7F
        ((0x80*0x80 = 0x4000) + 0x40 + 0x80 = 0x40C0) >> 8 = 0x40
        ((0x00*0x00 = 0x0000) + 0x00 + 0x80 = 0x0080) >> 8 = 0x00
    

(To be precise, you should also add the MSB to the right of the LSB, etc.,
since 1/.FF = 1.01010101..., but now you're getting lost in the noise.)

EDIT: screwed up some math; thanks gjm11.

~~~
gjm11
But 1/.FF = 1.01010101..., not 1.02030405... .

(Multiplying the RHS by FF gives FF.FFFFFFFFFF etc. = 100. 1.020304... is
1/(.FF)^2.)

~~~
colanderman
Whoops, good catch! Edited.

------
pkaye
I know many compilers use this optimization. However I recently used this for
the case where the divisor was a variable but restricted to a few possible
values. I expanded it to a switch case statement where each possible divisor
was expanded to a constant value allowing for this optimization. For this
processor, there is no hardware divide so this made for significant
performance gains.

------
WalterBright
I think all modern compilers do this now. I know the D compiler does:

[https://github.com/D-Programming-
Language/dmd/blob/master/sr...](https://github.com/D-Programming-
Language/dmd/blob/master/src/backend/divcoeff.c)

~~~
simonster
Modern compilers do some variant of fast integer division by constants, but I
think most use the "round-up" algorithm described in an earlier post in this
series ([http://ridiculousfish.com/blog/posts/labor-of-division-
episo...](http://ridiculousfish.com/blog/posts/labor-of-division-
episode-i.html)) and a variety of other places and not the "round-down"
algorithm described in this post when the multiplier is a bit too big. You'd
know better than I, but it looks like this is the round-up algorithm:
[https://github.com/D-Programming-
Language/dmd/blob/master/sr...](https://github.com/D-Programming-
Language/dmd/blob/master/src/backend/divcoeff.c#L206-L208)

------
PhantomGremlin
My first question was: how slow is hardware divide? The article didn't answer
that, but there are many resources on the net that do (more or less, it's
tough to precisely benchmark these very sophisticated CPUs). One paper is at
[1].

The general conclusion is that simple operations like add can be done in 1 CPU
cycle. They can be easily pipelined for a throughput of 3 different add
operations in 1 cycle. Modern CPUs also pipeline integer multiply so, while
the latency is perhaps 4 cycles, the throughput is 1 multiply in 1 CPU cycle.

But hardware division can be slow, really really slow. A Pentium 4 might have
a latency of 80 cycles for a 32 bit operation, 160 cycles for a 64 bit
operation. And there's no pipelining, multiple divides don't appear to be done
in parallel. Even newer CPUs like the Sandy Bridge architecture have a latency
of 26/92 cycles for 32/64 bit operations.

So it's not surprising that compilers try to avoid doing hardware divide.

[1]
[https://gmplib.org/~tege/x86-timing.pdf](https://gmplib.org/~tege/x86-timing.pdf)

~~~
TheLoneWolfling
I hadn't realized hardware division was that slow, yow.

I suppose that part of this is that we now have 64-bit operations everywhere -
the length of the critical path for a 64-bit divide is absurd.

------
colanderman
Note that while most compilers already replace fixed-divisor divisions with
multiplications, this algorithm presents a more optimized solution for certain
"problem" divisors, notably 7.

------
readerrrr
A compiler I use, based on lcc, did this automatically without even enabling
optimizations.

The division is transformed into multiplication and a right shift.

I don't think your usual programmer must care about that with the modern
compilers, but it is nice to know if you are implementing one or you are
coding in asm.

~~~
clarry
Or if you're dealing with data that isn't compile-time constant, but can be
expected to remain unchanged for the duration of something that needs to be
optimized all the way.

------
wbhart
Compilers have done this sort of thing for a very long time. Here is a well-
known paper from 1994 that already tries to do it for more arbitrary
constants.

[https://gmplib.org/~tege/divcnst-
pldi94.pdf](https://gmplib.org/~tege/divcnst-pldi94.pdf)

~~~
colanderman
No, they haven't. This extends the algorithm you reference to cover the
special case of certain "magic numbers" which don't fit nicely in a machine
register. From the article:

 _It does not seem to appear in the literature, nor is it implemented in gcc,
llvm, or icc, so fish is optimistic that it is original._

~~~
wbhart
Actually, looking more carefully at the Fish webpage I think they cite 7 as an
example because 7 does not divide 2^32+/-1 (or indeed 2^64+/-1). If it did,
the quotient would be a "magic number".

However, 7 does divide 2^33 - 1. This "magic number" _does_ fit in a 32 bit
word. It's just that division by 2^33 - 1 isn't as trivial as division by 2^32
+/\- 1 (it's not much worse though).

Multiplication by/storage of a 33 bit number also isn't a problem anyway,
because you can use the "IEEE trick" of storing the leading 1 bit implicitly.
Multiplication by such a number can be done with one multiply and one
addition. Not that you actually need to do this for the divisor 7.

The original Montgomery-Granlund work (apparently the basis of GCC division by
invariant integers for quite a while) was to always find a quasi-"magic
number" which fits in a word. So that was definitely not the innovation of
Fish.

The followup paper of Moller-Granlund does even better by using a mullow
instead of a full mul or mulhigh, which is usually available sooner on modern
processors.

Fish seem to have found some highly optimised sequences. I'm not entirely
clear on how new the ideas themselves are. It's clearly not the same as
Montgomery-Granlund, but Montgomery-Granlund use (B^2 - 1)/d (where admittedly
d has to be first "normalised"), Fish uses (2^p*B)/d where they fiddle the
dividend and do some post-shifting.

I'm not even sure if Fish is better than Moller-Granlund, though the latter
was 2009, not 1994.

I don't want to take away from the contribution of Fish. But it just needs to
be viewed in the context of rather a lot of previous work.

~~~
colanderman
Thanks, that's a very detailed analysis.

------
phkahler
From 2011, has this made it into gcc? Is it part of LLVM now since they tested
with that?

~~~
simonster
It looks like LLVM trunk still uses a subtraction and a shift for
"uncooperative" divisors: [https://github.com/llvm-
mirror/llvm/blob/1bdc6c4519d9ad254b2...](https://github.com/llvm-
mirror/llvm/blob/1bdc6c4519d9ad254b20534a1664f9d2809a95f1/lib/CodeGen/SelectionDAG/TargetLowering.cpp#L2751-L2766)

~~~
jevinskie
I found two other implementations of software division in LLVM:

[https://github.com/llvm-
mirror/llvm/blob/master/lib/Transfor...](https://github.com/llvm-
mirror/llvm/blob/master/lib/Transforms/Utils/IntegerDivision.cpp)

[https://github.com/llvm-
mirror/llvm/blob/master/lib/Transfor...](https://github.com/llvm-
mirror/llvm/blob/master/lib/Transforms/Utils/BypassSlowDivision.cpp)

