
Faster remainders when the divisor is a constant: beating compilers & libdivide - matt_d
https://lemire.me/blog/2019/02/08/faster-remainders-when-the-divisor-is-a-constant-beating-compilers-and-libdivide/
======
nathan_f77
For this equation:

    
    
        n/d = n * (2^N/d) / (2^N)
    

What is "N"? They didn't seem to explain that, although they say "Yet for N
large enough", and "drop the least significant N bits". Is it just a very
large integer? Or maybe the maximum integer for a certain number of bytes?

EDIT: I'm reading this Wikipedia article:
[https://en.wikipedia.org/wiki/Division_algorithm#Division_by...](https://en.wikipedia.org/wiki/Division_algorithm#Division_by_a_constant)

For a 32-bit unsigned integer, you would use N = 33 when dividing by 3? And N
= 35 when dividing by 10? I'm struggling to see how this works.

EDIT 2: I think this post helped me understand it:
[https://forums.parallax.com/discussion/114807/fast-faster-
fa...](https://forums.parallax.com/discussion/114807/fast-faster-fastest-code-
integer-division)

"The·basic idea is to approximate the ratio (1/constant) by another rational
number (numerator/denominator) with a power of two as the denominator."

Their example:

    
    
        (n/10) = (n*205) >> 11
    

Maps to the original example:

    
    
        n/10 = n * (2^11 / 10) / (2^11)
    

2^11 / 10 = 2048 / 10 = 204.8, which rounds to 205.

So N would be 11 in this example, but I guess it could be any value.

~~~
mark-r
I recently discovered that x/10 == (x*103)>>10 for the numbers 0-99. Helped
speed up something I was working on. The compiler generated a similar
multiply/shift for x/10, but mine was faster since it didn't need to work for
all possible values of x.

~~~
saagarjha
Does a multiply/shift pair take a different amount of time based on the
operands? Otherwise, I don't see how this would make a difference.

~~~
bdonlan
The compiler has to assume that your operands might be large enough that they
would overflow after a straight multiply.

~~~
saagarjha
As acqq/I have mentioned below, there are ways of telling the compiler that
the overflow can't multiply so it can freely optimize these away.

------
remcob
This is very similar to Montgomery reduction [1]. Both do a multiply taking
low bits followed by a multiply taking the high bits to reduce a number.

Montgomery reduction requires an augmented number system with conversions to
and from. The article's approach works on numbers directly, but if I
understand it correctly the multiplications are twice the width.

Are these two methods algebraically related? Would it be possible to get
fastmod working without the double width?

[1]:
[https://en.wikipedia.org/wiki/Montgomery_modular_multiplicat...](https://en.wikipedia.org/wiki/Montgomery_modular_multiplication#The_REDC_algorithm)

------
mjcohen
"Hacker's Delight" ([https://www.amazon.com/Hackers-Delight-Henry-S-Warren-
ebook/...](https://www.amazon.com/Hackers-Delight-Henry-S-Warren-
ebook/dp/B009GMUMTM/ref=sr_1_1?crid=M9M472NU7BWX&keywords=hackers+delight&qid=1549663765&s=Kindle+Store&sprefix=hackers+%2Caps%2C192&sr=1-1-catcorr))
has a whole chapter on this. So, yes, it is well known.

~~~
kbenson
As noted by the article in question:

 _The idea is not novel and goes back to at least 1973 (Jacobsohn). However,
engineering matters because computer registers have finite number of bits, and
multiplications can overflow. I believe that, historically, this was first
introduced into a major compiler (the GNU GCC compiler) by Granlund and
Montgomery (1994). While GNU GCC and the Go compiler still rely on the
approach developed by Granlund and Montgomery, other compilers like LLVM’s
clang use a slightly improved version described by Warren in his book Hacker’s
Delight._

~~~
bonzini
The second edition of Hacker's Delight does have a section on "Testing for
zero reminders after division by a constant", but it uses the multiplicative
inverse.

------
StefanKarpinski
It's astonishing that this method isn't previously widely known and used. It's
so simple once you see it.

~~~
vanderZwan
I wouldn't be surprised if some people had already thought of it while coming
up with a fast quotient method, but didn't bother to do anything with it
because they did not need a remainder.

------
IshKebab
Clever. Is this possible with 64-bit numbers somehow?

~~~
bloomer
Yes, the paper is general and covers the case for arbitrary bit size. 64-bit
would require a 128-bit approximate inverse in general and the multiplication
during computing the mod would extend that out another 64-bits. So would
require three registers and two more multiplications on x64 I think. Socidont
think it gains you anything in the general case. Bit for some divisors it
might beat the traditional Granlund-Montgomery-Warren method when you could
use a smaller approximate inverse.

------
hoseja
> GNU GCC Compiler

------
rurban
Genius

