Hacker News new | past | comments | ask | show | jobs | submit login
Faster remainders when the divisor is a constant: beating compilers & libdivide (lemire.me)
197 points by matt_d on Feb 8, 2019 | hide | past | favorite | 32 comments



For this equation:

    n/d = n * (2^N/d) / (2^N)
What is "N"? They didn't seem to explain that, although they say "Yet for N large enough", and "drop the least significant N bits". Is it just a very large integer? Or maybe the maximum integer for a certain number of bytes?

EDIT: I'm reading this Wikipedia article: https://en.wikipedia.org/wiki/Division_algorithm#Division_by...

For a 32-bit unsigned integer, you would use N = 33 when dividing by 3? And N = 35 when dividing by 10? I'm struggling to see how this works.

EDIT 2: I think this post helped me understand it: https://forums.parallax.com/discussion/114807/fast-faster-fa...

"The·basic idea is to approximate the ratio (1/constant) by another rational number (numerator/denominator) with a power of two as the denominator."

Their example:

    (n/10) = (n*205) >> 11
Maps to the original example:

    n/10 = n * (2^11 / 10) / (2^11)
2^11 / 10 = 2048 / 10 = 204.8, which rounds to 205.

So N would be 11 in this example, but I guess it could be any value.


The Wikipedia article seems to be using slightly different definitions. For the blog post, the actual article gives better definitions: https://arxiv.org/pdf/1902.01961.pdf.

I'm one of the co-authors, and I believe that in our case, N is always equal to the maximum bitwidth of the numerator. So for a 32-bit integer, N=32. I'll see if Daniel can add some clarification to the blog post.


(I am the author of the blog post and I am answering here at the request of nkurz.)

So nkurz is correct that, in the paper, N is always equal to the maximum bitwidth of the numerator. I was not careful enough to make sure that the notation of the paper matches the notation of the blog post. So the N from the blog post is the F from the paper. I am sorry about this.

I added the following footnote.

"What is N? If both the numerator n and the divisor d are 32-bit unsigned integers, then you can pick N=64. This is not the smallest possible value. The smallest possible value is given by Algorithm 2 in our paper and it involves a bit of mathematics (note: the notation in my blog post differs from the paper, N becomes F)."

To make it super simple, if you have 32-bit values, then pick N (from the blog post) to be 64 and you are all set. You can do better, but this requires knowing what the divisor is.


I have not used it in many years, but I have foggy memory of IBM's Power cpu compiler doing similar trick for remainder.


I recently discovered that x/10 == (x*103)>>10 for the numbers 0-99. Helped speed up something I was working on. The compiler generated a similar multiply/shift for x/10, but mine was faster since it didn't need to work for all possible values of x.


This is called reciprocal multiplication. You can trade off accuracy for number of bits required, etc. I used this to do meaningful processing of Kinect data on a CPU without an FPU at near full frame rate.

Just curious, did you try making x an 8-bit integer to see if the compiler chose a different multiplier?


No, I didn't try casting to char although I thought about it.


For a range that small you could even build a lookup table. Or is this a case where cheap computation beats memory access?


On modern hardware, computation beats RAM access almost always.

Here's a repo with 2 tests, counting bits, and calculating sine/cosine: https://github.com/Const-me/LookupTables

Neither code is particularly cheap but still faster than RAM.


For modern processors you can assume that a couple of instructions will be much faster than a memory lookup. Even if the lookup is cached it will most likely push something else important out of cache.


Does a multiply/shift pair take a different amount of time based on the operands? Otherwise, I don't see how this would make a difference.


The compiler has to assume that your operands might be large enough that they would overflow after a straight multiply.


As acqq/I have mentioned below, there are ways of telling the compiler that the overflow can't multiply so it can freely optimize these away.


My code was two instructions: imul eax,ecx,67h sar eax,0Ah. The compiler generated code was mov eax,66666667h imul ecx sar edx,2 mov eax,edx shr eax,1Fh add eax,edx.


When input unsigned but before if < 99, output unsigned int:

clang 7

        mov     ecx, edi
        mov     eax, 3435973837
        imul    rax, rcx
        shr     rax, 35
gcc 8.2

        mov     eax, edi
        mov     edx, -858993459
        mul     edx
        mov     eax, edx
        shr     eax, 3
msvc 19

        mov     eax, -858993459  ; cccccccdH
        mul     edx
        shr     edx, 3
        mov     eax, edx
icc 19

        mov       eax, 1717986919                               
        imul      edx                                           
        sar       edx, 2                          
        mov       eax, edx
The same, but when input unsigned char:

clang 7

        movzx   eax, dil
        imul    eax, eax, 205
        shr     eax, 11
gcc 8.2

        movzx   edi, dil
        lea     eax, [rdi+rdi*4]
        lea     eax, [rdi+rax*8]
        lea     eax, [rax+rax*4]
        shr     ax, 11
        movzx   eax, al
msvc 19

        movzx   ecx, al
        mov     eax, -858993459  ; cccccccdH
        mul     ecx
        shr     edx, 3
        mov     eax, edx
icc 19

        mov       eax, 1717986919
        imul      edi            
        sar       edx, 2        
        mov       eax, edx

clang and gcc with -O3, msvc -O2


This was exactly what I was about to mention; I can reproduce your results in Clang but it seems GCC doesn't want to optimize this? Here's what Godbolt gives me: https://godbolt.org/z/aFTdur


Just write a function “mydiv10” which returns if x < 99 { x /10 and 0 otherwise and try again.


Like this:

https://godbolt.org/z/bS2yzB

But also note that when the input is unsigned short we get:

https://godbolt.org/z/sbDbMU

        movzx   eax, di
        imul    eax, eax, 52429
        shr     eax, 19


It's conceivable that multiplication time might be proportional to the number of 1 bits in the multiplier.


This is very similar to Montgomery reduction [1]. Both do a multiply taking low bits followed by a multiply taking the high bits to reduce a number.

Montgomery reduction requires an augmented number system with conversions to and from. The article's approach works on numbers directly, but if I understand it correctly the multiplications are twice the width.

Are these two methods algebraically related? Would it be possible to get fastmod working without the double width?

[1]: https://en.wikipedia.org/wiki/Montgomery_modular_multiplicat...


"Hacker's Delight" (https://www.amazon.com/Hackers-Delight-Henry-S-Warren-ebook/...) has a whole chapter on this. So, yes, it is well known.


The article mentions the book and that the method presented therein is used by clang for faster division. The novel part is to adapt the method for faster remainders without calculating the quotient first.


As noted by the article in question:

The idea is not novel and goes back to at least 1973 (Jacobsohn). However, engineering matters because computer registers have finite number of bits, and multiplications can overflow. I believe that, historically, this was first introduced into a major compiler (the GNU GCC compiler) by Granlund and Montgomery (1994). While GNU GCC and the Go compiler still rely on the approach developed by Granlund and Montgomery, other compilers like LLVM’s clang use a slightly improved version described by Warren in his book Hacker’s Delight.


The second edition of Hacker's Delight does have a section on "Testing for zero reminders after division by a constant", but it uses the multiplicative inverse.



Nice! If I'm not wrong, under many conditions this derivation gives even faster divisibility test then the one presented in the article we comment.


It's astonishing that this method isn't previously widely known and used. It's so simple once you see it.


I wouldn't be surprised if some people had already thought of it while coming up with a fast quotient method, but didn't bother to do anything with it because they did not need a remainder.


Clever. Is this possible with 64-bit numbers somehow?


Yes, the paper is general and covers the case for arbitrary bit size. 64-bit would require a 128-bit approximate inverse in general and the multiplication during computing the mod would extend that out another 64-bits. So would require three registers and two more multiplications on x64 I think. Socidont think it gains you anything in the general case. Bit for some divisors it might beat the traditional Granlund-Montgomery-Warren method when you could use a smaller approximate inverse.


> GNU GCC Compiler


Genius




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: