What is "N"? They didn't seem to explain that, although they say "Yet for N large enough", and "drop the least significant N bits". Is it just a very large integer? Or maybe the maximum integer for a certain number of bytes?
The Wikipedia article seems to be using slightly different definitions. For the blog post, the actual article gives better definitions: https://arxiv.org/pdf/1902.01961.pdf.
I'm one of the co-authors, and I believe that in our case, N is always equal to the maximum bitwidth of the numerator. So for a 32-bit integer, N=32. I'll see if Daniel can add some clarification to the blog post.
(I am the author of the blog post and I am answering here at the request of nkurz.)
So nkurz is correct that, in the paper, N is always equal to the maximum bitwidth of the numerator. I was not careful enough to make sure that the notation of the paper matches the notation of the blog post. So the N from the blog post is the F from the paper. I am sorry about this.
I added the following footnote.
"What is N? If both the numerator n and the divisor d are 32-bit unsigned integers, then you can pick N=64. This is not the smallest possible value. The smallest possible value is given by Algorithm 2 in our paper and it involves a bit of mathematics (note: the notation in my blog post differs from the paper, N becomes F)."
To make it super simple, if you have 32-bit values, then pick N (from the blog post) to be 64 and you are all set. You can do better, but this requires knowing what the divisor is.
I recently discovered that x/10 == (x*103)>>10 for the numbers 0-99. Helped speed up something I was working on. The compiler generated a similar multiply/shift for x/10, but mine was faster since it didn't need to work for all possible values of x.
This is called reciprocal multiplication. You can trade off accuracy for number of bits required, etc. I used this to do meaningful processing of Kinect data on a CPU without an FPU at near full frame rate.
Just curious, did you try making x an 8-bit integer to see if the compiler chose a different multiplier?
For modern processors you can assume that a couple of instructions will be much faster than a memory lookup. Even if the lookup is cached it will most likely push something else important out of cache.
My code was two instructions: imul eax,ecx,67h sar eax,0Ah. The compiler generated code was mov eax,66666667h imul ecx sar edx,2 mov eax,edx shr eax,1Fh add eax,edx.
This was exactly what I was about to mention; I can reproduce your results in Clang but it seems GCC doesn't want to optimize this? Here's what Godbolt gives me: https://godbolt.org/z/aFTdur
This is very similar to Montgomery reduction [1]. Both do a multiply taking low bits followed by a multiply taking the high bits to reduce a number.
Montgomery reduction requires an augmented number system with conversions to and from. The article's approach works on numbers directly, but if I understand it correctly the multiplications are twice the width.
Are these two methods algebraically related? Would it be possible to get fastmod working without the double width?
The article mentions the book and that the method presented therein is used by clang for faster division. The novel part is to adapt the method for faster remainders without calculating the quotient first.
The idea is not novel and goes back to at least 1973 (Jacobsohn). However, engineering matters because computer registers have finite number of bits, and multiplications can overflow. I believe that, historically, this was first introduced into a major compiler (the GNU GCC compiler) by Granlund and Montgomery (1994). While GNU GCC and the Go compiler still rely on the approach developed by Granlund and Montgomery, other compilers like LLVM’s clang use a slightly improved version described by Warren in his book Hacker’s Delight.
The second edition of Hacker's Delight does have a section on "Testing for zero reminders after division by a constant", but it uses the multiplicative inverse.
I wouldn't be surprised if some people had already thought of it while coming up with a fast quotient method, but didn't bother to do anything with it because they did not need a remainder.
Yes, the paper is general and covers the case for arbitrary bit size. 64-bit would require a 128-bit approximate inverse in general and the multiplication during computing the mod would extend that out another 64-bits. So would require three registers and two more multiplications on x64 I think. Socidont think it gains you anything in the general case. Bit for some divisors it might beat the traditional Granlund-Montgomery-Warren method when you could use a smaller approximate inverse.
EDIT: I'm reading this Wikipedia article: https://en.wikipedia.org/wiki/Division_algorithm#Division_by...
For a 32-bit unsigned integer, you would use N = 33 when dividing by 3? And N = 35 when dividing by 10? I'm struggling to see how this works.
EDIT 2: I think this post helped me understand it: https://forums.parallax.com/discussion/114807/fast-faster-fa...
"The·basic idea is to approximate the ratio (1/constant) by another rational number (numerator/denominator) with a power of two as the denominator."
Their example:
Maps to the original example: 2^11 / 10 = 2048 / 10 = 204.8, which rounds to 205.So N would be 11 in this example, but I guess it could be any value.