Hacker News new | past | comments | ask | show | jobs | submit login
Faster remainders when divisor is a constant: beating compilers and libdivide (lemire.me)
101 points by marcodiego 51 days ago | hide | past | favorite | 16 comments



Oh, a memory...

The only time I was able to beat compiler optimization was in an algebraic task where remainders of M mod P were needed, but the number M was fairly often between P and 2P, sometimes 3P, but rarely higher. (Due to the intrinsic character of the algorithm used.)

Of course, the compiler could not have known that in advance, but I did, so I tried subtracting P from M once or twice and checking whether it was already under P.

That was like 40 per cent faster than a single division of 32-bit numbers. But it was 2005, so YMMV.


What made it not possible for the compiler to notice that M was small?

I have been thinking about "Sufficiently Smart Compilers" recently and it seems like in most cases a SSC would be able to guess the distribution of numbers being divided.


Does your SSC concept include things like profile-guided optimization (ie, consuming runtime statistics for some representative sample)?

If your program is doing calculations on data read from a CSV, for example, and you happen to know facts about that data that allows you to make a better algorithm (for that data), then I don't see how a compiler would be able to compete.

That aside, I am reminded of: https://ridiculousfish.com/blog/posts/old-age-and-treachery....

"grep sells out its worst case (lots of partial matches) to make the best case (few partial matches) go faster"


The ideal SSC as I see it would need to be provided a set of "representative" inputs to optimize for. I don't think you should be able to expect a compiler to guess what the most common inputs would be. I completely agree that a SSC cannot beat a human who understands the typical input when the SSC does not.

My original comment was based on the idea that a SSC would be able to deduce M's distribution from the "intrinsic character of the algorithm used". In that case the optimization would be based on the code not the data.


Providing a set of "representative" inputs to optimize for is at least as much as what profile guided optimization gives the compiler (the compiler can always generate a profile from that input set)

The “intrinsic character of the algorithm used”, I think, would mean the programmer has to tell the compiler how to do this faster, which I think the OP did when writing something along the lines of

  if n < divisor
    n
  else
    if n < 2 * divisor
      n - divisor
    else
      n % divisor


Faster remainders when divisor is a constant... Is this going to be a FizzBuzz speedup...? Checks article - yep there it is. ;-)


(2019)


as with everything written about division there are a host of mistakes. :)

i'm still wondering what the "out by 2 case" is for algorithm D...

most notably for 'word-size' remainders the compiler most certainly does not evaluate the division then work out the remainder... it comes all in one on hardware for the simple and most common case, and not far off it for the multi-word numbers stuff like GMP use (with single word divisor) - even stuff like "algorithm D" keeps track of a remainder as it goes (as any long division algorithm must).

the find is interesting though. its pleasant to see these kinds of problems continue to see improvement to the solutions available. i was completely expecting yet another "find some inverse" trick.


What is "out by 2 case" quoting?

> most notably for 'word-size' remainders the compiler most certainly does not evaluate the division then work out the remainder

I guess you're referring to this part of the article:

> How do compilers compute the remainder? They first compute the quotient n/d and then they multiply it by the divisor, and subtract all of that from the original value (using the identity n % d = n - (n/d) * d).

but the article is correct and is talking about constant divisors and the multiplication-by-inverse optimization here. For example, num % 5 on x86-64 gcc 11.2 -Ofast:

  foo(int):
        movsx   rax, edi
        mov     edx, edi
        imul    rax, rax, 1717986919
        sar     edx, 31
        sar     rax, 33
        sub     eax, edx
        lea     edx, [rax+rax*4]
        mov     eax, edi
        sub     eax, edx
        ret


> What if d is a constant, but not known to the compiler? Then you can use a library like libdivide.

Or you can use a JIT compiler ...


Allocating executable pages and running a compiler is orders of magnitude slower than just precomputing the multiplication factor.


Not the op your responding too, but If this bit of of code is gonna run many billions of time it could very well be worth that cost.


In a lot of cases you're working with code where you're storing divisors as a part of a data structure, so a JIT isn't really going to help with that.


There are plenty of times when you’re processing data where you load what is a runtime constant value but not a compile time constant.

I mean the extreme case is obviously a scripting language , where lowering to multiplication may well be possible.

That said you are correct in a JIT being vastly more heavyweight than most cases would need - but mostly due to the cost of making an indirect call rather than the cost of the jit itself


There are lots of JIT compilers out there, I don't know if any that would detect the divisor is a constant at runtime and specialize code for that. I think it's almost never done. In real life JIT compilation can't take long so the JIT compiler leaves a lot of possible optimizations on the table, just covering the low hanging fruit.


how does it know the value remains constant?

this is annoyingly non-trivial, and exactly the sort of thing that gets culled for JIT performance.

the problem isn't generating code properly targeted at the hardware, or unrolling late-binding cycle wasting.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: