
Optimization story: Switching from GMP to gcc's __int128 reduced run time by 95% - nanis
https://www.nu42.com/2016/01/excellent-optimization-story.html
======
sebcat
FWIW, __int128 and the optimizations mentioned works in clang too, my guess is
that the author just runs an old version.

    
    
        #include <stdio.h>
        
        int main() {
        	FILE *fp;
        	__int128 v=1234;
        
        	fp = fopen("foo12", "w");
        	fwrite(&v, sizeof(v), 1, fp);
        	fclose(fp);
        
        	return 0;
        }
    

with -O2 -march=native -ffinite-math-only -fno-math-errno, for clang 3.5.0,
linux/amd64, gives me:

    
    
        4005d0:	53                   	push   %rbx
        4005d1:	48 83 ec 10          	sub    $0x10,%rsp
        4005d5:	48 c7 44 24 08 00 00 	movq   $0x0,0x8(%rsp)
        4005dc:	00 00 
        4005de:	48 c7 04 24 d2 04 00 	movq   $0x4d2,(%rsp)
        4005e5:	00 
        4005e6:	bf b0 06 40 00       	mov    $0x4006b0,%edi
        4005eb:	be b6 06 40 00       	mov    $0x4006b6,%esi
        4005f0:	e8 bb fe ff ff       	callq  4004b0 <fopen@plt>
        4005f5:	48 89 c3             	mov    %rax,%rbx
        4005f8:	48 8d 3c 24          	lea    (%rsp),%rdi
        4005fc:	be 10 00 00 00       	mov    $0x10,%esi
        400601:	ba 01 00 00 00       	mov    $0x1,%edx
        400606:	48 89 d9             	mov    %rbx,%rcx
        400609:	e8 b2 fe ff ff       	callq  4004c0 <fwrite@plt>
        40060e:	48 89 df             	mov    %rbx,%rdi
        400611:	e8 6a fe ff ff       	callq  400480 <fclose@plt>
        400616:	31 c0                	xor    %eax,%eax
        400618:	48 83 c4 10          	add    $0x10,%rsp
        40061c:	5b                   	pop    %rbx
        40061d:	c3                   	retq   
        40061e:	66 90                	xchg   %ax,%ax

~~~
cnvogel
...and both compile a 128=64×64 multiplication into a single imul.

    
    
        __int128
        mult128(int64_t a, int64_t b)
        {
        	return (__int128)a * (__int128)b;
          a0:	48 89 f0             	mov    %rsi,%rax
          a3:	48 f7 ef             	imul   %rdi
          a6:	c3                   	retq

~~~
nanis
I have:

    
    
        Apple LLVM version 7.0.0 (clang-700.1.76)
        Target: x86_64-apple-darwin15.0.0
        Thread model: posix
    

I did mention that the _same_ source code when compiled with gcc runs faster
in a Linux VM than when it is compiled with clang and run in OSX. That means,
some of those optimization options either do not mean anything or do not work
as well when using clang.

------
soldergenie
I ran into the need for this recently, except for a program written in Java...
I really wish the JVM had a 128 bit data type (or at least a muldiv function,
in which muldiv(a, b, c) = a _b /c, but the intermediate a_b result can exceed
64 bits.

~~~
tacos
write one? it's simple algebra.

~~~
acchow
Your homegrown object wouldn't have any special meaning in the JVM and
wouldn't be able to be optimized as a 128-bit integer.

~~~
jovdg
Parent probably means writing the muldiv function, which would return a
"basic" 64bit int.

------
chenzhekl
I am just curiosu why gcc's __int128 is that faster than GMP.

~~~
tacos
GMP is arbitrary precision. __int128 on a 64 bit processor is optimized for
one size.

~~~
jmgao
It's actually even better than that. They're only using __int128 to hold the
result of a 64 by 64 bit multiplication, and x86_64's mul instruction will
give you this for free. gcc is smart enough to generate code that is literally
just a single mul.

------
Someone
Given N, introducing q=10^N, they are looking for a and b in [0,q) such that

    
    
      (q+2a)^2-(2b-1)^2 = q^2-1
    

For example, for N=1, q=10, given 18^2 - 15^2 = 99, we find a solution with
a=4, b=8: 48

That suggests using
[https://en.m.wikipedia.org/wiki/Midpoint_circle_algorithm#Va...](https://en.m.wikipedia.org/wiki/Midpoint_circle_algorithm#Variant_with_Integer-
Based_Arithmetic) to get rid of that pesky square root (circle drawing uses
sums of squares, but I think that is trivially adjusted for)

I also think it is worthwhile to extend this to other bases, as it more easily
produces more examples to look at for discovering structure, and gets rid of
the arbitrarily chosen number 10.

Edit: for those needing some explanation how to get at that claim:

    
    
      let q > 0
    

(they pick q=10, q=100, q=1000, but this generalises this)

Assume

    
    
      aq+b = b^2 - a^2 for some a,b in [0,q]
    

in other words: b^2-a^2, written in base q, is ab

Then:

    
    
      b^2 - a^2 = qa+b                          <=> (bring all b’s to the left, all else to the right)
      b^2 - b = qa + a^2                        <=> (factor)
      b(b-1)  = a(a+q)                          <=> (using (p+q)(p-q) = p^2-q^2)
      (b-1/2)^2 - (1/2)^2 = (a+q/2)^2 - (q/2)^2 <=> (multiply by 4 on both sides)
      (2b-1)^2 - 1 = (2a+q)^2 - q^2             <=> (bring all a’s and all b’s to the right)
      q^2 - 1 = (2a+q)^2 - (2b-1)^2             <=> (switch sides; slight rewrite)
      (q+2a)^2 - (2b-1)^2 = q^2-1
    

Edit 2: for large q, it probably is faster to do the (p+q)(p-q) = p^2-q^2
trick again (apologies for naming the free variable q the same as the q
introduced earlier), and get:

    
    
      (q+2a+2b-1)(q+2a-2b+1) = q^2-1
    

Next, factor q^2-1, write all ways to write it as product of integers, and
check each case for integer solutions for a and b in the right range. For
example, for q=10, we have:

    
    
      99 = 1x99 = 3x33 = 9x11 = 11x9 = 33x3 = 99x1, so we solve 6 systems (I guess one easily discard half of them on arguments of the p+q part being larger than the p-q part, but that’s peanuts) of two linear equations in two variables:
    
      / q+2a+2b-1 =  1
      \ q+2a-2b+1 = 99
    
      / q+2a+2b-1 =  3
      \ q+2a-2b+1 = 33
    
      …
    
      / q+2a+2b-1 = 99
      \ q+2a-2b+1 =  1
    

This also suggests that the number of solutions with d base q digits for a and
b is correlated with the number of integers that divide q^2-1. That would link
it to [http://oeis.org/A000010](http://oeis.org/A000010).

Edit 3: and for the 10^N case, one might consider
[http://oeis.org/A095370](http://oeis.org/A095370). For example, the 62-digit
number 1111…1111 has only 5 prime factors. You have to add 2 factors of 3 to
get 9999…9999, so 10^62-1 can be written as product of two integers in 2^7
ways => only 2^6 = 64 equations to solve.

~~~
nanis
Interesting idea. Thank you. I'll try that out.

~~~
Someone
In Python:

    
    
      q = 10000000000
      n = 11111111111111111111
      factors = [3, 3, 11, 41, 101, 271, 3541, 9091, 27961]
      
      q10 = q/10
      
      bits = len(factors)
      #print b
      #print 11*41*101*271*3541*9091*27961
      for i in range(2**bits):
        r = 1
        s = 1
        for j in range(bits):
          if i&(1<<j):
            r *= factors[j]
          else:
            s *= factors[j]
        # To solve:
        # q+2a+2b-1 = r
        # q+2a-2b+1 = s
        # q+2a = (r+s)/2
        # 2b-1 = (r-s)/2
        # 
        # => a = ((r+s)/2 - q) / 2
        # => b = (1+(r-s)/2) / 2
        #
        # r and s are odd, so we do not have to worry much about fractions
        two_a = (r+s)/2 - q
        if two_a % 1 == 0:
          a = two_a / 2
          two_b = (r-s)/2 + 1
          if two_b %1  == 0:
            b = two_b / 2
            if (q10 <= a < q) and (q10 <= b < q):
              print a, b, b*b-a*a
    

On my system:

    
    
      time (./ExcellentNumbers.py | sort -u)
      real	0m0.031s
      user	0m0.016s
      sys	0m0.011s
    

Beats the 1m40 seconds the optimisation got in C, but this is cheating, as it
hard-codes the factorisation. Doing that the proper way should not add
minutes, though.

If you want to tackle factoring large repunits, do make use of the fact that a
repunit containing pq digits can be trivially factored as either a p-digit
reunite and a 1000…0001000…0001… one or a q-digit repunit. For example, we
have:

    
    
      111111111111 = 11 x 10101010101
                   = 111 x 1001001001
                   = 1111 x 100010001
                   = 111111 x 1000001
    

With those factors in hand, find a few others by taking GCDs of other factors,
or by ‘randomly’ dividing numbers. For example, 1111 does not share a factor
with 111, so 100010001 must be divisible by 111.

Edit: tried the 42-digit one. It has:

    
    
      q = 1000000000000000000000
      n = 111111111111111111111111111111111111111111
      factors  [3,7,7,11,13,37,43,127,239,1933,2689,4649,459691,909091,10838689]
    

Times on my system to find 335 solutions:

    
    
      real	0m0.219s
      user	0m0.185s
      sys	0m0.022s
    

I think that is a bit faster than what you had.

And you may want to loosen the range of values for b, replacing

    
    
      (q10 <= b < q)
    

by

    
    
      (0 <= b < q)

~~~
te
This is seriously awesome. Got to love when you make a blog post about a 20x
speedup, and the internet returns the favor with an Infx speedup.

~~~
nanis
You need to keep in mind, though, that the factors of
9999999999999999999999999999999999999999 need to be found somehow. Sure, one
can "cheat" and use the factors of 1111111111111111111111111111111111111111
and a couple of extra 3s, but, then you have avoided the problem. That is fine
normally, but just not an apples-to-apples comparison.

~~~
Someone
Firstly, lots of work has been done here. See for example
[http://www.worldofnumbers.com/repunits.htm](http://www.worldofnumbers.com/repunits.htm)
(factorizations of repunits up to 8580 digits).

I also think repunits are, in general, easier to factor because ones that do
not have a prime number of digits have obvious divisors.

Certainly, rep units of up to 50 digits are factored easily (Google a bignum
factorization program online, and try for yourself)

Even disregarding that, the original code computed a square root about sqrt(n)
times (that's why its running time went up by a factor of 10 for every two
digits added). Even without using advanced factoring algorithms, you can
factor that number using, at the very most, sqrt(n) trial divisions.

------
alexchamberlain
Could you approximate the square root using 2 integers, rather than using
floating point arithmetic?

------
marvy
Small correction: K should equal 10^d, not d/2\. Unless I'm missing something.

~~~
nanis
d is the number of digits in 10a + b. Therefore, a and b both have d/2 digits.

~~~
marvy
That works too, but the author said they have d digits each.

