Hacker News new | comments | show | ask | jobs | submit login
Optimization story: Switching from GMP to gcc's __int128 reduced run time by 95% (nu42.com)
129 points by nanis on Jan 16, 2016 | hide | past | web | favorite | 31 comments



FWIW, __int128 and the optimizations mentioned works in clang too, my guess is that the author just runs an old version.

    #include <stdio.h>
    
    int main() {
    	FILE *fp;
    	__int128 v=1234;
    
    	fp = fopen("foo12", "w");
    	fwrite(&v, sizeof(v), 1, fp);
    	fclose(fp);
    
    	return 0;
    }
with -O2 -march=native -ffinite-math-only -fno-math-errno, for clang 3.5.0, linux/amd64, gives me:

    4005d0:	53                   	push   %rbx
    4005d1:	48 83 ec 10          	sub    $0x10,%rsp
    4005d5:	48 c7 44 24 08 00 00 	movq   $0x0,0x8(%rsp)
    4005dc:	00 00 
    4005de:	48 c7 04 24 d2 04 00 	movq   $0x4d2,(%rsp)
    4005e5:	00 
    4005e6:	bf b0 06 40 00       	mov    $0x4006b0,%edi
    4005eb:	be b6 06 40 00       	mov    $0x4006b6,%esi
    4005f0:	e8 bb fe ff ff       	callq  4004b0 <fopen@plt>
    4005f5:	48 89 c3             	mov    %rax,%rbx
    4005f8:	48 8d 3c 24          	lea    (%rsp),%rdi
    4005fc:	be 10 00 00 00       	mov    $0x10,%esi
    400601:	ba 01 00 00 00       	mov    $0x1,%edx
    400606:	48 89 d9             	mov    %rbx,%rcx
    400609:	e8 b2 fe ff ff       	callq  4004c0 <fwrite@plt>
    40060e:	48 89 df             	mov    %rbx,%rdi
    400611:	e8 6a fe ff ff       	callq  400480 <fclose@plt>
    400616:	31 c0                	xor    %eax,%eax
    400618:	48 83 c4 10          	add    $0x10,%rsp
    40061c:	5b                   	pop    %rbx
    40061d:	c3                   	retq   
    40061e:	66 90                	xchg   %ax,%ax


...and both compile a 128=64×64 multiplication into a single imul.

    __int128
    mult128(int64_t a, int64_t b)
    {
    	return (__int128)a * (__int128)b;
      a0:	48 89 f0             	mov    %rsi,%rax
      a3:	48 f7 ef             	imul   %rdi
      a6:	c3                   	retq


I have:

    Apple LLVM version 7.0.0 (clang-700.1.76)
    Target: x86_64-apple-darwin15.0.0
    Thread model: posix
I did mention that the same source code when compiled with gcc runs faster in a Linux VM than when it is compiled with clang and run in OSX. That means, some of those optimization options either do not mean anything or do not work as well when using clang.


I ran into the need for this recently, except for a program written in Java... I really wish the JVM had a 128 bit data type (or at least a muldiv function, in which muldiv(a, b, c) = ab/c, but the intermediate ab result can exceed 64 bits.


You could split A into multiple components (1st to 4th byte, 5th to 8th byte), and multiply/divide those components separately. That would still fit into 64bit.

Partial example is here: https://codereview.stackexchange.com/questions/72286/multipl...


Yeah, I did try that, but performance was lacking. Java, if you stick with primitives and use a C-like programming style (miniminal objects) was pretty close to C performance when I tried it, except for this one gap.


As I read the article, I was wondering if the GCC compiler was converting some of the code to intrinsics?

If this is the case, and I'm not a Java programmer, I was wondering if there is a way in Java to tell the JVM to do something similar and, if so, how does the JVM cope with different CPUs. Does the JVM, on x86-64 for example, know when it can use SSE instructions?


Speaking of hotspot: No intrinsics for >64 bits. Comparing Strings/copying arrays does utilize SSE. There is a minimal support for auto-vectorizing some simple loops. [0] (4y old thread)

[0]: http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/...


In those cases you could of course opt to write the high-performance bit in C and have Java call that. No reason to stick to one language if performance is critical.


The overhead of calling a JNI function is considerable so this only works for functions that do a relatively large amount of work. There are also a ton of other drawbacks to this related to the fact that you've now escaped the JVM, miss out on the management of memory for those functions (so better make them stateless) and a host of other potential pitfalls depending on what platform you're doing this.

Definitely not an 'of course', and for critical performance if the function is small enough it might actually be slower to write it in C and call it from the JVM.


Indeed, you need a relatively large direct ByteBuffers and optimally paralyze the JNI call to take benefit of multi-core. It's a non-trivial task to get stable performance boosts unless the data volume is sufficient.


write one? it's simple algebra.


The problem is performance when working with an entire object rather than a primitive type.


Your homegrown object wouldn't have any special meaning in the JVM and wouldn't be able to be optimized as a 128-bit integer.


Parent probably means writing the muldiv function, which would return a "basic" 64bit int.


I am just curiosu why gcc's __int128 is that faster than GMP.


Because GMP is designed for arbitrary size integers, and for safety. Given two mpz_t operands to multiply, it needs, in some order, to:

* Dereference pointers

* Extract the signs of both input operands

* Extract the sizes (number of limbs) of both input operands

* Check whether the output variable has enough space for the result, and otherwise, reallocate it

* Check whether the output operand is the same object in memory as either input, and in that case allocate temporary scratch space and make a copy of the data (since the multiplication algorithm doesn't allow aliasing)

* Decide which multiplication algorithm to use (different algorithms are optimal for different sizes)

* Call the chosen multiplication algorithm

* In the multiplication algorithm, loop over the variable numbers of limbs

* Actually perform the arithmetic operations in the multiplication

* Normalize the resulting size by checking whether the top limb of the output is nonzero

* Write the sign and size of the result to the output variable

In addition, mpz_t variables need to malloc and free every time they are created and destroyed. This is not so much a concern in practice, because optimizing programmers will reuse variables as much as possible. (However, some language wrappers don't reuse GMP variables, and are unnecessarily slow as a result.)

You can use GMP's internal functions if you want faster arithmetic and either know that overflow cannot occur or want to do overflow checking manually. For example, if z, x and y are pairs of 64-bit integers, not aliased with each other, then mpn_mullo_basecase(&z, &x, &y, 2) does a 128-bit multiplication (mod 2^128). Depending on the platform, using mpn_mul_basecase to compute the full product with some scratch space might be faster.

This kind of code should generally be faster than calling the general and safe function mpz_mul, though it will still have a few cycles of overhead for the function call and looping.

For integers up to a few hundred bits, inlined fixed-size code is always going to be faster.


GMP is arbitrary precision. __int128 on a 64 bit processor is optimized for one size.


It's actually even better than that. They're only using __int128 to hold the result of a 64 by 64 bit multiplication, and x86_64's mul instruction will give you this for free. gcc is smart enough to generate code that is literally just a single mul.


A few inline instructions vs a library call.


Given N, introducing q=10^N, they are looking for a and b in [0,q) such that

  (q+2a)^2-(2b-1)^2 = q^2-1
For example, for N=1, q=10, given 18^2 - 15^2 = 99, we find a solution with a=4, b=8: 48

That suggests using https://en.m.wikipedia.org/wiki/Midpoint_circle_algorithm#Va... to get rid of that pesky square root (circle drawing uses sums of squares, but I think that is trivially adjusted for)

I also think it is worthwhile to extend this to other bases, as it more easily produces more examples to look at for discovering structure, and gets rid of the arbitrarily chosen number 10.

Edit: for those needing some explanation how to get at that claim:

  let q > 0
(they pick q=10, q=100, q=1000, but this generalises this)

Assume

  aq+b = b^2 - a^2 for some a,b in [0,q]
in other words: b^2-a^2, written in base q, is ab

Then:

  b^2 - a^2 = qa+b                          <=> (bring all b’s to the left, all else to the right)
  b^2 - b = qa + a^2                        <=> (factor)
  b(b-1)  = a(a+q)                          <=> (using (p+q)(p-q) = p^2-q^2)
  (b-1/2)^2 - (1/2)^2 = (a+q/2)^2 - (q/2)^2 <=> (multiply by 4 on both sides)
  (2b-1)^2 - 1 = (2a+q)^2 - q^2             <=> (bring all a’s and all b’s to the right)
  q^2 - 1 = (2a+q)^2 - (2b-1)^2             <=> (switch sides; slight rewrite)
  (q+2a)^2 - (2b-1)^2 = q^2-1
Edit 2: for large q, it probably is faster to do the (p+q)(p-q) = p^2-q^2 trick again (apologies for naming the free variable q the same as the q introduced earlier), and get:

  (q+2a+2b-1)(q+2a-2b+1) = q^2-1
Next, factor q^2-1, write all ways to write it as product of integers, and check each case for integer solutions for a and b in the right range. For example, for q=10, we have:

  99 = 1x99 = 3x33 = 9x11 = 11x9 = 33x3 = 99x1, so we solve 6 systems (I guess one easily discard half of them on arguments of the p+q part being larger than the p-q part, but that’s peanuts) of two linear equations in two variables:

  / q+2a+2b-1 =  1
  \ q+2a-2b+1 = 99

  / q+2a+2b-1 =  3
  \ q+2a-2b+1 = 33

  …

  / q+2a+2b-1 = 99
  \ q+2a-2b+1 =  1
This also suggests that the number of solutions with d base q digits for a and b is correlated with the number of integers that divide q^2-1. That would link it to http://oeis.org/A000010.

Edit 3: and for the 10^N case, one might consider http://oeis.org/A095370. For example, the 62-digit number 1111…1111 has only 5 prime factors. You have to add 2 factors of 3 to get 9999…9999, so 10^62-1 can be written as product of two integers in 2^7 ways => only 2^6 = 64 equations to solve.


Interesting idea. Thank you. I'll try that out.


In Python:

  q = 10000000000
  n = 11111111111111111111
  factors = [3, 3, 11, 41, 101, 271, 3541, 9091, 27961]
  
  q10 = q/10
  
  bits = len(factors)
  #print b
  #print 11*41*101*271*3541*9091*27961
  for i in range(2**bits):
    r = 1
    s = 1
    for j in range(bits):
      if i&(1<<j):
        r *= factors[j]
      else:
        s *= factors[j]
    # To solve:
    # q+2a+2b-1 = r
    # q+2a-2b+1 = s
    # q+2a = (r+s)/2
    # 2b-1 = (r-s)/2
    # 
    # => a = ((r+s)/2 - q) / 2
    # => b = (1+(r-s)/2) / 2
    #
    # r and s are odd, so we do not have to worry much about fractions
    two_a = (r+s)/2 - q
    if two_a % 1 == 0:
      a = two_a / 2
      two_b = (r-s)/2 + 1
      if two_b %1  == 0:
        b = two_b / 2
        if (q10 <= a < q) and (q10 <= b < q):
          print a, b, b*b-a*a
On my system:

  time (./ExcellentNumbers.py | sort -u)
  real	0m0.031s
  user	0m0.016s
  sys	0m0.011s
Beats the 1m40 seconds the optimisation got in C, but this is cheating, as it hard-codes the factorisation. Doing that the proper way should not add minutes, though.

If you want to tackle factoring large repunits, do make use of the fact that a repunit containing pq digits can be trivially factored as either a p-digit reunite and a 1000…0001000…0001… one or a q-digit repunit. For example, we have:

  111111111111 = 11 x 10101010101
               = 111 x 1001001001
               = 1111 x 100010001
               = 111111 x 1000001
With those factors in hand, find a few others by taking GCDs of other factors, or by ‘randomly’ dividing numbers. For example, 1111 does not share a factor with 111, so 100010001 must be divisible by 111.

Edit: tried the 42-digit one. It has:

  q = 1000000000000000000000
  n = 111111111111111111111111111111111111111111
  factors  [3,7,7,11,13,37,43,127,239,1933,2689,4649,459691,909091,10838689]
Times on my system to find 335 solutions:

  real	0m0.219s
  user	0m0.185s
  sys	0m0.022s
I think that is a bit faster than what you had.

And you may want to loosen the range of values for b, replacing

  (q10 <= b < q)
by

  (0 <= b < q)


This is seriously awesome. Got to love when you make a blog post about a 20x speedup, and the internet returns the favor with an Infx speedup.


You need to keep in mind, though, that the factors of 9999999999999999999999999999999999999999 need to be found somehow. Sure, one can "cheat" and use the factors of 1111111111111111111111111111111111111111 and a couple of extra 3s, but, then you have avoided the problem. That is fine normally, but just not an apples-to-apples comparison.


Firstly, lots of work has been done here. See for example http://www.worldofnumbers.com/repunits.htm (factorizations of repunits up to 8580 digits).

I also think repunits are, in general, easier to factor because ones that do not have a prime number of digits have obvious divisors.

Certainly, rep units of up to 50 digits are factored easily (Google a bignum factorization program online, and try for yourself)

Even disregarding that, the original code computed a square root about sqrt(n) times (that's why its running time went up by a factor of 10 for every two digits added). Even without using advanced factoring algorithms, you can factor that number using, at the very most, sqrt(n) trial divisions.


I tried out the arithmetic, and it works. So, this transforms the problem drastically. Nice insight.


Could you approximate the square root using 2 integers, rather than using floating point arithmetic?


Small correction: K should equal 10^d, not d/2. Unless I'm missing something.


d is the number of digits in 10a + b. Therefore, a and b both have d/2 digits.


That works too, but the author said they have d digits each.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: