Apple LLVM version 7.0.0 (clang-700.1.76)
Target: x86_64-apple-darwin15.0.0
Thread model: posix
I did mention that the same source code when compiled with gcc runs faster in a Linux VM than when it is compiled with clang and run in OSX. That means, some of those optimization options either do not mean anything or do not work as well when using clang.
I ran into the need for this recently, except for a program written in Java... I really wish the JVM had a 128 bit data type (or at least a muldiv function, in which muldiv(a, b, c) = ab/c, but the intermediate ab result can exceed 64 bits.
You could split A into multiple components (1st to 4th byte, 5th to 8th byte), and multiply/divide those components separately. That would still fit into 64bit.
Yeah, I did try that, but performance was lacking. Java, if you stick with primitives and use a C-like programming style (miniminal objects) was pretty close to C performance when I tried it, except for this one gap.
As I read the article, I was wondering if the GCC compiler was converting some of the code to intrinsics?
If this is the case, and I'm not a Java programmer, I was wondering if there is a way in Java to tell the JVM to do something similar and, if so, how does the JVM cope with different CPUs. Does the JVM, on x86-64 for example, know when it can use SSE instructions?
Speaking of hotspot:
No intrinsics for >64 bits.
Comparing Strings/copying arrays does utilize SSE.
There is a minimal support for auto-vectorizing some simple loops. [0] (4y old thread)
In those cases you could of course opt to write the high-performance bit in C and have Java call that. No reason to stick to one language if performance is critical.
The overhead of calling a JNI function is considerable so this only works for functions that do a relatively large amount of work. There are also a ton of other drawbacks to this related to the fact that you've now escaped the JVM, miss out on the management of memory for those functions (so better make them stateless) and a host of other potential pitfalls depending on what platform you're doing this.
Definitely not an 'of course', and for critical performance if the function is small enough it might actually be slower to write it in C and call it from the JVM.
Indeed, you need a relatively large direct ByteBuffers and optimally paralyze the JNI call to take benefit of multi-core.
It's a non-trivial task to get stable performance boosts unless the data volume is sufficient.
Because GMP is designed for arbitrary size integers, and for safety. Given two mpz_t operands to multiply, it needs, in some order, to:
* Dereference pointers
* Extract the signs of both input operands
* Extract the sizes (number of limbs) of both input operands
* Check whether the output variable has enough space for the result, and otherwise, reallocate it
* Check whether the output operand is the same object in memory as either input, and in that case allocate temporary scratch space and make a copy of the data (since the multiplication algorithm doesn't allow aliasing)
* Decide which multiplication algorithm to use (different algorithms are optimal for different sizes)
* Call the chosen multiplication algorithm
* In the multiplication algorithm, loop over the variable numbers of limbs
* Actually perform the arithmetic operations in the multiplication
* Normalize the resulting size by checking whether the top limb of the output is nonzero
* Write the sign and size of the result to the output variable
In addition, mpz_t variables need to malloc and free every time they are created and destroyed. This is not so much a concern in practice, because optimizing programmers will reuse variables as much as possible. (However, some language wrappers don't reuse GMP variables, and are unnecessarily slow as a result.)
You can use GMP's internal functions if you want faster arithmetic and either know that overflow cannot occur or want to do overflow checking manually. For example, if z, x and y are pairs of 64-bit integers, not aliased with each other, then mpn_mullo_basecase(&z, &x, &y, 2) does a 128-bit multiplication (mod 2^128). Depending on the platform, using mpn_mul_basecase to compute the full product with some scratch space might be faster.
This kind of code should generally be faster than calling the general and safe function mpz_mul, though it will still have a few cycles of overhead for the function call and looping.
For integers up to a few hundred bits, inlined fixed-size code is always going to be faster.
It's actually even better than that. They're only using __int128 to hold the result of a 64 by 64 bit multiplication, and x86_64's mul instruction will give you this for free. gcc is smart enough to generate code that is literally just a single mul.
I also think it is worthwhile to extend this to other bases, as it more easily produces more examples to look at for discovering structure, and gets rid of the arbitrarily chosen number 10.
Edit: for those needing some explanation how to get at that claim:
let q > 0
(they pick q=10, q=100, q=1000, but this generalises this)
Assume
aq+b = b^2 - a^2 for some a,b in [0,q]
in other words: b^2-a^2, written in base q, is ab
Then:
b^2 - a^2 = qa+b <=> (bring all b’s to the left, all else to the right)
b^2 - b = qa + a^2 <=> (factor)
b(b-1) = a(a+q) <=> (using (p+q)(p-q) = p^2-q^2)
(b-1/2)^2 - (1/2)^2 = (a+q/2)^2 - (q/2)^2 <=> (multiply by 4 on both sides)
(2b-1)^2 - 1 = (2a+q)^2 - q^2 <=> (bring all a’s and all b’s to the right)
q^2 - 1 = (2a+q)^2 - (2b-1)^2 <=> (switch sides; slight rewrite)
(q+2a)^2 - (2b-1)^2 = q^2-1
Edit 2: for large q, it probably is faster to do the (p+q)(p-q) = p^2-q^2 trick again (apologies for naming the free variable q the same as the q introduced earlier), and get:
(q+2a+2b-1)(q+2a-2b+1) = q^2-1
Next, factor q^2-1, write all ways to write it as product of integers, and check each case for integer solutions for a and b in the right range. For example, for q=10, we have:
99 = 1x99 = 3x33 = 9x11 = 11x9 = 33x3 = 99x1, so we solve 6 systems (I guess one easily discard half of them on arguments of the p+q part being larger than the p-q part, but that’s peanuts) of two linear equations in two variables:
/ q+2a+2b-1 = 1
\ q+2a-2b+1 = 99
/ q+2a+2b-1 = 3
\ q+2a-2b+1 = 33
…
/ q+2a+2b-1 = 99
\ q+2a-2b+1 = 1
This also suggests that the number of solutions with d base q digits for a and b is correlated with the number of integers that divide q^2-1. That would link it to http://oeis.org/A000010.
Edit 3: and for the 10^N case, one might consider http://oeis.org/A095370. For example, the 62-digit number 1111…1111 has only 5 prime factors. You have to add 2 factors of 3 to get 9999…9999, so 10^62-1 can be written as product of two integers in 2^7 ways => only 2^6 = 64 equations to solve.
q = 10000000000
n = 11111111111111111111
factors = [3, 3, 11, 41, 101, 271, 3541, 9091, 27961]
q10 = q/10
bits = len(factors)
#print b
#print 11*41*101*271*3541*9091*27961
for i in range(2**bits):
r = 1
s = 1
for j in range(bits):
if i&(1<<j):
r *= factors[j]
else:
s *= factors[j]
# To solve:
# q+2a+2b-1 = r
# q+2a-2b+1 = s
# q+2a = (r+s)/2
# 2b-1 = (r-s)/2
#
# => a = ((r+s)/2 - q) / 2
# => b = (1+(r-s)/2) / 2
#
# r and s are odd, so we do not have to worry much about fractions
two_a = (r+s)/2 - q
if two_a % 1 == 0:
a = two_a / 2
two_b = (r-s)/2 + 1
if two_b %1 == 0:
b = two_b / 2
if (q10 <= a < q) and (q10 <= b < q):
print a, b, b*b-a*a
On my system:
time (./ExcellentNumbers.py | sort -u)
real 0m0.031s
user 0m0.016s
sys 0m0.011s
Beats the 1m40 seconds the optimisation got in C, but this is cheating, as it hard-codes the factorisation. Doing that the proper way should not add minutes, though.
If you want to tackle factoring large repunits, do make use of the fact that a repunit containing pq digits can be trivially factored as either a p-digit reunite and a 1000…0001000…0001… one or a q-digit repunit. For example, we have:
111111111111 = 11 x 10101010101
= 111 x 1001001001
= 1111 x 100010001
= 111111 x 1000001
With those factors in hand, find a few others by taking GCDs of other factors, or by ‘randomly’ dividing numbers. For example, 1111 does not share a factor with 111, so 100010001 must be divisible by 111.
Edit: tried the 42-digit one. It has:
q = 1000000000000000000000
n = 111111111111111111111111111111111111111111
factors [3,7,7,11,13,37,43,127,239,1933,2689,4649,459691,909091,10838689]
Times on my system to find 335 solutions:
real 0m0.219s
user 0m0.185s
sys 0m0.022s
I think that is a bit faster than what you had.
And you may want to loosen the range of values for b, replacing
You need to keep in mind, though, that the factors of 9999999999999999999999999999999999999999 need to be found somehow. Sure, one can "cheat" and use the factors of 1111111111111111111111111111111111111111 and a couple of extra 3s, but, then you have avoided the problem. That is fine normally, but just not an apples-to-apples comparison.
I also think repunits are, in general, easier to factor because ones that do not have a prime number of digits have obvious divisors.
Certainly, rep units of up to 50 digits are factored easily (Google a bignum factorization program online, and try for yourself)
Even disregarding that, the original code computed a square root about sqrt(n) times (that's why its running time went up by a factor of 10 for every two digits added). Even without using advanced factoring algorithms, you can factor that number using, at the very most, sqrt(n) trial divisions.