
Comparing the CPU of the iPad Pro 9.7 to My MacBook, Surface and 6S - greenspot
http://stephencoyle.net/projects/single-core-studies/
======
madisp
Sadly the data is useless without the exact compiler flags.

I get 18.0658 with default "g++ PrimeChecker.c -o PrimeChecker" and 11.7689
with "g++ PrimeChecker.c -O2 -o PrimeChecker"

edit: the hardware is an i7-4870HQ, quite close to the i7-4850HQ that's in the
blogpost (+200MHz clock)

~~~
maccard
I'm running an i7-6700k, on Windows 10 with the MSVC 2015 compiler, my results
were 9592 primes between 0 and 100000 Time taken = 12.808 Without
optimisations, and 9592 primes between 0 and 100000 Time taken = 10.853 with
full optimisations. I would ahve thought it would have been much better.
without and with optimisations.

~~~
bithush
It is wasting a lot of time with the pointless bool assignments. See my post
below.

~~~
maccard
But then I'm not running the same code as either the parent of my comment, or
the article, and the benchmark doesn't hold. It's how fast can my CPU run the
same code vs theres. I could make it faster again by precomputing a table and
doing a lookup but that defies the point!

~~~
bithush
Very good point. I was just surprised how much that simple bool assignment
slowed things down so massively.

~~~
maccard
I started at 10.853, and with very little effort brought it down to 0.003

The slowdown isn't in assigning the bool; If you pass in 10000000 into that
function, it's going to get isPrime set to true on the first iteration, but if
you early out, you save all of those values. Again, this isn't really the
point of the exercise, it's supposed to be comparing the performance of the
same piece of code ran on different hardware.

[https://gist.github.com/anonymous/c6d5bae04334cbc7ef9583ebb7...](https://gist.github.com/anonymous/c6d5bae04334cbc7ef9583ebb7faf587)

------
bithush
I ran the C++ version on my i7-2640M laptop (ThinkPad T420s) and got 20.394s

Interestingly it is almost identical to his 2-generation newer MacBook Pro.

Edit: and if I fix isPrime to not waste using the prime bool I get 1.862s

Edit 2: and with my horrible Java copy&paste version[0] I get 2.050s

Edit 3: ok final edit but doing squareroot(n)+1 you speed it up to 0.012s for
C++ and 0.019ms for Java. Fast enough for me, good night :)

[0]
[https://gist.github.com/anonymous/0709e1807f57b686683f3e5f7b...](https://gist.github.com/anonymous/0709e1807f57b686683f3e5f7b010558)

------
krick
This is not comparing CPUs, this is comparing execution times on the different
hardware of several pretty non-descriptive programs written in different
languages without even properly describing the compile step, which is
basically meaningless.

------
ekr
I'm not at all familiar with the Apple development ecosystem, but since Swift
has been open-source, why not run the swift version on all platforms? There
are vast differences in the implementations of C++ and Swift, even if both are
compiled to native machine code.

That aside, given the hype behind Apple devices, I'm sure there have been
loads of benchmarks published already. What I would personally be more
interested in, would be to see how the A9X fares against Intel's Core M's,
which have a similar power envelope.

~~~
msbarnett
> For some reason the Swift version took much longer to run on the Macs.
> Something on the order of 50-65 seconds. I presume I overlooked some
> optimisation or compiler setting.

A version which shows the iPad performance crushing the Macs, not because of
CPU performance but because of some kind of Swift performance regression on
desktop, would not have been interesting

------
largote
It'd be more interesting to see a test with more branching as that is the one
big area where x86 CPUs tend to shine compared to "simpler" architectures (or
things like GPUs).

Also, I don't think this exercises the Floating-Point units for these CPUs
(haven't looked at the code though).

~~~
twotwotwo
Yeah. This was another benchmark that found that an earlier Apple CPU got
impressive instructions-per-clock on a couple of cryptographic functions (that
aren't special-cased like AES):
[https://zerobyte.io/blog/2014/04/29/benchmarking-
symmetric-c...](https://zerobyte.io/blog/2014/04/29/benchmarking-symmetric-
crypto-on-the-apple-a7/)

But those functions also don't really stress-test branch prediction, etc. They
use instruction-level parallelism but the control flow and memory accesses are
predictable. (That's almost necessary for software crypto; if execution isn't
~constant-time, you risk timing attacks.)

Looking at transistor counts: Apple said the three-core A8X had three billion
transistors total. Given Apple's focus on GPU and bringing other components
onto the SoC, many of those are not spent on the cores; still, in raw count,
it's right up there with Intel's Core-branded CPU+GPU dies, going by a table
in Wikipedia.

AnandTech put up a lot of numbers about the A9, including specs like cache
sizes and memory bandwidth and benchmarks like Geekbench and SPEC:
[http://www.anandtech.com/show/9686/the-apple-
iphone-6s-and-i...](http://www.anandtech.com/show/9686/the-apple-
iphone-6s-and-iphone-6s-plus-review/4)

Seems like, at a minimum, if your intuitions about mobile performance come
from early in-order chips, they may not apply to Apple CPUs or even the many
Cortex-A57-based SoCs out there. But it also doesn't seem like you can say
those chips are up there with Intel ones in general. I guess the evergreen-
but-not-wrong conclusion is that to really know your code's performance, you
wan to test it on hardware as similar as possible to what it'll really run on.

------
mayoff
Since both Visual Studio and Xcode support C++, why not just run the C++
version on all platforms?

------
caleblloyd
I recently delved into Prime Number computation algorithms, and found that the
fastest algorithm was outlined by the primesieve.org project:
[http://primesieve.org/segmented_sieve.html](http://primesieve.org/segmented_sieve.html)

Their algorithm optimizes around L1 cache size. I'm not sure of the L1 cache
size on Apple's SOCs, but to my knowledge Intel has been using 32KB for a
while. It'd be cool to see the primesieve.org algorithm on these devices
optimized for their respective L1 cache sizes.

I took a stab at converting their algorithm to a Go program, but
primesieve.org's C version still runs orders of magnitude faster.

My Go Program and times:
[https://github.com/caleblloyd/primesieve](https://github.com/caleblloyd/primesieve)
Primesieve.org published times:
[http://primesieve.org/](http://primesieve.org/)

EDIT: If you have any performance suggestions for my Go version of primesieve,
I've opened an Ask HN thread:
[https://news.ycombinator.com/item?id=11413827](https://news.ycombinator.com/item?id=11413827)

------
piinbinary
I wonder if the fact that it scans 2..<number instead of 2..sqrt(number) and
doesn't break upon discovering a factor has any impact on the relative speeds
due to differences in branch prediction.

------
microtheo
For what it's worth, my i5 surface pro 4 executed this little program in
20.891 s (complied with mingw64-gcc) i.e. 0.9x

Pleased to see that it is twice as powerful as the i3 surface pro 3 :)

~~~
microtheo
And 0.124 s with sqrt(n) and -Os & 0.02 s by interrupting the for... Let's go
further than 100000 :)

664579 primes between 0 and 10000000 Time taken = 11.319

------
Retric
The i7 2600 is from January 2011 making it over 5 years old and it's still
faster. Just not rediculusly so.

~~~
jkot
I got i7 2600K @ 5Ghz with water cooling. Very hard to find replacement. Test
run 16.05s

~~~
Retric
Don't get me wrong I also have a 2600K and have been somewhat disappointing
with Intel's progress. But, for most workloads the current generation really
is significantly faster.

PS: That's an impressive OC, I capped out at 4.2Ghz with air cooling, then
pulled it back down to 4.0 for extra safety margin.

~~~
jkot
How much is significantly faster? 30%? I need fast Scala compilation and I am
not impressed with new processors. I am waiting for Skylake-EP, it has good OC
potential and could handle 256GB DDR4 and bunch of NVMes.

For stable OC you need water. Air cooling will overheat too soon.. Sometimes I
run 24/7, the computer is very loud but stable. Dont ask for voltage, its way
in red. This is what I use:
[http://www.antec.com/product.php?id=704370&pid=17&lan=us](http://www.antec.com/product.php?id=704370&pid=17&lan=us)

------
jkot
I ported test into java. C++ version runs 16.05s on my computer, Java is bit
faster 14.58

[https://gist.github.com/jankotek/8fcb8205dbe1b8d131cdac2cc23...](https://gist.github.com/jankotek/8fcb8205dbe1b8d131cdac2cc2342f36)

~~~
vmorgulis
Interesting. C++ and Java are very near on my machine:

    
    
        java Prime
        9592 primes between 0 and 100000
        Time taken = 18.484
        9592 primes between 0 and 100000
        Time taken = 18.25
        9592 primes between 0 and 100000
        Time taken = 18.087
        ...
    

C++:
[https://news.ycombinator.com/item?id=11412820](https://news.ycombinator.com/item?id=11412820)

~~~
jkot
Do you have Java8? It has many improvements

~~~
vmorgulis
I'm on Debian and not familiar with Java. I did that:

    
    
        aptitude install openjdk-8-jdk

~~~
jkot
thats java8

------
trevyn
Can we get a left-pad.io benchmark?

------
vmorgulis
A previous topic about nbench:
[https://news.ycombinator.com/item?id=11252358](https://news.ycombinator.com/item?id=11252358)

------
B1FF_PSUVM
> quite a few suggestions for how the code could be optimised.

Pointless optimization is the meth of the slightly obsessive.

------
gkfasdfasdf
Single-core performance tends to depend heavily on clock speed. Would be
interesting to see what those were.

------
kyberias
One can kinda tell by looking at the code that the author hasn't written code
professionally. At least not for very long.

~~~
briHass
It doesn't matter much for his test, because the compiler won't optimize it,
but one can simplify the loop to only go from 2...n/2+1 e.g. change the line
to:

for(int a=2; a<n/2+1; a++)

That would be the obvious optimization, since 2 is your smallest factor, a
factor can't be larger than half your number. In actuallity, knowing a little
math gives you that if a factor is > the square root of n, there's a factor
smaller than that. So:

for(int a=2; a<sqrt(n)+1; a++)

~~~
occamrazor
And it would be better to move the sqrt(n) calculation outside of the loop (I
guess that any decent compiler would make this simple optimisation anyway)

~~~
briHass
Yeah, that would only be evaluated once, since the value of n doesn't change
inside the loop.

