
Yet Another Language Speed Test – Counting Primes - MrBra
https://bjpelc.wordpress.com/2015/01/10/yet-another-language-speed-test-counting-primes-c-c-java-javascript-php-python-and-ruby-2/
======
userbinator
_C is the fastest, with C++ approximately the same_

There's no C++ features really being used here besides true/false (which
basically gets treated as 1 and 0 in most compilers[1]) so there shouldn't be
any difference between the two - even the generated code should be identical.
The C++ code even uses printf instead of cout, which is one area where I would
expect a difference. Thus I'm surprised by the large difference between C and
C++ at the low end.

[1] On x86 and other ISAs with flags, booleans can be returned in them, which
is an optimisation that I haven't seen compilers do (yet) on x86. It's common
on embedded ISAs like 8051 though.

~~~
plorkyeran
Merely including <iostream> results in std::cout/std::cerr/std::cin being
constructed on startup even if they're never used (as in this case). This
doesn't take very long, but when the actual work being tested takes under a
millisecond...

------
RobAtticus
Here's an update that includes Go and Haskell:

[https://bjpelc.wordpress.com/2015/01/13/yet-another-
language...](https://bjpelc.wordpress.com/2015/01/13/yet-another-language-
speed-test-go-golang/)

------
whoopdedo

        Notably, it should be mentioned that writing idiomatic
        Python and Ruby results in much slower code than that 
        used here. Ranges bad. While loops good. 
    

I get 33% shorter times when I change the while loops to for-in-range and a
list comprehension.

But unless you're doing computationally expensive operations, I don't see the
benefit of these speed tests except to provide ammunition for flame wars. What
good is it for your code to run 50% faster if you spend 300% more time
creating, debugging, and maintaining it?

~~~
alialkhatib
I'm glad you brought this up, because I was thinking about it but felt like my
relative unfamiliarity with C(++) and a handful of other languages in this
"shoot-out" made my understanding here flawed (or at least partially naive).
I'm also going to piggyback a bit because another thought I had seems related.

There's also the issue that among languages like Python, standard libraries
(like multiprocessing[0]) make parallelizing code _trivial_ (even with/despite
the dreaded global interpreter lock). I rewrote some of the code the author
uses and on a smaller number (6710886 instead of 67108864) I'm seeing ~1/3 the
time to run compared to his baseline with 4 worker processes (27.469517s vs
1:25.373176). I haven't looked into what might be keeping it from being a full
4x faster, but my hunch is that it's a combination of other things going on in
my laptop and the undeniable difference in execution speed in cPython
(specifically, cPython3.4).

One might make the argument that this is a wildly unfair comparison, but my
bigger-picture argument is that a language that abstracts things like
parallelization to the point that one can invoke such features in a line or
two of added & changed code is probably worth factoring into an analysis, if
we're even interested in analyzing languages in this way.

If we're going to write C-like code and run it as Python, should we be
surprised when it doesn't live up to those expectations?

[0]:
[https://docs.python.org/3.4/library/multiprocessing.html](https://docs.python.org/3.4/library/multiprocessing.html)

~~~
stelund
Maybe too low chunksize value for map_async call?

------
headius
I assume you only measured JRuby at low n, where the startup and warmup time
would overshadow overall performance. JRuby is usually significantly faster
than C Ruby. In this case, it's nearly twice as fast:

    
    
      [] ~/projects/jruby $ time jruby primes.rb
      Number of primes in the interval [0,67108864]: 3957809
      
      real	6m26.536s
      user	6m43.472s
      sys	0m4.967s
      
      [] ~/projects/jruby $ time rvm ruby-2.2 do ruby primes.rb
      Number of primes in the interval [0,67108864]: 3957809
      
      real	11m17.610s
      user	11m16.564s
      sys	0m0.738s

------
jkrems
Worth pointing out that there's no dynamic allocation or dispatch in any of
the examples (afaict), so all languages that use a GC get around having to do
any reference tracking or collection. Which explains how Go/Java/C++ can track
C performance so closely. I wouldn't be surprised if they'd all run almost the
same machine code in the end.

------
pron
The benchmark (perhaps intentionally) doesn't test the most important thing
when we want to compare "language speed": the efficiency of abstractions. The
benchmark has no abstractions at all other than a simple, single-target,
function call that most compilers (whether AOT or JIT) would inline.
Interesting language comparisons test the languages' core abstractions like as
virtual dispatch/pattern matching.

~~~
caf
It tests one other thing - a recursive function call that is amenable to tail-
call optimisation.

------
vortico
I know the point of this blog post is to benchmark languages with a naive use
case, but for those interested, there are many faster algorithms designed for
deterministically or probabilistically checking the primality of a number, and
even faster ones for computing the number of primes up to n. Recently a
modified Lagarias-Odlyzko method was used to compute the number of primes up
to 10^25, although I believe it requires the Riemann Hypothesis.

Approximations to this computation are even more fun: Asymptotically, this
function tends to x/ln(x), or more accurately to the integral 1/ln(x) dx from
x=2 to n, whose asymptotic expansion is x/ln(x) + 1!x/(ln(x))^2 +
2!x/(ln(x))^3 + ...

------
x0054
This test is rather misleading. For instance, I decided to rewrite this same
procedure in Swift. I was stunned to find out that at first the speed was
almost 4 times slower. The C program took 34 seconds to run to 67108864 while
Swift took 2:11 minutes.

However, after swapping out the Float only capable square root from Swift
Foundation for the C sqrt function, leaving the rest of the code identical,
the execution time on the Swift code went down to 35 seconds, basically
identical.

I have to wonder if Python, Ruby, and PHP just have a Float only capable
square root function, or handle numbers with much higher precision than
necessary by the test.

~~~
caf
The C sqrt function is

    
    
      double sqrt(double)

~~~
x0054
Hmm, I can see that too now, though I am passing into it a CInt, and getting
back a CInt. Is it just converting the Int to Double internally and shooting
back a Double which is converted to CInt? If so, why do you think it's so much
faster?

------
RandomBK
It's important to note that some languages (particularly Java) don't ramp up
to their full speed until they are properly warmed up. It also sounds like the
author included startup time in the measurements, which will skew results for
the lower values of n.

Overall, be __very __careful of micro-benchmarks like these. The conclusions
that you can draw from them are tenuous at best, and usually have little
bearing on the performance of an actual program.

------
caf
The note about the idiomatic Ruby version of this (each on a range) being
slower than the tested version (while loop) is very interesting.

This looks like an opportunity for the Ruby interpreter to optimise: if the
range is excessively large, it could decide itself to loop over the bounds
rather than instantiating the entire range.

~~~
cremno
It doesn't. A range just knows if it's exclusive, and its beginning and end
value. That line is also not really idiomatic. It's just trying to be.

------
desireco42
What is interesting here is how Javascript is fastest of all non-compiled
languages, even PyPy. It shows to me, that simple language with well thought
out features can do wonders, even when it is interpreted.

I have no illusion that this is definite test of speed of languages.

~~~
jmgao
JavaScript is not simple, its features are not well thought out, and it is
definitely not being interpreted. The only reason JavaScript performs well
here is that multiple megacorporations have devoted tons of resources into
fast JavaScript implementations.

------
allendoerfer
I would like to see the numbers of the Python code compiled without
modifications with Cython.

~~~
dr_zoidberg
Without modifications it yields between 20 to 30% increase in performance.
However Cythons strength is in the type annotations. With those it should get
close to C speed.

~~~
allendoerfer
I would love the Cython team to work on the performance without modifications.
Maybe they could make use of the standardized type hints of Python 3.5.

I could see myself _compile_ with Cython here and there if the performance
gains would be meaningful but I do not want to _write_ Cython.

~~~
dr_zoidberg
Considering that Python 3 remains slower than Python 2 on many benchmarks,
what you're suggesting would be an interesting feature for Cython/Python 3 --
"Compile your Python 3 code with Cython and get 100%* speedup (*on average)".

However, as I understand, Py3's type hints are derived from a previously
available module (which I think also worked with Py2). So why would they add
compatibility for that now? Why not just write your own parser that reads
types annotations and then outputs Cython-equivalent variables declartions?

There are also some Cython specific types that you won't find, for example the
functions can be either cdef or cpdef, depending if you plan to use them from
C only, or C and Python.

I still see value in learning Cython as a complement to Python, for the cases
where you need to get "closer to the metal". It's not so much more, and it
helps a lot in performance.

------
ryanmk
It would be nice to see the numbers, in addition to the graph.

I did a straight port of C to lua, and ran it with luajit.

The C code ran in 49.901s.

The luajit code ran in 14m16.547s.

C code was compiled with -02 -std=c11 -lm (version 4.8.1, mingw).

luajit.exe was 2.0.3, static build, using VS2012 x64

~~~
letitgo12345
On OSX 10.1 with luajit 2.1.0 alpha, a straight translation of the C code gets
me 59 seconds on my machine which is close to the C speed on it (38 seconds
compiled with O3)

Edit: fixed luajit version

~~~
ryanmk
I've placed my port here:
[https://gist.github.com/anonymous/ce176d7ab4f6b7b1ba91](https://gist.github.com/anonymous/ce176d7ab4f6b7b1ba91)

If you see anything wrong with it, or odd, feel free to share.

I'm still investigating what is happening to make the run so slow, so if you
can find something wrong in my code, that would help.

~~~
whoopdedo
Lua is global by default. Declare all the variables as local and you'll see
significant improvement. Also, there is a boolean type so you can use true and
false directly instead of comparing numbers.

~~~
ryanmk
I tried using locals, and there was no change to the time. Using a boolean
return value for isPrime shaved off two seconds.

------
aruggirello
As Go and Haskell have already been added, it would be nice to benchmark Rust,
Hack (vs. PHP), and perhaps LUA and Clojure, too.

------
vortico

        $ math
        
        In[1]:= AbsoluteTiming[PrimePi[67108864]]
        Out[1]= {0.009661, 3957809}
    

:^)

------
throwaway232
why even bother with prime numbers? just have the program count from 1 to 2^n
and the results will have about as much meaning as these do.

------
halayli
Why many programmers have a tendency to generate performance numbers and
extrapolate conclusions without understanding the bottlenecks?

on the other hand:

combining the first 2 ifs into if (n <= 2) { if (n < 2) return 0; else return
1}, would eliminate an unnecessary if statement check that will be false for
most values.

>If n < two: return false.

> If number is equal to two: return true.

> If n (>2) is even: return false

~~~
copsarebastards
Why do many Hacker news commenters criticize posts without understanding what
the writer is trying to do? The purpose is to benchmark the language, not to
come up with the most efficient algorithm.

Further, your suggested optimization is terrible. Even if it turns out that's
slightly faster (which I'm not convinced it is: did you actually compile that
code and look at the generated assembly or profile it?) there are much bigger
wins to be had by using a better overall algorithm, such as the sieve of
Eratosthenes, _which was mentioned in the article if you had read it_.

I know I'm being a grumpy old man here, but if you decide to comment on
articles you either didn't read or didn't understand, you've only yourself to
blame.

~~~
halayli
I did read the article, and to find my suggestion terrible indicates that you
don't know how CPU instruction pipelines work. My suggestion will give you a
big win because of branch prediction optimization. Without it, the CPU
pipeline will be processing less instructions in parallel because the
condition will be false in most cases.

When you benchmark, it's important to know what you are doing or you'll arrive
to the wrong conclusions because of unknowns you aren't aware of.

Writing some code to generate performance numbers and jumping to conclusions
without understanding what influenced the numbers is unproductive to say the
least. These conclusions influence a lot of code that will be written in the
future.

If you cannot tell me where the bottleneck is (what's saturated), then you are
better off not doing anything.

~~~
copsarebastards
> _I did read the article, and to find my suggestion terrible indicates that
> you don 't know how CPU instruction pipelines work. My suggestion will give
> you a big win because of branch prediction optimization. Without it, the CPU
> pipeline will be processing less instructions in parallel because the
> condition will be false in most cases._

Thank you, Captain Obvious, for that very simplified explanation of junior
undergrad systems architecture. That would all be very important if branch
prediction misses were your biggest problem. Maybe look up the Sieve of
Eratosthenes people keep mentioning because hey, people keep mentioning it for
a reason!

Arguing about branch misses in this algorithm is like arguing about branch
misses in a bogosort implementation. The correct optimization isn't to remove
branch misses, it's to _stop using bogosort and use a reasonable sorting
algorithm_.

I'll charitably believe that you read the article, but if you're still
complaining that the code is unoptimized when the author _clearly said it was
unoptimized_ then you didn't understand the article. The article isn't about
code optimization, it's about gathering performance metrics for naively-
written code. It's profiling the _language_ , not the code.

> If you cannot tell me where the bottleneck is (what's saturated), then you
> are better off not doing anything.

I'll tell you one thing for sure: the bottleneck isn't branch prediction
misses.

EDIT: Just to tack on more reasons you're wrong here: 1) You don't actually
know that what you did removed any possible branch prediction misses without
looking at the generated assembly, and 2) you could probably eliminate more
branches using a switch-case statement (although again you'd have to look at
the generated assembly to see if that's the case).

------
TheLoneWolfling
This is completely and totally meaningless.

It's excluding compiler time, and it's only outputting a single number. As
such, there's nothing preventing any of the optimizing compilers from
optimizing the entire program to essentially print("pi<num> = <num>"). Ditto,
time to load interpreters is included, whereas time to load compilers is
excluded.

Also, note that the output is different in format for the different languages.

On a different note: he mentions PyPy in passing but doesn't post benchmark
numbers. It'd be interesting to see how PyPy fares, especially with the
slower-in-Python idiomatic version.

~~~
userbinator
_As such, there 's nothing preventing any of the optimizing compilers from
optimizing the entire program to essentially print("pi<num> = <num>")._

That has clearly not happened here. If it did, you would see a flat line
regardless of n. Nevertheless it is a good idea to make it e.g. take n as a
commandline parameter, even if only to make testing different values easier.

 _Ditto, time to load interpreters is included, whereas time to load compilers
is excluded._

You don't have to load the compiler every single time you run the program,
whereas that isn't the case with an interpreter.

 _Also, note that the output is different in format for the different
languages._

It's performing a single output operation at the very end of the computation.
Any difference in that would be completely lost in the noise as n increases
and the computation time dominates.

 _On a different note: he mentions PyPy in passing but doesn 't post benchmark
numbers_

PyPy is there between JavaScript and Ruby:

[https://bjpelc.files.wordpress.com/2015/01/graph3.png](https://bjpelc.files.wordpress.com/2015/01/graph3.png)

