
Go vs. C vs. pypy/Python loop performance comparison - karlheinz_py
http://karlheinzniebuhr.github.io/en/2015/09/28/C-vs-Go-vs-pypy-vs-Python/
======
kristianp
Looks like the go code is using an int, but the C code declares a long.

~~~
Someone
It also is buggy; it declares a signed long, but prints an unsigned one.

My money on where the time is spent is in the printing of the result, by the
way, not in the loop. The C program parses a format string and does more
output (the characters 'sum: '). Buffering/not buffering also may play a role.

~~~
dalke
I'll take your money.

First, a change because the optimizer will pre-compute the sum:

    
    
      #include <stdio.h>
      #include <stdlib.h>
       
      int main (int argc, char *argv[])
      {
        long a, bound=10000000;
        long sum = 0;
        if (argc > 1) {
          bound = atoi(argv[1]);
        }
        /* for loop execution */
        for( a = 0; a < bound; a++ ) {
          sum += a;
        }
        printf("sum: %lu\n", sum);
        return 0;
      }
    

I'll time the performance across a range of values.

    
    
      % cc tmp.c && time ./a.out 1000000
      sum: 499999500000
      0.005u 0.000s 0:00.00 0.0%	0+0k 0+0io 1pf+0w
      % cc tmp.c && time ./a.out 10000000
      sum: 49999995000000
      0.030u 0.000s 0:00.03 100.0%	0+0k 0+0io 1pf+0w
      % cc tmp.c && time ./a.out 100000000
      sum: 4999999950000000
      0.303u 0.000s 0:00.33 90.9%	0+0k 0+0io 1pf+0w
    

You'll notice that my timings for 10000000, at 0.03 seconds, was comparable to
the timings in the essay.

If the printf overhead were the dominate cost then we would expect to see less
variance in the timing. Instead, we see there's a linear increase, which is
exactly what we expect if the loop is the primary cost.

So no, most of the time is not spent in printing the result.

I'll also test with optimizations enabled:

    
    
      % cc -O3 tmp.c && time ./a.out 1000000
      sum: 499999500000
      0.000u 0.000s 0:00.00 0.0%	0+0k 0+0io 1pf+0w
      % cc -O3 tmp.c && time ./a.out 10000000
      sum: 49999995000000
      0.000u 0.001s 0:00.00 0.0%	0+0k 0+0io 1pf+0w
      % cc -O3 tmp.c && time ./a.out 100000000
      sum: 4999999950000000
      0.000u 0.000s 0:00.00 0.0%	0+0k 0+0io 1pf+0w
      % cc -O3 tmp.c && time ./a.out 1000000000
      sum: 499999999500000000
      0.000u 0.000s 0:00.00 0.0%	0+0k 0+0io 1pf+0w
    

The optimizer does pretty well on this code.

~~~
Someone
_" First, a change because the optimizer will pre-compute the sum:"_

You are changing the rules; you will not get my money :-)

And thanks for the educational reply.

~~~
dalke
To be fair, you didn't say how much or how I would get it. I imagine if I were
to show up at your doorstep with cap in hand, I might get a penny off of you.

If the optimizer is enabled, and the upper bound hard-coded, then the compiler
pre-computes the loop. This is the entire code from llvmgcc:

    
    
      _main:
      0000000100000f10	pushq	%rax
      0000000100000f11	leaq	72(%rip), %rdi ## literal pool for: sum: %lu
      
      0000000100000f18	movabsq	$49999995000000, %rsi
      0000000100000f22	xorb	%al, %al
      0000000100000f24	callq	0x100000f34 ## symbol stub for: _printf
      0000000100000f29	xorl	%eax, %eax
      0000000100000f2b	popq	%rdx
      0000000100000f2c	ret
    

Which means you are right - it's hard to beat a pre-computed constant.

I think I owe you a penny.

~~~
karlheinz_py
thanks for this insight, I've countered the problem with command line args.
Now the C compiler can't guess how many loops will be made and therefore the
test shows the same speed as C without using optimisation. But Go keeps having
consistent speed, faster than C. I'm curious what kind of optimisation the Go
compiler applies. See the update for further details..

~~~
karlheinz_py
Edit: I thought C optimisation wasn't working but I repeated the tests and now
C keeps having the same speed with optimisation despite passing the number as
command line argument. This also suggests that the value is not calculated at
compile time..

