
Call/Return Microbenchmarking and Call Frame Allocation - luu
http://cs.coloradocollege.edu/~bylvisaker/CallReturn/
======
userbinator
_Everyone knows that Python 's performance is generally pretty bad, but that
seems almost comical to me. How can it take more than 400 cycles just to do a
procedure call and return?_

There's all the instructions in the interpreter's main loop (which probably
contains a bunch of call/returns on its own...), plus the cycles needed to
read the instructions it's interpreting (including the manual management of
the VM's call stack), then throw in a cache miss or two, and 400 cycles starts
looking pretty _good_ for an interpreted language.

I'm of the opinion that if call/return time is taking up a significant portion
of the total running time of a program, then the problem is with the program's
structure and nothing is going to solve it better than fixing the code itself
(e.g. don't break it up into so many little pieces that most of the functions
don't do much more than call others; but don't inline too much, since that'll
increase cache misses.) All the clever trickery with stack/heap allocation to
bring down the overhead is interesting, but it can't ever beat the _zero_
cycles of not doing a call at all. As the saying goes, "the fastest way to do
something is to not do it at all."

~~~
lambdanator
Why would inlining too much increase cache misses? I've never heard that.

~~~
hawkice
Anything that increases executable size might/will cause it to make more
fetches into memory, if only to get all those bonus instructions.

I'd say this is a pretty minor effect in every scenario I've dealt with.
Haskell, in particular, lets you inline very easily (and so many people do if
performance is important), and I've only heard of one time where that had a
negative consequence, and that was mostly about a missing optimization that
you had to fix in a non-removing-inlining kind of way (it had to do with
polymorphic strings leading to code inlined at every single string literal,
which is extremely silly).

~~~
TheLoneWolfling
Yep.

Note, however, that, as a special case, inlining of a function only called in
one place "should" always be faster.

------
nkurz
Although the numbers given seem close to correct on Haswell and Sandy Bridge
(about 4 cycles overhead per function call) it's almost impossible to know
whether this was by luck or good design. It's always dangerous to presume that
the code you write in C has any bearing on the actual assembly executed by the
processor.

Lest one think this is an idle warning, when I tried to verify his results on
my machine I found that even when I prevented the inlining of the functions,
"gcc-4.9 -O3" persistently optimized away the second function call by
generating code like this:

    
    
      f2:
        callq  400700 <tree1>
        mov    %rax,%rdi
        jmpq   400700 <tree1>
    
    

It called the function as expected the first time, reusing the first argument
in %rdi that it was called with. It moved the return value from that first
call from %rax to %rdi as expected. But then rather than issuing a second call
and creating a new stack frame, the compiler realized that since it was just
returning the value without further processing, it could "tail call optimize"
by jumping directly into the called function and letting it return directly to
the top level.

So unless you carefully verify the generated assembly, you might end up
measuring only half the calls you would expect. In this case it might not
matter much (jmp and call are about equal in cost) but it's no longer actually
measuring call/return overhead as the title says. Using '-O1' gave me the code
closer to what one would expect.

~~~
nkurz
I modified his approach to get something that would compile to the assembly
closer to that one would expect regardless of the optimization level, and put
my code up here:
[https://gist.github.com/nkurz/d64f5b4ded4e19e17aae](https://gist.github.com/nkurz/d64f5b4ded4e19e17aae)

From source code like this:

    
    
      COMPILER_NO_INLINE uint64_t f3( uint64_t x ) {
        return f2( f2( x ) ) + 1;
      }
    

'gcc -O3' took 3.67 cycles per call from assembly like this on both Haswell
and Sandy Bridge:

    
    
      0000000000400720 <f2>:
      400720:       e8 db ff ff ff          callq  400700 <f1>
      400725:       48 89 c7                mov    %rax,%rdi
      400728:       e8 d3 ff ff ff          callq  400700 <f1>
      40072d:       48 83 c0 01             add    $0x1,%rax
      400731:       c3                      retq
    

'gcc -Os' was significantly slower, at 4.25 cycles per call on Haswell, 4.71
on Sandy Bridge:

    
    
      00000000004006f3 <f2>:
      4006f3:       e8 ea ff ff ff          callq  4006e2 <f1>
      4006f8:       48 89 c7                mov    %rax,%rdi
      4006fb:       e8 e2 ff ff ff          callq  4006e2 <f1>
      400700:       48 ff c0                inc    %rax
      400703:       c3                      retq
    

Since 'inc' shouldn't be any slower than 'add', I think this slowdown is
either due to alignment or the greater density calls in the code. Maybe this
is a case where the denser code needs to use the legacy decoder instead of the
µop cache?

'icc -O3' produced strange code that ran slightly slower, at 4.5 cycles per
call on Haswell, and 4.01 cycles on Sandy Bridge:

    
    
      0000000000400c90 <f2>:
      400c90:       56                      push   %rsi
      400c91:       e8 1a 00 00 00          callq  400cb0 <f1>
      400c96:       48 89 c7                mov    %rax,%rdi
      400c99:       e8 12 00 00 00          callq  400cb0 <f1>
      400c9e:       48 ff c0                inc    %rax
      400ca1:       59                      pop    %rcx
      400ca2:       c3                      retq
    

Can anyone tell me what it's doing with those extra pushes and pops? I think
they are just noise, but maybe they help with debugging or stack alignment?

Clang also produces odd code with extra pushes and pops, but somehow achieves
the same speed as GCC on Haswell despite this (3.67 cycles). Speed on Sandy
Bridge was slow at 4.5 cycles:

    
    
      00000000004005c0 <f2>:
      4005c0:       50                      push   %rax
      4005c1:       e8 da ff ff ff          callq  4005a0 <f1>
      4005c6:       48 89 c7                mov    %rax,%rdi
      4005c9:       e8 d2 ff ff ff          callq  4005a0 <f1>
      4005ce:       48 ff c0                inc    %rax
      4005d1:       5a                      pop    %rdx
      4005d2:       c3                      retq

------
pjungwir
Sort of OT, but I've been writing some malloc-less C lately, where most/all
memory is on the stack, and I've been wishing for a way to control the space
with a function argument, for example:

    
    
        void foo(int x) {
          char c[x]; 
          // . . .
        }
    

I'm pretty sure there is no way to do that in C, but I'm wondering if a bit of
(Intel) assembly could give me something re-useable from a C context. Any
ideas?

~~~
ajtulloch
Doesn't `alloca` do exactly what you want? [http://man7.org/linux/man-
pages/man3/alloca.3.html](http://man7.org/linux/man-pages/man3/alloca.3.html)

~~~
srean
Yes and now C99 supports variable length arrays anyway. The storage gets
allocated (deallocated) as control enters (exits) the scope. Jumping in I
think is not allowed, dunno if that gets compile time checked or results in a
run time error. alloca is officially not portable but its present on platforms
I care about. It is living dangerously, sure, but quite effective nonetheless.

Would have loved a portable way to query how much stack space the process has
left. The problem some architectures need not even have a stack.

@dbaupp Hey thanks ! and nice to know

~~~
dbaupp
It's a compile time error, e.g.

    
    
      void foo(int x) {
        goto label;
        {
            char y[x];
      label:
            y[0] = 'a';
        }
      }
    
      alloca-goto.c: In function ‘foo’:
      alloca-goto.c:2:5: error: jump into scope of identifier with variably modified type
           goto label;
           ^
      alloca-goto.c:5:1: note: label ‘label’ defined here
       label:
       ^
      alloca-goto.c:4:14: note: ‘y’ declared here
               char y[x];
                    ^

------
stefs
out of curiosity i tried this with javascript. seems like chrome is a lot of
faster than firefox. ~6 times with f25, but f25 is a bit too coarse. jsperf
(which isn't really the right tool for this job anyway) identified me as a
spammer while i was adjusting the test cases though. with lower f's the
difference would be even higher.

i tried it out in the browser, sum of fN(N) from [0..1000[ yields a ~2-3 times
difference.

~~~
stefs
correction: it was fM(N) where N=[0..1000[ and constant M=20 (i think; 20
didn't take too long for 1000 iterations). results were added up and printed
at the end and outside the benchmarking loop.

