
The cost of dynamic vs. static dispatch in C++ - mekishizufu
http://eli.thegreenplace.net/2013/12/05/the-cost-of-dynamic-virtual-calls-vs-static-crtp-dispatch-in-c/
======
pslam
A big extra cost of virtual functions in the underlying CPU not mentioned in
the article: they effectively create a branch target dependency on a pointer
chase. Put another way:

1) The virtual function address lookup requires a load from an address which
is itself loaded. If neither location is cached, this has the unavoidable
latency of two uncached memory accesses. Even at best, this incurs two cached
L1 accesses, which is about 8-16 cycles on modern architectures.

2) The function call itself is dependent on the final address loaded above.
None of that can proceed until the branch address is known. If cached, all is
good and the core correctly predicts execution of a large number of
instructions. Best case, the core may still block predicted execution shortly
after due to running out of non-dependent instructions, until it knows for
sure the address it should have branched to. Worst case, the branch can't
proceed until the two memory accesses access.

In any case, nearly all of this is dwarfed by the cost to the compiled code
itself: in most cases you can't inline, so simple transformations which could
eliminate the function call altogether can't happen.

~~~
MichaelGG
Can profile-guided optimization realise that a certain virtual function almost
always resolves to a specific implementation and have a conditional check to
inline or optimize when needed?

I'm not overly experienced with complicated OO systems, but sometimes it seems
the OO is just an abstraction for convenience, but runtime will always take a
particular path.

~~~
adamtj
My understanding is that good virtual machines basically do this sort of
profiling and optimization at runtime and JIT compile specializations as
necessary.

Does anybody know why JIT isn't done in classically AOT compilers? Is JIT
overhead generally higher than cost savings of the optimizations?

~~~
mjn
> Does anybody know why JIT isn't done in classically AOT compilers?

One (admittedly incomplete) answer is that AOT compilers try to replicate many
of the wins that JIT compilers get from runtime specialization by including a
profile-guided optimization pass instead, which specializes ahead of time,
using data logged from what you hope is a representative example of runtime.

Good JIT compilers can do things like optimizing fast paths, discovering
latent static classes in highly dynamic languages, etc. These kinds of
optimizations can also be done AOT, if you have good profile data and suitable
analysis & optimization passes.

The pros/cons of each approach are not entirely resolved, and you will find
varying opinions. Part of the problem with making a direct comparison is that
there are large infrastructural inconveniences with switching from one
approach to the other. A good JIT is a quite pervasive beast, not something
you can just tack on as a nice-to-have. PGO is somewhat infrastructurally
easier to add to an existing AOT compiler. Therefore, if you can do most of
what JIT does via PGO, you would prefer to do that, were you the maintainer of
an existing AOT compiler. Whether you really can is afaik a bit of an open
question.

~~~
emn13
I think something that's often overlooked in this discussion is the language
semantics differences. So we're not just comparing AOT with JIT (or why not
JIT an AOT compiled app...) We're almost always also comparing C++ to the
JVM/CLR worlds.

And then the point is that most optimizations a JIT can do that an AOT cannot
are particulaly important where the language semantics are "too" flexible. If
your code has lots and lot of virtual calls; lots of exceptions with
unpredictable code flow - well, sure, it's really important to elide that
flexibility where it's not actually used. That's kind of like JS Vm's nowadays
speculatively type their untyped objects - it's a huge win, and not possible
statically.

But the point is - these optimizations are critical because the languages
don't allow (or encourage) code to disable these dynamic features. In C++ this
_can_ be helpful; but how often is dynamic devirtualization really going to
matter? I mean, you can statically devirtualize certain instances (e.g. whole-
program optimization reveals only two implementations and replaces a virtual
call with an if), but the real code-could by any subtype but actually isn't
scenario just isn't one that comes up often.

The consequence is that C++ gets most of the benefits of a JIT simply because
the JIT is solving problems C++ compilers don't need to solve. The cost is
that the compiler wastes inordinate amounts of time compiling your entire
program as optimally as it can, even though it only has a few hotspots.

------
Taniwha
I worked on serious x86 clone once - we took a lot of real-world trace and ran
it through our various microarchitectures to see how it would fly - dynamic
C++ dispatch was interesting normally you expect something like

    
    
       mov r1, n(bp) ; get vtable
       mov r2, n(r2) ; get method pointer 
       call (r2)     ; call
    

that's a really bad pipe break a double indirect load and a call - but branch
prediction may be your friend ...

However some of the code we saw (I think it came from a Borland compiler)

    
    
       mov r1, n(bp) ; get vtable
       push n(r2)    ; get method pointer 
       ret           ; call
    

an extra memory write/read but always caught in L1 and on the register poor
x86 it saves a register right> ... but on most CPUs of the time you're screwed
for the branch prediction - CPUs had a return cache, a cheap way to predict
the branch target of a return - by doing a return without a call you've popped
the return cache leaving it in a bad state - EVERY return in an enclosing
method is going to mispredict as well - the code will run, but slowly

~~~
mappu
I use push/ret idiom all the time to stdcall off the stack.. did not realise
there was a return cache, that's very interesting.

~~~
Taniwha
depends on the CPU - but it's relatively trivial thing to build (especially
because unlike other caches it's a stack) on x86s return nominally is ALWAYS a
bad pipe bubble: a pop followed by an indirect jump - the pop gets resolved at
the end of its micro-op and the jump wants to be resolved early on so as to
start decoding the next instruction

In the end it can't hurt to generate a bad jump prediction off of the return
cache, it's no worse than being idle - the effect of messing with the cache
though can cause it to always fail so as a result you get no advantage from it

------
alextingle

        for (unsigned i = 0; i < N; ++i) {
          for (unsigned j = 0; j < i; ++j) {
            obj->tick(j);
          }
        }
    

I wouldn't go quite so far as to say that benchmarks with tight inner loops
like this are _completely_ useless, but they are nearly so.

The author is clearly aware that the real world of performance is much bigger
& more complex than his simple Petri dish. Credit to him for mentioning that.
It's also really refreshing to see him analysing the optimised assembly.

The trouble with this approach is that it's tempting to draw simple
conclusions. In this case, you might be tempted to conclude "CRTP always
faster than virtual dispatch", when the truth is likely to be much more
situation dependent.

I have seen a biggish project go though a lot of effort to switch to CRTP,
only to see a negligible performance impact.

~~~
eliben
And I have seen projects whose performance was crippled by layers upon layers
of endless virtual calls. YMMV ;-)

~~~
army
Agreed, for almost all code it doesn't matter, but for the remaining small
fraction it's worth thinking about these things. It sounds pretty insane to go
with a blanket approach of removing virtual calls throughout an entire
codebase without understanding which ones are the problematic ones. Especially
since some ways of solving the problem could potentially lead to other
problems like increased compiled code size.

I've seen plenty of software (especially systems software) that does spend
much of it's time in tight inner loops. Pulling out all the optimization stops
there can give measurable gains. I've personally seen measurable gains on real
applications from tricks like reordering branches so that the more predictable
branches go first.

~~~
emn13
Sure, it's a waste to optimize code that doesn't significantly contribute to
execution time. And there are lots of cases that are I/O bound, memory bound,
cache bound, cross-thread communication bound etc. But if you're doing actual
calculations - so _not_ lots of communication like I/O or threading, and not
delegating the crunching to a library - and your calculations are _not_
trivial bit-pushing (e.g. not just streaming with minor changes), then it's a
good bet that _any_ virtual function calls in that kind of code will be
problematic; getting rid of the inner-loop dynamic dispatches will almost
certainly help.

So it's situational, but IME it's pretty predictable where you'll see this
kind of optimization help. By all means profile and use whatever tools at hand
to help you along, and don't apply the optimization blindly - but despite the
whole "black art" label optimization sometimes gets this kind of thing really
is pretty straightforward.

------
kbutler
"If anything doesn’t feel right, or just to make (3) more careful, use low-
level counters to make sure that the amount of instructions executed and other
such details makes sense given (2)."

This is explicit support for confirmation bias.

See Feynman's discussion of measuring the charge of the electron in Cargo Cult
Science:

"Why didn't they discover the new number was higher right away? It's a thing
that scientists are ashamed of—this history—because it's apparent that people
did things like this: When they got a number that was too high above
Millikan's, they thought something must be wrong—and they would look for and
find a reason why something might be wrong. When they got a number close to
Millikan's value they didn't look so hard. And so they eliminated the numbers
that were too far off, and did other things like that..."

[http://neurotheory.columbia.edu/~ken/cargo_cult.html](http://neurotheory.columbia.edu/~ken/cargo_cult.html)

~~~
nkurz
And as an alternative, would you suggest laboriously using low level counters
to verify that every measurement you think is correct is indeed correct? Given
finite resources, what's a better approach than concentrating on the apparent
anomalous measurements? I'm not sure I see the parallel.

------
nly
When you think you can use CRTP instead of virtual dispatch in your program,
you didn't need virtual dispatch to begin with... you needed a generic
algorithm to operate over your object classes. That's exactly what run_crtp()
is, the CRTPInterface class is completely redundant except that it provides
some degree of compile-time concept checking (which we'll hopefully get in
C++17)

Virtual dispatch is useful for type erasure, when using abstract types from
plugins, DLLs or generally "somebody elses code". IMHO, the valid use cases
within a standalone program are actually fairly small.

~~~
jamesaguilar
Unit testing is my #1 use for virtual functions. "somebody else's code" a.k.a.
standard ML modules is a distant second.

------
berkut
I've done benchmarks on this fairly recently, and with the functions actually
doing a lot of work (ray intersection for a raytracer), I saw practically no
difference between CRTP and Virtual Functions:

[http://imagine-rt.blogspot.co.uk/2013/08/c-virtual-
function-...](http://imagine-rt.blogspot.co.uk/2013/08/c-virtual-function-
overhead.html)

And this was with billions of calls to the functions...

~~~
blt
Yes, the penalty is most glaring for calls that do a tiny amount of work.
Imagine if

    
    
      String.charAt(int index)

was a virtual call inside of strlen().

------
gjm11
So he found that dynamic dispatch was a lot more expensive. Fair enough and
not very surprising. But let's quantify it a bit in absolute terms. The
dynamic version of the code took 1.25s to run, during which time it performed
approximately 8 x 10^8 virtual function calls. That translates to a cost per
call of _1.5 nanoseconds_.

From which my takeaway would be: In inner-loopy code for which an extra
nanosecond or so per call is critical, you should avoid virtual function
calls. For anything else, don't worry about it.

~~~
MichaelGG
1.5 nanoseconds per call _in the best case_. In some huge monstrosity where
you've got to go chase down object headers not in the cache, things may be
quite different.

------
tomp
Instead of devirtualization, a simpler optimization, which would additionally
also help in the dynamic case, is simple loop hoisting of the method pointer
fetch. Instead of doing

    
    
        while(...) {
          (obj->vtable[0])(...)
        }
    

we could have

    
    
        void(*fn)(...) = obj->vtable[0]
        while(...) {
          fn(...)
        }
    

which would avoid two redirections per inner loop! Actually, I'm almost sure
that is what LuaJIT would do, and many other high-level programming languages
could perform this optimization as well. However, maybe C is too low-level to
be able to do that, and I don't know about C++.

~~~
eliben
That would save the indirection, but I hope the article shows that by far the
biggest cost comes from the lack of inlining. The latter would not be solved
by your function pointer.

------
vinkelhake
This is a nice article and props for including and dissecting generated
assembler!

A key thing here is that inlining is what enables zero-cost abstractions in
C++. A virtual call is slower than a regular call, but the main problem is
that it builds a barrier that effectively stops inlining.

It'll be interesting to see how devirtualization in GCC will do for real world
programs.

------
namuol
Observation: the intricacies of our technologies are growing to such
complexity that analysis of the things we once had a direct hand in the design
of plays out much like the analysis of some sort of mysterious natural
phenomenon.

------
jheriko
its interesting to see a break down of this - especially using modern
compilers on the intel platform.

did you try the intel compiler? for raw low level optimisation it sometimes
massively out performs the ms, gcc or clang versions...

i'd imagine these problems are worse on ARM chips, and dynamic dispatch is
even less effective there - certainly on PPC architectures I've seen much
worse performance than on similarly powered Intels in precisely this
situation. the caches are less and slower...

i'm not 100% but i think i've seen virtual calls 'devirtualised' by the MS
compiler a couple of years ago... I might be thinking of something else
though, it was a while back now. I was unpicking some CRTP mess in something
that /was not performance critical in anyway/...

~~~
pmjordan
You may be thinking of this: IIRC the standard recommends that compilers omit
dynamic dispatch when the dynamic type is known at compile time - this
essentially boils down to the case where a virtual method call follows
creation of the object with 'new' or as an automatic variable. In my
experience, this is commonly implemented correctly in compilers.

The other case where the dynamic type is known is in the constructor itself of
course.

------
cma
I'd like to see a comparison of calling a dynamically linked function call vs
a non-dynamically linked virtual call.

Dynamic linking has more indirection than you might expect because the
function addresses can't always just be put at the call site during the
library load (the places where you would want to write the address can be in
code that is read-only mmapped to aid in sharing memory between processes and
to avoid loading unused stuff from disk).

~~~
zwieback
In an ideal world the OS could still replace the call sites with straight
calls to the loaded library, circumventing a jump table altogether. I don't
remember what this is called, maybe something like a thunk, but I've seen it
happen in the debugger where the first call causes a fault which rewrites the
call site with the target address and subsequent calls are straight to the
lib. This can work even if the chunk of code containing the call sites is
shared and readonly, as long as the OS can override that.

------
vicaya
This _could_ be another case of premature optimization, as gcc 4.9+ could
automagically devirtualize non-overridden virtual functions. icc could do that
for years.

~~~
nkurz
That's not the way the phrase 'premature optimization' is usually used.
Usually, it means spending time optimizing something that is not a limiting
factor, or that otherwise will not make a difference in the final result.
Keeping your code simple in the hope that eventually it will become fast is
something else, probably falling closer to 'Sufficiently Smart Compiler'
[http://c2.com/cgi/wiki?SufficientlySmartCompiler](http://c2.com/cgi/wiki?SufficientlySmartCompiler).

------
simfoo
I really like the "Mandatory precaution about benchmarks", it's spot on

------
rottyguy
anything similar for higher level languages (c# or the likes)?

~~~
andor
The Java Hotspot VM can still optimize for this case, if the virtual call
leads to only a few classes most of the time. Several virtual methods can be
inlined, but of course there's still an extra step compared to static
dispatch: the classes of the current object and the inlined methods have to be
compared. If no matching method is inlined, control needs to be passed back to
the VM.

~~~
pmjordan
A fascinating article on this type of optimisation:

[http://www.azulsystems.com/blog/cliff/2010-04-08-inline-
cach...](http://www.azulsystems.com/blog/cliff/2010-04-08-inline-caches-and-
call-site-optimization)

~~~
MrBuddyCasino
Interesting read, I didn't know that making fields final in Java does nothing
for performance. Any idea what those "Generic Popular Frameworks" are? I put
my money on Hibernate.

