
Clang vs GCC benchmark on UnrealEngine3 - Aissen
https://plus.google.com/+RyanGordon/posts/4N6vPqaGjrV
======
tenfingers
gcc at -O0 -g has always been decent enough for me, and I never saw any
significant difference with clang (either in the past when gcc -O0 used to be
/slighly/ slower, or now).

Things change with -Ox, but then again, who really cares? Generating optimized
binaries is not something I do routinely, except when tuning for performance.

When I'm generating a binary for release, I couldn't care less if the project
actually took 2x times to compile. In fact, to use feedback-directed
optimization (which on some projects may make a marked difference, usually
larger then the generated binary at O3 between gcc/clang), you actually have
to build it twice, and _very_ few people actually bother.

clang used to be much faster at building, but then again the binary was never
equally on-par performance wise. As clang is getting better at optimizing, the
compilation times are getting more and more similar.

This is again, all hardly surprising for me.

I _have_ worked on projects using C++ where build times where incredibly slow.
They all shared the common attribute of #include
<megaheader_with_all_class_definitions> and "using namespace library".

For other projects, as long as rebuilding an object takes 2 seconds or 10,
does it really make a difference? In those cases, linking actually takes
disproportionately longer, to the point that compilation times are
insignificant.

It's a genuine question.

~~~
Narishma
Your projects aren't video games, are they? It's not very feasible to debug
games at -O0, unless they're very small. And I don't how if profile guided
optimization is feasible either.

~~~
endianswap
I've worked on several AAA games and have never had a problem
running/testing/debugging with optimizations off. There's a framerate hit, of
course, but I've never had a problem debugging because of it. Trying to debug
with optimizations on is usually a bigger time sink than just debugging at
20fps. When I'm dealing with gameplay code and not engine code I'll normally
just build with all optimizations on except -O0 the gameplay modules so I get
full framerate and perfect debugging of the code I'm interested in. But
usually that's not necessary (except it gives me faster load/restart times.)

------
bananas
Either way the flat compile time is not important to me even on projects of
this size. Most of it is probably going to be linking the .o files on dev
builds which is considerably quicker than a full compile cycle.

For CI builds ~ 10 mins is nothing (we're currently hitting 3 hours and
managing fine with a rats nest of csc and cl compilers being invoked galore).

What really brings the shit to the table with clang is simply this:

Compiler error messages are beyond awesome as is checker.

That is all. Productivity is way up for me. I do my own clang builds of stuff
because it's a lot more helpful as a compiler.

------
davidgerard
So how did the resulting binaries compare for performance? Compilation happens
once, execution many times.

~~~
shoo
except during iterative development, say, where maybe you execute once for
test between each edit

~~~
thesz
You build Unreal Engine from scratch everytime after each edit?

I think not.

The difference will be between 1s and 2s, given overall difference in speed. I
suspect that your time to think about problems will two orders of magnitude
more.

Mine is - minutes between builds.

------
rockdoe
So GCC takes longer to compile but produces a better optimized binary? That
makes the comparison pretty useless as you can probably just reduce the GCC
optimization level.

~~~
matthewmacleod
It produces a _smaller_ binary, but that doesn't really give us any
information on whether it's a better one.

AFAIK Clang and GCC are now roughly neck-and-neck in terms of performance
though (at least as of December) -
[http://www.phoronix.com/scan.php?page=article&item=llvm34_gc...](http://www.phoronix.com/scan.php?page=article&item=llvm34_gcc49_compilers)

~~~
cottonseed
GCC produces a slightly smaller binary when debug information is stripped, but
LLVM produces a massively smaller binary (half as big) when debug information
is included. If you're building debug, why would you care how big the stripped
binary is?

~~~
userbinator
This invites the question of whether GCC is outputting more _detailed_
debugging information than LLVM (which is a good thing), or if it's just being
inefficient at it.

~~~
cottonseed
The g++/gdb experience for C++ has always been pretty bad. Anyone know how
lldb compares?

~~~
dmm
In what ways is it bad? It works pretty damn good here.

~~~
cottonseed
Generally, failing to parse expressions at the level of abstraction of the
source code being debugged: I find it fails on user defined operators, can't
resolve overloaded functions, etc.

Even being able to reliably evaluate C++ expressions is far from what I want:
I want to be able to execute arbitrary C++ statements (including loops), use
templated objects that weren't instantiated at compile time, etc. I feel like
this might be close to possible using, say, the LLVM JIT, although I haven't
seen anything that implies anyone is trying to build it.

edit: for example (stock gcc/gdb on Ubuntu 12.04):

    
    
        #include <iostream>
        int main() { std::cout << "foo\n"; }
    

in gdb, break at main:

    
    
        (gdb) p std::cout << "foo\n"
        Cannot resolve function operator<< to any overloaded instance
    

edit edit: lldb doesn't fare much better (OSX 10.9):

    
    
        (lldb) p std::cout << "foo\n"
        error: no member named 'cout' in namespace 'std'
        error: 1 errors parsing expression

~~~
emaste
LLDB does JIT your expression via Clang, and then either interprets it locally
or executes it in the target, and you can execute arbitrary commands - for
example:

    
    
      (lldb) expr for (int i=1; i<5; i++) { printf("%d\n", i * i); }
      1
      4
      9
      16
    

It does still fail if the appropriate constructors, templated object
instantiations, etc. aren't in the target, as you point out.

~~~
cottonseed
Oh, very nice. That's clearly the right architecture.

It still fails on a lot of things for me (including your above example and the
std::cout example, although variants work), which I'm guessing (hoping?) is
just a lack of maturity. Looking forward to see how it progresses.

------
JupiterMoon
According to this:
[http://gcc.gnu.org/ml/gcc/2014-04/msg00195.html](http://gcc.gnu.org/ml/gcc/2014-04/msg00195.html)
these benchmarks are already out of date...

------
cpeterso
Building Firefox for OS X using clang on my MacBook Pro takes about 8 minutes.
Cross-compiling Firefox for Android using gcc 4.7 on the same MacBook Pro
takes about 45 minutes. :(

~~~
azakai
But that's not apples-to-apples. How about building for OS X using gcc?

~~~
cpeterso
Would you expect building for OS X using gcc to be much faster than building
for Android? The KLOCs compiled for OS X and Android should be about the same,
though the Android build is emitting ARM code and compiling some Java files
(but that only takes 1–2 minutes).

~~~
azakai
Very possible the ARM and x86/x86-64 backends are very different in
compilation speed, for example.

------
Steltek
If gcc is producing a final binary that's twice the size of clang's, wouldn't
that point to IO being a big factor in compile time?

What is this test really trying to determine? How the performance of the core
of each compiler impacts productivity? Shouldn't the test parameters be
adjusted to real world conditions, as seen by the test creator, then?

Like most benchmarks, you could fabricate scenarios that favor one or the
other all day long. Some closer to reality, some completely synthetic.

~~~
masklinn
> If gcc is producing a final binary that's twice the size of clang's,
> wouldn't that point to IO being a big factor in compile time?

250MB of output adding 5mn of compile time would mean 800KiB/s. That's about
what a 5400RPM HDD gets on random writes. Possible, but means GCC has an
utterly terrible I/O architecture and flushes to disk after more or less every
write call.

> What is this test really trying to determine?

Nothing? It's just trying to infirm or confirm the initial off-hand remark:
[https://twitter.com/icculus/status/448281959538888704](https://twitter.com/icculus/status/448281959538888704)

> Like most benchmarks, you could fabricate scenarios that favor one or the
> other all day long. Some closer to reality, some completely synthetic.

I'm not sure how you can get closer to reality than "I spend my days compiling
Goat Simulator with Clang and GCC, let's compile Goat Simulator with Clang and
GCC".

------
mschuster91
Seriously, a 70MB binary?

Is this all program code or does this include resources?

~~~
Cthulhu_
Well, judging by the title, it's the Unreal 3 engine; I'd say that would end
up being a pretty big binary.

~~~
mschuster91
The problem is performance: a small binary might end up having lots of its
code in L1 (16KByte) / L2 (512 KByte) cache (dec2013 values for i7, taken from
[https://www.scss.tcd.ie/Jeremy.Jones/CS3021/5%20caches.pdf](https://www.scss.tcd.ie/Jeremy.Jones/CS3021/5%20caches.pdf))...
but a 70mb binary should produce a lot of cache misses.

I wonder what performance penalties arise from that.

~~~
w0utert
I assume the 70MB binary statically links in all kinds of assets that blow up
the file size, static arrays of pre-calculated data etc. There's no way it
compiles to 70MB of assembly.

The code size would still be massively bigger than the CPU code cache, but the
working set that actually matters for performance does not have to have any
relation to the size of the total code itself, as long as time-critical
sections are small and localized.

~~~
sharpneli
Considering the massive size increase due to debug symbols I'd guess that a
sizable part of the 70MB indeed is assembly, at least 20MB and probably even
more.

C++ templates, while being absolutely lovely to write and use, tend to do
that. Every single different type you instantiate the template into produces a
separate compiled copy of the whole template. Write a sorting function and use
it for 5 types -> you get the same sort in assembler 5 times with differences
that relate to the types they use, alongside with 5 massive symbol names (C++
template symbols tend to be really large).

~~~
exDM69
> Write a sorting function and use it for 5 types -> you get the same sort in
> assembler 5 times with differences that relate to the types they use,
> alongside with 5 massive symbol names (C++ template symbols tend to be
> really large).

To be fair, this is also why the performance of C++ std::sort beats C qsort
hands down. Calling the comparator function with a function pointer is
expensive compared to doing an integer comparison. Additionally, the boundary
between translation units (ie. object files) inhibits a lot of optimization
that could otherwise take place (and does in the C++ case, when the function
calls are inlined).

There are also techniques to avoid C++ template bloat, such as type erasure.
Although this makes the template code only a thin shim that provides type
safety.

