
C++ Headers are Expensive - kbwt
http://virtuallyrandom.com/c-headers-are-expensive/
======
AndyKelley
In the Zig stage1 compiler (written in C++), I tried to limit all the C++
headers to as few files as possible. Not counting vendored dependencies, the
compiler builds in 24 seconds using a single core on my laptop. It's because
of tricks like this:

    
    
        /*
         * The point of this file is to contain all the LLVM C++ API interaction so that:
         * 1. The compile time of other files is kept under control.
         * 2. Provide a C interface to the LLVM functions we need for self-hosting purposes.
         * 3. Prevent C++ from infecting the rest of the project.
         */
    
    
        // copied from include/llvm/ADT/Triple.h
    
        enum ZigLLVM_ArchType {
            ZigLLVM_UnknownArch,
        
            ZigLLVM_arm,            // ARM (little endian): arm, armv.*, xscale
            ZigLLVM_armeb,          // ARM (big endian): armeb
            ZigLLVM_aarch64,        // AArch64 (little endian): aarch64
        ...
    

and then in the .cpp file:

    
    
        static_assert((Triple::ArchType)ZigLLVM_UnknownArch == Triple::UnknownArch, "");
        static_assert((Triple::ArchType)ZigLLVM_arm == Triple::arm, "");
        static_assert((Triple::ArchType)ZigLLVM_armeb == Triple::armeb, "");
        static_assert((Triple::ArchType)ZigLLVM_aarch64 == Triple::aarch64, "");
        static_assert((Triple::ArchType)ZigLLVM_aarch64_be == Triple::aarch64_be, "");
        static_assert((Triple::ArchType)ZigLLVM_arc == Triple::arc, "");
        ...
    

I found it more convenient to redefine the enum and then static assert all the
values are the same, which has to be updated with every LLVM upgrade, than to
use the actual enum, which would include a bunch of other C++ headers.

The file that has to use C++ headers takes about 3x as long to compile than
Zig's ir.cpp file which is nearing 30,000 lines of code, but only depends on
C-style header files.

~~~
pjmlp
Any plans to actually bootstrap the compiler?

~~~
AndyKelley
[https://github.com/ziglang/zig/issues/853](https://github.com/ziglang/zig/issues/853)

------
beached_whale
You can know where you time is going, at least with clang, by adding -ftime-
report to your compiler command line. The headers take a long time is often
that the compiler can do a better job at optimizing and inlining as everything
is visible. Just timing your compiles is like trying to find things in the
dark, you know the wall is there but what are you stepping on :) Good to know
what is taking a long time, but it may not be the header itself but how much
more work the compiler can do now to give a better output(potentially)

~~~
fouronnes3
I've been working with -ftime-report, but unfortunately it reports times per
cpp file. I'm looking for a way to get a summary across an entire CMake build.
Right now reading 100+ -ftime-report outputs is not really useful, although
deep down I know it's all template instancing anyway.

~~~
beached_whale
When I look most of the time has gone to inline and optimization. But I only
look sometimes and sample size is Me

------
nanolith
I recommend three things for wrangling compile times in C++: precompiled
headers, using forward headers when possible (e.g. ios_fwd and friends), and
implementing an aggressive compiler firewall strategy when not.

The compiler firewall strategy works fairly well in C++11 and even better in
C++14. Create a public interface with minimal dependencies, and encapsulate
the details for this interface in a pImpl (pointer to implementation). The
latter can be defined in implementation source files, and it can use
unique_ptr for simple resource management. C++14 added the missing
make_unique, which eases the pImpl pattern.

That being said, compile times in C++ are going to typically be terrible if
you are used to compiling in C, Go, and other languages known for fast
compilation times. A build system with accurate dependency tracking and on-
demand compilation (e.g. a directory watcher or, if you prefer IDEs,
continuous compilation in the background) will eliminate a lot of this pain.

~~~
shereadsthenews
pImpl pattern is great for those who don’t care about performance but it’s
inappropriate for most header libraries. You wouldn’t want a library that
hides the implementation of std::vector for example. With a visible
implementation the compiler compile e.g. operator[] down one x86 instruction.
With a pImpl pattern it will be an indirect function call in all likelihood
that will be hundreds of times slower. It can make sense for libraries where
every function is really expensive anyway, but it’s ruinous for STL and the
like.

~~~
kccqzy
Using the pimpl pattern doesn't mean an indirect function call. The function
to be called is always known. It's just an extra indirection in the data
member. It's cheap. Think of it as Java style memory layout: everything that's
not primitive stored in an object is a reference and therefore behind one
level of indirection. The performance of Java is acceptance in the vast
majority of use cases. Using pimpl will be the same.

~~~
Someone
_”It 's just an extra indirection in the data member. It's cheap”_

That extra indirection often means a cache miss. That isn’t cheap. Accessing
each item traversed through a pointer can easily halve program speed.

Java tries hard to prevent the indirections (local objects may live in the
stack, their memory layout need not follow what the source code say, objects
may even only exist in cpu registers)

~~~
repsilat
Hmm... if you were a horrible person you could declare a `char[n]` member
instead of a pointer. Then you could placement-new the impl in the
constructor, and static-assert that `sizeof(impl)>=n`... No more cache misses
:-).

:-(

~~~
yuushi
This doesn't take into account the alignment of the type though (you'd want to
use std::aligned_storage<sizeof(T), alignof(T)>), but that requires knowing
enough about T to be able to use sizeof() and alignof(), which means no
incomplete types, bringing us back to where we started.

------
AdieuToLogic
If C++ compile time is a concern and/or impediment to productivity, I
recommend the seminal work regarding this topic by Lakos:

Large-Scale C++ Software Design[0]

The techniques set forth therein are founded in real-world experience and can
significantly address large-scale system build times. Granted, the book is
dated and likely not entirely applicable to modern C++, yet remains the best
resource regarding insulating modules/subsystems and optimizing compilation
times IMHO.

0 - [https://www.pearson.com/us/higher-education/program/Lakos-
La...](https://www.pearson.com/us/higher-education/program/Lakos-Large-Scale-
C-Software-Design/PGM136492.html)

~~~
de_watcher
If it's a book I'm thinking about then it appeared already very dated to me 10
years ago. Too many limitations and there are some weird rules about
boundaries between elements of the architecture.

------
kazinator
Speaking about GNU C++ (and C), the headers are getting cheaper all the time
compared to the brutally slow compilation.

Recently, after a ten year absence of not using _ccache_ , I was playing with
it again.

The speed-up from _ccache_ you obtain today is quite a bit more more than a
decade ago; I was amazed.

 _ccache_ does not cache the result of preprocessing. Each time you build an
object, _ccache_ passes it through the preprocessor to obtain the token-level
translation unit which is then hashed to see if there is a hit (ready made .o
file can be retrieved) or miss (preprocessed translation unit can be
compiled).

There is now more than a 10 fold difference between preprocessing, hashing and
retrieving a .o file from the cache, versus doing the compile job. I just did
a timing on one program: 750 milliseconds to rebuild with ccache (so
everything is preprocessed and ready-made .o files are pulled out and linked).
Without ccache 18.2 seconds. 24X difference! So approximately speaking,
preprocessing is less than 1/24th of the cost.

Ancient wisdom about C used to be that more than 50% of the compilation time
is spent on preprocessing. That's the environment from which came the
motivations for devices like precompiled headers, #pragma once and having
compilers recognize the #ifndef HEADER_H trick to avoid reading files.

Nowadays, those things hardly matter.

Nowdays when you're building code, the rate at which .o's "pop out" of the
build subjectively appears no faster than two decades ago, even though the
memories, L1 and L2 cache sizes, CPU clock speeds, and disk spaces are vastly
greater. Since not a lot of development has gone into preprocessing, it has
more or less sped up with the hardware, but overall compilation hasn't.

Some of that compilation laggardness is probably due to the fact that some of
the algorithms have tough asymptotic complexity. Just extending the scope of
some of the algorithms to do a bit of better job causes the time to rise
dramatically. However, even compiling with -O0 (optimization off), though
faster, is still shockingly slow, given the hardware. If I build that 18.2
second program with -O2, it still takes 6 seconds: an 8X difference compared
to preprocessing and linking cached .o files in 750 ms. A far cry from the
ancient wisdom that character and token level processing of the source
dominates the compile time.

~~~
int_19h
> Ancient wisdom about C used to be that more than 50% of the compilation time
> is spent on preprocessing.

Ancient wisdom was that more than 50% of the time is spent _compiling_ the
headers, after they become a part of your translation unit after
preprocessing. I don't see why preprocessing itself would ever be singled out,
given that it's comparatively much simpler than actual compilation.

~~~
josefx
Opening and reading all the included files could be costly. Also it is
"ancient" wisdom so it might predate compilers that could detect the include
guard pattern and had to repeatedly preprocess the same files. There is an
ancient "Notes on programming" article by Rob Pike that comes up every now and
then with a paragraph against include guards for that outdated reason.

------
RcouF1uZ4gsC
> The test was done with the source code and includes on a regular hard drive,
> not an SSD.

In my opinion, this makes any conclusion dubious. If you really care about
compile times in C++, step 0 is to make sure you have an adequate machine (at
least quadcore CPU/ lot of RAM/SSD). If the choice is between spending
programmer time trying to optimize compile times, versus spending a couple
hundred dollars for an SSD, 99% of the time, spending money on an SSD will be
the correct solution.

~~~
cjensen
Just to address part of your concern: Traditionally disk speed makes very
little difference to compile times for real world C/C++ projects. This is
because real world projects have many files, and each one can be compiled in
parallel. Once you spawn sufficient compilers in parallel, the CPU becomes the
bottleneck, not the disk. (I.e. when a compilation asks for I/O, it then
yields the CPU to other compilers which have CPU work to do)

Note that Visual Studio, for example, does a poor job of this because it only
spawns one compilation per CPU thread. This results in individual threads
being idle more than they ought to be.

~~~
ahaferburg
Absolutely not true. The problem is not the compilation of one single file,
but that every one of these single files pulls in large amounts of headers,
distributed over various libraries (e. g. Qt/Boost/STL), all of which won't
fit into the disk cache.

If it doesn't make a difference, all that means is that your project is small,
or doesn't have too many dependencies. Good for you. But that's not the
reality for all projects.

~~~
cjensen
My projects take 10 minutes to build on a modern system, which is plenty
complicated enough. Don't appreciate the "good for you" flippancy.

------
lbrandy
All of msvc, gcc, clang, and the isocpp committee have active work ongoing for
C++ modules.

We'll have them Soon™.

~~~
Valmar
Who knows whether they'll see much use, due to C++ needing to keep backwards
compatibility for older projects that demand older versions of C++.

It probably partially depends on whether old-style headers can be used
simultaneously with new-style modules.

------
fpoling
Opera contributed jumbo build feature to Chromium. The idea is to feed to the
compiler not the individual sources, but a file that includes many sources.
This way common headers are compiled only once. The compilation time saving
can be up to factor of 2 or more on a laptop.

The drawback is that sources from the jumbo can not be compiled in parallel.
So if one has access to extremely parallel compilation farm, like developers
at Google, it will slow down things.

~~~
maccard
> The drawback is that sources from the jumbo can not be compiled in parallel.
> So if one has access to extremely parallel compilation farm, like developers
> at Google, it will slow down things.

Generally the way this works is rather than compiling into one jumbo file, you
combine into multiple files, and you can then compile them in parallel. UE4
supports it (disclosure, I work for them). and it works by including 20 files
at a time, and compiling the larger files normally.

There is also a productivity slow down where a change to any of those files
causes the all the other files to be recompiled, so you can remove those files
from the individual file.

> The compilation time saving can be up to factor of 2 or more on a laptop.

The compilation time savings are orders of magnitude in my experience, even on
a high end desktop. That's for a full build. For an incremental, there is a
penalty (see above for workarounds)

------
mcv
This reminds me of my very first job after university. We used Visual C++,
with some homebrew framework with one gigantic header file that tied
everything together. That header file contained thousands or possibly tens of
thousands of const uints, defining all sorts of labels, identifiers and
whatever. And that header file was included absolutely everywhere, so every
object file got those tens of thousands of const uints taking up space.

Compilation at the time took over 2 hours.

At some point I wrote a macro that replaced all those automatically generated
const uints with #defines, and that cut compilation time to half an hour. It
was quickly declared the biggest productivity boost by the project lead.

------
fizwhiz
Isn't this the reason precompiled headers are a thing?

~~~
jchw
As far as I understand it's also one of the reasons modules are a thing... or
at least people want them to be.

Precompiled headers are a pretty ugly solution and the way they've been
implemented in the past could be really nasty. (IIRC in old GCC versions it
would copy some internal state to disk, then later load it from disk and
manually adjust pointers!)

~~~
jepler
There must still be some dark pointer magic going on, because I noticed that
unless I disabled ASLR on Debian Stretch, each build of a precompiled header
came out different, screwing up ccache. I can only conclude that the specific
memory layout during an individual run influences the specific precompiled
header (".gch") output. We now run our build process under 'setarch x86_64
--addr-no-randomize.

~~~
jepler

        $ for i in `seq 3`; do gcc-6 -x c-header /dev/null -o x.h.gch; sha256sum x.h.gch; done
        98d8093503565836ba6f35b7adf90330d63d9d1c76dfb8e3ad1aeb2d933d1a45  x.h.gch
        17e5de099860d94aaa468c5ad103b3f0dd5e663f6cdbd01b4f12cf210023e71c  x.h.gch
        3cc2f1c0a517b5fedbbd49bb3a34084d9aa1428f33f3c30278a8c61f9ed9ba88  x.h.gch

------
timvisee
I would love to see the times of this on a Linux system (preferably on the
same hardware).

