Hacker News new | comments | ask | show | jobs | submit login
The C++ Build Process Explained (github.com)
358 points by green7ea 67 days ago | hide | past | web | favorite | 126 comments



> You could put a function's definition in every source file that needs it but that's a terrible idea since the definition has to be the same everywhere if you want anything to work. Instead of having the same definition everywhere, we put the definition in a common file and include it where it is necessary. This common file is what we known as a header.

In C and C++ parlance "definition" should be "declaration" and "implementation" should be "definition".[1] The terminology is important if you don't want to get confused when learning more about C and C++. This is compounded by the fact that some languages describe these roles in the author's original terms. (Perhaps the author's terminology reflects his own confusion in this regard?)

[1] This is indisputable given the surrounding context, but I didn't want to paste 3-4 whole paragraphs.


Thank you for the correction, you are entirely correct and it is a big oversight on my part. I tried to use as few technical terms as possible to make the article more approachable but ended up doing something worse: I misused a technical term which is misleading.

I will correct this as soon as I have some time to do so.


Thanks for the write-up and the effort to fix mistakes. Great stuff.


Yes, indeed.

In extremely old code, I sometimes see people preferring to manually write declarations of functions (even libc functions) in every source file instead of including a header.

To add to the confusion, in certain cases, it is possible to put a function's definition in a header file (for example if it's a function template, or in an anonymous namespace, or the static keyword is used to indicate internal linkage). So it is possibly to write this function definition manually in every translation unit.

Otherwise, the ODR rule requires functions to be defined exactly once.

> One and only one definition of every non-inline function or variable that is odr-used is required to appear in the entire program (including any standard and user-defined libraries). The compiler is not required to diagnose this violation, but the behavior of the program that violates it is undefined.


Interestingly a form the "one definition rule" even applies to some of those functions that can appear in multiple translation units, specifically inline functions and template functions (not static functions or those in an unnamed namespace). In those case it says that they must be identical in all the translation units that they're defined in, so it's more like a "unique definition rule" for them.

This sounds like it would be easy – just put the definition a header file. But even if the text of a function is identical in different translation units, it can still be ODR-different between them if the symbols that they look up are different due to other header files included before them declaring different thing or doing things like "using namespace std". Argument-dependent lookup is especially dangerous here. As with other ODR violations, this causes undefined behaviour and compilers/linkers aren't required to issue a diagnostic (and they usually don't!). I believe C++20 modules will solve this problem.


The functions can also be different if different compilation units were compiled using different compiler settings (or event different compilers).

In Titus Winters' recent Pacific++ talk[1], he pointed out that even something as simple as including an 'assert' statement will violate the ODR if some compilation units are compiled with 'debug' settings and some aren't. This can easily happen with build systems that cache compiled object files if changes in compilations flags don't automatically invalidate the cache.

[1] https://youtu.be/IY8tHh2LSX4?t=900


> even if the text of a function is identical in different translation units, it can still be ODR-different between them if the symbols that they look up are different due to other header files included before them declaring different thing

Exactly. This is why you should never put anonymous namespaces and definitions of objects with static linkage in an header file.


> this causes undefined behaviour and compilers/linkers aren't required to issue a diagnostic (and they usually don't!)

-fsanitize=undefined with LTO has shown ODR violations for a long time


As a non c++ programmer, what I don't understand is why do people need to write header files by hand. Surely it should be auto-generated from the source code?


Remember you are writing your comment in 2018, what was feasible in 1975 was very different. Programmers back then were smart enough to pull the job off, but the computers they had were limited such that it wasn't worth it. First because the computers were limited programs had to be smaller so the benefit wasn't as great. Second computers were slower so it was worth spending extra human effort once to save a lot of computer effort.

C++ has been working on modules to fix this for years now. It turns out to be harder than you would think. Everyone agrees with the basic problems statement, but there is disagreement on the details. All sides have good points in favor of their approach, and there are some places where you have both as they are incompatible. Progress is being made (and a lot of the incompatibilities turned out to be solvable but it took years of thinking to come up with how)


Header-generator tools are out there, but I'm not sure how usable they are. It's rarely done though. It's a bit of a pain having to 'write everything twice', but there's more to header files than re-declaring what you've done in your source files.

As berti said: enums, constants, macros, simple struct type declarations, typedefs, and templates, are examples of hand-written code that might belong entirely in the header file.

It would be possible to keep these constructs in a hand-written header file while auto-generating the rest, but C++ developers aren't convinced this is worth doing.


Headers are used control visibility of declarations (note though that visibility of symbols themselves is controlled by other means), and typically contain more than just function declarations (structs, enums, etc.). There are basically two common patterns to split public and private declarations: - public header for consumers of the interface, private header for internal use only - public header for consumers, private declarations directly in the .cpp/.c file


A C++ header file is much more than just serving as an interface to an implementation. Sure it can be used in that way, but it doesn't have to.

Have you checked out the standard library's header files? If you do, you'll see they are full of templates, and constexpr functions. In some sense, the header file is where most of the code is.

Have you also checked out the concept of a single-header dependency? Due to the myriad different ways to manage dependencies, those dependencies that consist of just a header file is extremely convenient to have. This is the logical conclusion when you can put increasingly complicated things in header files.


If anybody is curious about how templates work, there are many ways, but the two most historically popular are:

Prelinker:

1. Compile each file, noting which templates are needed in a section in the object file

2. Have a special program called the "prelinker" run before the linker that reads each .o file, and then somehow instantiates each template (which usually, but not always requires reparsing the C++ file)

Weak Symbols:

1. When compiling instantiate every needed template, but mark it somehow in the object file as being weak, so that the linker only pulls in one instantiation for each definition.

The prelinker used to be more popular, as if you e.g. instantiate the same template in every single file, your compiler does tons more work with the weak symbols approach, but now weak symbols are popular both because they are much simpler to implement and the fact that compilation is usually parallel, while linker is typically not means that walk clock times may even be faster.


Who actually used the prelinking implementation, aside from those very few compilers that valiantly tried to support "export template"?


Nobody, AFAIK. Even with an explicit "export template", it's basically impossible because of of the interaction of all the features of the C and C++ and CPP parts of the language. (Precompiled headers are/were are thing, but they're very brittle. Of course, one assumes you already know this, I'm trying to provide additional exposition.)

People started to use templates for metaprogramming and from that point on the scope for "reuse" of templates isn't really there. (Reusing parsing might be plausible, but it's really difficult because parsing is extremely context-sensitive because of SFINAE, #defines, etc.)

Some might comment that "modules" is "export template" all over again, but this time there are actually 2-3 implementations of 2-3 of the proposals and everyone is confident that the remaining minor problems can be resolved satisfactorily... and they're all exchanging experiences to help each other!


"Precompiled headers are/were are thing, but they're very brittle."

In compilers other than Visual Studio, yes (or so I'm mostly told - I don't have all that much experience with them), but msvc has had them since at least VS6 (late 1990's) when I first started using them, and they work very well and have saved me many, many hours since then. Maybe once or twice I had to delete the pch file in all that time, and that was most likely more of a issue of the GUI mangling the saved internal state than the actual compiler.

I've heard pushback against precompiled headers from Unix land for 2 decades, I'm not really sure where it comes from. I have the impression it's mostly cognitive dissonance - 'msvc has it and gcc doesn't, therefore it must be bad because gcc is 'better' than msvc'. It's similar to #pragma once - in use, it's objectively better in every possible way than include guards are, and gcc fanboys still dismissed it back when gcc didn't have them.


FWIW, gcc has had pragma once for ages. On the other hand recently we had issues with MSVC not recognizing that an header and its simlink were the same. GCC and clang had no problems.

There is a reason pragma once is not standardized. Defining when to include lines refer to the same file is extremely hard.


GCC had PCH for years though. Every semi-large project I do with GCC / Clang gets the PCH treatment, in particular because it's so simple with CMake.


Prelinkers are dying, but 20 years ago they were the normal way of doing things. 30 years ago, Cfront had to use them because it was relying on existing unix linkers that did not have weak symbol support.


I am still not sold on templates. They look like they might help the Google's and Microsoft's but for most code bases they seem to be forcing a dual system of dependencies without much benefit.


I think you meant "modules"?


Yeah I did, oops. Templates are great


I meant to type modules


EDG's frontend used to be prelinker based. I would guess that it supports the weak symbols method now, since they were involved in developing the itanium abi:

https://itanium-cxx-abi.github.io/cxx-abi/abi.html#linkage

[Edit]

EDG has a customer list here, but not all customers are compiler makers; frontends are useful for analysis too.

https://www.edg.com/customers


Yep, and EDG was the only frontend to implement "export template" (and Comeau was the only backend to implement it, so far as I know). I think that's the only reason why they implemented it this way - outside of template export, there's no particular reason to do templates like that. The other technique, with folding duplicate sections in object files, is necessary for inline functions anyway, so might as well use it for templates...


There is also explicit instantiation: it's a language feature!

Split the template into declarations and definitions, similar to .h versus .cc.

Then #include the definition into some module where you write template instantiations for all the needed types.


Painful, but if you know that a template will be only instatiated for a finite set of types, it can be worth it.


gcc comments on this here [0]. Am I correct in understanding that the Cfront model is weak symbols, while Borland is the prelinker?

[0] https://gcc.gnu.org/onlinedocs/gcc-5.5.0/gcc/Template-Instan...


That's backwards if my memory is correct. Borland compiled a copy of the template for each file[1] you compiled, but cfront tried to compile each template only once.

Cfront also basically never worked well.

1: the technical term is "compilation unit" since #include means you're never compiling just one file


Very interesting! which do gcc and clang use?


The prelinker approach is now laregly considered a dead end. I've never seen one that wasn't really buggy.


Weak symbols



Aside, C++ would have been so much nicer without the need for header files ...


Headers allow you to ship a binary without the full source code. If you want to build a linux kernel module you don't need all of the linux sources, just the headers.


In principle you're right, but the notion that header file exposes just the "interface" is completely false. Class definition, private variables and functions, etc. are all exposed in header file.

Header files are not a way to only expose the interface. You give up a lot more in C++.

I've never had to deal with pesky header files until I started developing C and it immediately struct me as a royal pain in the ass. Even after couple of years of developing C/C++, I find the whole concept of header files archaic. Include preprocessor directive literally copy pastes stuff with no intelligence what-so-ever. The user is now burdened to ensure #includes are guarded to what I call a patchy half-baked hacked up solution - #IFDEF/#DEFINE/#ENDIF and #pragma in C++.

It should be handled automatically by the compiler/preprocessor or IDE and I believe it is now being addressed in the C++17/20 spec with the advent of "modules". This thing should have been written up way back in 1989.


> In principle you're right, but the notion that header file exposes just the "interface" is completely false.

Sorry, but it is not "completely false". Doing it properly requires a carefully designed interface to hide internal data structures, and splitting out the end user headers from the internal headers, but it works.

> I've never had to deal with pesky header files until I started developing C and it immediately struct me as a royal pain in the ass.

It's like saying, "I've never had to deal with pesky .py files until I started developing Python."


> Doing it properly requires a carefully designed interface to hide internal data structures, and splitting out the end user headers from the internal headers, but it works.

Out of curiosity, how? (If pointers or other mechanisms for memory indirection are allowed then it's pretty easy, so let's agree to ban those.)

Not a rhetorical question, despite the parenthetical.


For global or namespace functions you can hide the implementation in the .cpp file.

Doing the same with objects require pointers to hide the details of any private members. A common pattern is called Pointer to IMPLemenation (PIMPL).

Not sure why we need to exclude pointers?

In c++14 this is very much simplified with the use of unique pointers.

std::unique_ptr<Foo> make_foo(...);


Because I want full encapsulation without giving up by-value semantics. I want my clients to be able to take objects of some opaque record type that I've defined, and put them entirely on the stack, or in some contiguous block of memory entirely of their making, without ever getting to know the constituent fields of the record. I want it all, I want to have my cake and eat it too.

When the parent (er, great grandparent?) says their parent is wrong to deny that headers only reveal the interface, I think it's a little disingenuous to base that on an unrevealed assumption that PImpl is in play. PImpl is basically the the idiom that begot Java. One reason I might opt for C++ over Java, is a desire for finer control over the location of memory -- but it's important for me to know that to actually get that, I'll probably have to sacrifice information hiding. It's a trade-off. Yes, on some level it's better to have the option to make that trade-off, but the product here is "encapsulation or value semantics", not "encapsulation and value semantics".

Mind, I'm not a C++ developer. Maybe these days link-time heroic optimization makes the "right" decisions and collapses these kinds of indirections in all the sorts of situations you'd want it to. I write a lot more Java, and my understanding is that HotSpot gets up to a lot of heroics pertaining to this stuff these days -- I've noticed HotSpot will churn through a workload involving processing a collection of records far more quickly if you can arrange for it to stream through an array, even if you'd expect the records to be scattered randomly throughout memory.


Well, there is a way to have them on the stack even with PImpl: http://www.gotw.ca/gotw/028.htm

However I'd say it's an ugly hack which should only ever be used if you _really_ need the performance.


> I want it all, I want to have my cake and eat it too.

Conceptually this should be possible with some preprocessor/header tinkering; something like:

private.h:

  struct Bar {
    double x;
    double y;
  };
  #define BAR_DEFINED
public.h:

  #ifndef BAR_DEFINED
  struct Bar {
    char opaque[16];
  };
  #endif

  struct Foo {
    Bar bar;
  };
consumer.cpp:

  #include "public.h"
private_implementation.cpp:

  #include "private.h"
  #include "public.h"


this code can crash on platforms where unaligned access is not allowed. you need an alignas on public Bar. and of course the catastrophic bugs if the size of opaque is not kept in sync properly. the point is: the module system should be doing something like this for you behind the scenes.


> I want my clients to be able to take objects of some opaque record type that I've defined, and put them entirely on the stack

if you want to put the object on the stack, the compiler has to know the size of the object to reserve enough space on the stack. How can it know the size of the object if it does not have its full definition somewhere ?


Module symbol table for example, where only the compiler can actually see the complete information about a type, although the consumer code can only access what is exposed as public.


This is fine if you can recompile a program against a new version of the library, but if you just want to relink it doesn't work well. This is actually while common when a .so is replaced with a newer version in a system without rebulding the world.

Some languages have, like ADA I think, have first class support for runtime sized, stack allocated types, so it might work there.


> Not sure why we need to exclude pointers?

Sometimes you need to. You cannot entirely stack allocate an object that uses PIMPL. Also you cannot allocate an array of such objects compactly in memory.

On the other hand, if you want to be able to evolve the class member variables but still maintain a stable ABI, you need to hide the memory layout, for example with PIMPL. But this is a C++ limitation. For example Objective-C* (and also soon Swift) allows modifying the class layout, adding properties etc, without changing the ABI.

https://en.wikipedia.org/wiki/Objective-C#Non-fragile_instan...


> Sometimes you need to. You cannot entirely stack allocate an object that uses PIMPL.

You actually can, if I'm not misunderstanding: http://www.gotw.ca/gotw/028.htm


They must have some level of indirection in the resulting code to accomplish that like virtual inheritance. There is a price and this can be done in C++ too, but you have to opt into the cost.


> Under the modern runtime, an extra layer of indirection is added to instance variable access,

so it's just syntax sugar for PIMPL.


Actually not, it works a bit differently. In Objective-C 1, there was no indirection and one had to explicitly use the equivalent of PIMPL to hide private members from the header or avoid the fragile base class problem.

In Objective-C 2 the object meta-data contains a table of instance variable offsets. The dynamic linker can modify this table at load time so you can freely add both instance variables and methods to new revisions of a class.

So what is the deal? Well, when the holder object itself is heap allocated, pimpl is inefficient because every access will require dereferencing two pointers. Also you cannot put protected or virtual members in the internal pimpl class (then there would be no point to have those in the first place).

That being said, it is not like Objective-C is some pinnacle of performance - you cannot allocate objects on the heap, and the compiler doesn't perform any devirtualization. So for performance critical code you have to drop down to C or...C++ :)


You can split "secret" data structures into another header, and them simply not give that secret header to the customer. If you give the binary and a header with some structures missing, the user can still use whatever you did give them.


> Out of curiosity, how? (If pointers or other mechanisms for memory indirection are allowed then it's pretty easy, so let's agree to ban those.)

Here's one popular way to do it in C++: https://en.cppreference.com/w/cpp/language/pimpl

I guess my tongue in cheek response to your parenthetical would be to ban the use of hidden internal data structures ;-)


How do you compare .py files with .h files? I am confused as to how you are drawing this analogy.


I mean header files are C and C++ source files, so saying you didn't have to deal with them until using C and C++ is a tautology. Likewise, you won't have to deal with .py files until you're using Python.


> In principle you're right, but the notion that header file exposes just the "interface" is completely false. Class definition, private variables and functions, etc. are all exposed in header file.

If you don't want to expose class internals, just use the PIMPL idiom. It's an extra indirection to protect your own abstraction, so naturally C++ decides to opt for performance by default.


That’s an annoying workaround at best. It’s not necessary in C. It wouldn’t be necessary in C++ if you could declare incomplete class types but still declare their member functions. This would have zero performance impact and improve encapsulation.


> It’s not necessary in C

sorry what ? most C libraries I know hide their implementations behind opaque types such as `typedef void* my_handle_t`.


Right, I'm saying that's not necessary, you can just do this:

    struct my_type;
    struct my_type *my_type_new(void);
This is more common than `typedef void*` in my experience, because it actually provides some minimal amount of type safety.

You can also do this in C++ of course, but you can't use member functions this way. My real complaint here is that the Pimpl idiom in C++ is more cumbersome than a simple forward declaration and free functions, which is available in C.


> Include preprocessor directive literally copy pastes stuff with no intelligence what-so-ever.

Well, it's easy to reason about them. The thinking goes about the translation units.

The real problem is that the actual interface is not enforceable: the symbols in the binaries are just names.


> Headers allow you to ship a binary without the full source code.

100% agree. That said, that doesn't mean that they're the only way to do it, or even the best way.


In principle I agree. An IDE should be able to automatically generate something like a header file regardless of language.

But still, no IDE or tool I've seen does this better. A header file gives you a very nice overview, and it really isn't any hassle to speak of to keep them up to date.

It takes getting used to (as with everything when starting out with a new language), but before you know it you might even start to miss them in other languages.


Header files absolutely are a hassle to keep up to date. Besides, they're a terrible experience. You have to pay attention to include order! You have to write forward declarations! You have to write include guards, in 2018!

They're also bad as an overview. They may have been good in the 1980s, but nowadays a proper documentation generator gives you better formatting, search features, and cross-references. Inline documentation is particularly obnoxious in header files: I like seeing function-level documentation alongside interface and implementation, but if I put the documentation in both the .cpp and the .h file it's duplicated in two places and easily gets out of date.


The order of headers mattering is very rare though and not indicitive of normal C++ code. It's rare, but not unheard of unfortunately as most C code is also available to C++.


Does #pragma once count as an include guard in your book?


#pragma once is not in any C++ standard, although basically all modern compilers support it. The reason it is not in the standard is that its behaviour in the presence of symlinks may differ depending on the filesystem and compiler. Is it the unique combination of filename + content that should be included once or is it the file path? etc.


> The reason it is not in the standard is that its behaviour in the presence of symlinks may differ depending on the filesystem and compiler

I have the feeling that the difference matters to almost nobody and should be avoidable by the few people it actually affects. I had more issues with colliding include guards than I had with symlinks and I still end up replacing CLASSNAME_H with PROJECT_CLASSNAME_H in our own headers every now and then since the autogenerated guards are too naive.


> You have to pay attention to include order! You have to write forward declarations! You have to write include guards, in 2018!

Include order is a valid complaint, the others truly are non issues.

I mean, we have other more pressing issues. Such as the fact that garbage collection is still a thing in 2018.

> ...but nowadays a proper documentation generator gives you better...

Examples of that?

> I like seeing function-level documentation alongside interface and implementation, but if I put the documentation in both the .cpp and the .h file it's duplicated in two places and easily gets out of date.

This is the only thing that I dislike about header files. And it is a constant reminder how bad IDEs are at handling this.


Doxygen can generate a list of all class members, including those from all ancestors. If you wanna do that checking header files you'll have to keep jumping between files going from parent to parent class.


... doxygen, no. I'd rather take header files than doxygen.

I hope he meant something else since he said "nowadays". Doxygen is ok if you haven't setup your build environment. But for when you actually are in the code (and somewhat familiar with it) I very much prefer just reading the header files than switching to a browser.


> Header files absolutely are a hassle to keep up to date.

This happens because there is some redundancy between the header file and the source. You can often arrange your code so that there is no redundancy (or almost zero).


> Header files absolutely are a hassle to keep up to date.

how so ? every IDE I've used automatically changes the header if you change the cpp and conversely


>it really isn't any hassle to speak of to keep them up to date

Dear lord, all those broken builds I've seen over the years would love to talk to you about that.


> But still, no IDE or tool I've seen does this better.

Why aren't Delphi, Oberon, and Modula-3 immensely better? Each file is an independent module except the one with main. Then the compiler quickly figures out in one pass what all the function signatures are. The minor drawback with this is that the one-pass method requires you to pay attention to declaration order, though a two-pass method renders that approach irrelevant. The IDE has all the symbolic goodness you need, and no goddamn header files. Also it's still easier than dealing with managing the dependency graph yourself as is necessary with C file include order.

I say this as a guy who enjoys C and has done way more work in C than Delphi, but give credit where it's due.


Because there is no separation between API and implementation? Header files are great, you don't have to wade through implementation details or documentation to figure out what is API and what isn't, even more so with good use of pimpl. You hardly need separate documentation tools, because the header (when structured well) is just the signatures and documentation. Templates muddy this a bit, but you put those in .inl files and it all works conceptually very pure again.

Headers (or at least 'separate specification of API') are/is great, and I miss it dearly in other languages.


That is what module navigation tools are for.

Already in Turbo Pascal for MS-DOS there was a TPU dump tool, which would generate a "header file" from a TP module.


Sure, there are a bunch of tools for pretty much every language, but I don't want to deal with that, nor switch between browser and editor for it. I know people complain all the time about headers, I was just saying I love the separation, even if it's an historical accident.


I started with Apple ][ Basic, but then moved to Turbo Pascal 3.0, 4.0 and 5.0 and enjoyed quite a lot the "Unit" system. But now came the realization that it was not possible for one to do cyclic references - e.g. "Unit A" refers to symbols from "B", and vice versa... Not that it's a good idea, but "C" linkage is okay with this, and you can split things much easier, thus probably ruining things in the long term... but sometimes it can help...

"Uses BGI;" and I'm done :)


Ah, someone did not read the Turbo Pascal manual! :)

Thankfully there is Bitsavers.

"Circular Unit References" chapter on the Turbo Pascal 5.0 programmer's user guide.

https://archive.org/details/bitsavers_borlandturVersion5.0Us...


Ha, I must've forgot about it completely! Thank you! And yes, I've never read the manual (unlike the tons of "C" books that followed). Pascal was simply thought orally, and on your own (back then), and that was the case for quite a lot of apps....


I meant that an IDE could provide you with the overview that a header file does.

But they don't.

Also, a header file will hopefully be logically structured by a human. Something generated by the IDE will not.


The ones for languages with native support for modules do provide such overviews, usually named as something browser.


And all I've seen are quite inferior to the header file. They are a nice addition and can be used for quick navigation. But it does not diminish the usefulness of the header file.


"Header files" can be generated as well.

You could even make them part of the build, if you want a continuous up to date version.


As I said, a header file will hopefully be logically structured by a human. Something generated by the IDE will not.

That makes a huge difference.


Separation of module interface and module implementation is useful, but headers are a horrible way to do that. For an example of how it can be done right, look at Borland Pascal dialects (with "interface" and "implementation" section in each unit, where "interface" can be extracted if desired), or Ada, or ML.


> no IDE or tool I've seen does this better.

Modula-2 does it better. Each module split in two files - definition and implementation. Unlike C++ header files, def files are compiled rather that inlined, so they don't have a potential to produce different result for each source file that references them. The result is 100% reliable dependency graph across all source files, no need for a makefile, and blazingly fast compilation (each def file is only parsed and compiled once).


OCaml also does it better. Its split between .ml (implementation) and .mli (definition) files is taken directly from Modula-2 [1].

[1] https://discuss.ocaml.org/t/what-is-the-reason-of-separation...


That has not been my experience in 35 years of dealing with headers in C and C++ in large-scale software projects. Ruby, Swift, and even Python (or Pascal if we want to go way back) are significantly easier to deal with for anything sophisticated.


> An IDE should be able to automatically generate something like a header file regardless of language.

https://www.hwaci.com/sw/mkhdr/


I quite like how Ada has "specification" files, which describes the methods (procedures, entries, functions), types and contracts for everything.



The draft for the merged version expected to be in C++20 is here: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p110...


I use https://github.com/mjspncr/lzz3 - it generates header files and source files. If feels quite weird and nice. :)


If java is any guide developers would go out of their way to add interfaces everywhere anyway, often they're essentially header files for the OO gods.


I haven't done any C++ in a while, can you swap out the implementation backing a header?

I suppose at the linker layer maybe but even that is just once globally for the program. Right? The OOP God's demand polymorphism!

If memory serves me, the C++ version is a class with all empty virtual functions.


Yeah, it can only be swapped out at link time (without crazy magic). That satisfies two very common cases though, when there is only one implementation (and the interface is a sacrifice to the OOP gods) and when the second implementation is for unit testing only.

Those 2 plus other crazy rules (like only one interface defined per file) mean that many projects will end up with just as many interface files as a c++ project would header files.


The section describing how a function call is made appears to be slightly incorrect.

The return value of the `add` function, in most ABI definitions, would be stored in a register. After that, the `main` function may then copy that value to its own space it has reserved on the stack.

This is at odds with the description in the article, which seems to describe `add` passing its return value to `main` via the stack.

(This is assuming no optimizations - all this would most likely be inlined anyway, with no function call.)


I've been looking for something like this to send to some C++ newbies. This is almost what I need but not exactly. Is there something similar that explains how libraries work?


Libraries are just a bundle of .o files (e.g. run "ar t /usr/lib/x86_64-linux-gnu/libpython2.7.a" on an ubuntu with devtools installed and you'll see all the .o files that are in libpython2.7).

The special thing about them though is that the linker will not include any .o files for which no symbols are referenced.

You can think of the typical linker algorithm as follows:

If I am missing a symbol X scan through all not used objects in each library until you find it, then include that .o file. Now check if any symbols are missing again and repeat until done.


> Now check if any symbols are missing again and repeat until done.

That properly describes linking using --start-group/--end-group. Without those flags, the process is closer to "look in the first .a file for definitions of all currently undefined symbols; then look in the next .a file for definitions of remaining undefined symbols; etc.". The difference becomes apparent if you link chains of libraries in the wrong order, or if you have cyclic dependencies between libraries; normally they will not be resolved unless you use the grouping flags. (But really, you should avoid making such cycles in the first place!)


Thanks for the correction. I was posting that from my phone without any references in front of me.


This is the first explanation I have understood of this perplexing phenomenon.


"Libraries are just a bundle of .o files"

Static libraries are more or less that yes (at least, that mental model is sufficient for pretty much all day to day use of them), but dynamic libraries aren't, at all.

As an aside, I found it very weird that the OP claimed that 'nobody uses static linking'. Wut? Static linking is everywhere (so is dynamic linking, of course, but implying that one is merely a quaint remainder from the past is just so odd).


There has been a heavy bias towards shared objects in glibc and the GNU userspace utilities for a long time. Today that mostly means dynamic libraries (there are also statically linked shared objects, but those are mostly a quaint reminder of the past).

I don't hate dynamic linking as much as I used to, but it is still a peeve of mine. Linux is one of the most backwards compatible kernels, but the heavy usage of dynamic linking means that a linux program from 20 years ago either won't work at all, or will be buggy.

A statically linked program from 20 years ago will work (but possibly won't have sound; though it's possible to get either alsa or pulse audio to emulate OSS and then even sound will work).


> Libraries are just a bundle of .o files

That's a very simplistic and error-prone description of what a library is supposed to be. Even in high-level descriptions, it's very important to be aware of the fundamental differences between static and dynamic libraries, and how they are integrated in the build process, which has a fundamental role in basic tasks such as deploying applications.


There is a really good paper by Ulrich Drepper on dynamic linking with ELF and glibc (+ a bunch of other important stuff if you're writing libraries) [1], but it's way too low level for what GP wants. I'm sure there must be something simpler around but I haven't seen it.

[1] https://akkadia.org/drepper/dsohowto.pdf


Presumably what you say makes sense to people that already know what you're saying, but to me, reading

> a bundle of .o files [note ".o"]

> e.g. [...] libpython2.7.a [note ".a"]

adds, rather than resolves, confusion.


I was hoping in the context of discussing this article, it would be clear what a .o file is. Discussing libraries without that knowledge is hard, so i just have to assume the person asking the question knows it.

.a is an extension for one type of library, sorry if that confused you, but it can be ignored for the purposes of this explanation.


And as we know this has a side effect on C++ with globally constructed (but unreferenced) objects. These would get "dropped" essentially when linking, and they come from a library.


Not only that, but add any undefined symbols from that .o file to the list of symbols I need to find.


That's a good point. I think I could add a section that quickly describes how linking works and gives a few examples using ldd.

I'm busy for the next few days but this is near the top of my TODO.

Thanks for the feedback :-)


> Is there something similar that explains how libraries work?

Check out this HN discussion on an ebook on linking and loading. The book itself is a treat, and the discussion around the book is very informative as well.

https://news.ycombinator.com/item?id=18424233


"Basically, the compiler has a state which can be modified by these directives. Since every .c file is treated independently, every .c file that is being compiled has its own state. The headers that are included modify that file's state. The pre-processor works at a string level and replaces the tags in the source file by the result of basic functions based on the state of the compiler."

I almost sort of get what the author means here but then I don't really. I mean, there is no 'state' for the compiler that is modified by precompiler directives, so this is probably an analogy or simplification he's making here, but I don't really understand how he gets to the mental image of 'compiler state'. Why not just say it like it is: the preprocessor generates long-assed .i (or whatever) 'files' in memory before the actual compiler compiles them, the content of which can be different between compilation units, because preprocessor preconditions might vary between compilation units?


It's neither an analogy nor simplification. When he speaks about "compiler state" he means "preprocessor symbol table state". When preprocessor processes a file its state is mutated -- symbols get defined, redefined or undefined.

What you propose as a replacement (an in-memory file) does not provide any insight into why the same file preprocessed twice may end up looking different or why order of included files matter.


Well in that case, I guess it's a definition thing. When I teach C++, I find it much more useful to make a clear separation between 'preprocessor' and 'compiler', and not make the preprocessor part of the compiler and then make the... uh... 'actual compiler' also part of the compiler.

When you take the preprocessed state of a compilation unit, by having the preprocessor write it out to disk, and show someone what the effects are of passing one or the other -D flag, or change the order of includes - that directly and concretely shows what is going on. And then this preprocessed file is passed on to the actual compiler. There is a clear separation between stages, easy to understand, and useful to boot when the time comes you have to debug an issue related to it and you want to look at the preprocessed file to see what's going.


> I find it much more useful to make a clear separation between 'preprocessor' and 'compiler'

Yes, I absolutely agree. And to be honest I cannot imagine explaining how preprocessor works without describing it as a separate entity.


Bit of a side-question, but somewhat related. Is anyone working on "whole program compilation"? I don't mean whole program optimisation, I mean an attempt to read all files for a given target in memory at the same time and then generate all translation units in one go (all in memory? and maybe linking them in memory too?). Clearly, there would be caveats (strange header inclusion techniques relying on macros to modify text of include files would break, gigantic use of memory and so forth), but for those willing to take the risk, presumably this should result in faster builds right?

In fact, ideally you'd even generate all binaries for a project in one go but that may be taking it a step too far :-)

At any rate, I searched google scholar for any experience reports on this and found nothing. It must have a more technical name I guess...


The terms you're looking for are "single compilation unit" or "unity build".

It's used sometimes, I think mostly to help the compiler optimise better.

Build times for a full rebuild may be faster, but may not, since traditional builds can use many CPU cores. However, it stops incremental builds from working - if you modify one source file, you have to recompile everything.


Or a compile server / incremental compilation. Tom Tromey worked on supporting this in GCC years ago, and blogged about the roadblocks that he met on the way. I don't remember the details, but eventually the project was abandoned.

It might still be interesting to read through this stuff--throw "tromey gcc compile server" at a search engine and see what comes up.


We do "kind of that": Our source is split across various cpp files to keep it well organized. Our build scripts then generates a single cpp file with ~30 #includes.

Compilation of the 11Mb object file takes about 30s and requires at most 1GB memory (the lib is not that huge, 20kloc), but I think when we started on it, the build took a few minutes for the lib alone. So we save some dev time, but building all unit test executables still takes an additional 1m30s. So that's only a minor improvement. But I think the real gain is a much better optimization (the architecture of the lib is great to maintain and bugs are at least critical or might even have a lethal impact; there is a lot of potential for inlining/LTO).


I don’t think that article’s accurate. At least not anymore. Modern C++ compilers do less while compiling, and much more while linking. This allows them to inline more stuff and apply some other optimizations.

VC++ calls that thing “Link-time Code Generation”: https://docs.microsoft.com/en-us/cpp/build/reference/ltcg-li...

LLVM calls it “Link Time Optimization”, pretty similar: http://llvm.org/docs/LinkTimeOptimization.html


LTO is still an opt in thing though. I suspect that most pronects still don't use it.


I just watched Matt Godbolt's recent talk about the linking process[1]. It's a pretty good talk.

[1] https://www.youtube.com/watch?v=dOfucXtyEsU


I hope someday the build and linking process could be standardized, but I don't believe it will happen, because many members of the committee come from Microsoft, Google and other tech giant who want to sell their compiler (or give it for free, but still).

There are too much Interests and the standardization would kill many of them




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: