Hacker News new | past | comments | ask | show | jobs | submit login
Are Headers in C++ the Problem? (buckaroo.pm)
66 points by bibyte 10 days ago | hide | past | web | favorite | 83 comments





C++'s refusal to maintain compatibility with C while scrupulously maintaining compatibility with some C practices is the problem.

They modified the type system so good, idiomatic C code won't compile, but they refuse to make writing unsafe code any safer in any meaningful way.

They added new syntax and reserved words, again breaking C compatibility, but they refuse to clean up C syntax even to the extent of removing the bizarre octal notation which makes 0100 do something deeply, deeply unexpected to anyone who doesn't know some obscure bits of lore.

Finally, they moved a bunch of real complexity to the compile phase, something C largely doesn't do, but they refuse to replace the file-based system they inherited from C with anything which would make C++ compiles even a little bit faster.


Your stance reminds me of this post:

C++ Modules - a chance to clean up the language? https://www.reddit.com/r/cpp/comments/agcw7d/c_modules_a_cha...

In short, C++ is a company-driven language instead of a community-driven one like Python and Rust. Some C practices are maintained because voters from the committee agree that these practices are necessary (for their code) and it's impossible to make a completely backward-incompatible change when they have >1M LOC code base.


How many 40 year old projects have we maintained that has broken _some_ backwards compatibility? C++ is a triumph in engineering rivalled by something like a megacity transportation system - it's massive, some of it is dark and scary, but it works really really well.

The C++ team is up against huge corporations that actively lobby for and against any of their changes - and to top it all, they do it for free. Some obscure C feature is bound it get trampled on.


All of them, really. Both C and C++ have made technically breaking changes. This isn’t to denegrate them, of course! It’s that the reality is that “no breakage” is literally impossible. It’s a spectrum, and about the experience.

"No breakage" is possible in a Lisp dialect.

With each rev of the dialect, put all new functions and variables into a new namespace like "std2019:...".

Old code that doesn't know about new features doesn't run into them at all.

Keep the old namespaces untouched: old code finds everything it expects.

There is no bullshit like reserved keywords: all symbols are namespaced in packages. New syntax is just macros named by symbols.

The only kind of breakage left is system requirements (new tools don't target old hardware well due to larger footprints; new dev environments don't run on old dev hardware; implementations drop support for unpopular platforms).


That's only if you make superficial changes, and is already possible in non-Lisp languages.

What about changing core semantics though?


Such as what? Say, change strict evaluation to lazy?

Core semantics can be wrapped in an operator that is in some namespace. We develop (lazy ...) and everything in it is lazily evaled.

File-level annotations are possible. Emacs Lisp introduced lexical scope instead of dynamic scope as a file-level option.


In Symbolics Genera stuff was originally implemented in ZetaLisp with Flavors. Then there is CLtL1, Symbolics Common Lisp (a rather large version of Common Lisp), ANSI CL, ...

There is quite a non-trivial difference between ZetaLisp and Symbolics Common Lisp: both a very large Lisp dialects. ZetaLisp is dynamically scoped, has base 8 integers, the old flavors system is widely used, Fexprs, ...

All these Lisp dialects are provided in the same system and one switches the reader/printer and the packages when changing the language: on can set the language context in a listener and also per file.


> This isn’t to denegrate them, of course!

I totally understand why you included this, but it's a shame that you felt it to be necessary.

Refusing to make any breaking changes is how we end up with... well, PHP. There's nothing inherently wrong with breaking change. Too much of it can annoy developers, sure. And sometimes it has to hurt a little bit so progress can be made (see Python). But doing little to no breaking changes ever will eventually cause problems and/or headaches.

That said, I wonder how many people moved to (for example) Rust simply because C++ moves slowly. Probably not that many, but surely it's nonzero.


> C++ is a triumph in engineering

I guffawed and then threw up in my mouth a little bit at this. C++ is riddled with dozens of horrible design flaws and kludges that make me think your definition of engineering is just plain wonky. Build time is one. Because of an inherent O(n^2) build system complexity, many large C++ codebases require massive, massive compile farms to achieve even reasonable turnaround times. They produce huge build artifacts that take practically forever to link and are still only barely debuggable. E.g. a recent audit of the V8 build time revealed that it compiles at an effective overall throughput of 186 lines of code per second. How do we deal with that? Distributed build system and 50+ core workstations.

This is a travesty of engineering.


The feat of engineering is that these problems are gradually being _solved_ or worked around while maintaining backwards compatibility. For example, C++20 will have modules, which means you won't need to include everything again and again in each translation unit.

And still... there are many reasons why you'd chose this language over others. They're probably doing _something_ right.

I think there's an underlying concept here.

C is three things:

1. A mid-level language, by which I mean a language where you do all of the mechanics but you abstract most of the machine-specific details. Memory management is manual, but you don't know or care about how malloc() works behind the scenes, and you can't even ask. You get integers with defined semantics up to things like overflow/wraparound, but you don't know or care if they're implemented in terms of multiple machine words or even if there's such a thing as a carry flag. It's midway between a macro assembler and, say, Python, which does nontrivial magic behind the scenes.

2. An unsafe language, with nontrivial undefined behavior (the overflow/wraparound stuff I mentioned above) with no guide rails like the ones Java provides, where if you overflow the language specifies an error will occur. In C, very little checking code of that type is emitted or specified.

3. A systems programming language, where programmers do unsafe things, like writing values to DMA registers, things the language can't abstract away because... well... you have to write an OS kernel in something, after all. This is unsafe by design, as opposed to the above, where things are unsafe because of a safety/speed trade-off.

Rust proponents want to separate 2 from 1 and 3, and make a language where you can do things by hand and do unsafe things on purpose, but the language has more guide rails to prevent you from doing unsafe crap by accident.

C++ apparently wants to separate 1 from 2 and 3, to move more high-level and get more "language magic" (templates, iteration stuff... ) without making the language any safer in any respect. That's just an uncomfortable position for a language to be in.

My point is, it could move Rust-ward and high-level-ward if it ditched some C-isms from the language... but you gave a good explanation of why it won't.


> Rust proponents want to separate 2 from 1 and 3, and make a language where you can do things by hand and do unsafe things on purpose, but the language has more guide rails to prevent you from doing unsafe crap by accident.

The first systems programming language to introduce this concept was ESPOL , created in 1961, already with the notion of unsafe code blocks.

Even better, according to surviving manuals, binaries with unsafe blocks were tainted and required enabling execution by the admin user.


The interest in making languages safer, rather than in making code written in them safer, seems to have developed into a bigger factor recently than it was in the 90s. I can’t personally think of any not-Rust languages that prioritize operating safety over ease of coding (compared to Rust). Perl lets you rewrite the global interpreter at runtime, PHP enables decades-old code to continue running, Apple Basic offered PEEK and POKE for writing arbitrary bytes to anywhere addressable by the memory controller. I’m sure some exist, but my point is that we have been favoring expanded capabilities for decades and only now are we starting to favor expanded safeties. When taken in that light, it makes perfect sense that C++ has always wanted to expand the capabilities of C without caring about safeties. What else could have resulted from that history?

In some sense. I’m not sure when “langsec” started, but it’s certainly older than rust.

Langsec as in "the root of (many) vulnerabilities is bad input handling, leading to parse tree differentials" started in ~2007 with Meredith Patterson's Dejector. This research led to using actual parsing frameworks everywhere and, in fact, directly inspired nom. However, this doesn't have much to do with the languages themselves being safe. That research is very old; Ada was an early attempt to tighten-down type systems for safer code, and I'm sure that even that had precedent.

Yeah, it's certainly bigger than just that idea; it was just the first thing that came to mind as something older than Rust.

While it is certainly true that Langsec has been around since we invented programming, Rust is the only viable language I’ve encountered in practical use anywhere that prioritizes langsec over ease-of-delivery. The focus of C++ on features rather than safeties is echoed by almost every language invented prior to Rust.

(I don’t know why Rust was the agent of change in that priority, but I’m very glad for it!)

EDIT: COBOL is good at this, or tries to be. Lots of loopholes but it’s clear from this article anyways that they tried to make it a safe language. But they prioritized features like “call arbitrary C functions” over safety, and you can just write unsafe code without even a hint that you’re doomed. That’s perhaps the essence of what I see as the difference: Rust forces you to declare your unsafe intentions to do unsafe things.

https://www.kiuwan.com/blog/security-business-oriented-langu...


> (I don’t know why Rust was the agent of change in that priority, but I’m very glad for it!)

Mozilla did an analysis of security bugs in Firefox and many of them were memory safety issues. That’s part why they sponsored it, at least.


Strange claim, have you ever heard about Ada?

Yes.

> C++ apparently wants to separate 1 from 2 and 3, to move more high-level and get more "language magic" (templates, iteration stuff... ) without making the language any safer in any respect.

How so? Take std::unique_ptr for example, which exists since C++11. It facilitates using the language in a safer manner, and at a higher level of abstraction (you no longer have to manually malloc and free/new and delete - you just have the concept of scoped ownership), while at the same time not adding any “behind the scenes” magic (e.g. garbage collection) so as not to leave room for any high-level language to be more performant than it is - that’s the real motto of C++, to my understanding.


> while at the same time not adding any “behind the scenes” magic (e.g. garbage collection)

Stuff the compiler does statically is still magic. Anything which makes it more difficult to see the assembly language through the code is magic.


It seems odd that this comment, of all of the ones I've made, is getting downvoted.

Did I break a taboo? Am I not supposed to notice that the + operator in C++ programs can represent vastly different amounts of CPU work and memory usage in different contexts? Or that RAII does substantial amounts of work which is hidden from the source code?

Seems like magic is magic regardless of whether it happens at compile time or runtime.


What made this incredibly clear to me was when await/yield was renamed co_await and co_yield because some extreme minority of companies (farming? from some statement I read somewhere 'yield' was the problem) would have to change variable names. It's like it gets a chance to be cleaner, but is then quickly hammered back into ugly syntax that reminds you that you are writing C++.

Almost any actual potential leaps in improvement get watered down and neutered way before they have a chance to get near the language.


> farming? from some statement I read somewhere 'yield' was the problem

bond yields in finance


> n short, C++ is a company-driven language instead of a community-driven one like Python and Rust. Some C practices are maintained because voters from the committee agree that these practices are necessary (for their code) and it's impossible to make a completely backward-incompatible change when they have >1M LOC code base.

Well the community has more than 1 MLoC of code and is quite reluctant to break compatibility: look at how hard it has been to get Python to really migrate from v2 to v3.

If anything c++ is more willing to make breaking changes (e.g. abandoning the terrible auto_ptr) because of the better tooling (willingness of compiler vendors to put in multi-standard support and appropriate warning flags).


> If anything c++ is more willing to make breaking changes

Slept through Python 2 vs 3.


Well you're one of the few. So many packages are still on 2.7, and Dropbox -- even with Guido working there! -- took three years to move to Python 3, just finishing last autumn.

> They modified the type system so good, idiomatic C code won't compile.

As someone with 30 years of C experience, maintaining one a 60,000 LOC code base that compiles as C and C++, I don't agree, not even slightly.

Each one of the things diagnosed in C++ but not in C is a good idea.

> bizarre octal notation which makes 0100

That would be a serious incompatibility. POSIX code like chmod(file, 0644) stops working. Under no circumstances would it be acceptable to just treat 0644 like 644; it would have to be diagnosed.

Octal notation is so needed here, that if it were removed, programmers would resort to macros like OCT(6,4,4).

What's strange in C is octal character constants, which must be exactly one, two or three octal digits. So \0000 is \000 (same as \0) followed by the literal character 0.

There is no leading zero; there are only octal constants: \123 is octal, not a hundred twenty-three.


Rust uses 0o for octal literals, so that you can still have them, but you don't run into the footguns with just a leading 0. It adds a nice symmetry with 0x and 0b (for hex and binary, respectively.)

What does 0100 actually do? Is it not supposed to be 64_{10}?

0 prefix octal notation is a mistake. It bites people in the strangest places, like when they decide to make the IP addresses line up in a file by changing:

  192.168.1.10
  192.168.130.22
to:

  192.168.001.010
  192.168.130.022
What they thought would be a cosmetic change breaks their application. I've seen this happen several times. The new replacement for inet_addr (inet_pton) removes this braindamage, but it is not available on Windows.

I always thought I had to quote my ip addresses whenever I do socket programming? Is that not the case? Or does one the parsing routines in C do something with the 0? I could understand it if I was using commas instead of dots, and passing the parts to a function, but I'm not, I'm quoting them as a string.

You pass it in as a string, and the standard library does strtol() on each component. The problem is that strtol() is "smart" and tries to autodetect the base of the input. So you could specify your IP address as 0xc0.0xa8.0xa.0x1 if you wanted, but who the hell does that?

It does make a value of 64 decimal. This is really surprising to a lot of people. It's really nice when you're dealing with unix permissions, and a potential footgun elsewhere.

Same with Javascript :(

0100 === 64


Even more interesting:

JavaScript:

077 == 63

078 == 78

The C compilers at least don't accept 078.


Wow if there is any digit above 7, it reverts back to decimal? Makes sense /s

A potentially better syntax would be 0o100, e.g.

> 0o100

This is some IOCCC level stuff, not distinguishable in all fonts.

I'd rather borrow a couple of ideas from Verilog, including the cosmetic underscore and the ability to represent binary, e.g. 8'b0101_1100


Who reads programs in a font where '0' and 'o' are indistinguishable side by side?

It's unclear when skimming... same as 7, l, i, 1, and I

Also, seriously, O and 0 are commonly mistaken in passwords.


You could even support binary that way:

0b10101101

If you standardize this the leading zero becomes somewhat unnecessary. But of course it's probably too late for C. You would have to depreciate the old notation for decades before it could be removed.


Deprecating is as simple as emitting a diagnostic whenever it occurs, rate-limited to once per translation unit, say.

Yes, but it also means fixing all of that old code before you can turn the depreciation into full on removal.

It would be interesting to note that such warnings would undoubtedly point out a few errors in existing code where people 0 padded a number they expected to be decimal. Many would be harmless, 00001 for example, but a few could have been causing unintended side effects (incorrect initialization values for example).


(As I mentioned above, Rust does exactly this; 0o for octal, 0b for binary.)

> good, idiomatic C code

Well which is it?


The problem may be not headers itself, but that there are so many ways to split the compilation and build artefacts in a project.

Java and (Turbo/Free) Pascal try to have few ways of dividing the materials: Java has class files (and Jar files, and modules), Turbo Pascal has Units. Both involve having a certain namespace correspond to some file the compiler can find and compile, and where you then use the definition from the compiled file.

Python and JavaScript/TypeScript have modules as namespaces that correspond to one source file - if you import a symbol from a module "bar/foo", the compiler will look in "bar/foo.js" (or a dozen places in node_modules, or in a dozen eggs), and if foo.js doesn't have that symbol you know something is wrong instead of wondering if there are other files that contribute to that namespace.

Lisp and C/C++ have a tradition that goes back far enough that compilation units and namespaces (packages in Common Lisp) are disjoint things. C++ is on the path to making this more complicated with the addition of modules that create another portioning of the compiled stuffs without forcing the others (headers, namespaces, source files) to agree with it.

So, headers are not the problem - it's that for a given entity in the source code (namespace, class, function, whatever) it's not automatically clear to the compiler where to look for it. Which effectively leads to the funky ball of dependencies between source files being variously manually specified in (auto-, C-)make files as well as extracted automatically and still being error-prone.


I think you hit on what I've felt about learning Python then trying to learn c++.

In python, all code paths are "statically discoverable". Meaning if a symbol exists in a file, I can find where that symbol comes from in that file. And if it's an import, the import tells me where to look for it in another library or relative path.

In c++ when I was trying to learn it was a lot of, "okay so why is 'foo' available here?" "Because it's imported by the 'bar' library that you're importing at the top of the header."


> In python, all code paths are "statically discoverable".

That's not exactly true. modules can monkey-patch themselves and other modules; they can override the import mechanism to do all sorts of things. It's usually considered bad form, but .. it's possible, and some people like it that way.


Yes you're right. But in the first five years of learning I've never run into that conventionally. Where I have, it was made painfully obvious by comments.

Practically speaking, when it comes to learning by reading code, it's invaluable.


The problem with headers (in C++) isn't #include. It's the bleed-through of macros (and pragmas, I guess) from headers into subsequent headers. Without that, precompiled headers would be an obvious win always and every time.

Macros are side effects and there's really no sane way to constrain them (currently), but C++ modules are an attempt to do so.


In my experience, precompiled headers (as implemented by g++ at least) don't help as much as one might hope. I suspect that they only cache the parsed AST, as opposed to generated code. I wouldn't be surprised to find that generating optimized code is a lot more computationally intensive than parsing.

For complex projects, precompiled headers are massive, and sometimes slow you down!

In several big C++ source bases that I've worked with, we reversed the process - at compile time, concatenate all the source files together, include everything you need just once. It's massively faster than precompiled headers, at the cost of sometimes running out of compiler heap, and inability to parallelize builds. The loss of parallelism was offset by the much faster compilation, though.


I think that's a given. Generated code isn't even... generated by including a header. (I'm assuming you're thinking of templates.)

You need use site "caching" to be able to reuse template parsing beyond the parsing step. I think linkers do a little bit of this, but that's after compile time.

Even just caching the parse will probably be a little win in the case of complex C++ headers (esp. system headers with lots of platform defines, etc), but it's unlikely to be a panacea for compile times.

I think the interesting part is what kind of tech can be built on top of the "headers are isolated" parts of the C++ modules proposal. The compiler are people are nothing if not inventive!



But this problem is gradually going away. That is, with every new language standard version, the use-cases for macros become more and more limited.

Anyway, yeah, modules should do sidestep this challenge.


Fixing this is one of the killer features of the "Blaze" family of build-systems (Buck, Bazel, Pants, Please), since they do not defer the actual build execution to another program (such as Make or Ninja).

Response from a CMake dev:

> Well, by the time CMake could discover -MM flags, the build has already been written and CMake (the program) is out of the picture. Linking to a CMake target is also not just "add this library to your link line" either, so a simple response file written somewhere during the build for the linker to use is not sufficient (nevermind that this file may be updated by any TU compilation rule in a library target, something build tools tend not to like too much). I guess combination configure/generate build tools can do this, but CMake is a build generator and does not execute the build at all.

https://www.reddit.com/r/cpp/comments/auyl07/are_headers_rea...


...Is it? Maybe blaze/bazel changed a lot in the past several years, but I remember it didn't even auto-import headers. You have to manually specify every header file in your dependency list. (To be precise, for every header you include, your dependency must include some BUILD target that contains that header.)

So you don't ever get "undefined symbol"... you instead get "cannot include header"! Not sure if that's an improvement. :/


The header lists are specified either as a list of directories (Bazel) or using globs (Buck). You must have been using quite an old version!

Yes, it is a big improvement. The error message tells you exactly what you failed to do. If a target does not export all of its headers correctly, then fixing that fixes it for everyone consuming that target, which scales well to large code-bases.

It is not a coincidence that so many companies (Google, Facebook, Amazon, Twitter, Thought Machine) have converged on this design.


> If you include header X then you must also link to the library target that X belongs to

Visual Studio's solution: #pragma lib(foo). That's that problem solved. This is not the big problem.

The big problems are a mix of 3 entangled problems:

- textual substitution by macros with cross-file scope: the behaviour of an included file depends entirely on previous files.

- object memory layout and API are unnecessarily bound together, so you can't change so much as a single function argument without completely recompiling all dependent code.

- a number of things (mostly templates but some strings) are duplicated hugely beacause they're defined in the headers, and then de-duplicated at link time. So compilation ends up disk-bound while the compiler writes out lots of data that will be thrown away.

So, you can change one byte in a header high up the dependency chain and recompile most of your code. Slowly.


> Visual Studio's solution: #pragma lib(foo). That's that problem solved.

Not even close to that easy; platform differences make it a pain to find libraries consistently. (Windows tends to make it most difficult, with Apple close behind for a few libraries.)


Yeah, that gets specified "outside the language" in the compiler arguments for link path and include path.

> - object memory layout and API are unnecessarily bound together, so you can't change so much as a single function argument without completely recompiling all dependent code.

These can readily be put in different headers. People often don't, but...


Patient: "Doctor Doctor, it hurts when I do this" Doctor: "Don't do that"

This is arguing that there is a solution to manage a problem, and that because there are (caveated, external, third party) ways to manage it, there isn't a problem. Yes, there are ways to mitigate some of the problems of headers, but all the bookkeeping is part of the criticism.

And there's still duplicate definitions in separate places to keep in sync, you end up having to move all your code into headers if you want to make things generic (with compile time repercussions), and it's easy to mismatch header versions with the libraries for the same file.

Linking to the libX library for X.h is the least problematic of the header problems/criticisms.


IMO what is needed is a preprocessor with reference semantics that are conformable to the modern notion of modules. I.e. a preprocessor that thinks that a reference is defined when it's defined in a module included by this one, instead of "before this point in a translation unit." This way it's adequate to include a list of defined symbols in a pre-compiled header, and reuse them when including that module to some other module.

I think headers aren't "the" problem, I kind of want to say they are "a" problem, but they aren't really even that. I think I'll go with that they can be really annoying. They are quite a powerful mechanisim, but can be difficult to understand when you have complex header hierarchies (as often found in embedded systems) and introduce a bit more friction when coding.

I think they are a good fit for C and C++. Headers allow more fine-grained control of module inter-dependency and are great for targeting different processor architectures. Java, C#, etc. don't have to deal with that to the same degree so there's no direct comparison.

Java has the Foo/IFoo pattern. It's not a direct equivalence, but if you want to separate interface from implementation, you have that option.

Except the Foo/IFoo pattern is more related to OOP than it is about separating interface from implementation. You wouldn't create Foo/IFoo classes if you knew for a fact that Foo will be the only implementation of IFoo

> Java has the Foo/IFoo pattern.

Are you thinking of C#?


It may be applicable there too; I just don't have any direct C# experience.

What? There's no real difference except spelling (aka syntax and convention).

Well there are plenty of header-only libraries that put their entire code in headers and you do not need to link to anything. You can do this in C too but it's much more popular in C++ with templates. I'm surprised the author doesn't mention this.

This blows up compilation times, so not really an option for larger projects.

>Headers are not necessarily evil.

Haven't demonstrated that headers are not bad. Just that there's a involved, error-prone, half-manual, workaround to "automate" their discovery and use.


Idk if you inline some functions in the header and then rely on the compiler to automatically pull the right version in then you could be in for some hard to find bugs.

Aren't modules supposed to improve the situation?

> In summary: Headers are not necessarily evil.

The headline by OP seems like a clickbait.


Going to invoke Betteridges Law here and say No.

I don't know if they're the problem, but they're a problem.



Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: