Hacker News new | past | comments | ask | show | jobs | submit login
Initialization in C++ is Seriously Bonkers (mikelui.io)
166 points by signa11 74 days ago | hide | past | web | favorite | 126 comments



> [Why is a global variable zero?] Because i has static storage duration, it’s initialized to unsigned zero. Why, you ask? Because the standard says so.

No, it's because i lives in an area allocated at the OS level, and OS has to have initialized that memory with something (otherwise you'd have a security bug). The .bss segment has been zeroed, for obvious reasons, since the advent of modern linkage.

The explanation is backwards. The reason that stack variables are UNinitialized is (contra the article which thinks it's because the programmer didn't put an initializer in the source code) that memory is on the stack, which is allocated internal to your program and in practice was used previously by some other call for some other purpose.

The fact that stack variables are uninitialized by default is actually an intentional feature, not a bug, as it correctly expresses the behavior of the runtime environment. Now, it may not be a good feature, and modern language designs have omitted it (by, let's remember, relying on much better modern compilers to elide all the extra code that would otherwise have been needed to zero out a stack frame!).

But it's got nothing to do with the syntax of the language. You can rely on .bss being zero in assembly, and similarly be surprised that stack memory is going to have junk in it.


> The reason that stack variables are UNinitialized is (contra the article which thinks it's because the programmer didn't put an initializer in the source code) that memory is on the stack, which is allocated internal to your program and in practice was used previously by some other call for some other purpose.

It is because the programmer didn't put an initializer into the source code. That's how the language is defined. What you are discussing is the underlying reason the language is defined that way. You are speaking past the author, not pointing out a mistake.

The blog post does contain a genuine error relating to this point, though: Any C programmer worth anything knows that this initializes i to an indeterminate value. This is wrong. Reading the variable does not simply give you an indeterminate value, it gives undefined behaviour (something the article never mentions, surprisingly).

> that stack variables are uninitialized by default is actually an intentional feature, not a bug, as it correctly expresses the behavior of the runtime environment

But it doesn't. Reading an uninitialized local variable gives undefined behaviour, which might not correspond to the behaviour of using any particular garbage value.

Anyway, the language is defined that way simply for performance reasons. The risk of undefined behaviour is a considerable downside but was deemed acceptable.

> But it's got nothing to do with the syntax of the language.

We're talking about the semantics of C. If C were defined such that static integer variables be initialized with 1 rather than 0, that's what would happen, no?


You're not wrong, but I think this is a case where knowing the reason for the rules is easier than remembering the actual rules. This is just from personal experience, but for me, remembering what is and isn't initialized in C++ seemed completely arbitrary and insane until I learned more about linking and process startup. YMMV of course.


I agree there's a lot of value in understanding hardware/OSs/etc, but it's no substitute for having a solid understanding of C in general. C has much to say on undefined behaviour, and there are no shortcuts - you just have to have a good grasp of the language's dark corners.

(Aside: I believe ++i++; is no longer UB in the latest version of the language, but used to be.)

On the particular topic of initialization in C/C++, I go with the rule of thumb of be as explicit as possible.

Beyond that, I'm not afraid to assign a marker value to a local only to overwrite it soon afterwards. I save myself endless trouble in case my code is buggy, and if it's not, the compiler is likely to elide the first assignment entirely.


But if you remember the underlying reason you might think that using uninitialized variables gives just gives you an arbitrary value. That's not true, it's UB. The compiler can do whatever it wants if it can prove that you read an uninitialized variable.


Do compilers really try to prove UB so they can use it as an excuse to silently murder kittens? That sounds so counterproductive. If it can prove UB, it should just pop an error and stop.


What's actually happening is a bit more subtle: The compiler is allowed to assume that the programmer in fact made no mistake and there's a reason why the UB is never actually triggered at runtime. Reasoning backwards from that can open up large optimization opportunities elsewhere, like not loading a value from memory twice, or not actually checking a condition. A good article about this that was recently discussed on HN: https://news.ycombinator.com/item?id=18575383


What they do is not try to prove UB, but use it as a precondition for the proof that their optimizations don't change the program behavior - ie. they can assume that anything resulting in UB won't happen.


> Do compilers really try to prove UB so they can use it as an excuse to silently murder kittens?

Clang in particular is quite aggressive about this, and is prone to producing surprising results. GCC has started doing this too, but it is much more conservative,


In the next sentence I say:

> variables must always be initialized before they’re used.

Also I believe I’m quoting close to the standard there about indeterminate value.

I stayed away from using the term “undefined behavior” because I’ve had a lot of problems in the past with students believing undefined behavior means “but it works if they ran a test with a particular compiler version and flags and didn’t see any problems”. No matter how I try to equivocate undefined behavior with invalid code, it has trouble sticking. This is a <edit>understandable</edit> perspective built from starting at interpreted languages that immediately notify of any runtime errors and otherwise don’t have undefined behavior.


Your first points are well taken. From a quick googling you're right about the way the spec phrases it. It is, of course, terribly precise about under what circumstances UB occurs - https://stackoverflow.com/a/23429477/

> I stayed away from using the term “undefined behavior” because I’ve had a lot of problems in the past with students believing undefined behavior means “but it works [...]

Seems to me that developing a due terror of UB is vital to having a basic understanding of C as a programming language * , and beyond that, it's vital to appreciating what C is.

As well as being a considerable practical headache (it's not a compiler bug, it's UB going haywire), it's one of the major differences between C and, say, Java (alongside the way C types are not portable, etc). I'd say it deserves some emphasis as a basic principle.

* I suspect some theorists would contend that, strictly speaking, C programs aren't programs, and C isn't a programming language. Programs are, theoretically, meant to unambiguously map inputs to outputs. C is not unambiguous. But, needless to say, I digress.


> No matter how I try to equivocate undefined behavior with invalid code, it has trouble sticking.

Try telling students to add -fsanitize=undefined to their compilation flags. That might help make it clear that their code is buggy.


"Any C programmer worth anything knows that this initializes i to an indeterminate value. This is wrong. Reading the variable does not simply give you an indeterminate value, it gives undefined behaviour"

Undefined behavior in the technical sense is not acceptable in a language at all, let alone a feature. If you take the idea of undefined behavior seriously, then it is valid to blow up the world in response to an error and the compiler cannot protect you. Nobody would use C or C++ if they actually accepted the meaning of undefined behavior, so to program in these languages requires embracing doublethink.


First up, I think I was technically mistaken in saying the blog was technically mistaken. It seems the language spec defines the UB in terms of read-before-assignment, but they do use the term 'indeterminate value' as well. So the blog post can be seen as incomplete, rather than mistaken.

> Undefined behavior in the technical sense is not acceptable in a language at all

The C committee disagrees, and they have their reasons. They're aiming to maximise performance and support for various weird and wonderful platforms.

C's philosophy is not like that of Java where, say, an int is defined to be 32 bit and your machine just has to make it happen, even if it's a peculiar 48 bit machine or something.

> If you take the idea of undefined behavior seriously, then it is valid to blow up the world in response to an error and the compiler cannot protect you

Well sure. UB means you screwed up. C isn't a hand-holding language. Explode the whole process. Perfectly valid. Not without precedent in the C++ spec, incidentally; C++ is defined to explode your process if you screw up in a certain way with exceptions https://stackoverflow.com/a/43675980/

Your program can't literally blow up the world, of course, but that's beyond the scope of the language. If UB could result in nuclear apocalypse, that would mean a compiler bug could generate a binary that did the same thing, regardless of source language.

> Nobody would use C or C++ if they actually accepted the meaning of undefined behavior, so to program in these languages requires embracing doublethink.

Plenty of C/C++ programmers have a very sloppy attitude to undefined behaviour, and sometimes they get bitten by it. Aggressive optimising compilers like gcc, really do depend on you taking responsibility for writing code with defined behaviour. Break that contract, and you can see nightmare intermittent bugs that only appear when using certain compiler flags. See: http://blog.llvm.org/2011/05/what-every-c-programmer-should-...

Of course, this is all made worse by that it's impossible for static analysis to detect all instances of UB. If you want a language that doesn't take this attitude, you might try Ada. It's a pity more people don't.


>If you take the idea of undefined behavior seriously, then it is valid to blow up the world in response to an error and the compiler cannot protect you

That's not its job. The compiler cannot protect you from everything. (You shouldn't be using a computer that has the ability to blow up the world.) The compiler merely translates code from language A into language B. If you pass it some code in some third language which could parse as mal-formed A code, then how is to detect that?

If I write a language with the same syntax as C but entirely different semantics, and I pass it to a C compiler, what do you think should happen? Undefined behavior is what happens.


> That's not its job. The compiler cannot protect you from everything.

The compiler can protect us from many things, such as uninitialised variables. It’s just that we choose to define the semantics of the C compiler such that it doesn’t.

Also I don’t think GP meant literally, physically destroy the world. It was a metaphor for catastrophic consequences of program misbehaviour. They meant that literally anything could happen, having dire consequences for reliability and security.


Right here is what I mean by doublethink. How can "literally anything" not include "literally physically destroy[ing] the world"? Of course, I was also using it as shorthand for any catastrophic consequence that is unacceptable.

It's the equivocation between inconsistent ideas that frustrates me. Of course, practically speaking you assume really bad things don't follow from undefined behavior. But then why the constant refrain about how we shouldn't rely on what actually happens?

I think I understand where the motivation for declaring undefined behavior comes from - people who set standards don't want responsibility for situations they don't completely control. But this disclaiming of responsibility puts the users of the standards in an impossible situation as a result.

I'm coming from a perspective of someone who programmed in C as my second language after BASIC, back in the 80s, before modern standards and before I knew anything about language standards. The philosophy and attitude of people who talk about standards and undefined behavior is something I first encountered on Usenet in the 90s, but I still am disturbed by it and haven't been "educated" to accept it.


> If you pass it some code in some third language which could parse as mal-formed A code, then how is to detect that?

This 'other language' approach doesn't strike me as a good way of thinking about it. It misses the point that C is pretty unique in its broad use of undefined behaviour. Unlike Java, where everything has an unambiguous definition. (Well, ignoring plenty of platform-specific variation points in the standard library, such as file-path syntax.)


> No, it's because i lives in an area allocated at the OS level

What OS? I don't have one. I have to link in init code to zero sections of ram before my embedded code runs.

Sure, it got into the standard because some popular OSes like Unix zero initialized pages, but it's not a universal truth. The reason it's guaranteed in C is because of the spec.


Dear lord sounds like you got lost from comp.lang.c. How did it get into the spec? Because Unix did it that way. Just like gets made it into the spec. It’s not like that was some willful good design. The history of C and it’s standarization is inextricably ties to the history of Unix.


Not necessarily true.

C89 only requires that static values be initialized.

A modern C standard (section: 6.7.8(10)) requires static values be initialized, but what the value is initialized too be _technically_ indeterminate.

There is the guide line given that integers must be zero, and pointers be NULL. But if a static storage class isn't consisting of purely integers, pointers, or (fixed sized) structures, arrays, and unions who's elements can recursively reduced to integers or pointers. Then the standard says the initialized value is indeterminate.

While relatively straightforward, there is a few gotcha's.


The C11 rules for static values are pretty explicit:

``` If an object that has static or thread storage duration is not initialized explicitly, then:

- if it has pointer type, it is initialized to a null pointer;

- if it has arithmetic type, it is initialized to (positive or unsigned) zero;

- if it is an aggregate, every member is initialized (recursively) according to these rules, and any padding is initialized to zero bits;

- if it is a union, the first named member is initialized (recursively) according to these rules, and any padding is initialized to zero bits; '''

I think that covers every possible value you can create, and it seems pretty non-indeterminate by my reading.


I always think C/C++ is not a good first language to teach, because you need to understand some general concepts about computer systems before diving in. Probably more than half of the beginners in their first C class give up when handling pointers/arrays - it's not because the concept is hard, but because they have no idea what the "memory" in the computer actually is and how it works. You're left with a naive analogy of home addresses and phone numbers, but it really doesn't give pointers justice (especially when dealing with arrays and dynamic memory!) Every student learning C should have a basic idea about the stack and the heap, how function calls work, and how assembly code is executed, in order to get the full picture.

Anyways, the problem C++ has for beginners is that the language tries to obscure the already hard to understand low-level behavior with high-level concepts (to be fair, that is the language's ultimate goal: to provide a high-level abstraction for low-level code.) So maybe good for your second/third language, but definitely not your first.


> I always think C/C++ is not a good first language to teach, because you need to understand some general concepts about computer systems before diving in.

Some believe that it's important to teach those things about computer systems first. To anyone who knows how the machine works it's pretty obvious why a primitive stack variable is uninitialized: initializing them would cost a store which would in most cases be pointless. People use C and C++ because they want all of the performance the machine has to offer, and usually more. Stuffing the program full of immediate value loads and stores would make it huge and slow. C++ is already cursed with enough pointless stores, for example initializing the buffer of a string just after resize and just before copying something else into the same memory. We don't need more of these things.

The author then goes on to complain at length about {} initializers in C++11, which everyone recognizes as problematic, and which has been partially fixed in C++14. This has basically nothing to do with the first part of the article.


I learnt C as a first language, and those computer sytems concept were explained as we go. It's not that hard to understand what memory is and that variables values are stored there, and I've had no problem with pointers.

And the big advantage is that when you understand those concept and pointers, there's no magic in other languages (value/reference parameters, objects, functions as first-class citizens)


Totally agree. Knowing C and a little assembly takes out the magic from other higher level languages. That's a good thing.


Monads are still a little magical though.


I keep thinking that C could support monads with a little work.


As I see it, adding lambda/anonymous functions would have a transformative impact on the whole language. Adding them to C++ had a similar impact although the language offered some ways to get around the limitation.


Very true. When I look at C# pretty much all the cool stuff they have added relies on lambdas and closures. Same for JavaScript. It's the one feature that enables a ton of other features.


"Programmable semicolons."


On the contrary, properly teaching pointers can do wonders to help people understand how memory is structured.

C and C++ are horrible academic languages for other reasons - the original design is just too much of a hack to begin with, and then there's decades of legacy backwards compatibility with it in all newer developments. Syntax is ugly and inconsistent, doing things right is often harder than doing them wrong, there are many features and exceptions that are only there for historical reasons, and standard library is very eclectic in terms of what is and isn't included (from a modern perspective). But take something like Modula-2, and pointers aren't a problem.


I think it's better to teach basic assembly before C. Using memory addresses makes so much more sense in assembly.

But for a first language I'd want to use something like Python or Javascript or even a toy language. It simply needs to be something where some basic, high level concepts can be taught without worrying about the details.


Yeah, there is really no need to be scared of assembly. It is easy to learn because there is not much to it. Only writing nontrivial programs in assembly is hard, and you don't need to do that to understand what's going on behind the scenes in compiled code.


> Only writing nontrivial programs in assembly is hard

It's not even really hard, just crushingly tedious and error-prone. You could also write all your programs on 80-column punch cards or with (1)ed, but why bother if there isn't some technical reason why a text editor isn't available?


Yeah even in the old days Pascal was the language of choice for beginners. Now pythin holds that position although some schools teach java or c# as a first language many use python these days.


In embedded assemblers I have used (TMS34010), the BSS section is uninitialized when the program is loaded (I helped write the OS). When I wrote embedded C, the added C runtime code that executes (before main() is called) zeroes out the BSS section (the OS has nothing to do with it). In other words, it is the C runtime that is responsible for initializing, not the OS.


I think you're the fourth person jumping in to quibble with my "bss is zeroed by the OS" statement by snarking about "Ah hah! But what if you YOU ARE the OS?!".

Yeah, I know. I live in that world too. I don't know that it's particularly relevant. The C standard is written to a norm of a Unix userspace environment, and that's clearly where the linked article is working.


It’s the C runtime that zeros it, not the OS. That’s my main point. The OS may initialize too, but that is outside the standard.


On oddball platforms it might be (though I think there's some quibbling as to whether the C runtime in such an environment is really distinct form "the OS"). It's not on Linux. The .bss segment becomes an anonymous mmap which is zeroed by the kernel when first accessed. Glibc never touches it.

(Edit to correct: obviously glibc "touches" .bss because it has its own static variables. But there's no "zero .bss" step in crt0.o)


I mean, there's probably more shipped units of "oddball systems" than there are of Unix-like multi process systems that have to wipe segments like .bss vracuse of their security model. So it's not like it's some pedantic corner case like you're implying.

And there are other parts of the C standard that would just declare 'implementation defined', so the standard doesn't have to say static duration is initialized just because that's what multi process systems do.


./libgcc/config/nds32/crtzero.S

    .L_bss_init:
        ! clear BSS, this process can be 4 time faster if data is 4 byte aligned
        ! if so, use swi.p instead of sbi.p
        ! the related stuff are defined in linker script
        la  $r0, _edata     ! get the starting addr of bss
        la  $r2, _end       ! get ending addr of bss
        beq $r0, $r2, .L_call_main  ! if no bss just do nothing
        movi    $r1, 0          ! should be cleared to 0
    .L_clear_bss:
        sbi.p   $r1, [$r0], 1       ! Set 0 to bss
        bne $r0, $r2, .L_clear_bss  ! Still bytes left to set


How does finding some uncompiled sample code someone added to gcc for an architecture no one has heard of refute the point that glibc on Linux implements bss with an anonymous mmap?


Not meant to refute, just sample code on the C runtime zeroing the bss section. You'll find this for a variety of systems, this was just one of them...


Rather than pointing to Linux over and over, let's look at Win95. It kind of comes from a 'fuck local security' standpoint (everything mmaped is in one global space mapped into all processes), and still clears bss because that's what the spec expects, and like a lot of embedded systems is fully spec compliant.


The OS does initialize it, so if you’re writing code for linux in any languages, or even in assembly, you won’t be able to spy on that retrieved memory.

Now C might want to do it too, but bare metal stuff is not necessarily compliant C so that doesn’t matter much.


(Having been the lead for an RTOS) it matters less what the spec says, and more what the compiler expects. So like, you can ignore kill(3) by just not calling it. Patching the compiler to not depend on zero initialized .bss is way harder, and it's just easier to clear it yourself pre main.


You're certainly correct, but I don't understand why you're phrasing this as an objection rather than as a rationale for how the language is specified.

Personally, I enjoy that the C standard makes a bunch of guarantees that I can use for my reasoning about correctness. Of course I can only get so far reasoning in abstract, language-defined terms and ignoring the execution environment, but where it is sufficient, being able to forget about operating system minutiae is certainly a relief.

Since the main thrust of the article is that C++'s language-imposed rules are overly complicated, restricting the perspective to the language specification seems reasonable.


> The .bss segment has been zeroed, for obvious reasons, since the advent of modern linkage.

For your amusement: in the Linux kernel, at least on x86, .bss isn’t initialized when the kernel is loaded. Instead, the kernel memsets it to zero a little later. I don’t know why.

Also, because all this stuff predates any concept of security, there is no read-only equivalent of .bss in most systems, meaning that you get suboptimal code for:

const int i = 0;


Why would the code for "const int i" be any different from the code for "int i"? The Standard doesn't require the variable to be allocated in an immutable memory area. It just says that if you do anything to change its value, the result is undefined behavior. And actually changing the value in a way that would be observable via the const variable is a valid subset of undefined behavior.


The optimal code for const int i = 0; is for the compiler to inline its uses, and for there never to be any storage allocated at all.

;-)


That works unless you use &i somewhere.


It's hard to imagine many situations where you need to give a name to the constant value 0. And think how "optimal" it would be if there were many and we had a read-only BSS! We could cheaply get a huuuge section full of constant zeroes.

I think you can in fact realize the idea in PECOFF, by the way. I still need to figure out some aspects to it these days (it seems a bit arcane and maybe Windows doesn't follow the spec very well). But in any case PECOFF has this notion "VirtualSize" (size of section in running program) and "SizeInFile" (size of the prefix of the section that should be filled with contents from the executable file). If SizeInFile is smaller than VirtualSize then the rest of the image gets filled with zeroes I think.


But in any case PECOFF has this notion "VirtualSize" (size of section in running program) and "SizeInFile" (size of the prefix of the section that should be filled with contents from the executable file). If SizeInFile is smaller than the rest of the image gets filled with zeroes I think.

Correct. A common and low-hanging-(more like "lying on the ground")fruit optimisation for PE files is to reorder and realign the sections such that all the 0s are at the end, in which case they can be "cut off" by setting those header fields appropriately and not waste storage space.


> I think you can in fact realize the idea in PECOFF, by the way.

You can express it in ELF too, with a NOBITS segment with ALLOC but no WRITE flag. Whether that works or exercises bugs in the dynamic loader is an open question.


On Linux (in the kernel image), it exercises bugs in the dynamic loader. I tried it once :)


The OS program loader is who allocates pages to .bss and initializes them to zero. In the case of an OS, who would do that initialization? At best it would be the bootloader. Initialization during kernel_init() before using any globals is acceptable too.


Having seen both sides of this, people are arguing about the proper way to skin a cat. The standard only cares the cat be skinned not how.

On OS it's done by the OS for security and performance reasons. One you don't want people to snoop on memory freed from other processes. Two the OS can use the MMU to map in previously zero's pages of memory on demand, so you don't need to actually zero the entire .BSS section on startup.

On a bare metal system usually it's done either in assembly (or more cheezy in C) + linker magic.


I don't think we should pretend that any of this was planned out. The reason the C standard is like it is because it doesn't just specify the best possible version of C that can exist, but because it encodes historical behaviour.

It's possible to post-rationalize those decisions, but as I understand it, there's no particular reason why .bss has to be initialized -- it just happens to be the way Unix has worked since forever.

With modern static analysis, there's probably no performance benefit to not initializing stack variables that are read before they're written?


> With modern static analysis, there's probably no performance benefit to not initializing stack variables that are read before they're written?

Not really relevant because that's a bug.

But the common thinking "let's always default-initialize stack variables because the compiler will make it efficient anyway" is clearly "sufficiently smart compiler" thinking. Implicit default initialization will never be 100% as efficient as simply not requiring initialization.

Plus, I like the error messages I can get, sometimes, with no default initialization. The compiler can statically detect some uninitialized reads. It cannot do that if all variables are implicitly initialized. In many other cases at least one can be quite certain to experience random crashes that let one track down the bad read rather quickly.

In other words, implicitly initializing to zero normally just hides logic bugs. It doesn't make them go away.


A better idea is to force the initializer, and require some special syntax if you really want an uninitialized variable. E.g. if a language uses underscore as a syntactic wildcard, then it could be something like:

   int x = _;
The vast majority of variables can and should be initialized when they're declared (and often don't ever change after, so const is a good idea too). This is especially so in C++, which is less likely to return values via references/pointers these days (with tuples and uniform init for returning structs).


Note that clang just got a change to experiment with this: https://reviews.llvm.org/rL349442


Doesn't seem to me like a statement that variables should be implicitly initialized.

It seems like a workaround around insanity with respect to UB (where compilers fail to notify the programmer that they statically detected a logic bug, and instead go on with compiling, making crazy optimizations based on the wrong assumption that the logic bug was never there).


> Doesn't seem to me like a statement that variables should be implicitly initialized.

Not at all. I also believe that we shouldn't hide logic bugs, but the problem with this class of bugs is how they are hard to catch: forcing the initialization can at least make the behavior more consistent across execution of the same program (avoid the rare sequence of condition that will clobber the value the right way before you use it). For some specific application this may be an OK tradeoff for release builds.

But to clarify why I posted this, I was just trying add one piece of data to this part of the thread of discussion

>> " With modern static analysis, there's probably no performance benefit to not initializing stack variables that are read before they're written?" >"Implicit default initialization will never be 100% as efficient as simply not requiring initialization"

Actually we don't know the exact impact, it is likely codebase dependent, but this patch in clang will allow to experiment with various tradeoffs.

> where compilers fail to notify the programmer that they statically detected a logic bug

Compilers (at least clang) won't detect a logic bug without noticing the programmer. You have warnings for this.

The optimize just "assumes" that there is no logic bug but can't reason about the logic:

{ int a = 1; foo(&a); } int b; bar(&b);

Can I optimize toward:

int a = 1; foo(&a); bar(&a);

If we assume that there is no logic bug, then it seems like a valid transformation to me. But if bar reads it parameter before writing to it, then this optimization makes foo impacting bar.


Thanks, that's insightful!


I'll tell you I'd rather live in the current world where the compiler emits a warning 'foo may be uninitialized' than the world we are heading into where that enables a half assed optimization that speeds up a ubenchmark on an architecture no one uses anymore.


In my view this author falls into the camp of people calling C++ a mess because it gives them too much control when they ask for it. One of his examples which he calls "The Abyss":

    #include <iostream>
    struct A {
        A(std::initializer_list<int> l) : i(2) {}
        A(int i = 1) : i(i) {}
        int i;
    };
    int main() {
        A a1;
        A a2{};
        A a3(3);
        A a4 = {5};
        A a5{4, 3, 2};
        std::cout << a1.i << " "
                  << a2.i << " "
                  << a3.i << " "
                  << a4.i << " "
                  << a5.i << std::endl;
    }
    
which outputs:

    1 1 3 2 2
he claims to be mysterious but is actually pretty reasonable.

In the case of:

a1: There is no initializer list in the variable declaration, so ctor 2 is called.

a2: An empty initializer list should reasonably behave like a default constructor, and a default constructor should be more efficient than processing an initializer_list, so ctor 2 is called. In general, the constructor with the matching number of arguments and correct types is preferred over initializer_list constructors. Makes sense, being able to specialize on number and type of arguments is more powerful than a design where a single initializer_list ctor invalidates all other ctors.

a3: No list, ctor 2 is called.

a4: Non-empty init list and no specialized non-init-list ctor, so ctor 1 is called.

a5: Several-variable init list and no overriding 3-variable constructor with matching types, therefore ctor 1 is called.

Complex? Maybe, but that's what you get with C++: very fine control of program semantics, benefit being expressive libraries. No other language fills this niche that I'm aware of.

I agree that C++ isn't a good language to teach in a CS 101 class. But no other single language is good either. The goal of CS 101 is to not make students give up before they get hooked, and that can happen because the material is too challenging or not challenging enough. For people in the former category, give them a scripting language and visual feedback, like Lua + Garrysmod. For the latter, give them assembly, haskell, C, C++ (teaching it like "C with templates" not "C with classes").


Hi. I completely agree. C++ is there for very fine control and serves the purpose well (and is constantly adding more control, e.g. trivially relocatable).

My biggest goal for a CS 101 class would just to build computational and critical thinking skills. I would love to just start off with peanut butter and jelly sandwiches[1]. As an engineering department, our students start with an engineering programming class that needs to also serve chemical, mechanical, civil, material, and biomedical engineers along with electrical and computer engineers(I hope I didn't leave anyone out).

[1]: https://edtechmagazine.com/k12/article/2008/07/programming-a...


The problem with C++ that the article explores isn't that it's complex. It's that it's unnecessarily complex, and you could have a language with all that power (and then some, like say a proper macro facility) without so many warts. Indeed, Rust is a proof by example now.

Of course, those warts were necessary for C++ to be successful back when it was introduced and competing against others, and therefore to its popularity today. And this popularity is just as much a part of C++ appeal as its power. But we can call them out for what they are, without trying to justify them.


C++ initialization is such a convoluted mess that CppCon was able to have an entire hour dedicated to "The Nightmare of Initialization in C++".

https://www.youtube.com/watch?v=7DTlWPgX6zs

C++ has a rigorous standard. There is no rule that wasn't added for a reason. Nor are there lines of code with which you can specify which rule it must follow.

The path to hell is paved with good intentions. Just because each step is logical and defensible doesn't mean the end result isn't fire and brimstone.

I believe that good things are more than the sum of their parts. You often hear this with respect to creative works like movies or games. I think it also applies to source code, language design, and much more. The flipside is that bad things are less than the sum of their parts. I've shipped games that fall into this category. C++ initialization has a LOT of parts. And I would strongly argue the end result is less than the sum of those parts.


> it gives them too much control when they ask for it

Rust gives you too much control when you ask for it, with unsafe blocks, cells, etc. C++ will stab you in the gut at random because you looked at it the wrong way.

> But no other single language is good either.

I had Python for my CS101, Java for 102, C for 201, and C++ for ~202. This was a decade ago now, but even then I thought it was a pretty good track. You learn fundamentals of computation in Python fully apart from the hardware warts, you get to taste what the corporate monotony many of you will be faced with for decades to weed out the chafe, then you get a dose of cold hard reality when its too late to turn back that it gets even worse.


std::initializer_list was probably the greatest mistake made in modern C++, but you can mostly avoid the complexity it introduced just by not using it. You usually only encounter it when you are trying to use it, in which case it behaves as you'd expect.


> Complex? Yes, but that's what you get with C++:

Congrats on successfully summarizing the complaint.

It's not mysterious, but it's gratuitously complex. And this is only variable initialization.


You ended the quote early, and my sentence went beyond the article. I said:

> that's what you get with C++: very fine control of program semantics, benefit being expressive libraries.

It's not gratuitous because you can't remove much of it without losing power or expressiveness (and still remain a C).

Also, by complex I didn't mean complicated.


It doesn't really seem necessary for "A a = A{}" and "A a = {}" to be different things to have power and expressiveness. It really seems like uniform initialization covered just one arbitrary case while making other cases even less uniform.


Indeed.

If just setting integers is this complicated, what do you expect to happen when you are trying to solve real problems?


The parsing details can be complicated while the end result can be simple. Consider the actual C++ example that would be taught & used in practice:

   struct A {
     int i = 0;
   }
   
   int main() {
     A a;
     std::cout << a.i << std::endl;
   }
Hey, look, done. And you only had to teach a single thing - default initialization. Which has an obvious & simple syntax. The int i is always initialized to 0, as intended, and it's in a single spot at the point of declaration.

Now try doing that in C. Oh, wait, you can't. The closest you can get is this mess:

   struct A {
     int i;
   } const default_A = {0};
   
   int main() {
     struct A a5 = default_A;
     printf("%d\n", a5.i);
   }
Which requires you to know that you can declare a type & an instance of that type in a single statement, why that const is where it is and why that's important here, how braced initialization rules work, that you need to always remember to manually initialize to your default_A, and that %d means 'int'. And the compiler won't help you with any of this except for the %d part if you get it wrong.


> Which requires you to know that you can declare a type & an instance of that type in a single statement

    struct A {
        int i;
    };

    const struct A default_A = { 0 };
Why didn't you try the obvious simple thing first?

The whole undertaking is pointless in any case. Why would you need default values (i.e. templates for constructors) baked in? The only reasonable default value, sometimes, is all-zeroes (or all-ones...).

Now, constant data (i.e. things that contain more useful information than just "it's the default because a real value is missing" -- and that aren't copied around pointlessly) is another case, and C's plain old value initializer syntax (as demonstrated above) serves it perfectly well.


> The only reasonable default value, sometimes, is all-zeroes (or all-ones...).

That's very false. Consider a basic string container with a small-size optimization. The default value for capacity is neither 0 nor 1, but the size of the inline array.

Similarly it could be an enum value, and the default for a given class isn't whatever the 0 value happened to line up with because that's arbitrary anyway. An example being a basic type id of a fixed number of types.


Considering that a vast amount of extremely important software and large projects use C++, I would say it solves real problems nicely. You can get pedantic about features of any language if you want to.


> a2: An empty initializer list should reasonably behave like a default constructor, and a default constructor should be more efficient than processing an initializer_list, so ctor 2 is called.

Yes, initializer list with empty initializer list should behave same as empty constructor, and if it doesn’t, it’s really bad style. However, if both are defined, then if you are constructing an object with empty initializer list, the initializer list constructor should be called with an empty list as an argument, simple as that. That’s very logical and sensible semantics, so obviously C++ had to choose something different.


But a2 isn't an empty initializer list. A{} is the same as A(). You could use A{} even if there was no initializer list constructor. The squiggly brackets are from uniform initialization: a feature created to avoid the ambiguity between zero-argument constructors and zero-argument function declarations.

Unfortunately, std::initializer_list created a new ambiguity, so it's only an initializer list if it doesn't match an existing constructor.


While it does sometimes seem like C++ does some things because of how much sense they don't make, I don't think this is one of those cases. Read the rest of the sentence you quoted and I think you'll find that the way C++ does it will make sense.


No, that reasoning is unsatisfactory. It would explain why A a; or A a(21, 37) calls A(); and A(int p, int q); respectively despite presence of A(std::initializer_list<int> l), but the initializer list constructor should still be called when one writes A a{} and A a{21, 37}. That would be consistent and predictable.


> An empty initializer list should reasonably behave like a default constructor

True. But an empty initializer list should also behave like, you know, the constructor which uses an initializer list. C++ is in the unfortunate position of having to choose one behaviour or the other. Either choice is reasonable, but both can be confusing.


> C++ is not a language I’d want to teach beginners. At no point in this post was there room for systems programming concepts, discourse on programming paradigms, computational-oriented problem solving methodologies, or fundamental algorithms.

This is a ridiculous argument. The entire point of the post was, as the author even noted, purely to deep dive into a rabbit hole, get super picky & pedantic about standards wording so that you can act surprised that copy constructors exist in C++ while it was completely glossed over in the identical C example of a struct copy initialization, and almost none of this is useful to know.

So don't fucking teach it and you'd have plenty of time to cover all that other stuff. Bam, problem solved. Ignore pre-C++11 entirely, and purely teach & use the new stuff which fixes all the complexity, and leave the rabbit hole for people that care about exploring the past.

Oh, and don't use standards wording because those are for compilers to use to implement the language, not for programmers to understand how to use it effectively. The only time it's ever useful to know the full definition of an aggregate type or how it has changed is when you want to write a blog post calling them crazy or complex. It's never useful to know when using the language. And if you have an object that cares to enforce it just

    static_assert(std::is_aggregate_v<A>, "A isn't an aggregate type");
Tada, now the compiler will tell you if/when you violated the rule and you don't need to try and understand the full scope of the ruleset.


Hi. I appreciate your candor, although I'm not sure what I've done to inspire such contempt. You're correct that the entire point of the post was to deep dive into a rabbit hole. I acknowledged that and also acknowledged that the standardese is unnecessary most of the time, to avoid any miscommunication in my intent. I did this point out how large the language is and, more to the point, to point out how potentially complex it can be, compared to C. If you've ever had to spend entire weekends running through students' broken C code at scale, and see all the interesting ways one can complicate seemingly simple things, then I think you'd agree that giving them more ways to confuse themselves in a fast-paced academic environment is not the way forward. If you have had that experience and didn't have any problems, then you are fortunate to work with such fast learners! (Not to imply the students I've had are dumb in anyway, many are very bright!)


I am not sure about the following issue: What would make your students more productive?

   - use C++, given that you stick to some convenient subset of C++ and can use stl
   - stick with C and force them to do the basics, like linked lists and string abstractions over and over again.
I guess in the olden days really good students used to develop their own library of C abstractions, and reuse them with several courses; but you can't quite do that if you have C in just one course and what's the point anyway in this day and age ?


How else would they learn about linked lists and dynamic arrays?


depends on the course, if its a data structure course then its fine to do the linked list and dynamic array, but forcing your students to do them again and again in later courses will not make them very productive, they also might not enjoy the experience too much.

This could happen for example if the data structure course was in python or java and later courses are in C.

Also you can do the darn list in many ways: single linked list, double, ring, with counter/without counter. All very important in this day and age when you should know to avoid them altogether because linked lists fucks up cache behavior.


Funny you mention that: the data structures course I took let us use any language we wanted. I chose python because I thought “hey python’s easier than C!” So I started making some tree implementation from scratch, and I kept having problems because I didn’t understand Python’s name binding system at the time, because I had no mental model of pointers or references or objects really. (Of course it was a shallowcopy/deepcopy issue). The next assignment I did in C and it was much more natural. Only later did I realize how meta and absurd it was to build custom dictionaries out of python primitives.


a dictionary in python where everything is a dictionary. That's what undergraduate courses are made for!


C might be okay to learn for academic purposes but it's an absolute nightmare to use building real things.

Things like strings and pointers are nightmarish in C. Segfaults aplenty.

I've never encountered a segmentation fault since switching to modern C++. Just because you use C++ doesn't mean you have to use the entirety of it.


> how potentially complex it can be

Welp, time to go back to BASIC everyone. Or Go

Powerful tools can be complex, who would have thought


A huge reason people teach C/++ to beginners is to understand how the "machine" works. Yes, the model C++ uses doesn't map too well to modern assemblies, and yes, it maps even worse to the actual hardware, but its abstraction is the one everyone uses, so it's worth teaching. Teaching modern C++ doesn't do that, at all. "Modern" C++ even has garbage collection (reference counting) in the form of shared_ptr! The author makes the good point that C is much better suited to this job than C++.

Also, standards definitely shouldn't be inscrutable to programmers. Look at the ECMAScript and Go specifications, they're clear as day. Sometimes you just need to know exactly what your code is doing, and this only becomes more important in the low-level scenarios that C++ is often used for. I haven't read the C++ specification, but if it's as hard to read as everyone says it is, then there needs to be some other way to understand exactly what's going on. How else am I meant to precisely understand how my structs will be laid out, or which casts are guaranteed to be valid, or whether the language allows data pointers and function pointers to be interchangeable?


use the new stuff which fixes all the complexity

It doesn't, unfortunately it just adds to the complexity.


How so? My experience has been that it actually reduces net complexity, except in the case where you need to support different versions of the language in the same codebase.

Take for example, `auto`, added in C++11. If you don't have to support C++98 or C++03, the use of `auto` significantly simplifies many workflows and is often strictly better with less mental overhead.


> Take for example, `auto`, added in C++11. If you don't have to support C++98 or C++03, the use of `auto` significantly simplifies many workflows and is often strictly better with less mental overhead.

When writing code, yes, but in my experience, it can also make interpreting what code that overuses auto is doing (especially what variables are and what functions return) an awful lot harder in many situations, and you end up jumping all over the place to work out that "auto result = ..." is actually "uint32_t result = ..."

That's one of my personal pet peeves about more modern C++ (and other languages to some extent like Rust): they seem to be more optimised for writing code quickly (which generally only happens once), not understanding it later (i.e. you didn't write the code someone else did) and maintaining / altering it in the future, which generally happens a lot more over code's lifetime.


This is one of my worries about type inference (which C++ auto is not), but so far, the Rust code I've read has definitely been much easier to understand than the equivalent C code, and it's often much more generic. I like how Rust forces refuses to infer function signatures, which means you still always know the types of the inputs and outputs. I don't use IDEs, but I'd assume that they also make this whole thing a non-issue.


Relying on IDEs to do this type of thing really doesn't help for situations like merge conflicts (i.e. which merge tools understand languages fully?), and it doesn't completely solve the problem of easily understanding code just at a glance - you still need to go round mouse-overing or clicking on stuff to see extra detail.


Rust actually deliberately requires types to be declared, expect inside function bodies. So function return types have to be explicit. The main reason is exactly your readability argument. (Of course it also makes type inference inside the function body easier).


Yes, to be clear I wasn't accusing Rust code of having this particular problem, I was just using it as an example of a modern language where in my opinion there's a bit of an over-emphasis on brevity in terms of writing the code, as opposed to reading/understanding it and maintaining it, which I feel for long-term large projects, I'm not convinced is a total win.

I could have easily used Go or Swift as examples instead.

And to be clear, these new languages are in general improving things a lot, but I just worry there's an over-emphasis on "being able to do a lot of complex stuff which very few syntax / characters", which I don't really fully agree with (at least for large complex long-term projects).

That's not to say I'm right, but in my experience of programming over 15 years in everything from ADA, C, C++ to Java and Python, at least on large projects, the speed at which code was created was very rarely that important. Getting it right, bug-free and performant was generally much more important. In some cases newer more-condensed syntax can definitely help, but in others, it can cause trouble.


I definitely agree with most of your points.

Code is read much more often than written, and most newer languages optimize for convenience rather than readability.

In my view Rust is actually pretty verbose compared to other new languages, and feels a bit cumbersome to write. Technically it could be made quite a bit leaner syntax-wise.

I just wanted to point out that Rust and the Rust community often share your view and want to optimize for readability as well. There actually was a lot of heated discussion last year around proposals that suggested reducing readability for convenience.


Most pre-C++11 code I've seen that gets iterators from `begin()`, `end()`, and friends just typedef out their actual type anyways, so you the experience there is not that different from `auto`.


But those typedefs almost always (in my experience) contain some indication of at the very least an abbreviation of the type and some hints as to whether it's a pointer or not.

That's a lot better to go on than "auto".


You already know the type if you're calling .begin() or .end() on it. You don't care about the type of the iterator because you know it can iterate over the collection you're working with.

That's what auto fixes. It avoids coupling the concept of iterating from the specific type of the iterator which isn't important.

Similarly auto helps you achieve DRY. 'auto myFoo = std::make_unique<Foo>();'. The type wasn't removed, it just wasn't repeated twice.

But at this point auto, or things like auto, exist in nearly every major language. So you'll need to teach best practices for working with it at some point. That's a general thing that's everywhere.


is there a specific case that the uint32_t helps understanding the logic of the code?


That was a contrived example (although I've often seen auto used instead of that or even bool which I personally think are total over-uses of auto).

But knowing if something is a base type, a reference, a (smart) pointer, or an expensive-to-copy class/struct is very important when writing high-performance efficient code. Being able to see this at-a-glance by the type in the code in my experience helps tremendously with understanding what the code's doing and the implications in terms of data passing / transfer and understanding how that section of code interacts with other parts or could be changed to do other things.

I also think it allows people to be a bit sloppy and not care what's going on (i.e. with regards to whether it's by-value or by reference, etc) as they don't fully need to understand the code, and again, in high-performance computing where you seriously care about processor cycles and memory allocations, this can make a big difference if you're not careful.


> This is a ridiculous argument.

Please keep it civil. I'm sure you could refute the GP without expressing contempt.


One thing that's related to syntax and initialization has annoyed me and many other c++ coders for a while:

https://en.wikipedia.org/wiki/Most_vexing_parse

And it's one of the motivations for the {} syntax.


Author here--some extra context to add:

The post was mostly written to point my students to, so I don't have to keep repeating myself. I get a not-insignificant number of 1st, 2nd, and 3rd years (in a 5-year program) believing C is some antiquated language and believing that they're getting held back in some way by learning C vs C++. One even suggested the department was incompetent for not teaching C++. Because I work in an engineering department, many students have not had a great deal of time learning programming fundamentals early on and regardless, they are eager to learn more advanced tools than they are ready to use. It is a bit rambling for the purpose--I have a tendency to...erm, overwhelm...with information to make my point.

I haven't blogged much so I originally submitted it at lobste.rs[1] for any advice on the writing and visual style of the site. I welcome any constructive feedback. E.g. "I hate that side-nav! It keeps popping in and out!"

Also to clarify, I am not anti-C++ is anyway. I am a firm practitioner of Chesterton's fence[2] and believe in nuance. That cuts both ways. C++ is the way it is because it filled a specific need. It's greatest flaw is trying to appease everyone and, recently, trying to catch up quickly to recent QoL features in other languages. This fortunately gives it a lot of features other languages don't have, and it unfortunately gives it a lot of features other languages don't have.

Once again, the greater point being a warning, that C++ can easily become a time sink in language-specific knowledge instead of domain-specific knowledge. Of course sometimes that's what you want, e.g. when trying to optimize for performance.

[1]: https://lobste.rs/s/tul188/initialization_c_is_seriously_bon...

[2]: https://en.wikipedia.org/wiki/Wikipedia:Chesterton%27s_fence


Why not teach rust?


I can think of at least 3 reasons! 1) I don’t control the curriculum. 2) I’m pretty sure no one knows it well enough to teach. 3) Rust’s ecosystem and adoption is still too small for being taught as an engineering tool and for delivering employment opportunities to students.

I personally like rust and hope it does well. I see it somewhat orthogonal to both C and C++


I’m pretty sure it’s not orthogonal. The goal is to replace both of these languages in the next 10+ years.


I understand Rust's goal is to have the zero-cost abstractions of C and C++ with greater safety, effectively displacing them.

This is obviously a personal opinion, but sometimes I like being able to create bugs in my code. Not from an industrial or business standpoint, but from a greater understanding POV. It's easier to reason about the underlying machine (yes, yes I know C/C++ models abstract machines) when I can actually break that machine with the tools at hand. There's unsafe Rust which I have not looked at, but in terms of getting my hands dirty, sometimes C/C++ just feels better. The primitiveness, even of template programming vs Haskell's typeclasses, or constexpr vs D's CTFE. Something about that raw primitiveness is attractive. This is absolutely positively probably just experience bias. I'm not sure if anyone else can relate to this.

So in that regard, I see Rust as orthogonal. If you want that feeling like, "hey, I'm just directly fiddling raw virtual memory addresses", that's not Rust's target. Rust markets itself as a safe language that hides all those bits by default.


Not sure where it is but I saw a slide recently from a cpp conference that showed something like 25 different ways to initialize an integer. I wasn't at the conference but I wonder what point the speaker was making.


C++ is a little like the English language: a bunch of different pieces gathered together over a period of time, new features added to replace/fix old. I like it. (My comment copy/paste from another C++ thread.)


How many languages give you that much precise control over memory? So few.


Many sibling languages in the era all of them originally appeared did so - Pascal (not the core language, but real-world implementations with their extensions, like Borland's), Modula-2 etc.


C, for example.



Navigation on this page is also pretty bonkers, iyam.


Yeah, I have nav labels hanging over and obscuring the text. (FF on Ubuntu)


Yikes! That's no bueno. And away it goes...


Btw, I'm also using FF on Ubuntu.


Thank you, I'm glad I'm not the only one who think so.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: