The C standard is quite liberal in its use of undefined behaviour. Some cases ar...

mjd · on March 16, 2020

I have two responses to that. The first is that I think your specific example is mistaken. The C89 standard mandates a diagnostic message “for every translation unit that contains a violation of any syntax rule or constraint” (2.1.1.3). It also specifies that a string literal is a sequence of “any member of the source character set except the double-quote ", backslash \, or new-line character” (3.1.4). So my reading is that an attempt to include a newline in a string literal is a violation of a syntax rule that requires a diagnostic, and is therefore not undefined behavior. But I am far from an expert, and I would be glad if you could correct me if I have misunderstood.

But my second point is, the people who wrote the C89 standard were really smart. They were compiler writers from many areas of research and industry and worked very hard to find a workable compromise between all of their conflicting needs. It has often happened that I've seen someone say something like “an unclosed string shouldn't be undefined behavior, it should be a syntax error”, and then someone from the committee would show up and say “we wanted to make it a syntax error, but we were supposed to codify existing practice. Widely-Used Compiler X didn't diagnose that, and in fact couldn't, because of …”. And the reason would be something I couldn't have thought of. So I have learned to give them the benefit of the doubt, because they know way more about it than I ever have or will.

eru · on March 18, 2020

Thanks for the learned response!

I just did some research, and https://stackoverflow.com/a/367662/590534 has a good overview over undefined behaviour on the syntax level.

Oh, I don't doubt that the standard authors were smart, and worked within the constraints that were prevalent at the time. Their justified genesis doesn't make the outcomes less bizarre.

(And especially for C++, most of the later features are essentially workaround for bad ideas they had earlier.)

For the running example about unclosed string literals, I wonder why they didn't make it at least implementation defined behaviour instead?

The standards also seem a bit split in their purpose: on the one hand, they often try to codify existing behaviour. On the other hand, they often introduce new features that take compilers years to implement. (That probably applies more to C++ than C.)

With the benefit of hindsight, lots of problems could have been avoided if C had come with a module system a bit more sophisticated than automated copy-and-paste of include files via the preprocessor.

GCC has lots of flags to give you slightly different dialects of C. The Linux kernel for example tells GCC to never elide null pointer checks and to let signed integers overflow.

Haskell also has different dialects, but in addition to compiler flags, you can also specify which variant of the language you are using via pragmas at the top of your file. In most cases, you can mix and match modules written in different dialects; because they get translated to a more stable intermediate representation before they are combined.

C doesn't have that luxury with its include files.

For C++, there's a long running proposal to add modules to the language. But from all I've read, thanks to all the other complicated features in the language, modules are unlikely to work well for C++. (I'm mostly basing that on https://vector-of-bool.github.io/2019/01/27/modules-doa.html )

Enough ranting. Summary: I agree that the standard authors make the best effort given the situation. Doesn't make the languages more sensible, though.

pwdisswordfish2 · on March 16, 2020

Making syntax errors UB instead of specifying that the program must refuse to compile may seem unfortunate, but it does have some justification: it allows implementations to add extensions to the language and still claim compliance with the standard. If the standard mandated that any syntax not allowed by the standard must cause a compilation error, any compiler adding extensions to C syntax would be in breach of the standard.

eru · on March 18, 2020

The standard could mark that behaviour as implementation defined instead of undefined.

But it's really a non-issue: compilers like GCC don't claim to be standards compliant in all modes and with any combination of options. They are happy enough to have some combination of command line options that make them behave according to a specified standard.

yiyus · on March 16, 2020

I don't see any problem with syntax not allowed by the standard being a breach of the standard.

young_unixer · on March 16, 2020

And you don't see any problem with extensions being prohibited?

yiyus · on March 17, 2020

No, I don't. I think that extensions are extensions which by definition are not part of the standard.

Compilers should allow extensions, but the standard does not necessarily have to. I do not say it should be done this way, but it is perfectly possible to define a strict standard and leave extensions out of it.

eru · on March 18, 2020

And that's what's happening in practice for some features.

GCC has a ton of options, and only a few of the myriad combinations give you a compiler that behaves strictly according to one of the C standards. And that's just fine.

In fact, compiler vendors experimenting, even in ways that are not allowed by the standard, is one of the main avenues to come up with new ideas for how to evolve the standard.

In fact, minor variations on semantics that are not allowed under the standard are probably more in spirit with C, than eg feeding the whole source file to a Python interpreter, if there's any string literal anywhere in the file that's not closed on the same line.

The former violates the standard, the latter complies.

young_unixer · on March 17, 2020

That makes sense.

kazinator · on March 16, 2020

The only sense in which an unterminated string literal can be UB is like this.

It's possible that if a string literal is not terminated, the situation will cause a de facto overly long string literal token to exist. For instance:

   const char *str = "short str; int x = 42; and now a realy long line follows ...

if the non-termination causes a de facto large literal to exist in the program, and that large literal exceeds a minimum implementation limit, then that is UB. Probably.

Strictly speaking, since the token is not closed, it is not a string literal, and so any limitation on string literals doesn't apply to it. Unless we interpret that limit as pertaining to the implementation's tokenizer, as such, which can plausibly choke on simply the valid prefix of a string literal that is too long.

Anyway, if a newline occurs before the minimum limit on string literal length is reached, and that newline is not escaped with a backslash, then that is a straightforward syntax error.

eru · on March 18, 2020

The standard could have still made that situation a syntax error, or at least implementation defined behaviour.

marvy · on March 17, 2020

maybe you're thinking of unrecognized escape sequences?

E.g., printf("Hello \m") is undefined.

eru · on March 18, 2020

I wasn't thinking of those, but they would be good examples as well.

They could be declared implementation defined behaviour without sacrificing any flexibility.