The C standard is quite liberal in its use of undefined behaviour.
Some cases are due to conflicting existing implementation that predated the standard. (Though implementation defined might make more sense?) Some are for performance reasons.
And some are utterly bizarre, like some things that should be syntax errors instead being declared undefined. If memory serves right, that includes eg not closing your string literals. (Thankfully, all implementations I know give you an error message.)
I have two responses to that. The first is that I think your specific example is mistaken. The C89 standard mandates a diagnostic message “for every translation unit that contains a violation of any syntax rule or constraint” (2.1.1.3). It also specifies that a string literal is a sequence of “any member of the source character set except the double-quote ", backslash \, or new-line character” (3.1.4). So my reading is that an attempt to include a newline in a string literal is a violation of a syntax rule that requires a diagnostic, and is therefore not undefined behavior. But I am far from an expert, and I would be glad if you could correct me if I have misunderstood.
But my second point is, the people who wrote the C89 standard were really smart. They were compiler writers from many areas of research and industry and worked very hard to find a workable compromise between all of their conflicting needs. It has often happened that I've seen someone say something like “an unclosed string shouldn't be undefined behavior, it should be a syntax error”, and then someone from the committee would show up and say “we wanted to make it a syntax error, but we were supposed to codify existing practice. Widely-Used Compiler X didn't diagnose that, and in fact couldn't, because of …”. And the reason would be something I couldn't have thought of. So I have learned to give them the benefit of the doubt, because they know way more about it than I ever have or will.
Oh, I don't doubt that the standard authors were smart, and worked within the constraints that were prevalent at the time. Their justified genesis doesn't make the outcomes less bizarre.
(And especially for C++, most of the later features are essentially workaround for bad ideas they had earlier.)
For the running example about unclosed string literals, I wonder why they didn't make it at least implementation defined behaviour instead?
The standards also seem a bit split in their purpose: on the one hand, they often try to codify existing behaviour. On the other hand, they often introduce new features that take compilers years to implement. (That probably applies more to C++ than C.)
With the benefit of hindsight, lots of problems could have been avoided if C had come with a module system a bit more sophisticated than automated copy-and-paste of include files via the preprocessor.
GCC has lots of flags to give you slightly different dialects of C. The Linux kernel for example tells GCC to never elide null pointer checks and to let signed integers overflow.
Haskell also has different dialects, but in addition to compiler flags, you can also specify which variant of the language you are using via pragmas at the top of your file. In most cases, you can mix and match modules written in different dialects; because they get translated to a more stable intermediate representation before they are combined.
C doesn't have that luxury with its include files.
For C++, there's a long running proposal to add modules to the language. But from all I've read, thanks to all the other complicated features in the language, modules are unlikely to work well for C++. (I'm mostly basing that on https://vector-of-bool.github.io/2019/01/27/modules-doa.html )
Enough ranting. Summary: I agree that the standard authors make the best effort given the situation. Doesn't make the languages more sensible, though.
Making syntax errors UB instead of specifying that the program must refuse to compile may seem unfortunate, but it does have some justification: it allows implementations to add extensions to the language and still claim compliance with the standard. If the standard mandated that any syntax not allowed by the standard must cause a compilation error, any compiler adding extensions to C syntax would be in breach of the standard.
The standard could mark that behaviour as implementation defined instead of undefined.
But it's really a non-issue: compilers like GCC don't claim to be standards compliant in all modes and with any combination of options. They are happy enough to have some combination of command line options that make them behave according to a specified standard.
No, I don't. I think that extensions are extensions which by definition are not part of the standard.
Compilers should allow extensions, but the standard does not necessarily have to. I do not say it should be done this way, but it is perfectly possible to define a strict standard and leave extensions out of it.
And that's what's happening in practice for some features.
GCC has a ton of options, and only a few of the myriad combinations give you a compiler that behaves strictly according to one of the C standards. And that's just fine.
In fact, compiler vendors experimenting, even in ways that are not allowed by the standard, is one of the main avenues to come up with new ideas for how to evolve the standard.
In fact, minor variations on semantics that are not allowed under the standard are probably more in spirit with C, than eg feeding the whole source file to a Python interpreter, if there's any string literal anywhere in the file that's not closed on the same line.
The former violates the standard, the latter complies.
The only sense in which an unterminated string literal can be UB is like this.
It's possible that if a string literal is not terminated, the situation will cause a de facto overly long string literal token to exist. For instance:
const char *str = "short str; int x = 42; and now a realy long line follows ...
if the non-termination causes a de facto large literal to exist in the program, and that large literal exceeds a minimum implementation limit, then that is UB. Probably.
Strictly speaking, since the token is not closed, it is not a string literal, and so any limitation on string literals doesn't apply to it. Unless we interpret that limit as pertaining to the implementation's tokenizer, as such, which can plausibly choke on simply the valid prefix of a string literal that is too long.
Anyway, if a newline occurs before the minimum limit on string literal length is reached, and that newline is not escaped with a backslash, then that is a straightforward syntax error.
Some cases are due to conflicting existing implementation that predated the standard. (Though implementation defined might make more sense?) Some are for performance reasons.
And some are utterly bizarre, like some things that should be syntax errors instead being declared undefined. If memory serves right, that includes eg not closing your string literals. (Thankfully, all implementations I know give you an error message.)