I'm still learning C, having not had much reason to do so until recently, so I'm not quite sure I understand this statement. How is integer handling in C treacherous (any more so than any other language, especially other languages that operate as close to the metal as C)? Is it something to do with signed vs unsigned, and beware of over/under-flow when doing math? Or is it a more insidious subtlety I'm ignorant of because it has been biding its time for a more perfect opportunity to bite me in the ass?
Take one down, pass it around, one bottle of beer on the wall.
One bottle of beer on the wall, one bottle of beer.
Take it down, pass it around, zero bottles of beer on the wall.
Zero bottles of beer on the wall, zero bottles of beer.
Take one down, pass it around, four billion, two hundred ninety-four million, nine hundred sixty-seven thousand, two hundred ninety-five bottles of beer on the wall.
(Yes, you pretty much have it in your question; the Google search you want to make to learn more is [integer overflow]).
Integer promotion rules are not simple, either. Integer width rules, far from simple.
Unfortunately, most integer issues don’t have outcomes that are nearly that amusing.
Stats were stored in an 8-bit field and were incremented at some (variable, depended on other factors) rate as the character leveled up. Once they hit 255 in a stat, the next time they leveled up their stat would be back to single digits.
This sort of behavior is silent in C so you have to have bounds checking yourself to make sure you don't cause underflow/overflow.
Also, be careful with enumerations. Any integer can be stored into the enumerated type and, at best, the compiler will warn you (an error if you use the right compiler flags) that you haven't cast it to the enumerated type. But nothing guarantees that it's actually in the range of the enumeration.
I find this interesting - why only one more byte of overhead? That would've limited string lengths to 256. So 2 bytes would seem the minimum, and even then, how do you go to 4 bytes once memory becomes cheap without breaking everything? Using NUL-termination, the upper bound for a string is effectively the amount of memory the OS is willing to give you, and code can keep working without modification for decades.
Am I missing something here?
You could do some kind of variable int encoding scheme, where longer strings would require more bytes for length, with some overhead to indicate how many length bytes are required for each string.
It would have also been a fair bit of overhead on those old single-Mhz machines. Still, almost everything you do with strings is slow so it probably would have been manageable. As an added bonus, it is zero space overhead for short strings (up to 127 characters) which are the majority of strings.
In the modern era, it's probably cheap to the point of being free, because in the vast majority of cases I would expect branch prediction to largely eliminate the checks as being very predictable.
It's not necessarily a good solution, there are likely ones that are much more performant, but many solutions exist, and a few have even been implemented in widely used languages.
On a modern machine you'd need at least 4 bytes. `std::string` uses `size_t` (64-bits on 64-bit machines).
I don't think these were invented back when K&R were designing C, but it feels like something any smart computer language designer could invent if the need arose.
For example `std::string` uses `std::size_t` as a length which is 32-bit on 32-bit machines and 64-bit on 64-bit machines. When you serialise a string to protobuf (for example) it uses a variable-length prefix. Other formats could have fixed 4-byte prefixes or whatever they want (even null-terminated).
A length prefix is pretty clearly superior.
"Fundamental" in this case means "matches reality". Having a number at the beginning doesn't match reality as closely as having the string of characters in sequential memory addresses with something to terminate them.
The quick fox made the jump\N
27The quick fox made the jump
The second one requires more work to store (a character-counting routine), and needs even more work to handle variable length strings that may exceed 255-ish bytes/characters.
I'm not discounting the benefits of prefixing the length, just saying it's not more fundamental than null-terminating an arbitrary sequence of characters.
You already couldn't make this argument stick in the ASCII era, where a string can't contain NUL but can contain SOH (Start of Heading), STX (Start of Text), ETX (End of Text), EOT (End of Transmission), ENQ (Enquiry), ACK (Acknowledge), BEL, BS, HT (horizontal tab), LF, VT (vertical tab), FF (form feed), CR, SO (shift out), SI (shift in), DLE (data link escape), DC1, DC2, DC3, DC4 (device control 1-4), NAK (negative ACK), SYN (synchronous idle), ETB (end of transmission block), CAN (cancel), EM (end of medium), SUB (substitute), ESC (escape), FS (file separator), GS (group separator), RS (record separator), US (unit separator), and DEL, but Unicode makes that argument even sillier. Strings have always contained things that aren't "characters".
The real problem is no matter what in-band character you take as the magical termination character, you will have strings that want that in it, because in the general case strings can contain anything, because C is always asking you to pass them around to things as the general-purpose storage data structure. You can fix that with an escaping scheme, but now you have an escaped string, not just "a string". Since strings do indeed need to be able to carry NUL in the general case, you either must have some sort of scheme for representing them, or expect a ton of errors when things jam the distinguished character into your string when you didn't expect it. (Note that for precisely the same reasons that NUL-termination isn't a good idea, there isn't any way to "filter" wrong NULs. You can't tell.)
You might just barely be able to argue the problem is that C's library mistook NUL-terminated strings for arbitrary-sized arrays that can contain anything, but in C if you want arbitrarily-sized arrays you would then have no choice but to pass the array size around to every call that expected such a thing. The next immediately obvious thing to do is to pack the number together with the array in a struct, and lo, we're back to length-delimited strings.
No matter how you slice it, C's got a major foundational screw-up in this area somewhere. If NUL-terminated strings are the bee's knees, C's APIs still took them in way too many places where they are not appropriate, and it caused decades of serious and often exploitable bugs.
Unicode Standard (version 10.0, section 23.1 Control Codes) makes it clear that it "specifies semantics for the use" of only 9 of those ASCII control codes you mentioned, i.e. U+0009 to U+000D (HT, LF, VT, FF, CR) and U+001C to U+001F (RS, GS, RS, US). The rest of the 65 ASCII and Latin-1 control codes, except U+0085 (NEL), "constitute a higher-level protocol that is outside the scope of the Unicode Standard".
Particularly about NUL, it says: "U+0000 null may be used as a Unicode string terminator, as in the C language. Such usage is outside the scope of the Unicode Standard, which does not require any particular formal language representation of a string or any particular usage of null."
So Unicode makes that argument less silly.
But C never claimed to support all ASCII strings. C doesn't even have strings. C just has char arrays, which are byte arrays. When strings were formalized by convention in the stdlibs, clearly the supported strings are 1-255 strings, NUL excluded. That's the character set available for strings in the stdlibs. If you insist on using stdlib strings for some other kind of strings, that's your own problem.
That is precisely my point... there is no well-supported solution in core C for arbitrary binary strings, despite C's extremely frequent use in domains that require them. If you insist on using stdlib strings for other kinds of strings, you do have a problem... but you also have no other choice. Which brings it back to being a language/library problem.
As I already alluded to, C itself doesn't have a problem with length-delimited strings, and there are plenty of libraries you can get for them. But the core library for C does force this problem in your face by leaving you no other choice, and it is a valid criticism of C.
(C is such a disaster that the only thing to do is to leave it behind as quickly as possible. However, if we were somehow stuck with the language itself, there's a lot of ways we could improve the libraries it comes with, as again demonstrated by the many such improved libraries you can get. However, one of the things I've learned from learning a ton of languages over the past couple of decades is that a language almost never manages to escape from its own standard library, and the few that manage it (like D) pay a stiff adoption price in the process. C's standard library has a real problem here, that has caused real bugs, and no amount of wordplay is going to fix those decades of bugs.)
Also, ETX might have been a good terminator :) I assume NUL was chosen for easier checking (if (char) ...) vs (if (char == 0x03) ...)
But my argument was against length prefixing somehow being "more fundamental" than having just a sequence of characters "raw" in memory addresses.
0 is not a letter of the alphabet, but nor is 01000001 (ascii 'a').
So either the first number is special, or you look for a special number to indicate the end. Neither represents reality, because the "end" of a single group of characters is visually identical to a million white-space characters that happen to fit into the emptiness that follows.
My point being, it's probably not helpful to argue which "matches reality" when they're both just abstract representations of concepts.
The difference doesn't matter much on modern (non-embedded) processors, but it did make sense at the time C was designed. It matches the most common hardware design pattern better than the alternatives.
Somehow NUL is still an assigned character in the ASCII code table. Strange, hmmm?
Ok, it's an ASCII character code point. One that's used to terminate strings. I meant it's not a character you'd find in the middle of a string, though I realize that's kinda tautological. Back when ASCII was developed, punch cards were used. Any row in the card that wasn't punched was a NUL. It wouldn't have made sense to have it in the middle of a string. It would be like missing a character altogether.
find and xargs are examples of programs with this feature.
It depends a bit on what you call a "string". If you're thinking "something a human will want to read", then yeah, there's no much need to encode null. If however you take a looser view of "an 8 bit vector" then encoding null becomes important. Otherwise your system can't be 8 bit clean.
Overall I think the null terminator has caused more problems than it has solved, but prefixing the string length isn't a panacea either. You end up with systems with 256, 65536, or even 4294967296 byte limits on their strings. It's also more difficult to pass around an index into the string so you end up having to make lots of copies and then possibly merge them later or your language is cluttered with index values everywhere strings are used.
It's quite possible that if K&R had gone with length prefix strings that we would have a different class of errors where the string index gets offset or malicious values are inserted in the length field.
Using find -print0 etc. is a good idea not so much because NUL is an uncommon character (the various record separators / vertical tab / ... are no more common), but because UNIX - being a C system through and through - allows any character to appear in a file name except '/' (path separator) and NUL. Thus, NUL makes a perfect separator between filenames.
The UNIX filesystem, qua filesystem, doesn't have a character set, just NUL-terminated strings. On the plus side, it's simple to handle, and means that retrofitting UTF-8 or another encoding is pretty easy. On the downside, two bytestrings that Unicode-canonicalize to the same value may name different files, which is surprising for humans.
It's notable that many of early UNIX' competitors were much more full-fledged systems, featuring full-fledged record-oriented files and typed data instead of UNIX' bytestrings-everywhere approach.
The security track record on the authors internet-facing programs is better than most. In fact, I cannot think of any author writing similar software with a better track record.
Sometimes the most popular solutions are not necessarily the best ones for every purpose. Whenever I write programs in C from scratch, I use byte.h, buffer.h, etc., from the above mentioned author. I do not use memcpy.
In doing this, I am not a professional programmer and I am not writing internet-facing programs for other users. I am a student of C learning how to use C, the language. If I know how the language works, then it stands to reason I should be able to use a variety of libraries, including alternatives to the "standard libraries".
Otherwise it is arguable I would be just learning how to use a standard library, not a language.
The C language has utility on its own, as form of a notation, and it is that utility which I seek to learn about. Historical records indicate there was C language in productive use for some time before there was a "standard library".
Hold up, the z80 offered string maniuplation instructions? Using adr+len, no less? You miiiight call LDIR a string manipulation function using adr+len but in practice almost all z80 machines I've seen use NUL terminated strings and something like CPIR to find it.
Null termination is part of the C language.
char str= "1234";
printf ("%s: %lu", str, sizeof(str));
A couple of ugly corner cases:
const char str = "abcd\0efgh";
printf("length = %zu, size = %zu, value = \"%s\"\n",
strlen(str), sizeof str, str);
length = 4, size = 10, value = "abcd"
const char str = "abcd";
printf("length = %zu, size = %zu, value = \"%s\"\n",
strlen(str), sizeof str, str);
This explicitly is a different kind of representation in memory.
OS safety was already a concern in 1961.