Hacker News new | comments | show | ask | jobs | submit login
Did Ken, Ritchie and Brian choose wrong with NUL-terminated text strings? (2011) (acm.org)
55 points by mayankkaizen 6 months ago | hide | past | web | favorite | 55 comments



Worth keeping in mind that the security cost here is illusory: the track record of length-delimited data structures isn't much better than that of ASCIIZ, especially since integer handling in C is so treacherous.


> ...especially since integer handling in C is so treacherous.

I'm still learning C, having not had much reason to do so until recently, so I'm not quite sure I understand this statement. How is integer handling in C treacherous (any more so than any other language, especially other languages that operate as close to the metal as C)? Is it something to do with signed vs unsigned, and beware of over/under-flow when doing math? Or is it a more insidious subtlety I'm ignorant of because it has been biding its time for a more perfect opportunity to bite me in the ass?


Two bottles of beer on the wall, two bottles of beer.

Take one down, pass it around, one bottle of beer on the wall.

One bottle of beer on the wall, one bottle of beer.

Take it down, pass it around, zero bottles of beer on the wall.

Zero bottles of beer on the wall, zero bottles of beer.

Take one down, pass it around, four billion, two hundred ninety-four million, nine hundred sixty-seven thousand, two hundred ninety-five bottles of beer on the wall.

(Yes, you pretty much have it in your question; the Google search you want to make to learn more is [integer overflow]).


And various undefined behaviours (signed overflow etc.) colluding against the programmer to the effect that compilers sometimes delete range-checking code.

Integer promotion rules are not simple, either. Integer width rules, far from simple.


Here’s a lovely real-world example, discussed just recently here on HN: https://kotaku.com/why-gandhi-is-such-an-asshole-in-civiliza...

Unfortunately, most integer issues don’t have outcomes that are nearly that amusing.


A bug in an early XCom game led to character's becoming so fast that they couldn't leave the landing zone.

Stats were stored in an 8-bit field and were incremented at some (variable, depended on other factors) rate as the character leveled up. Once they hit 255 in a stat, the next time they leveled up their stat would be back to single digits.

This sort of behavior is silent in C so you have to have bounds checking yourself to make sure you don't cause underflow/overflow.

Also, be careful with enumerations. Any integer can be stored into the enumerated type and, at best, the compiler will warn you (an error if you use the right compiler flags) that you haven't cast it to the enumerated type. But nothing guarantees that it's actually in the range of the enumeration.


Overflow of a signed integer - even in intermediate values - is technically undefined behavior, with the concomitant nasal demons, etc.


I would claim that there is still a difference between P-strings and, say, ASN.1 DER (which, at heart, is still TLV).


>Using an address + length format would cost one more byte of overhead than an address + magic_marker format, and their PDP computer had limited core memory.

I find this interesting - why only one more byte of overhead? That would've limited string lengths to 256. So 2 bytes would seem the minimum, and even then, how do you go to 4 bytes once memory becomes cheap without breaking everything? Using NUL-termination, the upper bound for a string is effectively the amount of memory the OS is willing to give you, and code can keep working without modification for decades.

Am I missing something here?


1 additional byte of overhead would give you 2 bytes for the length, since you wouldn't have to have the NUL byte at the end of the string.

You could do some kind of variable int encoding scheme, where longer strings would require more bytes for length, with some overhead to indicate how many length bytes are required for each string.


You don't even need that length prefix if you use something like SDNV. Ironically this would have introduced a case where you have something like a null terminator that can overflow.

It would have also been a fair bit of overhead on those old single-Mhz machines. Still, almost everything you do with strings is slow so it probably would have been manageable. As an added bonus, it is zero space overhead for short strings (up to 127 characters) which are the majority of strings.


In the PDP era, they would have noticed the overhead of having a variable-length int at the beginning they would have to decode.

In the modern era, it's probably cheap to the point of being free, because in the vast majority of cases I would expect branch prediction to largely eliminate the checks as being very predictable.


Even in the PDP era I am not so sure, given the architectures and OSes being developed outside AT&T, some of them in production.


So then....they chose correctly?


Despite being vigorously opposed to NUL-termination in the modern era, yes, I would not criticize the people who actually made the decision for the PDP. They had no real reason to believe we'd still be discussing that decision 60 years later.


Two bytes extra for the length, but one byte saved from the marker character.


There are various schemes to address this. A simple one is to make the legtth prefix variable, nad use the high order bit use the highorder bit of each byte to indicate there is another length byte following the current one. This gives you 128 possible values for each length byte. The max string size for a single byte prefix would be 127 (2^(8-1), the max size for two bytes would be 16384 (2^16-2), etc.

It's not necessarily a good solution, there are likely ones that are much more performant, but many solutions exist, and a few have even been implemented in widely used languages.


I presume the machine at the time was 16-bit, and thus you could never have a string longer than 65536 bytes in memory.

On a modern machine you'd need at least 4 bytes. `std::string` uses `size_t` (64-bits on 64-bit machines).


Length itself could also be stored with variable length encoding.


That's what I mean though - if you standardize, as part of the language, that you can have 2 bytes for the string length because at the time no machine could handle anything larger, how do you move forward once 4 or 8 bytes is not a big deal without breaking everything?


Best solution would be something like SDNV. Short strings (up to 127 characters) have no length penalty and you can handle strings of any length.

I don't think these were invented back when K&R were designing C, but it feels like something any smart computer language designer could invent if the need arose.


N-bit machines can use N-bit lengths for strings in memory. If you're storing strings your storage format has to specify a way to decide the prefix length (or you can use some other system).

For example `std::string` uses `std::size_t` as a length which is 32-bit on 32-bit machines and 64-bit on 64-bit machines. When you serialise a string to protobuf (for example) it uses a variable-length prefix. Other formats could have fixed 4-byte prefixes or whatever they want (even null-terminated).



A null-terminated representation is as close to a fundamental datatype for a string as possible. It is same in spirit as other fundamental data types in C like array. People have built abstractions over these fundamental datatypes over the years.


If it isn't in libc, it doesn't get used by anyone wanting to write portable APIs.


It's not fundamental at all. You can't even represent null bytes in a null-terminated string.

A length prefix is pretty clearly superior.


A string contains characters. NUL is not a character; it's nothing.

"Fundamental" in this case means "matches reality". Having a number at the beginning doesn't match reality as closely as having the string of characters in sequential memory addresses with something to terminate them.

The quick fox made the jump\N

or

27The quick fox made the jump

The second one requires more work to store (a character-counting routine), and needs even more work to handle variable length strings that may exceed 255-ish bytes/characters.

I'm not discounting the benefits of prefixing the length, just saying it's not more fundamental than null-terminating an arbitrary sequence of characters.


"A string contains characters. NUL is not a character; it's nothing."

You already couldn't make this argument stick in the ASCII era, where a string can't contain NUL but can contain SOH (Start of Heading), STX (Start of Text), ETX (End of Text), EOT (End of Transmission), ENQ (Enquiry), ACK (Acknowledge), BEL, BS, HT (horizontal tab), LF, VT (vertical tab), FF (form feed), CR, SO (shift out), SI (shift in), DLE (data link escape), DC1, DC2, DC3, DC4 (device control 1-4), NAK (negative ACK), SYN (synchronous idle), ETB (end of transmission block), CAN (cancel), EM (end of medium), SUB (substitute), ESC (escape), FS (file separator), GS (group separator), RS (record separator), US (unit separator), and DEL, but Unicode makes that argument even sillier. Strings have always contained things that aren't "characters".

The real problem is no matter what in-band character you take as the magical termination character, you will have strings that want that in it, because in the general case strings can contain anything, because C is always asking you to pass them around to things as the general-purpose storage data structure. You can fix that with an escaping scheme, but now you have an escaped string, not just "a string". Since strings do indeed need to be able to carry NUL in the general case, you either must have some sort of scheme for representing them, or expect a ton of errors when things jam the distinguished character into your string when you didn't expect it. (Note that for precisely the same reasons that NUL-termination isn't a good idea, there isn't any way to "filter" wrong NULs. You can't tell.)

You might just barely be able to argue the problem is that C's library mistook NUL-terminated strings for arbitrary-sized arrays that can contain anything, but in C if you want arbitrarily-sized arrays you would then have no choice but to pass the array size around to every call that expected such a thing. The next immediately obvious thing to do is to pack the number together with the array in a struct, and lo, we're back to length-delimited strings.

No matter how you slice it, C's got a major foundational screw-up in this area somewhere. If NUL-terminated strings are the bee's knees, C's APIs still took them in way too many places where they are not appropriate, and it caused decades of serious and often exploitable bugs.


> but Unicode makes that argument even sillier

Unicode Standard (version 10.0, section 23.1 Control Codes) makes it clear that it "specifies semantics for the use" of only 9 of those ASCII control codes you mentioned, i.e. U+0009 to U+000D (HT, LF, VT, FF, CR) and U+001C to U+001F (RS, GS, RS, US). The rest of the 65 ASCII and Latin-1 control codes, except U+0085 (NEL), "constitute a higher-level protocol that is outside the scope of the Unicode Standard".

Particularly about NUL, it says: "U+0000 null may be used as a Unicode string terminator, as in the C language. Such usage is outside the scope of the Unicode Standard, which does not require any particular formal language representation of a string or any particular usage of null."

So Unicode makes that argument less silly.


NUL is a character in the ASCII character set. That is a problem because you cannot create all the strings composed of ASCII characters in C.

But C never claimed to support all ASCII strings. C doesn't even have strings. C just has char arrays, which are byte arrays. When strings were formalized by convention in the stdlibs, clearly the supported strings are 1-255 strings, NUL excluded. That's the character set available for strings in the stdlibs. If you insist on using stdlib strings for some other kind of strings, that's your own problem.


"But C never claimed to support all ASCII strings."

That is precisely my point... there is no well-supported solution in core C for arbitrary binary strings, despite C's extremely frequent use in domains that require them. If you insist on using stdlib strings for other kinds of strings, you do have a problem... but you also have no other choice. Which brings it back to being a language/library problem.

As I already alluded to, C itself doesn't have a problem with length-delimited strings, and there are plenty of libraries you can get for them. But the core library for C does force this problem in your face by leaving you no other choice, and it is a valid criticism of C.

(C is such a disaster that the only thing to do is to leave it behind as quickly as possible. However, if we were somehow stuck with the language itself, there's a lot of ways we could improve the libraries it comes with, as again demonstrated by the many such improved libraries you can get. However, one of the things I've learned from learning a ton of languages over the past couple of decades is that a language almost never manages to escape from its own standard library, and the few that manage it (like D) pay a stiff adoption price in the process. C's standard library has a real problem here, that has caused real bugs, and no amount of wordplay is going to fix those decades of bugs.)


Good point. And I would agree that the error lies in choosing to use strings for inappropriate places.

Also, ETX might have been a good terminator :) I assume NUL was chosen for easier checking (if (char) ...) vs (if (char == 0x03) ...)

But my argument was against length prefixing somehow being "more fundamental" than having just a sequence of characters "raw" in memory addresses.


None of it really "matches reality." It's all binary numbers, and on a deeper level, voltages or magnetized particles.

0 is not a letter of the alphabet, but nor is 01000001 (ascii 'a').

So either the first number is special, or you look for a special number to indicate the end. Neither represents reality, because the "end" of a single group of characters is visually identical to a million white-space characters that happen to fit into the emptiness that follows.

My point being, it's probably not helpful to argue which "matches reality" when they're both just abstract representations of concepts.


I was going more toward "closer to reality". But I take your point. Somewhere we're going to need extra info about the string itself, whether that extra info is a magic terminator or a magic prefix. The magic prefix gives great benefit, but also is more complex to implement if you want to store an arbitrary-length string.


Most CPUs have a flags register, and typically have a "zero" flag which is set when the result of the last operation was zero. Zero is special in the vast majority of hardware designs. Checking for null (zero) instead of another specific value often saves a few cycles. That's where the optimization of having all FOR loops count down towards zero comes from, the check saves a cycle or two each iteration on some CPUs. The same thing happens when reading from a buffer, the load instruction will set the Zero flag when the terminating null is read.

The difference doesn't matter much on modern (non-embedded) processors, but it did make sense at the time C was designed. It matches the most common hardware design pattern better than the alternatives.


> NUL is not a character

Somehow NUL is still an assigned character in the ASCII code table. Strange, hmmm?


Ha, the fact that I spelled it "NUL" instead of using the word "null" should have made me pause. :)

Ok, it's an ASCII character code point. One that's used to terminate strings. I meant it's not a character you'd find in the middle of a string, though I realize that's kinda tautological. Back when ASCII was developed, punch cards were used. Any row in the card that wasn't punched was a NUL. It wouldn't have made sense to have it in the middle of a string. It would be like missing a character altogether.


Why would you want to represent a null byte in a string? Is there a character encoding where the null value has a meaning?


Interesting some Unix command line utilities will send null separated records if you pass a flag (often -0) because it's the least likely character to show up as part of the string.

find and xargs are examples of programs with this feature.

It depends a bit on what you call a "string". If you're thinking "something a human will want to read", then yeah, there's no much need to encode null. If however you take a looser view of "an 8 bit vector" then encoding null becomes important. Otherwise your system can't be 8 bit clean.

Overall I think the null terminator has caused more problems than it has solved, but prefixing the string length isn't a panacea either. You end up with systems with 256, 65536, or even 4294967296 byte limits on their strings. It's also more difficult to pass around an index into the string so you end up having to make lots of copies and then possibly merge them later or your language is cluttered with index values everywhere strings are used.

It's quite possible that if K&R had gone with length prefix strings that we would have a different class of errors where the string index gets offset or malicious values are inserted in the length field.


Elaborating on the NUL bytes on the command-line, e.g. find -print0 | xargs -0:

Using find -print0 etc. is a good idea not so much because NUL is an uncommon character (the various record separators / vertical tab / ... are no more common), but because UNIX - being a C system through and through - allows any character to appear in a file name except '/' (path separator) and NUL. Thus, NUL makes a perfect separator between filenames.


Then it sounds like 'find' f'ed up, if, when these things are passed around, they are not escaped properly (not saying this is the case). Just like today with various charsets, whenever there is a charset boundary, say between bytes and C library strings, which is what this is, there has to be a charset conversion.


By default, find separates by newline; this is human-friendly, but breaks if an attacker/script/... puts a newline in the filename.

The UNIX filesystem, qua filesystem, doesn't have a character set, just NUL-terminated strings. On the plus side, it's simple to handle, and means that retrofitting UTF-8 or another encoding is pretty easy. On the downside, two bytestrings that Unicode-canonicalize to the same value may name different files, which is surprising for humans.

It's notable that many of early UNIX' competitors were much more full-fledged systems, featuring full-fledged record-oriented files and typed data instead of UNIX' bytestrings-everywhere approach.


Because you just read it from a file and don't want to corrupt it?


There is one author who does not use memcpy, strcpy, etc. He wrote his own "standard" C library routines. Others have used these routines; they are public domain.

The security track record on the authors internet-facing programs is better than most. In fact, I cannot think of any author writing similar software with a better track record.

Sometimes the most popular solutions are not necessarily the best ones for every purpose. Whenever I write programs in C from scratch, I use byte.h, buffer.h, etc., from the above mentioned author. I do not use memcpy.

In doing this, I am not a professional programmer and I am not writing internet-facing programs for other users. I am a student of C learning how to use C, the language. If I know how the language works, then it stands to reason I should be able to use a variety of libraries, including alternatives to the "standard libraries".

Otherwise it is arguable I would be just learning how to use a standard library, not a language.

The C language has utility on its own, as form of a notation, and it is that utility which I seek to learn about. Historical records indicate there was C language in productive use for some time before there was a "standard library".


Who is this author, and where is his code? I'd like to study it -- I might learn something.


Almost certainly D. J. Bernstein's string API from daemontools , http://cr.yp.to/daemontools.html and his other bits of code. Some commentary at http://www.and.org/vstr/comparison . Probably more commentary elsewhere.


>The CPUs that offered string manipulation instructions—for example, Z-80 and DEC VAX—did so in terms of the far more widespread adr+len model.

Hold up, the z80 offered string maniuplation instructions? Using adr+len, no less? You miiiight call LDIR a string manipulation function using adr+len but in practice almost all z80 machines I've seen use NUL terminated strings and something like CPIR to find it.


The title should have "Dennis" instead of "Ritchie".


Null terminated strings are just application level logic. "strings" are just bytes in memory. There are no strings.


No, If you quote a string in c "string data here" you get a piece of data with null termination.

Null termination is part of the C language.

Example: char str[]= "1234";

    printf ("%s: %lu", str, sizeof(str));
Prints: 1234: 5


Use "%zu" to print a value of type size_t.

A couple of ugly corner cases:

    const char str[] = "abcd\0efgh";
    printf("length = %zu, size = %zu, value = \"%s\"\n",
           strlen(str), sizeof str, str);
output:

    length = 4, size = 10, value = "abcd"
And:

    const char str[4] = "abcd";
    printf("length = %zu, size = %zu, value = \"%s\"\n",
           strlen(str), sizeof str, str);
This has undefined behavior. (Which is a good reason to let the compiler figure out how big the array has to be. Computers are better at counting things than you are. Let them.)


> Should the C language represent strings as an address + length tuple or just as the address with a magic character (NUL) marking the end?

This explicitly is a different kind of representation in memory.


Maybe - but the people who followed definitely chose poorly by continuing to use null terminated strings.


Did they choose "wrong"? No. Did they make a choice and move on? Yes.


Yes, because there were already safer systems programming languages being used by the time C got invented.

OS safety was already a concern in 1961.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: