
Did Ken, Ritchie and Brian choose wrong with NUL-terminated text strings? (2011) - mayankkaizen
https://queue.acm.org/detail.cfm?id=2010365
======
tptacek
Worth keeping in mind that the security cost here is illusory: the track
record of length-delimited data structures isn't much better than that of
ASCIIZ, especially since integer handling in C is so treacherous.

~~~
korethr
> ...especially since integer handling in C is so treacherous.

I'm still learning C, having not had much reason to do so until recently, so
I'm not quite sure I understand this statement. How is integer handling in C
treacherous (any more so than any other language, especially other languages
that operate as close to the metal as C)? Is it something to do with signed vs
unsigned, and beware of over/under-flow when doing math? Or is it a more
insidious subtlety I'm ignorant of because it has been biding its time for a
more perfect opportunity to bite me in the ass?

~~~
tptacek
Two bottles of beer on the wall, two bottles of beer.

Take one down, pass it around, one bottle of beer on the wall.

One bottle of beer on the wall, one bottle of beer.

Take it down, pass it around, zero bottles of beer on the wall.

Zero bottles of beer on the wall, zero bottles of beer.

Take one down, pass it around, four billion, two hundred ninety-four million,
nine hundred sixty-seven thousand, two hundred ninety-five bottles of beer on
the wall.

(Yes, you pretty much have it in your question; the Google search you want to
make to learn more is [integer overflow]).

~~~
zerokernel
And various undefined behaviours (signed overflow etc.) colluding against the
programmer to the effect that compilers sometimes delete range-checking code.

Integer promotion rules are not simple, either. Integer width rules, far from
simple.

------
PopsiclePete
>Using an address + length format would cost one more byte of overhead than an
address + magic_marker format, and their PDP computer had limited core memory.

I find this interesting - why only _one_ more byte of overhead? That would've
limited string lengths to 256. So 2 bytes would seem the minimum, and even
then, how do you go to 4 bytes once memory becomes cheap without breaking
everything? Using NUL-termination, the upper bound for a string is effectively
the amount of memory the OS is willing to give you, and code can keep working
without modification for decades.

Am I missing something here?

~~~
tbirdz
1 additional byte of overhead would give you 2 bytes for the length, since you
wouldn't have to have the NUL byte at the end of the string.

You could do some kind of variable int encoding scheme, where longer strings
would require more bytes for length, with some overhead to indicate how many
length bytes are required for each string.

~~~
jerf
In the PDP era, they would have noticed the overhead of having a variable-
length int at the beginning they would have to decode.

In the modern era, it's probably cheap to the point of being free, because in
the vast majority of cases I would expect branch prediction to largely
eliminate the checks as being very predictable.

~~~
PopsiclePete
So then....they chose correctly?

~~~
jerf
Despite being vigorously opposed to NUL-termination in the modern era, yes, I
would not criticize the people who actually made the decision for the PDP.
They had no real reason to believe we'd still be discussing that decision 60
years later.

------
chrisaycock
Many prior comments over the years:

[https://news.ycombinator.com/item?id=2837571](https://news.ycombinator.com/item?id=2837571)

[https://news.ycombinator.com/item?id=7572711](https://news.ycombinator.com/item?id=7572711)

[https://news.ycombinator.com/item?id=3892410](https://news.ycombinator.com/item?id=3892410)

[https://news.ycombinator.com/item?id=8385306](https://news.ycombinator.com/item?id=8385306)

------
aknoob
A null-terminated representation is as close to a fundamental datatype for a
string as possible. It is same in spirit as other fundamental data types in C
like array. People have built abstractions over these fundamental datatypes
over the years.

~~~
IshKebab
It's not fundamental at all. You can't even represent null bytes in a null-
terminated string.

A length prefix is pretty clearly superior.

~~~
function_seven
A string contains characters. NUL is not a character; it's nothing.

"Fundamental" in this case means "matches reality". Having a number at the
beginning doesn't match reality as closely as having the string of characters
in sequential memory addresses with something to terminate them.

The quick fox made the jump\N

or

27The quick fox made the jump

The second one requires more work to store (a character-counting routine), and
needs even more work to handle variable length strings that may exceed 255-ish
bytes/characters.

I'm not discounting the benefits of prefixing the length, just saying it's not
more fundamental than null-terminating an arbitrary sequence of characters.

~~~
jerf
"A string contains characters. NUL is not a character; it's nothing."

You already couldn't make this argument stick in the ASCII era, where a string
can't contain NUL but can contain SOH (Start of Heading), STX (Start of Text),
ETX (End of Text), EOT (End of Transmission), ENQ (Enquiry), ACK
(Acknowledge), BEL, BS, HT (horizontal tab), LF, VT (vertical tab), FF (form
feed), CR, SO (shift out), SI (shift in), DLE (data link escape), DC1, DC2,
DC3, DC4 (device control 1-4), NAK (negative ACK), SYN (synchronous idle), ETB
(end of transmission block), CAN (cancel), EM (end of medium), SUB
(substitute), ESC (escape), FS (file separator), GS (group separator), RS
(record separator), US (unit separator), and DEL, but Unicode makes that
argument even sillier. Strings have always contained things that aren't
"characters".

The real problem is no matter what in-band character you take as the magical
termination character, you will have strings that want that in it, because in
the general case strings can contain anything, because C is always asking you
to pass them around to things as the general-purpose storage data structure.
You can fix that with an escaping scheme, but now you have an _escaped_
string, not just "a string". Since strings do indeed need to be able to carry
NUL in the general case, you either _must_ have some sort of scheme for
representing them, or expect a ton of errors when things jam the distinguished
character into your string when you didn't expect it. (Note that for precisely
the same reasons that NUL-termination isn't a good idea, there isn't any way
to "filter" wrong NULs. You can't tell.)

You might just barely be able to argue the problem is that C's _library_
mistook NUL-terminated strings for arbitrary-sized arrays that can contain
anything, but in C if you want arbitrarily-sized arrays you would then have no
choice but to pass the array size around to every call that expected such a
thing. The next immediately obvious thing to do is to pack the number together
with the array in a struct, and lo, we're back to length-delimited strings.

No matter how you slice it, C's got a major foundational screw-up in this area
somewhere. If NUL-terminated strings are the bee's knees, C's APIs still took
them in _way_ too many places where they are not appropriate, and it caused
decades of serious and often exploitable bugs.

~~~
pishpash
NUL is a character in the ASCII character set. That is a problem because you
cannot create all the strings composed of ASCII characters in C.

But C never claimed to support all ASCII strings. C doesn't even have strings.
C just has char arrays, which are byte arrays. When strings were formalized by
convention in the stdlibs, clearly the supported strings are 1-255 strings,
NUL excluded. That's the character set available for strings in the stdlibs.
If you insist on using stdlib strings for some other kind of strings, that's
your own problem.

~~~
jerf
"But C never claimed to support all ASCII strings."

That is precisely my point... there _is_ no well-supported solution in core C
for arbitrary binary strings, despite C's _extremely_ frequent use in domains
that require them. If you insist on using stdlib strings for other kinds of
strings, you _do_ have a problem... but you also have _no other choice_. Which
brings it back to being a language/library problem.

As I already alluded to, C itself doesn't have a problem with length-delimited
strings, and there are plenty of libraries you can get for them. But the core
library for C does force this problem in your face by leaving you no other
choice, and it is a valid criticism of C.

(C is such a disaster that the only thing to do is to leave it behind as
quickly as possible. However, if we were somehow stuck with the language
itself, there's a lot of ways we could improve the libraries it comes with, as
again demonstrated by the many such improved libraries you can get. However,
one of the things I've learned from learning a ton of languages over the past
couple of decades is that a language almost never manages to escape from its
own standard library, and the few that manage it (like D) pay a stiff adoption
price in the process. C's standard library has a real problem here, that has
caused real bugs, and no amount of wordplay is going to fix those decades of
bugs.)

------
aplorbust
There is one author who does not use memcpy, strcpy, etc. He wrote his own
"standard" C library routines. Others have used these routines; they are
public domain.

The security track record on the authors internet-facing programs is better
than most. In fact, I cannot think of any author writing similar software with
a better track record.

Sometimes the most popular solutions are not necessarily the best ones for
every purpose. Whenever I write programs in C from scratch, I use byte.h,
buffer.h, etc., from the above mentioned author. I do not use memcpy.

In doing this, I am not a professional programmer and I am not writing
internet-facing programs for other users. I am a student of C learning how to
use C, the language. If I know how the language works, then it stands to
reason I should be able to use a variety of libraries, including alternatives
to the "standard libraries".

Otherwise it is arguable I would be just learning how to use a standard
library, not a language.

The C language has utility on its own, as form of a notation, and it is that
utility which I seek to learn about. Historical records indicate there was C
language in productive use for some time before there was a "standard
library".

~~~
korethr
Who is this author, and where is his code? I'd like to study it -- I might
learn something.

~~~
eesmith
Almost certainly D. J. Bernstein's string API from daemontools ,
[http://cr.yp.to/daemontools.html](http://cr.yp.to/daemontools.html) and his
other bits of code. Some commentary at
[http://www.and.org/vstr/comparison](http://www.and.org/vstr/comparison) .
Probably more commentary elsewhere.

------
Sir_Cmpwn
>The CPUs that offered string manipulation instructions—for example, Z-80 and
DEC VAX—did so in terms of the far more widespread adr+len model.

Hold up, the z80 offered string maniuplation instructions? Using adr+len, no
less? You miiiight call LDIR a string manipulation function using adr+len but
in practice almost all z80 machines I've seen use NUL terminated strings and
something like CPIR to find it.

------
dblotsky
The title should have "Dennis" instead of "Ritchie".

------
bobsc123
Null terminated strings are just application level logic. "strings" are just
bytes in memory. There are no strings.

~~~
callesgg
No, If you quote a string in c "string data here" you get a piece of data with
null termination.

Null termination is part of the C language.

Example: char str[]= "1234";

    
    
        printf ("%s: %lu", str, sizeof(str));
    

Prints: 1234: 5

~~~
_kst_
Use "%zu" to print a value of type size_t.

A couple of ugly corner cases:

    
    
        const char str[] = "abcd\0efgh";
        printf("length = %zu, size = %zu, value = \"%s\"\n",
               strlen(str), sizeof str, str);
    

output:

    
    
        length = 4, size = 10, value = "abcd"
    

And:

    
    
        const char str[4] = "abcd";
        printf("length = %zu, size = %zu, value = \"%s\"\n",
               strlen(str), sizeof str, str);
    

This has undefined behavior. (Which is a good reason to let the compiler
figure out how big the array has to be. Computers are better at counting
things than you are. Let them.)

------
csours
Maybe - but the people who followed definitely chose poorly by continuing to
use null terminated strings.

------
JustSomeNobody
Did they choose "wrong"? No. Did they make a choice and move on? Yes.

~~~
pjmlp
Yes, because there were already safer systems programming languages being used
by the time C got invented.

OS safety was already a concern in 1961.

