Hacker News new | past | comments | ask | show | jobs | submit login

NUL terminated strings were the right decision for C. They’re certainly much simpler than length fields.

Consider using a length field. How big should that field be? If it's fixed size, you introduce complications regarding how big a string you can represent, and differences in field sizes across architectures. If it's variable-sized (a la UTF-8), then you've added different complications: you would need library functions to read and write the length, to get access to the string contents, to calculate the amount of memory required to hold a string of a given size, etc. Very much not in the spirit of C.

Next, what endianness should that field have? NUL terminated strings have no endianness issues: they can be trivially written to files, embedded in network packets, whatever. But with a length field, we either need to remember to marshall the string, or allow for the length field to not be in native byte order. Neither is a pleasant prospect, especially for a 1970’s C programmer.

Also, consider C-style string parsing, e.g. strtok/strsep. These could not be implemented with length-field strings.

Explicit length is better when you have an enforced abstraction, like std::string, but at that point you’re not writing in C. If you have to pick an exposed representation, NUL termination is much better than Pascal-style length fields.

So what was the “one-byte mistake?” The article says that it was saving a byte by using NUL termination instead of a two-byte length field. Had K&R not made that “mistake,” we would be unable to make a string longer than 65k - a far more serious limitation than anything NUL termination imposes!

K&R got it right.

No one doubts that there were advantages to NUL-terminated strings, but against them you have to weigh the many thousands of security holes that were thereby created.

UTF-8 got it right with having a variable-length byte representation of numerical values. Seven-bit values unaffected. Longer values use more bytes as necessary.

The C approach takes a whole different philosophy. You want a "string"? NULL terminated. Simple. You want a buffer? Do it yourself.

instead of a length field, use a pointer to the last character. the length is the difference of these two pointers. The maximum string length is the size of your address space. Problem solved.

The problem is hardly solved. Your string length computation is already wrong - the length is the difference between those pointers plus one.

Hypothetically, C could use the first `sizeof (size_t)` bytes to store the length. Endianness shouldn't be much of an issue; just use the endianness of the current machine. You don't typically write the NUL terminator when writing a string to a file; similarly, you wouldn't typically write the length field when writing a counted string to a file.

I agree that NUL-terminated strings were the right decision at the time (and aren't that bad now), but there are sane ways to do counted strings.

You piqued my interest, and it looks like NUL-terminated strings in files is actually common. tar, ELF, JPEG, gzip all use them. When the serialization format matches the in-memory format, you can mmap directly into a struct, which is fast and convenient.

> When the serialization format matches the in-memory format, you can mmap directly into a struct, which is fast and convenient.

...and a non-starter if you ever have data interchange between differing types of machines.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact