
The Most Expensive One-byte Mistake - CowboyRobot
http://queue.acm.org/detail.cfm?id=2010365
======
supersillyus
In the article he says it is just one byte extra, but that's clearly not the
case, since we probably want strings longer than 255 characters. To be
practical, even in the sort term, you'd probably need at least two bytes, and
that's still a bit limiting. NUL-terminated strings keep you from having to
worry about the size of the length specifier and what byte ordering when
storing it.

He notes that other languages of the day didn't go with NUL-terminated
strings. It could also be interpreted, considering that C is more widely used
still than other languages from the time, that by doing something different, C
made the right choice.

~~~
spullara
What you would likely do is use an extension bit so there would be no fixed
maximum length for the strings. This would of course add a bunch of overhead
that you may or may not make up with the other advantages of knowing the
length of strings.

~~~
coldnose
* If you're appending data to a string, and the variable-length increases by a byte, will you have to memmove() the entire existing string down 1 byte?

* Is every programmer responsible for detecting this condition?

* (This will make manual string manipulation very complicated and dangerous.)

* Suppose you're concatenating two strings, such that the sum of the lengths requires an additional byte. This could cause a buffer overflow. How would a strcat() function avoid causing a buffer overflow here?

* Does every string need a maximum-length counter too?

* Can you access a random element in the string without having to dereference and decode the length?

On the other hand, if you use a constant-sized length,

* What happens when you overflow the maximum length?

* Can you erroneously create a shorter string by appending text?

* How should string libraries handle this condition? By abort()ing? By returning a special error code? Does every string manipulation need to be wrapped in an if() to detect the special error? How should the programmer handle this condition?

In either case,

* Can you tokenize a string in-place?

* Can an attacker read a program's entire address space by finding an address that begins 0xffffffff and treating it as a string?

~~~
spullara
I think most of these are answered with the suggestion that no one would have
made strings this way without making a matching library that handled the
concerns you have here. Honestly char[] strings are much less useful in a
UTF-8 world anyway.

------
cperciva
There's another reason for using NUL-terminated strings: It means that a
suffix of a string is also a string. Want to convert "hello world" into
"hello" and "world"? Just replace the ' ' with a '\0'. With length-prefix
strings, you would need to copy the "world" into a new memory allocation --
which is far too wasteful for 1970s programming.

~~~
matthavener
Of course, assuming that unwanted space is there. With addr and len, you could
split "helloworld" into "hello" and "world" by simply pointing to the two
separate parts, each with len == 5. (This is actually really nice when parsing
large strings into smaller strings because you can simply point all the
smaller strings into a single, large string).

~~~
ZoFreX
This is actually what Java does when you call substr, effectively (String
objects contain a reference to a char array, a length, and an offset).

~~~
daemin
That's also because (to my knowledge) that strings in Java are immutable. Same
as in Ruby and a bunch of other languages.

So the optimisation that can be done is to put all of the string data into one
big (immutable) array and have each string object just reference into it for
its data.

------
WalterBright
>Another candidate could be IBM's choice of Bill Gates over Gary Kildall to
supply the operating system for its personal computer. The damage from this
decision is still accumulating at breakneck speed, with StuxNet and the OOXML
perversion of the ISO standardization process being exemplary bookends for how
far and wide the damage spreads.

First off, IBM _did_ put CPM/86 on the PC and sell it. Many people bought it.
We had it at work. The reason CPM/86 failed relative to PC-DOS was simple -
CPM/86 cost nearly $300 and PC-DOS cost $40.

There wasn't anything discernably better about CPM/86 at the time, either its
programming API or its user interface, and people did the sensible thing and
bought the less expensive operating system. It was a no-brainer.

As to the idea that CPM/86 was somehow a secure operating system, I haven't
heard that before. It's patently false.

What killed CPM/86, plain and simple, is it was way, way overpriced. I have no
idea who made that decision.

~~~
blasdel
pageman has a [dead] reply below linking to
[http://www.businessweek.com/magazine/content/04_43/b3905109_...](http://www.businessweek.com/magazine/content/04_43/b3905109_mz063.htm)
— claiming that the price disparity was IBM's doing, intended to drive the
cross-platform CP/M out of the market while still 'supporting' it.

It looks like pageman got hellbanned 2 years ago for participating in one of
the erlang frenzies. I went to email him about it, but it looks like I'd
already done so 1.5 years ago!

------
brlewis
People should not forget that UNIX was an informal experiment, not a plan for
what should dominate the computing world 40 years later. When writing
informally-experimental software, you should have a bias toward doing things
differently from the mainstream, so that you can put new ideas to the test.

There are plenty of options for people who don't want to write new code using
NUL-terminated strings. They can make their own decision. There's no reason to
blame a 40-year-old decision for errors today.

~~~
btilly
If you want your new code to interact with old code, there is a very
significant advantage to using the old code's data format.

------
shabble
Whenever I've had to write non-trivial C, I try to use the bstring:
<http://bstring.sourceforge.net/> library if possible, which provides a
length+blob approach with a nice API, and can be converted to/from nul strings
fairly easily.

------
forgotusername
DOS did not "invent" it's own path separator, directories were added to DOS
long after command line flags, which already used slash, and COMMAND.COM
already supported placing a flag immediately after a command name (without
whitespace), making backwards compatibility with their own OS extraordinarily
hacky to implement if not impossible.

Also please remember we're talking about the late 70s here. This is one of
those annoyingly common vitriolic ideas about Microsoft, really tarnishes the
reputation of this article as being well researched by appearing here.

~~~
astine
The author mentioned this:

"IBM had decided to use the slash for command flags, eliminating Unix as a
precedent, and the period was used between filename and filename extension,
making it impossible to follow DEC's example."

~~~
WalterBright
DEC also used periods to separate filename from extension.

People forget that CPM's conventions (which were copied by CPM/86 and PC-DOS)
were copied from DEC conventions. DEC operating systems were very, very
popular at the time.

------
astral303
I wonder whether the most expensive mistake is not so much the NUL-terminated
string design, but the persistent use of library string and memory
manipulation functions that we now call "unsafe". The culture of constantly
writing your own raw-memory-based string manipulation functions and working
with memory so directly, without safety constraints.

I remember using strings and STL containers in C++ and wondering why would
_anybody_ go back to the clunky ways of malloc'ing, scanning for NULs, using
memcpy, strcmp and other things that might "run away" on you so easily. I
remember feeling unsafe using strcpy. In contrast, it felt very safe to use
C++ strings: I had to out of my way to code a buffer overflow.

Better, but not best, the "safe" replacements of strncpy require you to
maintain the correct string length on your own. Basically, there are just so
many ways to shoot yourself in the foot.

The industry could've signficantly addressed this by adapting a standard
library for manipulating strings and memory, at least with length-restricted
"safe" functions (strncpy, but one that always NUL-terminates). Potential
language support and compiler support could've been added. All of this
should've been done way back in the day, but it wasn't.

~~~
mattgreenrocks
That subjective feeling of unsafety is excellent. I can only wish that other
people felt that same tinge of doubt when invoking these operations, as one
mistake can corrupt memory.

I also find it ridiculous how tolerant the OSS community is about these
things. Major projects written in C (Pidgin) often find themselves fixing
these sorts of mistakes over and over again. Why is this acceptable? To me it
suggests a rushed design, and it potentially puts my information at risk.
Remember, all of the speed gains to be had from C are lost the moment you core
dump.

It isn't that C is inherently insecure, but we should require good reasons for
it to be used, particularly with apps that interact with the network. Right
now it seems like it has a bit too much geek cred as the language of alpha
developers.

------
_delirium
In the performance category, it imposes a cost on string searches, also,
unless you know the string length via some other mechanism. Algorithms like
Boyer-Moore get their speed by skipping over some characters, but with null-
terminated strings it's unsafe to skip any character without first checking if
it's null. Granted, that's mostly a cost imposed on long strings being
searched for longish substrings.

~~~
alnayyir
The case you described as having the cost imposed is literally what boyer-
moore excels at, so it doesn't even have the benefit of hurting the
pathological case.

------
ZoFreX
> The damage from this decision is still accumulating at breakneck speed, with
> StuxNet and the OOXML perversion of the ISO standardization process being
> exemplary bookends for how far and wide the damage spreads.

Bashing Windows doesn't make you cool, it's just tiresome.

~~~
mattgreenrocks
Yeah, but it does make you appear cool on Slashdot.

------
sovande
A C string is a convention, not a type and I don't know why we still cannot
get a string type in C that can live side by side with 'char *'. I mean we got
wchar_t duct-taped to C so why did the C99 guys not introduce a 'string' type
(add/len)?

~~~
daemin
You can just do it yourself, typedef char* (or some other fancy array) as a
string type and treat the first x bytes as a length of the string.

There's nothing stopping people from doing this temselves.

------
kelnos
_If the source string is NUL terminated, however, attempting to access it in
units larger than bytes risks attempting to read characters after the NUL. If
the NUL character is the last byte of a VM (virtual memory) page and the next
VM page is not defined, this would cause the process to die from an
unwarranted "page not present" fault._

I don't see how that could happen. Are there any machines out there where VM
pages are not aligned to some multiple (even 1x) of the CPU's native word
size?

If I'm correct in assuming there isn't, then by definition, if the NUL
character is the last byte of a page, then it's word-aligned, and you'd read
exactly to the end of the page anyway. Alternatively, if you're _starting_
your read on an unaligned address but making your read size a multiple of the
CPU word size (in which case you could read past the end of the page), you're
not really gaining anything performance-wise.

~~~
nitrogen
The page fault problem seems unlikely, but SIMD-optimized versions of
functions like strcspn() will read in 8-byte chunks, causing Valgrind's
memcheck to complain about invalid reads when you do a strcspn() on a string
where length % 8 != 0.

As for the original problem, I suspect x86 is relatively unique in its
tolerance for unaligned memory accesses. The ARM processor I'm working with
will return the wrong data if you try a 16-bit or 32-bit read that is not
aligned on a 16-bit or 32-bit boundary. malloc() is supposed to return memory
that is aligned to the platform's largest alignment requirement, so it takes a
bit of deliberate work to create unaligned accesses. It seems unlikely that a
NUL-terminated string will be accessed using unaligned multi-byte reads, since
optimized libc functions will take alignment into consideration.

------
barrkel
The choice of a null terminator vs len+adr is distinct from the issue of
buffer overflows. A string implementation could use storage buffers sized with
len+adr (with len as capacity) and range check them at runtime, but still use
null terminators for the actual string length. This wouldn't necessarily be
redundant; null termination has desirable semantics for algorithms that
process text as a stream (e.g. scanners, regex) and saves having to keep track
of distance to string end in a separate counter.

The deeper problem is lack of memory safety. The state of the art in static
checks wasn't as advanced back then though (Pascal, in its original form, was
not well received at the systems level), and dynamic checks would be
considered too costly.

------
iqster
Interesting read. However, the article doesn't fully consider the costs of the
alternative decision. One of the things I really hated about COM programming
was BSTRs (prefix the length of the string in memory BEFORE the actual
pointer). Granted, BSTRs are a kludge but to me, it illustrated the elegance
of the NULL terminated C-string vs. alternatives.

~~~
joezydeco
The other thing that came to mind to me was that you have now defined a
maximum string length, unless you then had a meta-length number (describing
the "length" of the length). And how long should _that_ number be?

I could see where Ken & Co just looked up and figured null-termination was a
more elegent answer.

~~~
tlrobinson
You could use a variable-width size field similar to how UTF-8 works, but
that's getting pretty complicated and inefficient.

e.x.

0xxx xxxx

10xx xxxx xxxx xxxx

110x xxxx xxxx xxxx xxxx xxxx

1110 xxxx xxxx xxxx xxxx xxxx xxxx xxxx

etc

~~~
boredguy8
And re-introduces the 'magic character' you're trying to eliminate.

~~~
fuzzix
The magic character is in a known location, there's no need to iteratively
scan for it.

The use of "magic" characters is not the problem, as I see it.

~~~
boredguy8
No it's not. In the first line, it's in [0], in the second line, it's in [1],
in the third line, it's in [2]. You have to scan through to find where the '0'
appears

~~~
tlrobinson
True, but it's O(log N) rather than O(N) to determine length.

BTW I'm not sure if it was clear or not but those are bits in my diagram, not
bytes/characters. The "x"s are the bits of the length field, not the
characters of the string.

~~~
daemin
But you could just as easily set aside the first 4 bytes of a string to be the
length and then it would be O(1).

But does that really matter when you're doing operations on the string that
iterate over the whole string anyway? Since iterating the whole string is O(n)
anyway, you're not really gaining anything.

~~~
fuzzix
Operations to return string length or return characters from the end of the
string become far less expensive.

You also have a fair idea of a reasonable sized buffer to move in duplication
operations, avoiding single character move loops.

------
mrb
I would argue that a more expensive one-byte mistake was the computer industry
failing to standardize on a newline representation (CR+LF versus LF).

In mixed Unix/Windows environments, I cannot even _count_ the number of times
that this discrepancy has created issues or slowed down work: config files
accidentally saved with Windows newlines and causing cryptic error messages
("No such file or directory: filename" because "filename" from that config
file has an invisible CR at the end), discovering that a tool had been
silently converting newlines from one format to another when you wanted to
preserve them but it is too late to fix because the files have been shipped to
customers, tools stripping the LF but not the CR and corrupting text displayed
on terminals, $ in a regex not matching the end-of-line because of the
presence of CR, of course the pain of editing Unix files on Windows (the only
reason why Wordpad exists), etc.

~~~
Nick_C
> the pain of editing Unix files on Windows (the only reason why Wordpad
> exists)

An FYI for anyone who might have this problem, Notepad2 is a drop-in
replacement for notepad and can handle Unix/Windows line-ending conversions
(plus a few other nice features for a small binary).

------
AshleysBrain
C++'s std::string solves this. Not only does it take care of all memory
management for you, it also stores the string as length and a pointer. There's
also a common 'small string' optimisation that stores strings a few bytes long
inside the class itself, so it isn't doing any dynamic allocation. O(1) to get
the size and small strings don't allocate - nice! Still, people will continue
to moan that C++ is rubbish or something. You can even invent safe stack-
allocated strings using templates to mimic exactly how C does it but in a safe
and straightforward way, but the idea hasn't caught on - seems std::string is
just fine.

To be fair C++ is best described as using both, since string literals are
still null-terminated, and in many cases you have to use the
std::string::c_str() method for backwards compatibility where a null-
terminated string is expected (which is often regularly, in practice).

~~~
burgerbrain
_"To be fair C++ is best described as using both"_

I think that is the real reason people commonly complain about C++. It's stuck
in some sort of weird twilight zone, halfway towards being a modern safe
language, but retaining enough of C to make it dangerous. Arguably more
dangerous than C since newcomers may not be aware of what they are getting
themselves into.

Kind of like the difference between a pit of quicksand, and a pit of quicksand
covered in palm leaves. ;)

~~~
astral303
I don't agree that it is more dangerous than pure C. First, you can do all
your operations on strings in 100% C++, which will be aggressively optimized
to essentially what you'd be doing with C anyway. Second, keeping the
underlying representation as NUL-terminated allows you to use other APIs that
consume char * C-style strings by calling c_str(). There is a compile-time
const check that forces you to write a cast when calling APIs that do not
consume a const char *. If the APIs that you are calling do not modify the
string, but merely read it (which is the case for almost all uses), then you
can have a C++ app that is completely safe from string buffer overflows.

This is called abstracting out an easy-to-make-tragic-mistakes-in problem into
a small layer (C++ std::string) and using that layer everywhere.

------
gatlin
> In other words, this could have been a perfectly typical and rational IT or
> CS decision, like the many similar decisions we all make every day; but this
> one had quite atypical economic consequences.

In other words, it is the exact same kind of pragmatic decisions as all the
rest of the author's examples, but is simultaneously different.

------
ChuckMcM
"Using an address + length format would cost one more byte of overhead than an
address + magic_marker format, ..."

Actually pedantically, assuming an unsigned length, one byte for strings < 256
characters, two bytes for 64K, Etc.

Having heard Dennis at least talk about the development of UNIX the notion
that C was 'dangerous' and not for the folks who didn't know what they were
doing, was both a conscious decision and expedient. I mean c'mon you can cast
a constant into a function pointer. It really was just syntactic sugar over
basic assembly.

~~~
barrkel
And yet, these days, performing such undefined operations may cause your
compiler's optimizer to assume that such code is e.g. dead, and behave
unexpectedly.

~~~
ChuckMcM
Oh absolutely. The original author was claiming that using a 'signature'
character, NUL, to indicate the end of a string was is the most dangerous one
byte 'feature' and speculates on whether or not the folks at Bell Labs thought
that through.

In context, C was just a glorified version of assembler (but with better
looping constructs) and that one could overrun strings, or randomly go into
the weeds if the NUL was missing was 'understood' because that was no more
dangerous than simply writing the assembler yourself. I mean 'JSR @R7,0x4bee'
isn't really that much different :-)

I think the author was looking at the past through today's understanding of
what the C compiler does, as opposed to the purpose it was originally set out
for (which was to be more readable than assembly)

------
Flow
What always bothered me with nul termination is that it isn't a data
structure, it's a convention.

And as with all conventions there's discipline involved. You can't unit test
and get rid of a certain kind of bug once.

To this day I still think that (Borland) Pascal is a better language than C.
It has sets, array indices and best of all, a real module concept.

------
rbanffy
Actually, null-terminated strings made a lot of sense when the string is
short. On my 8-bit days, I used both approaches (because, sometimes, strings
have to contain a NUL in them)

------
yuhong
One OS which did not make this mistake was the Classic Mac OS. I had a thread
on Ars Technica where I argued whether Copland would have been more secure
than Mac OS X based on that.

------
yuhong
>Another candidate could be IBM's choice of Bill Gates over Gary Kildall to
supply the operating system for its personal computer. The damage from this
decision is still accumulating at breakneck speed, with StuxNet and the OOXML
perversion of the ISO standardization process being exemplary bookends for how
far and wide the damage spreads.

As it happens, I researched that one for years, and I now think the root cause
of that one is likely Bill Gates being an aggressive businessman who treated
business as war.

