
The Most Expensive One-byte Mistake (2011) - fleaflicker
http://queue.acm.org/detail.cfm?id=2010365&
======
yongjik
Yeah, if we only used strings marked with 2-byte integers, everybody would
have been happy, because 64kb string is enough for everyone. (And let's be
realistic, nobody sane would have chosen 4-byte string length back in early
70s.)

So, if we went down the pass, what will we have? All the fun of having
"legacy" APIs that seem to work but internally only accept strings up to 64kb
length and mysteriously chop off excess bytes when you least expect it. It's
Y2K problem all over again.

And just when you finally think you're over with it, memory is cheaper again,
size_t is 64bit, and someone invariably wants to store a binary blob >4G as
string. Fun time again.

Have we forgotten how much trouble we went through in the 90s to handle memory
in x86 "640k is enough for everybody" architecture?

~~~
TheSoftwareGuy
Just store a pointer to the first character and a pointer to the last one. to
get the length you just subtract the two.

~~~
simias
That's exactly equivalent to having a "size" parameter with the same size as
the pointer, except you have to use a substract instruction when you want to
get the length of the string, so I'd say it's inferior to just storing the
length of the string.

For instance, if you copy a string you also have to update the end pointer
instead of just copying the size attribute in bulk. And you get the same
disadvantages of non-portable strings, different representations depending on
the architecture/endianess etc...

I completely agree with the OP, there's no perfect solution. If addr + len was
truly superior I'm sure we'd see

    
    
        struct string { long len; char s[]; };
    

or for your version

    
    
        struct string { char *endptr; char s[]; };
    

everywhere. And the C standard library would have evolved along with it.

Out of the top of my head the only thing that makes '\0' terminated strings
special in C is that it's the way string literals are represented. It would be
trivial to recode all of string.h using addr + len instead of nul terminated.

~~~
1amzave
> _That 's exactly equivalent to having a "size" parameter with the same size
> as the pointer, except you have to use a substract instruction when you want
> to get the length of the string, so I'd say it's inferior to just storing
> the length of the string._

Except that it has the important property that the (effective) length
descriptor, being a pointer, would necessarily "grow" over time (across
generations of machines, e.g. 16 bit -> 32 bit, etc.) and would thus never
impose any artificial restrictions on string length.

------
gcb0
Oh the irony of history.

On the week that str+len was abused left and right, someone surfaces to the
frontpage an article about how str+NUL is wrong and everyone should use
str+len.

~~~
quotient
Now that you point it out, that's actually rather amusing.

------
millstone
NUL terminated strings were the right decision for C. They’re certainly much
simpler than length fields.

Consider using a length field. How big should that field be? If it's fixed
size, you introduce complications regarding how big a string you can
represent, and differences in field sizes across architectures. If it's
variable-sized (a la UTF-8), then you've added different complications: you
would need library functions to read and write the length, to get access to
the string contents, to calculate the amount of memory required to hold a
string of a given size, etc. Very much not in the spirit of C.

Next, what endianness should that field have? NUL terminated strings have no
endianness issues: they can be trivially written to files, embedded in network
packets, whatever. But with a length field, we either need to remember to
marshall the string, or allow for the length field to not be in native byte
order. Neither is a pleasant prospect, especially for a 1970’s C programmer.

Also, consider C-style string parsing, e.g. strtok/strsep. These could not be
implemented with length-field strings.

Explicit length is better when you have an enforced abstraction, like
std::string, but at that point you’re not writing in C. If you have to pick an
exposed representation, NUL termination is much better than Pascal-style
length fields.

So what was the “one-byte mistake?” The article says that it was saving a byte
by using NUL termination instead of a two-byte length field. Had K&R not made
that “mistake,” we would be unable to make a string longer than 65k - a far
more serious limitation than anything NUL termination imposes!

K&R got it right.

~~~
ScottBurson
No one doubts that there were advantages to NUL-terminated strings, but
against them you have to weigh the many thousands of security holes that were
thereby created.

------
TomMasz
A _lot_ of programming decisions were made to save a byte here and there. It's
easy to point at them today and say they're "bad", but at the time they were
the absolutely _correct_ thing to do. It's hard to imagine now but not saving
that byte could mean your program wouldn't fit into RAM. Try telling your
management in the 1960s that your program won't load because it's "properly
coded" and see how far you get.

What we've failed to do is ever revisit those decisions and change them where
we've identified problems. Yes, you can probably compile (with warnings) files
from UNIX v7, but we pay for that compatibility. But there's no question
designing, building and maintaining a libc alternative is a colossal
undertaking and not likely to happen on a whim. So here we are.

------
radiospiel
Well, strings without an explicit length field allow for things like strstr(3)
or prefix parsing without performance penalties due to reallocating memory.

~~~
kevingadd
Blatantly incorrect. Try thinking over how you might implement those two
operations on a string with a length field and see if you can figure out why
your statement is wrong.

~~~
RogerL
That came across as very condescending to me. Could you tell us how you would
implement this?

I for one do not see how to implement it.

~~~
edmccard
>I for one do not see how to implement it.

(referring to an allocation-free strstr if strings had an associated length)

If a string was a (pointer,length) pair, then strstr would return a pair
(pointer+offset,length-offset) where, just like in the original strstr, the
new pointer points into the existing string. No allocation needed

(But what about the pointer-length pair that has to be created? Well, it will
live on the stack just like the return value from the original strstr, which
has to be stored somewhere also.)

~~~
mzs
Except that's not how K&R would have done it. It would have been pascal style:
one byte of length directly followed by the string itself in memory. The
pointer would be the address of the length byte. Now it's hard to create
strstr without copying so they wouldn't. Instead you would have a new API
where every call would have offsets to the beginning, likely it would be one
based too. Anyway C does it the way it does cause PDP set the zero condition
flag on move.

------
gumby
When I was at PARC the Mesa guys (who had counted strings) did some analysis
and (at least in those days) the counted strings ended up being, in aggregate,
faster. I suspect the advantage would be even greater these days since memory
allocation was a bigger deal back then.

I wonder if you could do this compatibly in the compiler by adding another
primitive type (counted string) which had the length in the bytes before the
start of the null-terminated string. You'd need a new type because various
routines in the standard library would have to invisibly have two versions for
counted and non-counted strings (since if you incremented a string pointer, or
used a function like strchr, you'd have to treat it as a regular char _).
"Safe" code would use a different call (say, cstrchr) that returned an index
instead of a char_. The compiler could optionally warn on unsafe "legacy"
calls as it can with strcpy instead of strncpy.

------
cliveowen
It's all true, but then again, everything would be better if we'd start from
scratch today. Compromises made to tip-toe around technology limitations are
what adds complexity to most of today's software, but even tomorrow's software
will be influenced by today's limitations. It's best not to dwell on the past.

------
crashandburn4
This page won't load for me and neither will googles webcached version[1],
does anyone have a version of this that I can see?

[1]
[http://webcache.googleusercontent.com/search?q=cache:http://...](http://webcache.googleusercontent.com/search?q=cache:http://queue.acm.org/detail.cfm?id=2010365&)

~~~
HillRat
This ([http://cacm.acm.org/magazines/2011/9/122797-the-most-
expensi...](http://cacm.acm.org/magazines/2011/9/122797-the-most-expensive-
one-byte-mistake/fulltext)) should work.

~~~
crashandburn4
Thanks

------
orvado
Does anyone understand what the author meant by the following statement:

${lang} is the language of the future

This looks like a macro for substitution, but maybe its some hip new term I've
never encountered. An actual language or just a placeholder for a language
that hasn't been chosen yet?

~~~
noobiemcfoob
I think he means to imply that whoever is making the statement would
substitute ${lang} with their language of choice as the successor of C.

------
bananas
Yeah because strings with a length prefix/field are just as secure!

    
    
       200,"STR"
    

We know where that got us...

Programming 101, rules 1&2:

1 - never trust your inputs

2 - always check your invariants.

------
ithinkso
With NULL terminated strings it also was simpler to serialize it. If str+len
was a standard now we would have 13 more serialization standards.

------
rw_grim
So to be "safe" and "secure" we can only have strings 256 characters long, or
we need to waste a few bytes repeatedly for short strings. Sounds like the
UTF-8 vs UTF-16/32 debate..

~~~
kevingadd
The reality is that null-terminated strings are _dramatically_ more expensive
than strings with a length counter in _every_ regard other than memory usage,
and the memory usage overhead from storing a length value is utterly miniscule
compared to the actual size of the string. Even if you ignore all the
secondary costs that result from the decision to use null-terminated strings,
they're just poor engineering. There are far better ways to save a few bytes.

(By secondary costs I mean things like the myriad bugs caused by null-
terminated strings, the severe performance penalties involved in copying and
manipulating them, the unfortunate implications they have for file formats and
network protocols, etc.)

~~~
TheLoneWolfling
My "ideal" string would be to store it as a UTF-8 rope, with the additional
restriction (doesn't change the interface any) that all characters within a
node in the rope have the same length. (You can use overlong encodings
internally if it makes sense (One single-byte character in a bunch of longer
characters), which is a microoptimization that will in some cases save a few
bytes.)

I'd also treat a character + combining characters as a single character.

~~~
weinzierl
>that all characters within a node in the rope have the same length [...] I'd
also treat a character + combining characters as a single character.

The problem with this is that Unicode doesn't restrict the number of combining
marks. If your hypothetical library wants to offer full Unicode support, your
"nodes of same length" idea wouldn't work.

Of course an implementation which makes an arbitrary restriction wouldn't be
unusual. In fact, I'm not aware of any application that supports an arbitrary
number of combining marks even if the standard allows it.

When it comes to standard conformance UAX15-D3 [1] is probably the closest we
could get. It'd require 128 byte per character.

[1]
[http://unicode.org/reports/tr15/#Stream_Safe_Text_Format](http://unicode.org/reports/tr15/#Stream_Safe_Text_Format)

~~~
TheLoneWolfling
The length of each _node_ is not the same; each character _within_ a node is
the same number of bytes long.

So you end up with one node in the rope that has one logical character - that
is some absurd number of bytes long. (In reality there'd be a maximum of 2^8-1
or 2^16-1 or something bytes per character)

