
A defense of C's null-terminated strings - jsnell
https://utcc.utoronto.ca/~cks/space/blog/programming/CNullStringsDefense?showcomments
======
_ph_
The article tries to make its point with unnecessarily complex code examples.
They seem to be locked in a C way of thinking. But Pascal was created about
the same as C and had proper types.

Strings could be represented as length checked arrays. They could come in two
types, short ones, having a length byte at the beginning and are limited to
256 elements, and long ones, having an "int" size marker and corresponding
length limits. The type handling could be handled by the compiler, the
length() function would always return an int, and the elements addressed by
a[i]. Thats it. This would have saved us more than 3 decades of nasty bugs if
not security problems.

~~~
cefstat
> But Pascal was created about the same as C and had proper types.

The Pascal I remember (some version of Turbo Pascal in the early 90s) had
strings where you had to declare the length as part of the type. Strings of
different length had different types and you could not pass a string of length
30 to a procedure that expected a string of length 31. The way to solve this
is to declare that all your strings had the same length and hope that it would
be enough. It was horrible and at the time I found the C approach much more
reasonable. Of course Pascal improved through the years and the C approach
turned out to be a huge security problem.

~~~
valarauca1
Pascal strings are weird.

The first byte tells you the string's length. If the MSB of the first bit is
set, the string's first 4 bytes tell you the length of the string. I mean yes
its not a horrible system, but there are better systems.

I'll be honest I like Rust's. Just keep the length and the pointer on the
stack in a tuple.

------
eru
> C's null byte terminator imposes a fixed and essentially minimal size
> penalty regardless of the length of the string.

From an information theoretic point of view, making 1 out of 256 values of
each byte unusable as the content of the string, is an overhead linear in the
size of the string, not constant.

~~~
chongli
It's even worse than that (from a practical standpoint, not a complexity
theory standpoint) when you get into the realm of interpreted languages
written in C. Try processing text containing null bytes in bash, for example.

~~~
jstimpfle
Data containing null bytes is not text.

I like to think of bash / sh as just a job control system. It enables you to
setup process pipelines. Everything is just processes and files (streams).
Variables only serve as programs arguments. Since program arguments are also
null terminated, variables containing null bytes just make no sense.

(Of course it gets dirty as soon as you feed the output of a job into a
variable, for example. I think that isn't even defined by POSIX for non-text
streams).

~~~
chongli
Feeding the output of a job into a variable is a common enough task. One
example I worked on recently was to use the results of one find(1) command as
an array of arguments to another. Perhaps this was a very ugly way of
comparing two directories, but it worked.

This is why it surprised me when I realized (without having thought about it)
that variables can't contain null bytes. It made my quick id3 tag parsing
script a bit trickier to write, since I could no longer just use read(1P).

------
cousin_it
What I want from strings in a programming language:

1) Memory-safe (can't touch memory that you don't own)

2) Opaque (standard library can change the implementation later)

3) Immutable (can be passed around like values)

4) Unicode (no fixed width chars, no default conversion to bytes)

I don't know any good reason to deviate from these rules. Even systems
languages would do well to follow them, IMO.

~~~
geocar
I don't think any of these are valuable, let alone essential.

1) Memory-safe costs a lot and gives me little. If I know I'm not going to
touch memory incorrectly I should be able to take advantage of that.

2) Knowing the cost of operations is essential to do any kind of performant
programming in a reasonable amount of time; e.g. whether `length(x)` is O(N)
or O(1). The operations specified is hugely important.

3) When I've got a 300GB log to update, an in-place algorithm will be better
for everybody. Microcontrollers don't want to waste space with extra copies
either, and a better algorithm wins.

4) I'm not sure having an opinion about Unicode in the language is ideal: You
can either store an array of bytes, or you can store an array of code points,
or you could store a list of UTF8 "characters", and I see value in all of
these interpretations (comparison/copying, rendering, transcoding). I'd prefer
flexibility, and the language not switching them around on me behind my back
(see #2)

~~~
vardump
I do kernel driver and embedded programming. I find your attitude scary.

1) Memory buffer errors cost a lot. Kernel panic or a frozen device after
overwriting its whole memory. Having some low cost safety would be _very_ much
welcome. One of the biggest reasons I'm interested in Rust.

2) Well, I guess you can peek at the implementation, no? Strlen doesn't give
you any guarantees either -- different implementations over an order of
magnitude in performance.

3) Having a 300GB log to update, what happens when your update process step 2
out of 3 fails? In-place updates are risky on shared or persistent data. And
this is a low frequency corner case anyways. Most strings don't need to be
mutated. So it's reasonable to default to that.

4) Array of bytes is usually good enough. When it's text, it encoding should
just always be UTF-8 except when not practical for some external reason.

~~~
geocar
I can appreciate your "kernel driver and embedded programming" background, and
I'm working on my attitude.

I think about programming a lot.

I don't do much programming since my programs usually work correctly the first
time, but when I do I write fast web servers and operating systems and text
editors and language bindings and compilers and interpreters and
compressors/decompressors and image editing tools, backup tools, system
administration tools, reverse engineering network protocols, reverse
engineering file formats, invoice generation tools, billing and accounting
packages, mobile apps, high volume mail servers, mp4 decoders, 1wire (dallas)
drivers, ad servers, ad players, web applications like email clients and web
stores, windows device drivers, linux device drivers, and other things that I
can't remember right now.

I have observed that the bugs I tend to make have more to do with my failing
memory (i.e. what I can remember), than overflowing buffers that I allocated.
Maybe memory buffer errors happen to other programmers often enough that it's
worth worrying about for other people, but I'd argue this says more about
those other programmers, and I propose that the techniques I use to avoid
making those mistakes are more valuable than memory protection since it
clearly allows me to have my cake and eat it too.

I've observed a great deal of performance can be gained by controlling how my
structure is laid out in memory, but in order to do that, the compiler needs
to trust me.

Maybe.

Some languages are experimenting with algebraic types and what they're doing
might be a good-enough middle ground such that they can actually get zero-cost
buffer protection, but I haven't played with them much to know for certain.
I'm willing to be convinced, but I haven't yet, so I continue to maintain that
enforced memory protection is not ideal.

Re 2) I think the original point is that the implementation should be free to
change it however (and whenever) they like. That's why they wanted opaque
strings, such that you access them from the interface itself.

Re 3) It depends. If it fails, then we've lost however long it takes to scan
300GB (a few minutes), but just because we copied it once from the disk,
doesn't mean we need to copy it again and again: That turns minutes into
hours. I agree that most strings don't need to be mutated, and that it's
reasonable to default to it, however order-of-magnitude performance gains are
worth spending a little time thinking about, and they often require mutating-
in-place.

Traversing a tree is another good example, and the Schorr-Deutsch-Waite link-
inversion algorithm (Knuth, TAOCP vol.1 § 2.3.5) is essential for constrained-
memory devices when you need a tree-walker. Sometimes I need to mutate, so I
think _enforcing_ immutability is simply not valuable.

~~~
vardump
> I do I write fast web servers and operating systems and text editors and
> language bindings and compilers...

I didn't want to invoke any contest, but to remind C is used a lot in those
contexts and zero termination seems to be one of the bug magnets. A very
common source for security vulnerabilities as well, pretty large portion of
them are related to string processing.

> Maybe memory buffer errors happen to other programmers often enough that
> it's worth worrying about for other people

Yup. Maybe a large part of how I think of this to protect the other people.
It's not rare someone messes up on an embedded system and writes a bit
somewhere else in memory. Often it won't crash, but the bugs you get take ages
to hunt down.

There's just no way around it. I do want to delegate things to other people.
Defensive coding tends to catch and prevent a lot of mistakes at that point.
Many of them have pretty cowboy attitudes towards details such as parameter
validation or error checking. Some of them will need to maintain it in the
distant future.

> I've observed a great deal of performance can be gained by controlling how
> my structure is laid out in memory, but in order to do that, the compiler
> needs to trust me.

Yeah, performance is often about cache, locality of reference. Pointer chasing
and other random access destroys performance. Applying SIMD without penalties
requires at least 16 byte alignment. In 2016, 64 byte align is even better.
SIMDs are getting pretty wide.

Re 3)...

I still think that 300GB string example is esoteric, even off topic. You're
not going to put 300GB string in any standard string abstraction. There are
probably a lot of other worries, such as running out of disk space, undetected
data corruption on disk (happens pretty often), failure in the middle of the
modification process, etc.

> Sometimes I need to mutate, so I think enforcing immutability is simply not
> valuable.

I don't think anyone wanted to enforce immutability. Just to have it default.

~~~
geocar
> I didn't want to invoke any contest, but to remind C is used a lot in those
> contexts and zero termination seems to be one of the bug magnets. A very
> common source for security vulnerabilities as well, pretty large portion of
> them are related to string processing.

I understand. I think the gross majority of this comes from the C standard
library though, and _not_ from the C language.

qmail for example had no buffer overflow problems -- the first security
vulnerabilities found 10 years after release wouldn't affect any system that
qmail was developed to run on simply because nobody gave their mail server 4GB
of ram! Surely the author should have predicted this problem, but it
demonstrates methodology can protect against this issue.

KDB takes an interesting approach of reference counting in all operators.
Operators can then be optimised for situations where one argument (or both)
have a reference-count of exactly one. This means writing C code with KDB's
memory management library can make things very simple.

Antirez also has an interesting string library that uses a combination of
length+null-termination specifically for the purpose of finding bugs.

> There's just no way around it. I do want to delegate things to other people.
> Defensive coding tends to catch and prevent a lot of mistakes at that point.
> Many of them have pretty cowboy attitudes towards details such as parameter
> validation or error checking. Some of them will need to maintain it in the
> distant future.

I understand what you're saying, but I think technique can help a lot more
than we think: Most of the software engineering industry wants to move to
smarter and better tooling, and I'm just proposing we upgrade our brains some
too.

I don't have all the answers yet.

> You're not going to put 300GB string in any standard string abstraction.

In KDB this is actually pretty common, but it doesn't have a "standard string
abstraction" because while it meaningfully supports byte-arrays that are in
the 300GB range, the string-atoms (symbols) which are used similar to strings
in python et al, never approach 1MB let alone 300GB.

However in C I just did mmap() and updated the file. It took about 15 minutes
to write and debug, and a few minutes to run on the server.

> I don't think anyone wanted to enforce immutability. Just to have it
> default.

I don't know. I parsed _I don 't know any good reason to deviate from these
rules._ as "enforce". I might be wrong, you'd have to ask "cousin_it" what
he/she meant at the time. I know at least later they either were convinced, or
clarified[1] this position.

However what I meant was that I don't want to enforce those rules because I
can think of an exception for each, and they understood that. Sorry I was
unclear though.

[1]:
[https://news.ycombinator.com/item?id=10842858](https://news.ycombinator.com/item?id=10842858)

------
antirez
The SDS library
([http://github.com/antirez/sds](http://github.com/antirez/sds)) is the middle
ground between the two approaches. SDS strings are apparently C null-
terminated strings, but before the pointer that you pass around there is
metadata with the length. This makes SDS strings compatible with everything
there is in the C standard, but has allows to easily manipulate the string
without managing the allocation manually. Also sdslen() is O(1) and binary
safe.

The main drawback is that most functions have to reassign the pointer back,
like in:

    
    
        sds foo = sdsnew("foo"); // Creates an SDS string
        printf("%s\n", foo); // You can print it with printf()
        foo = sdscatlen(foo,buffer,10); // Append 10 bytes.
    

Failing to reassign sdscatlen() return value back to "foo" creates a bug. SDS
originated with Redis but is now a standalone library. Recently version 2.0
was released that makes it synchronized with the version we have inside Redis.

~~~
JoshTriplett
Seems like you could avoid that potential for error by making functions that
modify an SDS take &foo rather than just foo, and modify it in place.

As a stopgap, though, SDS could use GCC's attribute "warn_unused_result" on
all of those return values.

~~~
jzwinck
Last time I tried it, that GCC attribute was only implemented for C, not C++.
It seems like a good idea anyway.

------
pjc50
Well, yes: these are the original design constraints that led to null-
terminated strings. The result is a data structure a bit like a linked list:
finding the length is an O(n) operation.

So long as all your operations on the string are char-at-a-time, and your
first act is to check the char against zero, this works fine.

In fact, _so long as you stay in the original UNIX batch processing model_ ,
where you're reading bytes from stdin and writing to stdout, the whole program
works fine. Inconvenient buffer-size issues are avoided: don't keep
intermediate buffers, just write to stdout a char at a time. (Have a look at
lex/yacc generated code for this model, for example, with their not-easily-
resettable parsers and trouble handling errors other than by calling `exit`).

The problem is not so much "string is not a type" as " _buffers_ for composing
strings are not a type".

~~~
marvy
What do you mean by that last sentence?

~~~
pjc50
Roughly the distinction Java makes between String and StringBuffer. I observed
that buffer overflows require a buffer: if you do all your IO a character at a
time, this is less risky.

For security you need not "string first char + length of string" but "string
first char + length of _allocated region into which it is safe to write_ ",
which (unlike the length of the string) cannot be inferred from the string
head pointer in any way.

~~~
marvy
thanks

------
lmm
C strings may well have made sense for the time. The point is they don't do
now. I don't blame the people who created C, I blame the people who continue
to use it today.

~~~
signa11
> I don't blame the people who created C, I blame the people who continue to
> use it today.

that's pretty harsh, and far-reaching. is it really justified to paint
_everything_ with that brush ?

~~~
lmm
Probably not; I probably should remain open-minded. But I've seen so much
written in C, and so rarely has C been a good choice, that I don't feel like
wasting any more of my time.

------
camperman
Can't remember where I read it because it was such a long time ago but one
explicit benefit of null-terminated strings is that they take a single
instruction to check for the end on nearly all processors: jump if zero.

That allows for elegant constructions in C such as:

    
    
        while (*p++ = *q++);

~~~
justncase80
It may be syntactically terse but I'm not sure it's elegant as you would need
to do several checks before this line or else you could end up with some
pretty major bugs.

For example, you would need to ensure that sizeof p is >= sizeof q, or else
you will end up with a buffer overflow. This then implies that q has a length
and therefore a \0. So you will need to iterate through the string once to
find the \0 (which may not be there).

Failure to do this can lead to execution of arbitrary code, which is why null
terminated strings are indefensible frankly.

~~~
camperman
I read (and used) this during a more civilized age when everyone and his dog
wasn't trying to get your program to execute arbitrary code, although I have
still used the construct on a recent project on the Raspberry Pi and it worked
just fine.

~~~
justncase80
You're very naughty ;)

------
lugus35
Take a look at bstring[0] and you'll see how to circumvent C's null-terminated
string in a clean and light way.

[0] [http://bstring.sourceforge.net/](http://bstring.sourceforge.net/)

~~~
anon1385
bstring relies on undefined behaviour for security. Do not use it.

>Bstrlib is, by design, impervious to memory size overflow attacks. The reason
is it is resiliant to length overflows is that bstring lengths are bounded
above by INT_MAX, instead of ~(size_t)0. So length addition overflows cause a
wrap around of the integer value making them negative causing balloc() to fail
before an erroneous operation can occurr. Attempted conversions of char *
strings which may have lengths greater than INT_MAX are detected and the
conversion is aborted.

>It is unknown if this property holds on machines that don't represent
integers as 2s complement. It is recommended that Bstrlib be carefully
auditted by anyone using a system which is not 2s complement based.

~~~
fhars
It is not only unknown if this holds on machines that don't use 2s complement,
it is quite likely that this doesn't even hold on machines that do use 2s
complement, but have an optimizing compiler. In many cases, the overflow
checks contains terms like

    
    
        (b->slen+1) < 0
    

which an optimizing compiler could compile as

    
    
        b->slen < -1
    

effectively switching the overflow test off (as there is no non-overflowing
execution where this code will produce other behaviour than the original code,
and behaviour on overflow is undefined, so the compiler can safely get rid of
the addition).

~~~
fhars
I just had to check what happens using a real compiler using realistic
compiler options (gcc-4.9.2 -O2):
[https://github.com/websnarf/bstrlib/issues/8](https://github.com/websnarf/bstrlib/issues/8)

~~~
anon1385
Thanks for doing that. It's much more productive than just moaning on HN like
I did.

------
pjmlp
Regarding the comments on the OP site

" D is the first example to have array slices that I can think, of so this
might be slightly a-historical."

No, D wasn't the first one.

There were already systems programming languages with proper string types with
the same age or older than C.

Usually there was a clear distinction between array of characters and strings,
both in any case safer than C's approach.

------
bluetomcat
Another nice feature of C strings is that a substring spanning to the end is
essentially p + something. You can also insert temporary null-chars in the
middle and have copy-free substrings.

Most other high-level languages would let you copy the substring even if you
don't intend to modify it.

~~~
vardump
> Most other high-level languages would let you copy the substring even if you
> don't intend to modify it.

C++, std::string yes. But for example Java, C# and Go (slices) substring
doesn't copy data.

~~~
joosters
Didn't java change its behaviour?

~~~
anon1385
Yes:
[https://news.ycombinator.com/item?id=9862556](https://news.ycombinator.com/item?id=9862556)

~~~
vardump
That's a bit scary. Some of older Java code relies a bit too much on substring
not copying. I think most devs understood it pins the original String char[]
buffer until all references are gone.

~~~
pjmlp
That was always a JVM specific behavior.

Like C, any dev that makes language assumptions based on the installed
compiler, is bound to get burned.

~~~
joosters
Really? Does java offer no cost guarantees for library functions? If a given
method might be O(1) or O(n) at the whim of the JVM, programmers are doomed to
failure.

~~~
jerven
Even worse, with partial evaluation or escape analysis it might still be non
allocating to use a substring call today.

So the thing about advanced compilers in the java world is that they use
defined behaviour(results) to do optimisations. And the best optimisations are
about not doing work you can avoid. i.e. if you can avoid doing an allocation
at all because the (sub)string you just created is only used locally, the
object might not be allocated at all (heap|stack) but just registers are
reassigned with temporary values needed to calculate the function. So its
already common to have O(n) semantics turn into O(1) or even better O(0) ;).

In the java world results are specced, not how you need to get them.
(Exceptions to this rule exist).

------
leni536
>A fixed size length field commits you to a maximum string size size_t would
be enough, but I guess it would be too much of an overhead.

~~~
kzrdude
Debating the size of the size field is a folly: It's a struct with one char
pointer, and a length field. For alignment / padding reasons, the length field
with use the size of a pointer anyway, so `size_t` is entirely appropriate
(and never too small).

~~~
gpderetta
A relatively common implementation of strings (as compared to slices) is to
store the length before the char array. The length field can be encoded with a
variable size encoding. The string itself is just a single pointer.

If you have a real garbage collector of course using the same representation
for strings and slices (two pointers or pointer+len) is appropriate.

------
kzrdude
Explicitly sized strings are so much easier to work with efficiently.
SIMDified algorithms, for example.

