
The format of strings in early (pre-C) Unix - fcambus
https://utcc.utoronto.ca/~cks/space/blog/unix/UnixEarlyStrings
======
Thomas_Lord
Slightly off topic. The article doesn't call it out but there's a lovely
assembly hack here. In:

bec 1f / branch if no error

    
    
       jsr r5,error / error in file name
    
           <Input not found\n\0>; .even
    
       sys exit
    

jsr calls a subroutine passing the return address in register 5. The routine
error interprets the return address as a pointer to the string.

r5 is incremented in a loop, outputing one character at a time. When the null
is found, it's time to return.

The instructions used to return from "error:" aren't shown but there is a
subtlety here, I think.

".even" after the string constant assures that the next instruction, "sys
exit", to which "error:" is supposed to return, is aligned on an even address.

By implication, the return sequence in "error:" just be sure to increment r5,
if r5 is odd. I am guessing something like the pseudo-code:

inc r5

and r5, fffe

ret r5

~~~
ksherlock
Yep!

[http://minnie.tuhs.org/cgi-
bin/utree.pl?file=V1/sh.s](http://minnie.tuhs.org/cgi-
bin/utree.pl?file=V1/sh.s)

    
    
        error:
            ...
    	inc	r5 / inc r5 to point to return
    	bic	$1,r5 / make it even

~~~
Thomas_Lord
Thanks! Nifty!

------
kazinator
After skimming through this, I navigated around this Chris Siebelmann's site
with the forward and back links, discovering something way more interesting
than Unix strings and refreshingly relevant:

"How I do per-address blocklists with Exim"

[https://utcc.utoronto.ca/~cks/space/blog/sysadmin/EximPerUse...](https://utcc.utoronto.ca/~cks/space/blog/sysadmin/EximPerUserBlocklists)

I run Exim, and I'm also a huge believer in blocking spam at the SMTP level,
and also do some things globally that should perhaps be per-user. I'm eagerly
interested in everything this fellow has to say.

~~~
username223
I've followed his site for awhile, and he seems like a thoughtful person who
likes to document his reasoning. I'm not a sysadmin, so I don't care about a
lot of what he writes, but he's worth reading for stuff I care about. For
example, he's pretty astute about the "how" and "why" of spam.

------
coupdejarnac
I was hoping to read something juicy like null termination was created by a
summer intern.

~~~
tomrod
Nope! Only time-wasting and mind-engaging software like solitaire.

~~~
ghrifter
So am I overachieving by trying to learn React+Flux and Angular.js +
typescript and also trying to learn ASP.NET 5 MVC 6 while working my
internship?

Maybe I should just make HTML 5 games instead...

~~~
alayne
I don't know a less blunt way to say this, but the tools you use are usually
not as important as the software you write. It sounds like you're more
focussed on padding your resume with buzzwords. Try to work on writing an
interesting application.

~~~
kbart
If you aimed to land on a job in a big company, "Padding your resume with
buzzwords" is not that bad, because HR people filter CV's from the pile using
these same keywords. Not that I say it's a good practice..

------
holmak
I have seen it claimed that null-terminated strings were encouraged by the
instruction sets of the time -- that some instruction sets make null-
terminated sequences easier to handle than length-prefixed ones. The article's
error-message-printing code snippet is a good example. Does anyone think there
is any truth to this?

~~~
toast0
Null terminated is going to be nice in most instruction sets, you don't need
to keep track of a count, so you save a register, and you have one less thing
to increment (or decrement). Loop condition is basically free too, loading the
next byte into a register is going to set the status register, so you don't
need a compare, you can just branch if the zero flag is set.

As long as you are handling good data, it's clearly more efficient.

~~~
Zardoz84
Also, null-terminated arrays are used for other stuff that not are strings.
For example : [https://developer.gnome.org/glib/stable/glib-String-
Utility-...](https://developer.gnome.org/glib/stable/glib-String-Utility-
Functions.html#g-strsplit) returns an null terminated array of strings.

~~~
Buge
argv is a null-terminated array of string.

------
derefr
I always felt like NUL-termination, newline-separation, and (eventually) UTF-8
were all sort of complementary ideas: they all take as an axiom that strings
are fundamentally streams, not random-access buffers; and they all separate
the space of single-byte coding units, by simple one-bitwise-instruction-
differentiable means, into multiple lexical token types.

Taking all three together, you end up with the conception of a "string" as a
serialized bitstring encoding a sequence of four _lexical_ types: a NUL type
(like the EOF "character" in a STL stream), an ASCII control-code type (or a
set of individual control codes as types, if you like), a set of UTF-8
"beginning of rune" types for each possible rune length, a "byte continuing
rune" type, and an ASCII-printable type. (You then feed this stream into
another lexer to put the rune-segment-tokens together into rune-tokens.)

In the end, it's not a surprise that all of these components were effectively
from a single coherent design, thought up by Ken Thompson. It's a bit annoying
that each part ended up introduced as part of a separate project, though: NULs
with Unix, gets() with C, and runes with Plan 9.

One of the pleasant things about Go's string support, I think, it that was an
opportunity for Ken to express the entirety of his string semantics as a
single ADT type. That part of the compiler is quite lovely.

------
emmelaich
How else would you implement them, seriously.

You have two choices, counted or terminated.

 _Counted_ places a complexity burden at the lowest level of coding.

With _terminated_ you still have the option of implementing strings with
structs or arrays with counts or anything.

And people did of course. Many many different implementations of safe strings
exist in C; the fact that none have won out _vindicates_ the decision to use
sentinel termination.

------
bitwize
One of the worst programming ideas ever dates bavk even earlier than we
thought.

If only Dennis had had the foresight to nip that one in the bud...

~~~
marvy
what's your better idea? (hint: this has to work in assembly language.)

~~~
xenadu02
Strings and arrays begin with one pointer-sized word that indicates the size
of the string/array, thus making all the various mutant versions of functions
that work with them unnecessary. And eliminate the requirement to specify
length separately when passing as a function argument. And make bounds-
checking trivial. And almost entirely eliminate buffer overflows.

This would naturally want malloc to know the type and count, eg: char[] x =
malloc(char[], 100). That means no opportunity to screw it up (let the
compiler turn that into the sizeof math to pass to the actual allocator).

If bounds checking is a performance bottleneck you could turn it off with
compiler flags; that's not a valid argument against it.

But hey... all the various buffer overflows, RCEs, and various exploits are
totally worth the minor performance gains /sarcasm.

~~~
kbob
I don't think you appreciate the zeitgeist. People were building complex
systems: compilers, operating systems, databases, numerical simulations, and
worse in machines with less memory than an Arduino. Adding a byte to every
string was widely viewed as madness.

~~~
atemerev
Yes, of course. The only problem is that now my iPhone is more performant than
top supercomputer was then, but this ugly hack with strings is still there,
alive and kicking.

And even then — one byte of memory could be nothing compared to CPU overhead.
Or maybe not — RAM was insanely expensive these days.

~~~
brrt
But you're comparing apples and pears. This 'ugly hack' is - for obvious
reasons - not what 90% of software on your iPhone is actually using to
manipulate strings. Instead, the ugly hack known as NSString tidily wraps the
char buffer, its byte-length, possibly an offset - most application developers
never deal with null-terminated strings!

So in other words, I don't really understand why you are arguing for replacing
a standard - one that works well for its purposes, mind you - with another
when this has in fact already happened. And even less I understand why you are
trying to frame a good and sound engineering decision as somehow a mistake?

------
castell
The predecessor of Unix, Multics was written in PL/1 and was very innovative
(modern OS still borrow "new ideas"):
[https://en.wikipedia.org/wiki/Multics](https://en.wikipedia.org/wiki/Multics)

------
jamesfmilne
That was anti-climatic.

