Hacker News new | comments | show | ask | jobs | submit login
The format of strings in early (pre-C) Unix (utoronto.ca)
102 points by fcambus 207 days ago | hide | past | web | 64 comments | favorite



Slightly off topic. The article doesn't call it out but there's a lovely assembly hack here. In:

bec 1f / branch if no error

   jsr r5,error / error in file name

       <Input not found\n\0>; .even

   sys exit
jsr calls a subroutine passing the return address in register 5. The routine error interprets the return address as a pointer to the string.

r5 is incremented in a loop, outputing one character at a time. When the null is found, it's time to return.

The instructions used to return from "error:" aren't shown but there is a subtlety here, I think.

".even" after the string constant assures that the next instruction, "sys exit", to which "error:" is supposed to return, is aligned on an even address.

By implication, the return sequence in "error:" just be sure to increment r5, if r5 is odd. I am guessing something like the pseudo-code:

inc r5

and r5, fffe

ret r5


Yep!

http://minnie.tuhs.org/cgi-bin/utree.pl?file=V1/sh.s

    error:
        ...
	inc	r5 / inc r5 to point to return
	bic	$1,r5 / make it even


Thanks! Nifty!


After skimming through this, I navigated around this Chris Siebelmann's site with the forward and back links, discovering something way more interesting than Unix strings and refreshingly relevant:

"How I do per-address blocklists with Exim"

https://utcc.utoronto.ca/~cks/space/blog/sysadmin/EximPerUse...

I run Exim, and I'm also a huge believer in blocking spam at the SMTP level, and also do some things globally that should perhaps be per-user. I'm eagerly interested in everything this fellow has to say.


I've followed his site for awhile, and he seems like a thoughtful person who likes to document his reasoning. I'm not a sysadmin, so I don't care about a lot of what he writes, but he's worth reading for stuff I care about. For example, he's pretty astute about the "how" and "why" of spam.


I was hoping to read something juicy like null termination was created by a summer intern.


This is the origination of null terminated strings in C:

"In BCPL, the first packed byte contains the number of characters in the string; in B, there is no count and strings are terminated by a special character, which B spelled ā€˜*e’. This change was made partially to avoid the limitation on the length of a string caused by holding the count in an 8- or 9-bit slot, and partly because maintaining the count seemed, in our experience, less convenient than using a terminator.

From https://www.bell-labs.com/usr/dmr/www/chist.pdf


Nope! Only time-wasting and mind-engaging software like solitaire.


So am I overachieving by trying to learn React+Flux and Angular.js + typescript and also trying to learn ASP.NET 5 MVC 6 while working my internship?

Maybe I should just make HTML 5 games instead...


I don't know a less blunt way to say this, but the tools you use are usually not as important as the software you write. It sounds like you're more focussed on padding your resume with buzzwords. Try to work on writing an interesting application.


If you aimed to land on a job in a big company, "Padding your resume with buzzwords" is not that bad, because HR people filter CV's from the pile using these same keywords. Not that I say it's a good practice..


Those technologies are unlikely to still exist by the time you graduate. Learn C.


React and Redux are a great stack for making solitaire


I have seen it claimed that null-terminated strings were encouraged by the instruction sets of the time -- that some instruction sets make null-terminated sequences easier to handle than length-prefixed ones. The article's error-message-printing code snippet is a good example. Does anyone think there is any truth to this?


Null terminated is going to be nice in most instruction sets, you don't need to keep track of a count, so you save a register, and you have one less thing to increment (or decrement). Loop condition is basically free too, loading the next byte into a register is going to set the status register, so you don't need a compare, you can just branch if the zero flag is set.

As long as you are handling good data, it's clearly more efficient.


Also, null-terminated arrays are used for other stuff that not are strings. For example : https://developer.gnome.org/glib/stable/glib-String-Utility-... returns an null terminated array of strings.


argv is a null-terminated array of string.


> loading the next byte into a register is going to set the status register

not in x86


If you load into C, you can JCXZ, instead of JZ, since you don't get the zero flag for free with a load.


Aside from the Zero flag, there was also a Negative/Signed flag. Setting the high-bit in the last character (or clearing it if the rest had the high bit set) was also a way to terminate strings and save a byte over null-terminated strings.

People who go ape-shit over 0-terminated strings would probably be even more upset if we still did that.


Yes, the PDP-11 instruction set makes null-terminated strings especially efficient. Strcpy() is 3 instructions per character, for example.

The PDP-11 features that make this possible are (1) post increment addressing and (2) MOVE instructions set the condition codes.


The ASCIZ directive on the PDP-11 assembler took a string literal and blitted it to memory with a trailing NULL, so there was a convenience factor there. BCPL and B used null-terminated strings, so it made sense for C to inherit the practice.

Pascal strings were significantly size-limited due to memory constraints (an implementation detail), so there was a reason to prefer null termination, but in essence we're still slaves to an obsolete instruction set.


Yes, the architecture on which you're most likely reading this (x86 or x64) also has semi-dedicated string handling instructions, akin to higher level string functions.

For example, a string copy 'instruction' that copies a 0 terminated string from one location to another looks like this:

  lea esi, source_string
  lea edi, dest_string
  repnz movsb           ; copies esi -> edi until *esi = 0


are you sure about that? http://docs.huihoo.com/help-pc/asm-repne.html http://faydoc.tripod.com/cpu/repnz.htm

repnz movsb is undocumented and undefined, MOVS doesnt modify flags, not sure how would that work


Whoops, looks like I've indeed mixed up my assembler.

While the rep/repe/repnz can be used in conjunction with movs(b/w) it will not terminate on the z flag in the eflags register (so it won't stop if *esi == 0) but rather when ecx = 0.

So ecx needs to contain the size of the string you want to move, and rep movsb does not use the zero flag.


Maybe that's not the whole picture, but I think a big reason for null terminated strings is that C has pointers. A char pointer makes it practical to move through strings and a NULL terminator is easier to handle than a size test. In the end, sized strings make C code more complicated, unlike other languages such as Pascal.


Which also has pointers, yet makes use of safe strings.


True, I should have said pointers and pointer arithmetic. In Pascal you cannot do these things because there is no pointer arithmetic to move along strings and arrays.


Only if you are speaking about the boring ISO Pascal.

All Pascal dialects like Apple Pascal and Turbo Pascal had pointer arithmetic.


That was an extension to the Pascal language, it doesn't really change the way to write idiomatic code.


Moving goal posts are we?

Should I list all the compiler specific features that everyday C programmer uses but aren't defined in ANSI C, e.g. inline assembly?

No professional Pascal compiler was a pure ISO Pascal, that error was left for schools teaching bad Pascal.

Professional Pascal compilers were way more powerful than C, safer and in the days of K&R C chaos, just as portable.


I have made no comment about the quality of Pascal vs. C. In fact I like Pascal a lot, my comment is just on the way to write idiomatic code in each language. In C it is much easier and common to traverse data structures. Pascal may do it in some implementations, but its not how programmers used to organize their code. This just means that for C programmers it is frequently easier to work with NULL terminated strings.


I always felt like NUL-termination, newline-separation, and (eventually) UTF-8 were all sort of complementary ideas: they all take as an axiom that strings are fundamentally streams, not random-access buffers; and they all separate the space of single-byte coding units, by simple one-bitwise-instruction-differentiable means, into multiple lexical token types.

Taking all three together, you end up with the conception of a "string" as a serialized bitstring encoding a sequence of four lexical types: a NUL type (like the EOF "character" in a STL stream), an ASCII control-code type (or a set of individual control codes as types, if you like), a set of UTF-8 "beginning of rune" types for each possible rune length, a "byte continuing rune" type, and an ASCII-printable type. (You then feed this stream into another lexer to put the rune-segment-tokens together into rune-tokens.)

In the end, it's not a surprise that all of these components were effectively from a single coherent design, thought up by Ken Thompson. It's a bit annoying that each part ended up introduced as part of a separate project, though: NULs with Unix, gets() with C, and runes with Plan 9.

One of the pleasant things about Go's string support, I think, it that was an opportunity for Ken to express the entirety of his string semantics as a single ADT type. That part of the compiler is quite lovely.


How else would you implement them, seriously.

You have two choices, counted or terminated.

Counted places a complexity burden at the lowest level of coding.

With terminated you still have the option of implementing strings with structs or arrays with counts or anything.

And people did of course. Many many different implementations of safe strings exist in C; the fact that none have won out vindicates the decision to use sentinel termination.


One of the worst programming ideas ever dates bavk even earlier than we thought.

If only Dennis had had the foresight to nip that one in the bud...


what's your better idea? (hint: this has to work in assembly language.)


Strings and arrays begin with one pointer-sized word that indicates the size of the string/array, thus making all the various mutant versions of functions that work with them unnecessary. And eliminate the requirement to specify length separately when passing as a function argument. And make bounds-checking trivial. And almost entirely eliminate buffer overflows.

This would naturally want malloc to know the type and count, eg: char[] x = malloc(char[], 100). That means no opportunity to screw it up (let the compiler turn that into the sizeof math to pass to the actual allocator).

If bounds checking is a performance bottleneck you could turn it off with compiler flags; that's not a valid argument against it.

But hey... all the various buffer overflows, RCEs, and various exploits are totally worth the minor performance gains /sarcasm.


I don't think you appreciate the zeitgeist. People were building complex systems: compilers, operating systems, databases, numerical simulations, and worse in machines with less memory than an Arduino. Adding a byte to every string was widely viewed as madness.


Yes, of course. The only problem is that now my iPhone is more performant than top supercomputer was then, but this ugly hack with strings is still there, alive and kicking.

And even then — one byte of memory could be nothing compared to CPU overhead. Or maybe not — RAM was insanely expensive these days.


But you're comparing apples and pears. This 'ugly hack' is - for obvious reasons - not what 90% of software on your iPhone is actually using to manipulate strings. Instead, the ugly hack known as NSString tidily wraps the char buffer, its byte-length, possibly an offset - most application developers never deal with null-terminated strings!

So in other words, I don't really understand why you are arguing for replacing a standard - one that works well for its purposes, mind you - with another when this has in fact already happened. And even less I understand why you are trying to frame a good and sound engineering decision as somehow a mistake?


Indeed. Its like mainframes didn't have 2-digit years because the programmers were stupid or lazy, they did what they could with what they had.


Funny, because outside AT&T people were able to do it, like those crazy Burroughs company in 1961.


If the strings is under 255 characters, using Pascal strings have the same cost. You keep "wasting" a single byte for null character or an counter. Null terminated strings becomes more space efficient when you need to store longer strings.


This has more far reaching consequences. You can't create pointers to strings, or you have to restrict them to only point at the head of strings. You would still have overruns, but the focus would be on overwriting the length field at the start of the string instead of overwriting the null at the end, or maybe overwriting the length on the next string. This is a seismic change to just about everything string related in C, the days of people iterating over characters with a pointer would be over.

Just think about how something like strsep (strtok to you older folks) would work in a system like this.


Remember, this was all written before computer networking, let alone internetworking. The level of defensiveness required to write network-facing code wasn't realised at the time.

The prevailing view of how to do things was based on tape (magnetic or paper). In that context, it's really inconvenient to know how long your tape is before you write it and really convenient to just keep reading it until you hit a marker or the hardware says "sorry, end of tape". See also the "end of file" character, ctrl-D.

Judging history in hindsight is uncharitable.


Ignoring history is even worse.

Burroughs B5000 in 1961, with memory safe systems programming in an Algol dialect.

https://en.wikipedia.org/wiki/Burroughs_large_systems

Similar examples can be provided for systems older or of similar age to PDP-11.


"An ordinary unprivileged user with sufficient knowledge of the system needs only the ability to be able to modify machine-code in order to penetrate the system completely."

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.162...

Security was hard then and continues to be hard now. It would also be nice to have a cost comparison between the various alternatives.


You should take on account that PDP-11 was pretty limited. You can't afford "wasting" memory on counters (for strings bigger that 255 chars) on an computer that handled few MiB at best, when you can use the null character trick, plus being efficient on these machine. Also, I think that not was motivation to make it secure on something that was initially a pet project on a unused machine.


PDP-11 was certainly more powerful than the Burroughs machines, being one decade newer.


P-Strings (Pascal) do exactly this. The original ones limited it to 255 (instead of using the pointer size), but others used more. One of the nice things about using Pascal for programming back In The Day.

The article says it doesn't know where ZTSs came into Unix, but I thought I read somewhere, many moons ago, that it was a pretty simple choice: prefixed strings used one more byte of memory than a prefixed string for a string longer than 255 characters, and on the systems at the time, that mattered a lot.


I thought the limited 255 byte strings were a nuisance in Turbo Pascal. It was one of the many reasons switching to C/C++ felt like getting out of jail.


A bit smarter is to combine the length field with the metadata stored already by malloc. Malloc (generally) knows the size of the allocation, probably not the exact size, but the size rounded up to 8 or 16 bytes. For example with glibc you call malloc_usable_size(ptr) (a non-standard function of course).

You can encode the byte length of the string minus the rounded up size (known by malloc) using the same trick that OCaml uses: https://rwmj.wordpress.com/2009/08/05/ocaml-internals-part-2... Using this trick you can get the byte length of the string from the base pointer in O(1). Such strings can also contain \0 characters.

Thus you don't need to store the string length explicitly as a separate field, and you can even make string pointers within the string work, and it's backwards compatible with legacy functions that expect a null-terminated string (albeit it won't work if you pass a string containing \0 to one of those legacy functions, but that is to be expected).


I wrote more about this topic: https://news.ycombinator.com/item?id=10866416


I think I read something like this in a header file for a GNU licensed compiler at some point.

The String object was a character array (presumably terminated by a NULL) which was prefixed by a size_t (it may have been assumed that String inherited another precursor class).

Still, I do like the idea of working within a bounded set. This sounds much more like a language feature where you could work within an array of known size. Though most of the time when I care about easy String handling I'm better off writing it in a higher level language.


make str_t an opaque type. First char is always length.

1. If string is 0-15 chars long, str_t is a small string, aka char[16]. First byte is length, payload starts in 2nd byte, tail is always zeroed. Empty string is all zeros always.

2. If string is 16-248, str_t is a half string. The whole payload exist in a single contiguous buffer. First bytes is length. A pointer to the rest of the payload exist within the remaining 15 bytes, starting at the first alligned address. This way you can cast half string as small string when passing as value without knowing the size at compile time (and stil recover the rest of the payload). Gap between the length and the pointer is reserved and always zeroed.

3. For lengths 249 and higher. Each value 249-255 is not the lenght, but an encoded way to represent one or more styles of bigger strings. Not need to hold the payload in a single contiguous buffer. Remaining 15 bytes will hold a pointer to the data structure in the same fashion as half string. Reserved bytes are used to represent length if possible. If not, length is required to exist at a fixed position in the first buffer accesible when dereferencing the pointer.


Wow, that sounds nightmarish. Three different ways to store the string, pointers to buffers everywhere, and because it is so complex your string library needs to account for all possibilities because there's no way anybody is going to want to manually twiddle with those strings.

If you get it right and hide all of the complexity from the user then it's not quite as bad, although undoubtedly confusing the first few times someone inspects it in their debugger.


YEs, you are right... it is very complex. But it is consistent enought that the compiler can do it all in the background.

If you think that it is a bad idea to do:

   str_t* string = (str_t*)malloc(string_length);
then half of the goal is been achieved right there.

The other half is to have this nailed down in the language definition in sufficient detail so that implementations do not differ widely and you can poke inside it with assembly when you have to.


I used to work for a company that whose C++ string class worked like this, although not quite as intricate. Had a separate mode for statically compiled strings too. It was pretty nice, although I didn't agree with every style choice. Makes some branchy code though.

In Visual Studio debugger you can write scripts to pretty-print string variables in the "watch" window, I would guess other good debuggers can do this too.


String operations are intrinsically complex by nature. At least they are now in this age that you cannot assume the encoding is always ASCII.

The goodness of this approach is that if you just need to pass around the string, a shallow copy of the first 16 byte block is enough. It might or might not contain the whole of the array, but if you need to know, it means you need to go through all the corner cases, no strcat(small_buffer, unsanitized_user_input), thank you very much!


Are there any other types in C that allocate space on the heap? Is that even possible in C? Not C++; there we have std::string, which is much as you describe above.


ken had written most of the original asm kernel.


But Dennis could have reversed it by giving C -- and thus later Unix -- a better default, you know, like Pascal had.


The predecessor of Unix, Multics was written in PL/1 and was very innovative (modern OS still borrow "new ideas"): https://en.wikipedia.org/wiki/Multics


That was anti-climatic.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: