Hacker News new | past | comments | ask | show | jobs | submit login
Knuth and the Unix Way (leancrew.com)
231 points by angersock on Dec 8, 2011 | hide | past | favorite | 62 comments

McIlroy's basic point is that, instead of building single-use programs, we should be directing our effort towards building reusable components, i.e. Unix utilities. Then, over time, most programs could be built as a simple piping together of high-level components.

This vision seems to have failed for about 5 decades now, but it is still a popular one. After no one was able to build serious applications by gluing together Unix utilities, it was thought that we would instead build serious applications by gluing together objects. This vision has also failed (or at least, not succeeded in a very apparent way). In the mean time there have been other promised types of "componentry", like COM, CORBA, WS-*, ..., all more or less complete failures (or at least, egregiously unapparent successes).

So, in my view, it is Knuth who has the last laugh.

I strongly disagree with the notion that the composition of small programs to build larger ones is a "failed vision". Shell scripting is an invaluable tool in the daily-use toolbox for a huge number of people, myself included. And what is a shell, besides a programming language for gluing together other programs?

Now, if you mean that no one is writing full-scale professional desktop or web applications by piping together sed and awk, then of course you are correct. But many projects come closer to this than you might think. Git, for instance, was originally just a large collection of shell scripts, which has over time been gradually rewritten in C. Many pieces of it are, even today, still written in the POSIX Bourne shell.

To me, the principal reason for the astounding longevity of Unix is its emphasis on composability. The tiny, ubiquitous Unix utilities are ancient compared to virtually every other piece of software in common usage, yet are no less useful today than when ken and dmr first conceived of them. By making the system modular and composable at a very fundamental level, they ensured (perhaps as much by accident as by intentional design) that users far into the future would be able to continue using their tools for purposes not yet dreamed of. And indeed, this is very much the case, and is likely to continue that way for some time to come. Whatever the eventual successor to Unix turns out to be, it's likely to have the same emphasis on small building blocks with enormous synergy when composed.

Likewise, when you say that the vision of "building serious applications by gluing together objects" has "not succeeded in a very apparent way", I frankly have no idea what you're talking about. OOP is the dominant paradigm in modern programming, bar none. You might argue that some other paradigm (FP, perhaps?) would have served us better in retrospect, but the last thing anyone can truthfully claim about the object-oriented approach is that it has not been successful. It's hard to think of even one popular programming language that doesn't borrow at least a few ideas from that school of thought.

I do agree with you about the "enterprisey" component approach being largely a failure. Thankfully, CORBA, SOAP, et al. seem to be mostly behind us, or at least rapidly receding.

"Git, for instance, was originally just a large collection of shell scripts, which has over time been gradually rewritten in C. "

Doesn't that prove the parent's point that the collection of supposedly reusable components could not be glued together to form GIT? There must be a reason why it was rewritten in C. And if it had to be rewritten then I don't think we can call the "large collection of shell scripts" a complete success.

"OOP is the dominant paradigm in modern programming, bar none."

There could be many reasons for that but it does not address the point that the actual gluing together of objects has somehow succeeded. How much of that OO work is actually shared between projects? How much of it is simply OOP for the sake of it because the language was chosen for convenience?

To me success would be using objects from one project without modification in a completely separate project. I just don't see that happening very often. If the objects are simple enough you might as well just rewrite them. If they are complicated enough to be of value to import, they usually require modification.

> There must be a reason why it was rewritten in C.

The motivation for rewriting in C was performance on Windows - since Windows has very inefficient forking, shell scripts are very slow there[1]. I think that says more about the deficiencies of Windows than the deficiencies of the shell scripting/pipeline model.

This is slightly besides the point anyways. Even though most of Git has been rewritten in C, Git is still comprised of lots of little single-purpose commands, which I have used in pipelines to do some "outside the box" processing on Git repositories.

[1] http://en.wikipedia.org/wiki/Git_%28software%29#Portability

Doesn't that prove the parent's point that the collection of supposedly reusable components could not be glued together to form GIT? There must be a reason why it was rewritten in C.

Ironically the main reason for the rewriting was the poor performance of the fork system call in Windows (Cygwin) which make the system slow particularly for shell scripting.

The problem was Windows, not Git.

from the blurb: libgit2 is a portable, pure C implementation of the Git core methods provided as a re-entrant linkable library with a solid API, allowing you to write native speed custom Git applications in any language which supports C bindings.


The point here seems to be that (ba)sh scripts are wonderful for one-time tasks. Not so much for anything else/permanent workflows.

The problem is the brain-damaged shell which, probably for all sorts of accumulated legacy crap from '70s, uses fork+exec instead of the far more efficient spawn (or posix_spawn).

Composition should be of libraries or algorithms not arbitrary black box programs with n different options. Shell scripting for anything non-trivial is a pain as nothing is portable or reusable. Linux has extensions to posixx, Freebsd as well and so do their specific core utilities.

> Now, if you mean that no one is writing full-scale professional desktop or web applications by piping together sed and awk, then of course you are correct.

That very much depends on where you draw the line. If Ruby on Rails, Wordpress or Jango are "professional web applications", then so is werc [1] "web anti-framework" which is built entirely on rc, the plan9 shell.

xmobar [2], dzen [3] and geektool [4] are all notification tools that can watch files for changes or accept, process and display input through pipes.

Quicksilver [5] uses piping, though this is arguably a different paradigm — select object first, then the action, then details. I wonder why this paradigm isn't more popular and articulated, especially in the light of ubiquity of indexing tools like Spotlight on OS X, wonderbar in Firefox and DDG bang operators [6]. It would be interesting to see the trend continue.

Overall, this piping stuff around approach is very subdued, but virtually omnipresent. Modern cell phones (Samsung Bada, anyone?), TVs and other "smart" appliances provide a glimpse of the non-composable world.

[1] http://werc.cat-v.org/docs/

[2] http://projects.haskell.org/xmobar/

[3] https://sites.google.com/site/gotmor/dzen

[4] http://projects.tynsoe.org/en/geektool/

[5] http://www.blacktree.com/

[6] http://duckduckgo.com/bang.html

What's somewhat amusing, though, is that there doesn't seem to be a great deal of wholly bespoke software being written in certain industries (games, a lot of the Web stuff). It's pretty rare that a thick monolithic client of any real seriousness is still being produced,

In a way, a lot of folks are just plumbing together little utilities and libraries to do what they could be programming themselves. jQuery, Rails, node.js, redis, sqllite, etc. are the components for larger plumbed systems.

So, with the overwhelming reliance on frameworks and third-party libraries in modern software development, perhaps McIlroy may be shown right yet.

I would say that in the modern day, we seem to be building plenty of serious applications by gluing together web service APIs.

I think functional programmers would disagree with your thesis :-) As an example http://xmonad.org/manpage.html - Window manager in 200 lines of code http://www.haskell.org/haskellwiki/Xmonad/Screenshots



I am sure there are pretty cool smalltalk and lips examples too.

Knuth is an algorithms engineer and his solution reflects his background. McIlroy's invention of pipes was seminal, but the state of the art has moved on much further. I think David Turner's contributions via SASL have never been fully acknowledged. I would posit that we now know how to go about achieving the composability techniques required to build large programs easily. The answer seems to lie embedded somewhere deep inside category theory! The ideas are slowly getting popular and seeping over to mainstream languages. At least for me, the way I program today is fundamentally different from 12 years ago.

As sedq, who is dead for some reason, said:

> When I read comments like this I'm left with two questions: 1. What are "serious applications"? 2. How does this person define "failed"?

> This task of counting word frequencies looks like an ideal job for awk. But there are many ways to do it, and with different UNIX's utilities. UNIX is remarkably flexible.

> When awk was introduced it was not imagined that people would try to write 10 page programs with it. But of course, they did. Are these the so-called "serious applications" that some people want to write?

> If UNIX utilities and pipes (including the pipe function in C) are a "failure", why are they still with us after so many years? If that's failure, then what is "success"?

> I think it comes down to what you're trying to do. For processing text, such as the task discussed in the article, I find UNIX utilities to be enough.

> I still have no idea what "serious applications" are. Is text processing "serious"?

> Much of UNIX's userland was intended for text processing. And for that it works very well.

If UNIX utilities and pipes (including the pipe function in C) are a "failure", why are they still with us after so many years?

I guess "us" means 1% of the population, since Mac/Windows users generally don't use the command line and the apps they do use aren't held together with pipes.

It's rather pointless to talk about the greater population of users when considering whether a programming technology is pertinent. Sure, non-technical users don't use pipes. They also don't do any programming, and are completely apathetic regarding which programs they use outside of a few "core" ones like office or photoshop. In a word, whether Bob next door uses pipes or not is immaterial.

My point is that Office and Photoshop don't use pipes either (AFAIK).

(BTW, how did sedq get banned within 3 hours of creating an account? This doesn't seem right.)

I'm sure this topic is dead, but just so you know, kernel debugging for windows uses named pipes at least when VMware is involved. .NET added support for named pipes in release 3.5. Mac OS X seems to fully support pipes, too. I suspect pipes are used, but transparently by most programs that use them.

A cheap shot by McIlroy. Knuth was specifically asked to write in the literate style.

Word counting is one of those simple, domain-independent problems that lend themselves well to code reuse. It's a rare type of problem. Most tasks presented to a professional software developer could not be solved by a small shell script. A large and unmaintainable one, maybe.

McIlroy's code lends itself very well to literate coding! Here's a try:

# First, tr anslate multiple s queezed occurrences of the c omplement of A-Za-z (non-word-characters) into a line separator

  tr -cs A-Za-z '\n' |
# and then lowercase every word.

  tr A-Z a-z |
# Sort the words with a disk-based mergesort.

  sort |
# Count the unique characters :: [String] -> [(Int,String)]

  uniq -c |
# And do a reverse numerical disk-based merge sort.

  sort -rn |
# And write the first $1 lines and then quit.

  sed ${1}q
Personally, I'd change the last two to be

  sort -rn -k 1,1 | head -n $1
but that's just bikeshedding. And the parts it's made from are so modular, and so focused, that you can wrap your head around all of the problem, without worrying about how sort works, or how tr expands character ranges.

If I gave a "professional software developer" in 2011 a problem to count word frequencies, and I got back 10 pages of Pascal that didn't go much faster than code that fits on a Post-It note for my problem, I wouldn't trust that developer with anything else important.

The reason for preferring sed 42q instead of head -n 42 is because head didn't used to exist. IIRC it was created at Berkeley and for some years there were many systems that didn't have it. It does seem a bit superfluous when all it could do was head -42 (the -n and other options came later). I still write sed 42q to this day. :)

Your code fails for French text.

As did Knuth's; see below the discussion about that.

If you wanted to change it, to split "words" by blanks and punctuation characters, you would have

  tr -s [:blank:][:punct:] "\n"
and then that piece would fit into the pipeline, the rest unscathed. This design is far easier to reason about than a dozen pages of Pascal.

Those globs should be protected from the shell otherwise ./bc is going to alter tr's arguments.

How many modern programs comply with Unicode Text Segmentation http://unicode.org/reports/tr29/ for finding word boundaries or Unicode Collation Algorithm http://unicode.org/reports/tr10/ for sorting the words.

Honestly, that looks more like examples in a MAN page rather than explanations of an algorithm.

As Knuth was given his task, McIlroy was given his own which he performed well.

I disagree that code reuse only works in rare instances. Rather, code reuse works better in small focused pieces than in large chunks. Utilities such as "sort" get used a lot.

The rare part is that a whole solution can be made out of UNIX utilities, not that UNIX utilities are used at all. Being familiar with the utilities available, much of the coding I currently do ends up being domain specific processing called from a shell script, often in a pipeline.

Also, sometimes I can answer feature requests from users by showing them how they can actually do it themselves on the UNIX command line. This kind of thing is often forgotten, as it's no longer "development." But code I don't have to write because the user can reuse utilities should certainly count.

Agreed. Knuth no doubt chose a simple, but non-trivial, program to build from scratch to illustrate the technique. McIlroy responded with an equivalent made up from Lego bricks. Knuth documented what he did. McIroy said "I used this command" n times.

To be fair to Knuth, it's not the sort of little utility program that he usually works on. At least from TeX, Metafont, and the TAOCP books, it's usually really big problems, or very low-level things where cycle counts matter. The two sorts in McIlroy's pipeline guarantee that his solution won't scale well at all, especially when only the top 10 are wanted.

I know that this really isn't the point, but this is bugging me: McIlroy's pipeline doesn't properly handle words like don't or home-spun.[1]

My first thought for a fix (still split on spaces, but now we need to remove left-over punctuation):

    tr -s '[:space:]' '\n' | tr -d '",.!?' | etc.
I doubt the tr of that time had the character classes, but it gets at the general idea. I'm sure there are better alternatives, and I would be happy to hear them.

Oh, and also, we would need to remove single-quotes that might be left-over too, but only if they are at word boundaries, rather than within words. I'm going back to work now...

[1] https://gist.github.com/1447872

Does Knuth's version handle those words? Since that would discredit the article's claim that McIlroy's version was bug-free.

No, Knuth's version doesn't handle those words either. Here's how he defines a word in his literate program:

Let's agree that a word is a sequence of one or more contiguous letters; "Bentley" is a word, but "ain't" isn't. The sequence of letters should be maximal, in the sense that it cannot be lengthened without including a nonletter.

Whether if it does or doesn't, it probably includes an explanation as to why it does or doesn't, perhaps with references to articles going into more detail on the virtues of each possibility.

Although the consensus seems to be contractions count as one word, some people still believe they should count as two.

I suppose then the problem would be counting "can't" as a "can" and a "not".

Bug-free seems a bit of a strong claim either way, good catch ;)

I don't know (I haven't seen the full original articles.) One of the McIlroy quotes mentions "handling of punctuation" and "isolation of words", so clearly they were thinking of some of these edge cases.

> In particular the isolation of words, the handling of punctuation, and the treatment of case distinctions are built in.

Reminds me of this: http://xkcd.com/664/

The alt-text is sobering. How many ground-breaking ideas are put into everyday appliances?

For every 0x5f375a86 we learn about, there are thousands we never see. </quote>

I'd say 0x5f375a86/10000 ~ 160000 (known) / 1 (never see) is a highly optimistic ratio.

I'm not sure if you're joking, but that hex number isn't a count. It's a reference to a specific instance of an elegant solution to a problem (the fast reciprocal square root function from Quake).

For those who are curious, Wikipedia has a fascinating explanation, and the C code in question:


There is a vast amount of work done in the cubicle-land that never gets outside of the legal walls.

> I can’t help wondering, though, why he didn’t use head -${1} in the last line. It seems more natural than sed. Is it possible that head hadn’t been written yet?

I'm not sure when head(1) was added, but I doubt it was added by anyone in the Unix core team at Bell Labs, it certainly didn't make it to Plan 9 because it is redundant with sed as the example illustrates.

sed 11q is concise and clear, no need for shortcuts, but if you really have to, you can write your own head shell script that just calls sed.

sed is concise and clear if you know the command set, which was true of all developers at the time. But in the modern world, we all think in awk^H^H^Hperl^H^H^H^Hruby or whatever. On the command line, generic tools like sed, while they still work, have lost a lot of their relevance in favor of simpler stuff like head/tail which have much clearer semantics.

according to OpenBSD man page, head(1) first appeared in 3.0 BSD, so that would be 1979.

So, Knuth wrote it in 10 pages from the ground up?

But, the other guy just used unix commands.

I wonder how many lines of code all those utilities add up to.

Probably more than 10 printed pages.

I checked the Plan 9 tools, which are similar to the Unix ones. Though without access to Knuth's program, I can't draw a sensible conclusion from the following data.

  term% wc /sys/src/cmd/^(tr.c sort.c uniq.c sed.c)
      356     998    5993 /sys/src/cmd/tr.c
     1752    4526   28371 /sys/src/cmd/sort.c
      165     346    2185 /sys/src/cmd/uniq.c
     1455    4267   26848 /sys/src/cmd/sed.c
     3728   10137   63397 total

  term% wc <{man tr} <{man sort} <{man uniq} <{man sed}
       54     269    2028 /fd/8
      137     741    5542 /fd/7
       41     149    1225 /fd/6
      208    1078    8464 /fd/5
      440    2237   17259 total

Isn't there a middle ground where you can remain in the spirit of literate programming but still use libraries? It seems like that would be the best way . That seems like it would handle the biggest critique of the article unless I'm missing the point.

Yes, one can write libraries in literate style, and I have done so (http://monday.sourceforge.net/) in the past. However, from reading interviews with Knuth it seems that he does not find reusable software very interesting in itself.

This article is not about literate programming.

I hate to nitpick, but if the explanation is being held up as wonderfully clear, must note that there's a typo:-

'Make one-word lines by transliterating the complement (-c) of the alphabet into newlines (note the quoted newline), and squeezing out (-a) multiple newlines.'

Surely should be 'squeezing out (-s)'? Unless I'm missing something?

The comments on the blog suggest that the typo is the blog author's, not McIlroy's.

Sorry probably wasn't clear - that's what I meant :). If he's holding it up as the paragon of clarity, he ought to reproduce it clearly.

It's a nit I know, but it confused me when I wanted to run through it and understand what was going on :)

An interesting exercise would be to actually maintain both versions: e.g. make utf-8 friendly; add error handling. I think McIlroy's version would still come out on top but not by as much. Nice, informative error handling is one of the things most lacking in shell scripts.

It was fixed in Plan9, where programs return error strings, not numbers, the shell (rc) is better and errors are not lost in pipelines.

And UTF-8 was invented by Ken Thompson and Rob Pike while working on Plan9.

Yes, I'm aware of plan9. The 'what constitutes a word' and the error handling were and are the more interesting improvements.

Being a PHP programmer, I struggle to espouse such programming practices.

For example, I copy my database class into each new project :(

How can I approach PHP development in a McIlroy manner?

Part of the problem is not understanding how to incorporate library repositories into my specific project repo.

Imagine if the shell script had been written in a literate way? :)

Try hyphenating a text with "simple pipeline".

  pic challenge-accepted.ms | tbl | eqn | troff -ms

You're missing the point (intentionally, I suspect).

Your point isn't clear to me.

Once a problem becomes a non-toy problem, pipelines lose. troff(1) will do all the hyphenation (C or C++ code) in your code. Ironically troff is like TeX, it processes documents.

Of course troff hyphenates. Its job in the pipeline is to hyphenate, compute layouts, and output input for dpost. If your argument is that decoupling hyphenation from layout is difficult, I believe you are wrong. Splitting the hyphenator into a separate program is straightforward enough. Consider the text formatter

  hyphen < input | layout | topostscript
where hyphen inserts all possible hyphens, and layout discards those which are unnecessary.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact