
Knuth and the Unix Way - angersock
http://www.leancrew.com/all-this/2011/12/more-shell-less-egg/
======
jhpriestley
McIlroy's basic point is that, instead of building single-use programs, we
should be directing our effort towards building reusable components, i.e. Unix
utilities. Then, over time, most programs could be built as a simple piping
together of high-level components.

This vision seems to have failed for about 5 decades now, but it is still a
popular one. After no one was able to build serious applications by gluing
together Unix utilities, it was thought that we would instead build serious
applications by gluing together objects. This vision has also failed (or at
least, not succeeded in a very apparent way). In the mean time there have been
other promised types of "componentry", like COM, CORBA, WS-*, ..., all more or
less complete failures (or at least, egregiously unapparent successes).

So, in my view, it is Knuth who has the last laugh.

~~~
tmhedberg
I strongly disagree with the notion that the composition of small programs to
build larger ones is a "failed vision". Shell scripting is an invaluable tool
in the daily-use toolbox for a huge number of people, myself included. And
what is a shell, besides a programming language for gluing together other
programs?

Now, if you mean that no one is writing full-scale professional desktop or web
applications by piping together sed and awk, then of course you are correct.
But many projects come closer to this than you might think. Git, for instance,
was originally just a large collection of shell scripts, which has over time
been gradually rewritten in C. Many pieces of it are, even today, still
written in the POSIX Bourne shell.

To me, the principal reason for the astounding longevity of Unix is its
emphasis on composability. The tiny, ubiquitous Unix utilities are ancient
compared to virtually every other piece of software in common usage, yet are
no less useful today than when ken and dmr first conceived of them. By making
the system modular and composable at a very fundamental level, they ensured
(perhaps as much by accident as by intentional design) that users far into the
future would be able to continue using their tools for purposes not yet
dreamed of. And indeed, this is very much the case, and is likely to continue
that way for some time to come. Whatever the eventual successor to Unix turns
out to be, it's likely to have the same emphasis on small building blocks with
enormous synergy when composed.

Likewise, when you say that the vision of "building serious applications by
gluing together objects" has "not succeeded in a very apparent way", I frankly
have no idea what you're talking about. OOP is the dominant paradigm in modern
programming, bar none. You might argue that some other paradigm (FP, perhaps?)
would have served us better in retrospect, but the last thing anyone can
truthfully claim about the object-oriented approach is that it has not been
successful. It's hard to think of even one popular programming language that
doesn't borrow at least a few ideas from that school of thought.

I do agree with you about the "enterprisey" component approach being largely a
failure. Thankfully, CORBA, SOAP, et al. seem to be mostly behind us, or at
least rapidly receding.

~~~
grannyg00se
"Git, for instance, was originally just a large collection of shell scripts,
which has over time been gradually rewritten in C. "

Doesn't that prove the parent's point that the collection of supposedly
reusable components could not be glued together to form GIT? There must be a
reason why it was rewritten in C. And if it had to be rewritten then I don't
think we can call the "large collection of shell scripts" a complete success.

"OOP is the dominant paradigm in modern programming, bar none."

There could be many reasons for that but it does not address the point that
the actual gluing together of objects has somehow succeeded. How much of that
OO work is actually shared between projects? How much of it is simply OOP for
the sake of it because the language was chosen for convenience?

To me success would be using objects from one project without modification in
a completely separate project. I just don't see that happening very often. If
the objects are simple enough you might as well just rewrite them. If they are
complicated enough to be of value to import, they usually require
modification.

~~~
pmarin
_Doesn't that prove the parent's point that the collection of supposedly
reusable components could not be glued together to form GIT? There must be a
reason why it was rewritten in C_.

Ironically the main reason for the rewriting was the poor performance of the
fork system call in Windows (Cygwin) which make the system slow particularly
for shell scripting.

The problem was Windows, not Git.

~~~
vrotaru

        http://libgit2.github.com/
    

from the blurb: libgit2 is a portable, pure C implementation of the Git core
methods provided as a re-entrant linkable library with a solid API, allowing
you to write native speed custom Git applications in any language which
supports C bindings.

and

    
    
        www.jgit.org
    

The point here seems to be that (ba)sh scripts are wonderful for one-time
tasks. Not so much for anything else/permanent workflows.

------
bluesnowmonkey
A cheap shot by McIlroy. Knuth was specifically asked to write in the literate
style.

Word counting is one of those simple, domain-independent problems that lend
themselves well to code reuse. It's a rare type of problem. Most tasks
presented to a professional software developer could _not_ be solved by a
small shell script. A large and unmaintainable one, maybe.

~~~
lsb
McIlroy's code lends itself very well to literate coding! Here's a try:

# First, _tr_ anslate multiple _s_ queezed occurrences of the _c_ omplement of
A-Za-z (non-word-characters) into a line separator

    
    
      tr -cs A-Za-z '\n' |
    

# and then lowercase every word.

    
    
      tr A-Z a-z |
    

# Sort the words with a disk-based mergesort.

    
    
      sort |
    

# Count the unique characters :: [String] -> [(Int,String)]

    
    
      uniq -c |
    

# And do a reverse numerical disk-based merge sort.

    
    
      sort -rn |
    

# And write the first $1 lines and then quit.

    
    
      sed ${1}q
    

Personally, I'd change the last two to be

    
    
      sort -rn -k 1,1 | head -n $1
    

but that's just bikeshedding. And the parts it's made from are so modular, and
so focused, that you can wrap your head around all of the problem, without
worrying about how sort works, or how tr expands character ranges.

If I gave a "professional software developer" in 2011 a problem to count word
frequencies, and I got back 10 pages of Pascal that didn't go much faster than
code that fits on a Post-It note for my problem, I wouldn't trust that
developer with anything else important.

~~~
gnuvince
Your code fails for French text.

~~~
lsb
As did Knuth's; see below the discussion about that.

If you wanted to change it, to split "words" by blanks and punctuation
characters, you would have

    
    
      tr -s [:blank:][:punct:] "\n"
    

and then that piece would fit into the pipeline, the rest unscathed. This
design is far easier to reason about than a dozen pages of Pascal.

~~~
ralph
Those globs should be protected from the shell otherwise ./bc is going to
alter tr's arguments.

------
dubya
To be fair to Knuth, it's not the sort of little utility program that he
usually works on. At least from TeX, Metafont, and the TAOCP books, it's
usually really big problems, or very low-level things where cycle counts
matter. The two sorts in McIlroy's pipeline guarantee that his solution won't
scale well at all, especially when only the top 10 are wanted.

------
telemachos
I know that this really isn't the point, but this is bugging me: McIlroy's
pipeline doesn't properly handle words like _don't_ or _home-spun_.[1]

My first thought for a fix (still split on spaces, but now we need to remove
left-over punctuation):

    
    
        tr -s '[:space:]' '\n' | tr -d '",.!?' | etc.
    

I doubt the _tr_ of that time had the character classes, but it gets at the
general idea. I'm sure there are better alternatives, and I would be happy to
hear them.

Oh, and also, we would need to remove single-quotes that might be left-over
too, but _only_ if they are at word boundaries, rather than within words. I'm
going back to work now...

[1] <https://gist.github.com/1447872>

~~~
oconnor0
Does Knuth's version handle those words? Since that would discredit the
article's claim that McIlroy's version was bug-free.

~~~
jhck
No, Knuth's version doesn't handle those words either. Here's how he defines a
_word_ in his literate program:

 _Let's agree that a word is a sequence of one or more contiguous letters;
"Bentley" is a word, but "ain't" isn't. The sequence of letters should be
maximal, in the sense that it cannot be lengthened without including a
nonletter._

------
michaelty
Reminds me of this: <http://xkcd.com/664/>

~~~
Splines
The alt-text is sobering. How many ground-breaking ideas are put into everyday
appliances?

~~~
d0mine
_For every 0x5f375a86 we learn about, there are thousands we never see._
</quote>

I'd say 0x5f375a86/10000 ~ 160000 (known) / 1 (never see) is a highly
optimistic ratio.

~~~
jsnell
I'm not sure if you're joking, but that hex number isn't a count. It's a
reference to a specific instance of an elegant solution to a problem (the fast
reciprocal square root function from Quake).

~~~
pjscott
For those who are curious, Wikipedia has a fascinating explanation, and the C
code in question:

<http://en.wikipedia.org/wiki/Fast_inverse_square_root>

------
luriel
> I can’t help wondering, though, why he didn’t use head -${1} in the last
> line. It seems more natural than sed. Is it possible that head hadn’t been
> written yet?

I'm not sure when head(1) was added, but I doubt it was added by anyone in the
Unix core team at Bell Labs, it certainly didn't make it to Plan 9 because it
is redundant with sed as the example illustrates.

sed 11q is concise and clear, no need for shortcuts, but if you really have
to, you can write your own head shell script that just calls sed.

~~~
ajross
sed is concise and clear if you know the command set, which was true of all
developers at the time. But in the modern world, we all think in
awk^H^H^Hperl^H^H^H^Hruby or whatever. On the command line, generic tools like
sed, while they still work, have lost a lot of their relevance in favor of
simpler stuff like head/tail which have much clearer semantics.

------
TheRevoltingX
So, Knuth wrote it in 10 pages from the ground up?

But, the other guy just used unix commands.

I wonder how many lines of code all those utilities add up to.

Probably more than 10 printed pages.

~~~
p9idf
I checked the Plan 9 tools, which are similar to the Unix ones. Though without
access to Knuth's program, I can't draw a sensible conclusion from the
following data.

    
    
      term% wc /sys/src/cmd/^(tr.c sort.c uniq.c sed.c)
          356     998    5993 /sys/src/cmd/tr.c
         1752    4526   28371 /sys/src/cmd/sort.c
          165     346    2185 /sys/src/cmd/uniq.c
         1455    4267   26848 /sys/src/cmd/sed.c
         3728   10137   63397 total
    
      term% wc <{man tr} <{man sort} <{man uniq} <{man sed}
           54     269    2028 /fd/8
          137     741    5542 /fd/7
           41     149    1225 /fd/6
          208    1078    8464 /fd/5
          440    2237   17259 total

------
zitterbewegung
Isn't there a middle ground where you can remain in the spirit of literate
programming but still use libraries? It seems like that would be the best way
. That seems like it would handle the biggest critique of the article unless
I'm missing the point.

~~~
housel
Yes, one can write libraries in literate style, and I have done so
(<http://monday.sourceforge.net/>) in the past. However, from reading
interviews with Knuth it seems that he does not find reusable software very
interesting in itself.

------
singular
I hate to nitpick, but if the explanation is being held up as wonderfully
clear, must note that there's a typo:-

'Make one-word lines by transliterating the complement (-c) of the alphabet
into newlines (note the quoted newline), and squeezing out (-a) multiple
newlines.'

Surely should be 'squeezing out (-s)'? Unless I'm missing something?

~~~
telemachos
The comments on the blog suggest that the typo is the blog author's, not
McIlroy's.

~~~
singular
Sorry probably wasn't clear - that's what I meant :). If he's holding it up as
the paragon of clarity, he ought to reproduce it clearly.

It's a nit I know, but it confused me when I wanted to run through it and
understand what was going on :)

------
emmelaich
An interesting exercise would be to actually maintain both versions: e.g. make
utf-8 friendly; add error handling. I think McIlroy's version would still come
out on top but not by as much. Nice, informative error handling is one of the
things most lacking in shell scripts.

~~~
4ad
It was fixed in Plan9, where programs return error strings, not numbers, the
shell (rc) is better and errors are not lost in pipelines.

And UTF-8 was invented by Ken Thompson and Rob Pike while working on Plan9.

~~~
emmelaich
Yes, I'm aware of plan9. The 'what constitutes a word' and the error handling
were and are the more interesting improvements.

------
mise
Being a PHP programmer, I struggle to espouse such programming practices.

For example, I copy my database class into each new project :(

How can I approach PHP development in a McIlroy manner?

Part of the problem is not understanding how to incorporate library
repositories into my specific project repo.

------
taeric
Imagine if the shell script had been written in a literate way? :)

------
adobriyan
Try hyphenating a text with "simple pipeline".

~~~
p9idf

      pic challenge-accepted.ms | tbl | eqn | troff -ms

~~~
adobriyan
You're missing the point (intentionally, I suspect).

~~~
p9idf
Your point isn't clear to me.

~~~
adobriyan
Once a problem becomes a non-toy problem, pipelines lose. troff(1) will do all
the hyphenation (C or C++ code) in your code. Ironically troff is like TeX, it
processes documents.

~~~
p9idf
Of course troff hyphenates. Its job in the pipeline is to hyphenate, compute
layouts, and output input for dpost. If your argument is that decoupling
hyphenation from layout is difficult, I believe you are wrong. Splitting the
hyphenator into a separate program is straightforward enough. Consider the
text formatter

    
    
      hyphen < input | layout | topostscript
    

where hyphen inserts all possible hyphens, and layout discards those which are
unnecessary.

