
Revisiting Knuth and McIlroy's word count programs (2011) - tosh
http://franklinchen.com/blog/2011/12/08/revisiting-knuth-and-mcilroys-word-count-programs/
======
userbinator
I briefly investigated literate programming a long time ago and came away with
similar conclusions --- you can achieve a very similar effect by simply
drowning your source code in needlessly verbose comments, and being able to
move around blocks of code with no regard to their execution order is more
obfuscation than anything else. Ironically, the "modern" programming style
commonly seen in Java and C# seems to be approximating this, with their
verbosity and highly scattered execution flow.

This is extremely undesirable for debugging, but then again, Knuth probably
doesn't need to; the fact that TeX is written using his literate programming
tools, yet remains remarkably bug-free, should be taken more as an endorsement
of his skill than the tools.

The underlying philosophy behind literate programming seems to be
concentrating on using human language, but from working with and observing a
lot of other highly productive and effective developers, I've realised that
trying to make programming more like a human language doesn't really work; the
way to go is to know the _programming_ language, and the underlying machine,
very very well. One of my favourite examples of this is APL, where those
experienced can read and write programs in it as easily as they can a human
language; programs which are completely incomprehensible to anyone who doesn't
know the language.

~~~
gnuvince
I used literate programming to write ~25 programs last year, and I found that
the coolest thing about the technique was crafting a narrative for
understanding the problem and appreciating the solution.

One difficulty when reading somebody else's program is thinking "why didn't he
just do X?" and not having a clear answer. In a good literate program—and it's
harder to write good literate programs than simple programs—the author can
touch on why a particular approach is unsuitable, or why he preferred one
valid approach over another valid approach.

In the end, I think I prefer to write "regular code"; tools like editors,
debuggers, build systems, etc. work better with real code. Still, I try to
write doc-comments that inject better narratives in my programs and,
hopefully, help the readers attain a better understanding.

~~~
LukeShu
I've found that getting to include the narrative is great when you're first
writing it, and then for making small changes. But then, when something
happens that changes your perspective on the problem, the structure of the
narrative often needs to change substantially, while the structure of the code
might not. And suddenly that formerly helpful narrative is a huge piece of
legacy cruft that needs to be refactored... to support a few-line change in
the actual code.

For that reason, I'm with you in that I try to write doc-comments that inject
the narrative.

~~~
taeric
I'm curious if you have examples of times when the narrative changed
dramatically.

I have found that it is easy to rat-hole thinking you know the entire
structure and starting on the narrative of how you are writing it. I'm
convinced most of even Knuth's programs were written a bit more holistically,
and then a narrative emerged not necessarily of how the code was written, but
how the code could be explained.

And really, this is no different than the other abstractions we have at our
disposal. It is rare that the original function layout of the system survives
for long. Why would you think your initial narrative would?

------
taeric
That exchange was always odd. Knuth was showcasing a style of programming.
McIlroy was solving the problem.

That is, the critique was somewhat orthogonal. He even agreed that it was a
great exposition of the data structure used. What he feared, was it would bias
people to overly complicated programs. Instead of solving problems, they would
craft elaborate solutions.

And to a large extent, this is somewhat appealing. It is a large part of my
fear working with some folks elaborate abstractions in code I deal with.
Sometimes, they are absolutely needed. Often, they are not.

For pedagogical reasons, literate programming is tough to beat. Look up wc
done literately. Quite easy to read. Even Knuth's programs are easier to read
than you would think. Helped in large by his style.

~~~
whipoodle
I disagree. You would only think the actual implementation is an
implementation detail if you don't have to ship things, as academics like
Knuth don't. There is more to learn from it than just that, but I think that's
a clear point here.

~~~
taeric
I'm not sure I follow your point. What do you disagree with from what I said?

The task given to Knuth was to display literate programming using a somewhat
toy program. He did so.

McIlroy showed how to solve the toy program.

Oddly, Knuth's is the one I would reach for as part of my program. McIlroy's
is the one I use interactively all the time. Literally last week I did
basically that pipeline in my shell about ten times during log dives.

If I found that elastic search or some other tool was doing something like
that pipeline on my queries, I would be rather unhappy. Not to mention I would
find a hella easy speed up by rewriting it.

That said, I am now highly interested in benchmarking those two programs.
Would be curious to see the results, if someone else already has.

------
vram22
I had commented on Chen's post, here:

[http://franklinchen.com/blog/2011/12/08/revisiting-knuth-
and...](http://franklinchen.com/blog/2011/12/08/revisiting-knuth-and-mcilroys-
word-count-programs/#comment-1593558515)

and linked to a blog post of mine, in which I too had written two solutions
(one in Python and one in shell) to the original problem posed to Knuth by
Bentley. Fun.

Here are my solutions:

[https://jugad2.blogspot.in/2012/07/the-bentley-knuth-
problem...](https://jugad2.blogspot.in/2012/07/the-bentley-knuth-problem-and-
solutions.html)

A roughly similar problem is posed and solved in the Kernighan and Pike book,
The Unix Programming Environment, IIRC, where they mention it as a nice
example of the power of combining components, which is a key part of the Unix
philosophy - quoted by commenter sswam in Chen's post:

[ This is the Unix philosophy:

Write programs that do one thing and do it well.

Write programs to work together.

Write programs to handle text streams, because that is a universal interface.

\-- Doug McIllroy ]

~~~
Koshkin
> _text streams_

It is important to understand that this UNIX programming guideline (as well as
others) only works in the particular context, which is made clear by the fact
that UNIX was not designed, originally, to be an operating system for a server
or an embedded device, or, for that matter, for batch processing in the sense
OS/360 was, for example. Rather, it was designed to allow a moderate-size
group of people (mostly programmers) to use a computer in a time-sharing,
interactive manner to enter and edit and otherwise process text-based data.
Taken out of this (now mostly historical) context, this and other such
guidelines should be seen with a healthy dose of suspicion.

~~~
vram22
Yes, good point. Though text streams are still valid nowadays too, it just
depends on the usage and context, whether they are that useful or not. I've
seen the discussions on HN about Unix text streams vs. JSON or vs. Powershell
piping objects, etc. I do know about the background of the early Unix people
and the initial uses to which it was put, having read about that.

------
jepler
I use "sort | uniq -c | sort -n | tail" all the time to find the most frequent
items, so McIlroy's program is no surprise.

Compared to something like python's heapq.nlargest, though, there's a key
difference: The first sort has to temporarily keep all the items (storage:
O(n)) and it has to sort them all (time: O(n lg n)). With another algorithm,
the storage is only O(d+m) where d is the number of distinct words and m is
the number of most frequent items sought; and I think the complexity of the d
operations on an m-item heap is O(d lg m). Since typically m << d << n, this
can be a big savings in storage and time!

~~~
LukeShu
Since we're on the topic of Unix philosophers:

"When in doubt, use brute force." \-- Ken Thompson

"Fancy algorithms are slow when n is small, and n is usually small. Fancy
algorithms have big constants. Until you know that n is frequently going to be
big, don't get fancy. (Even if n does get big, use Rule 2 [measure before you
optimize] first.)" \-- Rob Pike

Which is to say: McIlroy's team had a culture that favored simpler solutions
with worse storage/time, until actual usage informs them that performance is
an issue.

~~~
vram22
And also (after Rule 5):

Rule 6: There is no Rule 6.

:)

~~~
vram22
Hasty downvoter, do your research first. That Rule 6 was not made up by me, it
was the real thing, by one of the people mentioned in the parent comment - Rob
Pike:

[https://www.lysator.liu.se/c/pikestyle.html](https://www.lysator.liu.se/c/pikestyle.html)

~~~
LukeShu
I suspect you were not down-voted because they didn't believe that it was
accurate, but because they believe that it didn't meaningfully contribute to
the discussion.

------
debdrup
I'm a sysadmin, not a programmer, so maybe that's why it seems like the
Haskell solution is more complicated to me - but can someone explain why a
supposed improvement for something is more complicated, and involves a bunch
of stuff that can't be expected to be included on any Unix-like system you sit
down in front of?

It's especially confusing, considering that the blogger claims to have changed
their opinion, but doesn't bother to clarify what has changed on the new blog
that he "helpfully" links to. It's also interesting that the author claims
that McIllroy would approve of his solution, without checking with him.
McIllroys email isn't exactly hidden if you know where to look, and I know he
still posts on a few mailing lists regularily, so it's not like he's
completely unavailable.

McIlroys solution works on any POSIX compatible system. Feel free to check for
yourself: [http://shellhaters.org/](http://shellhaters.org/)

~~~
LukeShu
I agree with you, the Haskell solution is worse. But I disagree with some of
your reasons.

Forget the ubiquity of Unix. Forget POSIX--McIlroy's text was written before
even the first drafts of POSIX.

Part of the premise of the challenge to Knuth was to use his solution to
advocate for _his_ programming system: WEB (essentially a variant of
Pascal)--look how great it is to program in WEB! So naturally, McIlroy
included in his response a comparison to _his_ programming system: UNIX. Knuth
had designed WEB to make programming nicer; McIlroy had designed UNIX[1] to
make programming nicer. It wasn't just a showdown between word count programs,
it was a showdown of WEB vs UNIX.

And to hear some people tell it, the things that lead to Unix's victory in
that little showdown are the same things that lead to its ubiquity today. If
people liked Knuth's solution better, maybe we'd have WEB/Pascal systems
everywhere instead of Unix.

[1]: He wasn't the sole designer, but he did invent pipes, which is the big
item in using the Unix shell as a programming model.

~~~
flogic
It's a bit of an unfair comparison though. The problem and tool set were
predefined before Knuth started. Also it's a problem that's particularly
suited to Unix tools. There are many problems where Web might have resulted in
the better solution. As a kid, I saw a program that computed the position of
Moon in the sky given a location and time. That would probably be better
solved with WEB than Unix pipes.

------
keithpeter
There are times when you want a quick solution with available tools, and I
imagine, there are times when you need to write a portable program that is
documented and that can be maintained easily.

Trying to get McIlroy's pipeline to run on Debian...

    
    
        keith@lavazzared:~$ cat bash.txt | tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q 
           4200 the
    

...I just get the most common word. Using head instead of sed gives more
lines...

    
    
        keith@lavazzared:~$ cat bash.txt | tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | head
           4200 the
           1822 is
           1251 to
           1221 a
           1147 of
           869 if
           804 and
           570 in
           567 shell
           562 command
    

Am I holding it wrong? The source text was the bash man page.

~~~
s_kilk
> Am I holding it wrong?

Yup. The six lines are meant to be put in a script file, then invoked with the
number of results to print as the first arg, where ${1} will work correctly.

If you just paste it into the shell you get whatever ${1} is set to in your
shell session.

~~~
keithpeter
All is now clear, thanks to both parent posts

    
    
        keith@lavazzared:~$ cat bash.txt | ./wordc.sh 10
           4200 the
           1822 is
           1251 to
           1221 a
           1147 of
           869 if
           804 and
           570 in
           567 shell
           562 command

~~~
lesserknowndan
This comment thread suggests that Knuth's approach is better, i.e., the shell
script version doesn't really explain why, what, or how it is doing what it is
doing.

It could have been improved by being written in a literate style that explains
how each UNIX command line tool is being used and why; and would have executed
in exactly the same way.

------
Waterluvian
Are there any IDEs and associated languages that by design are well suited to
viewing code in different ways, where these views are all first class?

For example, maybe I want to structure my code by class. Helpers over here.
Another module over there. But with a single keystroke switch to a code view
where my functions have been inlined and my classes have been collapsed so
that I see all of super's calls inline. And in a recursive way so I can expand
or collapse how deep this flattening and unrolling occurs.

