
The Word Count Problem - pdonis
http://blog.peterdonis.com/opinions/still-another-nerd-interlude.html
======
dinkumthinkum
I want to be charitable but I find the criticism here of Knuth of be quite
childish. Was the point to solve the problem with as little work as possible?
If that's the case, there's probably some plugin for your favorite text editor
that already does this. So what? What was the point exactly? You could just
have easily built a library function that does all of this and then write a
Pascal program that calls the library in one line of code, state the library
is not part of your solution, and say the solution is better than the one with
UNIX utilities. So what? Is the point to illustrate that Knuth is unaware of
elementary concepts such as code reuse, come on that is just sort of
ridiculous.

If you had asked Knuth to solve this problem using any means in the whole
world and do it within 10 minutes, I don't think he would have any trouble
doing it.

And honestly, if the lesson is to teach us some really basic lesson of code
reuse, not reinventing the wheel, or some such like that, I think this was a
pretty convoluted way to deliver sort of a common sense lesson.

~~~
powera
Some other points from the linked-to article that back up that the criticism
here is a bit ridiculous:

1) This was in 1986. That's about 4 lifetimes ago in computer programming.

2) This was an article specifically to demonstrate literate programming. It
wasn't supposed to be "the most efficient way" possible. It's approximately
the same argument as "why should I write a program to sort data in an
interview; I'm not going to get paid to write sorting functions".

~~~
zacharyvoase
1) I'd like to see you say the same about differential calculus, which
actually _was_ several lifetimes ago.

2) I have made the same argument, in interviews, many times. It's a good
argument to make.

------
kenjackson
Against all conventional wisdom I actually think McIlroy was wrong.

In my experience writing reusable code, unless you're writing a library, is
usually a waste of time. Must code, even when written with intent to be
reusable -- isn't. Maintainability and readability almost always trump
reusability. Refactoring fast the point where reuse is needed typically is the
better approach.

McIlroys approach is the less common approach of library/framework writers
while Knuth is the approach of app devs.

~~~
chubot
I don't agree -- you're papering over an enormous distinction. Unix tools and
the shell are the biggest success story in software reuse that we have. (I
would argue that it is why nearly everything we use today is Unix -- servers,
iOS and MacOS, Android, etc.)

I somewhat agree when people say "code reuse has failed". When people set out
to write "reusable code", it ends up not being reusable. Usually because they
think of reuse as writing a bunch of classes that they can "import".

But writing tools is another way to reuse software. For example, in Unix, you
reuse ssh for git and for scp. With the web, you can reuse a huge amount of
work in Varnish and nginx by chaining components.

So library reuse is not all it's cracked up to be, but that's not what McIlroy
is doing. His solution to the word count problem is fantastically and
obviously better. It's real reuse.

~~~
kenjackson
_I don't agree -- you're papering over an enormous distinction. Unix tools and
the shell are the biggest success story in software reuse that we have._

To be clear, I think there's a distinction between writing a library for well
known use cases. I wouldn't suggest stdio is a waste. But it's almost always a
waste of time to emerge from a client project with a bunch of libraries for
functionality that will likely never be used again.

And among good developers I think this is one of the biggest problems. I've
encountered many times good programmers building elaborate frameworks that
take as long as the main project itself, who then present the utility of this
framework, which we never use again. They didn't understand the problem space
well enough to know when reusability made sense and across what pivots.

 _So library reuse is not all it's cracked up to be, but that's not what
McIlroy is doing. His solution to the word count problem is fantastically and
obviously better. It's real reuse._

Knuth also reused. He reused a compiler. He reused input output capabilities.
He reused subroutine abstraction capability.

IMO it's less what they used and more what they produced (as all developers
reuse). Knuth didn't produce a bunch of reusable tools and I think that is
justified and the right approach most of the time.

I went back and reread the paper just now and McIlroy even alludes to this:

"The utilities employed in this trivial solution are Unix staples. They grew
up over the years as people noticed useful steps that tended to recur in real
problems. Every one was written first for a particular need, but untangled
from the specific application.

With time they accreted a few optional parameters to handle variant, but
closely related, tasks. Sort, for example, did not at first admit reverse or
numeric ordering, but these options were eventually identified as worth
adding."

What he describes is the right approach -- and what Knuth wrote is the first
step here. I'm fairly certain if Knuth was a systems developer and this came
up over and over again you'd see refinement and tools/libraries that made
certain aspects of this more reusable. But that's neither his domain nor the
specific ask for this client.

McIlroy's rant here seems pretentious in light of this. A much better rant
would be to show how Unix tools could in six lines replace LaTex for
typesetting (without using anything provided by LaTex).

~~~
chubot
I agree with your first point and didn't dispute it. Library reuse is
overrated.

But it isn't contradicting the point of what McIlroy does -- it actually
_supports_ it. Unix tools are more reusable than libraries full of code.

I can't tell what the second part of your post is saying. It doesn't matter if
Knuth "would have" done something; McIlroy _already_ did it. The problem is
solved with 6 lines. End of story. No pontificating. That's what Unix lets you
do -- get on with your day :)

------
GuiA
This is fascinating, and should be a source of reflection for anyone writing
code for a living or being involved with programming on any serious level.

Additionally, I can't help but think that this essay (and others in this
style) and subsequent reflections would make for a perfect seminar targeted
towards final year CS students, as it merges perfectly academic reflection and
real world engineering considerations, which is what many CS programs across
the world sorely need.

------
zacharyvoase
I think I found a copy of the original Knuth/McIlroy paper itself:
[http://people.mokk.bme.hu/~kornai/AzkellHaskell/bentley_1986...](http://people.mokk.bme.hu/~kornai/AzkellHaskell/bentley_1986.pdf)

------
andreasvc
If the problem is simple enough to be solved by a shell script, then obviously
literate programming is going to be overkill. Algorithmic problems such as
data-structures are well suited for literate programming.

~~~
catenate
Having reusable tools in the first place, to make it a problem simple enough
to solve with a shell script, is the accomplishment that advances the state of
our art. It reduces a complex data structure to passing data through a pipe,
which makes the algorithm more usable for more people.

I wrote in a literate programming style for years, to generate LaTeX documents
and a world-scale build system for a multinational. But I'm past that, since
it's intricate and unmaintainable by others. Instead, I'd rather generate
documents in UTF-8 (an ASCII-compatible and increasingly universal format),
and sets of small tools which call each other and expose their data in files
for new tools to use. With this approach I can evolve the toolset in small,
state-preserving steps, to minimize how much I have to code to meet current
needs and implement new approaches.

~~~
andreasvc
The re-usable tools in this example are only applicable because the problem is
simplistic. If you were required to extend the solution to cover say language
specific definitions of word boundaries, the literate style would be a
definite win. The shell solution looks elegant but really it's just a quick
and dirty hack as far as generality is concerned.

~~~
catenate
I'm pretty sure that the trick to tackling most problems in practice is to see
the basic problems to solve, and find existing tools to apply. Solve 80% of
the problem in 20% of the time, and move on to something else you need to get
done.

If the problem warrants it, you can revisit it later to do something more
ornate. In the meantime, you've got what you need to move on.

------
Shish2k
"Program specifically designed to demonstrate literate programming turns out
to be a good demonstration of literate programming, not a good demonstration
of library creation"

... can someone please point out what the problem is? I don't see it :-|

~~~
TazeTSchnitzel
The problem is that it isn't what they wanted it to be.

------
bjourne
I haven't checked out the original paper, but the solution presented on the
blog is a non-solution. It only works for ascii characters and words composed
out of ascii. When you reuse someone elses code, you always get stuck with
someone elses assumptions. In this case, that English is the only language and
that the regular expression [A-Za-z] encompasses all possible letters in the
world. The shell solution, built upon standard tools, can not be extended to
work in an international context but a custom solution in Python (or even
Pascal) quite concievably could.

~~~
pdonis
> _I haven't checked out the original paper, but the solution presented on the
> blog is a non-solution. It only works for ascii characters and words
> composed out of ascii._

As I understand it, that was the original spec, which both Knuth and McIlroy
wrote to. I agree that it is limited as you say.

> _The shell solution, built upon standard tools, can not be extended to work
> in an international context but a custom solution in Python (or even Pascal)
> quite concievably could._

As bryanlarsen pointed out, the shell solution can easily be extended by using
an internationalized version of tr. The Python equivalent would be to use the
built-in Unicode support. (If Pascal had that, you could do the same in
Pascal.)

However, it's worth noting that by specifying the problem that way you still
have the issue of how the input stream (which is going to be bytes) is
encoded. Essentially, the original spec declared by fiat that the encoding was
ASCII.

Also, btw, you can express non-English languages in ASCII (though certainly
not as wide a variety as in Unicode); the program as written does assume that
words are composed only of the 26 standard ASCII letters, but it could easily
be extended to include the ASCII special characters. Another exercise for the
reader. :-) Though if you're going to do this kind of extension, it might be
better just to go the whole way and handle Unicode.

~~~
bjourne
But your assertion was that this was an easy problem solved with just a few
lines of shell code. That assertion is only true in an extremely limited
context where the only language possible is English and the only character
encoding is ascii. That's where the Unix tools shines because they are great
at handling problems easily expressable as regular expressions. But it is not
a very realistic example, nor a fair comparison.

> As bryanlarsen pointed out, the shell solution can easily be extended by
> using an internationalized version of tr.

You should try that and then blog about it. :) I've spent lots of time
battling issues with ascii-centric libraries. My conclusion is that it is not
easy at all which is not strange because tools like tr and sed were written
decades before unicode support became a must have. I can't say that it is
impossible to write a shell script to count words in a text written in Arabic
script, but it doesn't seem easy.

~~~
pdonis
> _But your assertion was that this was an easy problem solved with just a few
> lines of shell code._

Where did I assert that?

> _I can't say that it is impossible to write a shell script to count words in
> a text written in Arabic script, but it doesn't seem easy._

Is there a definition of what counts as a "letter" in Arabic script? That is,
which Unicode code points correspond to letters? And does Arabic share the
convention that a "word" is a sequence of letters delimited by non-letters?

If the answers to those questions are "yes", then the extension of the
algorithm already presented to handle Arabic is straightforward; it's just a
matter of substituting the Arabic definition of "letters" for the ASCII
definition. (With, as I said in my previous comment, the additional issue of
determining the encoding of the input and output.)

If some of the answers to the above questions are "no", then you have an issue
with the problem specification, not with the algorithm you're going to use to
solve it.

------
ralph
Both the article's improved pipeline and Doug McIlroy's original have a bug.
I've just written about it on +Ade Oshineye's post about the original article.
[https://plus.google.com/105037104815911535953/posts/KuXczyiw...](https://plus.google.com/105037104815911535953/posts/KuXczyiwqep)

~~~
pdonis
Good catch! I've implemented your suggested fix of adding a sed '/^$/d' at the
beginning of McIlroy's pipeline. The github code is updated, along with an
updated test file and expected output.

(Btw, the Python version did not have the bug, since the split method of
Python strings already ignores initial whitespace.)

~~~
ralph
Your bug fix is broken; it makes no sense to place it there. I suggest
increasing the number of test inputs to first show the problem.

~~~
pdonis
Oops, you're right, the sed command should be on the second line. Fixed the
test inputs and the fix.

------
Tichy
Doesn't Pascal have libraries for sorting? I find that hard to believe. So
there must have been another reason why Knuth chose to implement everything
from scratch. Probably to demonstrate programming techniques. You don't learn
that much about algorithms by stringing together a couple of shell commands.

~~~
to3m
Pascal makes code reuse hard. <http://www.lysator.liu.se/c/bwk-on-pascal.html>
puts it better than I could. And when I used Pascal in the early 1990s - and
this was Borland Pascal, one of the better ones - it even had strings! -
things had not improved massively.

~~~
Tichy
When did that comparison take place anyway? I don't think anybody is still
using Pascal in this day and age?

~~~
to3m
1980s, I think it says?

At any rate, my point, though it wasn't really clear, was that pascal makes
code reuse hard to such an extent that I don't think it's actually possible to
have a general-purpose sort routine in the library that would be actually
useful. Certainly nothing like qsort, anyway. It just can't be expressed.

(I'm sure modern versions of Pascal have this problem licked.)

------
websiteguy
Maybe I am missing something here, but The python version is: O(N) sequential
disk access O(NlogN) cpu O(N) RAM

The shell script is the same for disk and cpu, however, it is O(1) of RAM, and
therefor can operate on input of size limited to disk, not RAM

~~~
pdonis
You're correct, the Python version is O(N) instead of O(1) in RAM usage for
two reasons, one easily fixed and one not. Sorry if I'm belaboring the point,
but I think it's worth going into a little more detail.

The easily fixable issue is that the shell pipeline buffers reads, whereas my
Python version just uses sys.stdin.read() for simplicity. I could have
buffered the reads by using Python's generators/iterators instead. However,
that alone isn't enough to get O(1) RAM usage.

The not so easily fixable issue is that the Unix sort command uses temporary
files in order to not have to have all of its data in memory at once. See, for
example, here:

[http://vkundeti.blogspot.com/2008/03/tech-algorithmic-
detail...](http://vkundeti.blogspot.com/2008/03/tech-algorithmic-details-of-
unix-sort.html)

Python's sorted builtin doesn't work like this; it can take a generator as
input, but it returns a list. This is one respect in which Python's built-in
functionality lacks a feature that the Unix utilities have. I don't know if
anyone has tried to re-implement the Python sorted function to use temporary
files and return a generator to reduce memory usage.

[Edit: the Python version would also have to re-implement the uniq function to
take a generator as input and return a generator, which would, I believe, also
require temporary files.]

~~~
websiteguy
Unix sort is a merge sort, using fixed RAM uniq -c is line by line sed is line
by line

A superior solution in every way

If you were going to do this in Python (or another similar language), you need
to write a sort function that operates on a file, not a Python collection, as
the collection is always bound by RAM, which would at best be re-implementing
the sort command. Once you can sort a file, everything else is trivial, as
counts can be done line by line, or more siply, via uniq -c.

Of course, if you only care about things that fit in memory, you can do it in
Python, but it is still far easier to use command line for these type of
problems.

~~~
pdonis
> _Of course, if you only care about things that fit in memory, you can do it
> in Python, but it is still far easier to use command line for these type of
> problems._

I agree; I was not trying to claim that my Python solution should be used in
preference to the shell pipeline solution in any kind of "production"
environment. As I noted in another comment, Knuth's Pascal solution appears to
be open to the same criticism.

------
vog
I like that article very much, so I wanted to monitor the blog. However, I
found neither an Atom feed nor an RSS feed. That's a pretty good way to turn
off potential regular readers.

~~~
pdonis
I do have feeds set up, but I just realized they aren't linked to or
autodiscoverable. :redface: I'll fix that ASAP. Thanks for the feedback!

~~~
pdonis
Fixed--added links on the home page to RSS 2.0 and Atom feeds. Also feed
autodiscovery should work now.

~~~
vog
Great! I just subscribed to your Atom feed.

(BTW, who needs RSS anymore? Atom is so much cleaner designed and thus easiert
to handle with. I'm even planing use Atom as main format and generating my
website from that ...)

~~~
pdonis
Thanks for subscribing!

I generate everything from Markdown source using PyBlosxom's static rendering
(which has some issues that may eventually drive me to switch to something
else or roll my own). It auto-generates RSS and Atom feeds, so it costs me
nothing to have both just in case someone prefers RSS or can't use Atom for
some reason.

------
pdonis
Inspired by reading this:

[http://www.leancrew.com/all-this/2011/12/more-shell-less-
egg...](http://www.leancrew.com/all-this/2011/12/more-shell-less-egg/)

------
indubitably
Neither Knuth’s nor McIlroy’s solution handles Unicode, so they both suck.

