Hacker News new | comments | show | ask | jobs | submit login
The Word Count Problem (peterdonis.com)
69 points by pdonis 1880 days ago | hide | past | web | 59 comments | favorite



I want to be charitable but I find the criticism here of Knuth of be quite childish. Was the point to solve the problem with as little work as possible? If that's the case, there's probably some plugin for your favorite text editor that already does this. So what? What was the point exactly? You could just have easily built a library function that does all of this and then write a Pascal program that calls the library in one line of code, state the library is not part of your solution, and say the solution is better than the one with UNIX utilities. So what? Is the point to illustrate that Knuth is unaware of elementary concepts such as code reuse, come on that is just sort of ridiculous.

If you had asked Knuth to solve this problem using any means in the whole world and do it within 10 minutes, I don't think he would have any trouble doing it.

And honestly, if the lesson is to teach us some really basic lesson of code reuse, not reinventing the wheel, or some such like that, I think this was a pretty convoluted way to deliver sort of a common sense lesson.


Some other points from the linked-to article that back up that the criticism here is a bit ridiculous:

1) This was in 1986. That's about 4 lifetimes ago in computer programming.

2) This was an article specifically to demonstrate literate programming. It wasn't supposed to be "the most efficient way" possible. It's approximately the same argument as "why should I write a program to sort data in an interview; I'm not going to get paid to write sorting functions".


1) I'd like to see you say the same about differential calculus, which actually was several lifetimes ago.

2) I have made the same argument, in interviews, many times. It's a good argument to make.


The point, imho, is to illustrate literate programming, a technique designed to write easily understandable code. And the rebuttal is, that with high level tools and a good documentation you can build as easily understandable code in about 20 lines including comments. (To be fair, one would need to understand all the tools used for comparison to Knuths program).


Is the point to illustrate that Knuth is unaware of elementary concepts such as code reuse....

In fact, Knuth considers reusable code to be a "menace."

http://www.informit.com/articles/article.aspx?p=1193856


> To me, "re-editable code" is much, much better than an untouchable black box or toolkit. (Knuth)

Probably Knuth misunderstood the concept


He didn't misunderstand the concept. He emphasized the needs for source access to libraries


Yes, that's true. However, code reuse is somehow opposed to source access to libraries, which should not be necessary true.


I would argue that more code is a better solution in this case. The purpose was not to write short, clear code, but to illustrate a certain programming style. How am I supposed to "get" the style in 5 lines? A short solution would have been a failure.


The same code in composable and reusable units would have demonstrated just as much.


Against all conventional wisdom I actually think McIlroy was wrong.

In my experience writing reusable code, unless you're writing a library, is usually a waste of time. Must code, even when written with intent to be reusable -- isn't. Maintainability and readability almost always trump reusability. Refactoring fast the point where reuse is needed typically is the better approach.

McIlroys approach is the less common approach of library/framework writers while Knuth is the approach of app devs.


I don't agree -- you're papering over an enormous distinction. Unix tools and the shell are the biggest success story in software reuse that we have. (I would argue that it is why nearly everything we use today is Unix -- servers, iOS and MacOS, Android, etc.)

I somewhat agree when people say "code reuse has failed". When people set out to write "reusable code", it ends up not being reusable. Usually because they think of reuse as writing a bunch of classes that they can "import".

But writing tools is another way to reuse software. For example, in Unix, you reuse ssh for git and for scp. With the web, you can reuse a huge amount of work in Varnish and nginx by chaining components.

So library reuse is not all it's cracked up to be, but that's not what McIlroy is doing. His solution to the word count problem is fantastically and obviously better. It's real reuse.


I don't agree -- you're papering over an enormous distinction. Unix tools and the shell are the biggest success story in software reuse that we have.

To be clear, I think there's a distinction between writing a library for well known use cases. I wouldn't suggest stdio is a waste. But it's almost always a waste of time to emerge from a client project with a bunch of libraries for functionality that will likely never be used again.

And among good developers I think this is one of the biggest problems. I've encountered many times good programmers building elaborate frameworks that take as long as the main project itself, who then present the utility of this framework, which we never use again. They didn't understand the problem space well enough to know when reusability made sense and across what pivots.

So library reuse is not all it's cracked up to be, but that's not what McIlroy is doing. His solution to the word count problem is fantastically and obviously better. It's real reuse.

Knuth also reused. He reused a compiler. He reused input output capabilities. He reused subroutine abstraction capability.

IMO it's less what they used and more what they produced (as all developers reuse). Knuth didn't produce a bunch of reusable tools and I think that is justified and the right approach most of the time.

I went back and reread the paper just now and McIlroy even alludes to this:

"The utilities employed in this trivial solution are Unix staples. They grew up over the years as people noticed useful steps that tended to recur in real problems. Every one was written first for a particular need, but untangled from the specific application.

With time they accreted a few optional parameters to handle variant, but closely related, tasks. Sort, for example, did not at first admit reverse or numeric ordering, but these options were eventually identified as worth adding."

What he describes is the right approach -- and what Knuth wrote is the first step here. I'm fairly certain if Knuth was a systems developer and this came up over and over again you'd see refinement and tools/libraries that made certain aspects of this more reusable. But that's neither his domain nor the specific ask for this client.

McIlroy's rant here seems pretentious in light of this. A much better rant would be to show how Unix tools could in six lines replace LaTex for typesetting (without using anything provided by LaTex).


I agree with your first point and didn't dispute it. Library reuse is overrated.

But it isn't contradicting the point of what McIlroy does -- it actually supports it. Unix tools are more reusable than libraries full of code.

I can't tell what the second part of your post is saying. It doesn't matter if Knuth "would have" done something; McIlroy already did it. The problem is solved with 6 lines. End of story. No pontificating. That's what Unix lets you do -- get on with your day :)


I won't dispute the utility and versatility of UNIX tools, but your comment that "nearly everything we use today is Unix" is wildly off the mark. According to the following article, it's estimated that Windows is installed on 36% of servers and 92% of desktop systems: http://en.wikipedia.org/wiki/Usage_share_of_operating_system...


Not sure 90% of code running under unix is shell scripts either...


> In my experience writing reusable code, unless you're writing a library, is usually a waste of time.

In the sense you're using "library" here, the Unix utilities are libraries--more precisely, they are library functions; the whole set of utilities is the library. So McIlroy was a library/framework writer.

> Most code, even when written with intent to be reusable -- isn't.

I more or less agree with this; but fortunately, you don't have to look at "most" code to find code that is reusable. As witness, once again, the Unix utilities. Or, for that matter, the Python builtins and standard library that I used to build the Python versions of the "word count" program.

Moreover, even if you're in Knuth's position and don't have a good base of reusable code to draw on already, it still makes a big difference how you proceed from there. Do all app devs write big monolithic apps? I believe pg once wrote an essay about bottom-up vs. top-down programming, leaning heavily towards the former.


(largely agreement here, can't say much about Knuth, though)

The essay: http://www.paulgraham.com/progbot.html ?

I think a lot of everyday reuse (by factoring commonly used code) happens at a level of programming which neither fits application nor library development but writing small tools to automate ad-hoc tasks (e.g. by scientists, admins, etc.) Environments like Python, the shell, or Lisps seem to be more suited to this kind of development than, say, Java.

In the case of the example, I could easily imagine that the "isolate words in input text"-part will turn out to show up again, and then one would factor it out to a function. When the first text containing unicode soft hyphen characters is encountered, only a single shared function has to be updated, and all little scripts using it will benefit!

A function extracting the text from xml-documents, skipping a configurable set of element types, might find similar reuse, etc.

Just an aside: I'm not familiar with Python, what I found disappointing was the mix of chained object method invocations and "outer" function calls - since replacing the built-in string class seems to be frowned upon, I was wondering how the example script would look like using pipe[1] :-)

One question about the Python code: is the first call to sorted needed, and if yes, why?

[1] http://pypi.python.org/pypi/pipe


> The essay: http://www.paulgraham.com/progbot.html ?

Yes, that's the one I was thinking of. Thanks for finding the link!

> the mix of chained object method invocations and "outer" function calls

Actually, you're right, I could have factored out the part that builds the dict of words vs. occurrences into its own function, so that the entire pipeline would be a single chain of calls. Now that you've put the thought into my head, I may go back and do that. :-)

> I was wondering how the example script would look like using pipe[1] :-)

That would certainly make the Python look more like the shell script. :-)

One of the constraints I imposed on myself when writing the Python version was to only use what comes with it--built in functions/syntax and the standard library, with no third-party packages, similar to how McIlroy only used "built-in" Unix commands that came with every Unix system.

> One question about the Python code: is the first call to sorted needed

You're right, it isn't. I've pushed an update to the github repo fixing this (in both versions).


> Now that you've put the thought into my head, I may go back and do that.

Just did it. :-) Factored out a "uniq" function in both versions, so the "pipeline" function is now a single chain of method invocations (with the slicing at the end). Which means it could actually be put in the "if __name__ == '__main__'" stanza, but that seems less readable to me.


Going out of your way to write reusable code, unless you're writing a library, is usually a waste of time.

That I can agree with. But there's nothing stopping you from breaking up your program into small utilities that have a simple interface. Those utilities don't need to be generic or widely applicable. But separating them from the main program can help.

Just to give an example, I once had trouble writing a function that transformed a particular tree data structure. I ended up breaking it in two: the walking part, that took the transformation part as an argument. And of course the transformation part. Note that the walking part was fairly generic, and could have been use for other purposes (heck, one of its arguments where a function!). But I didn't write it for that. I wrote it to reduce the amount of brainpower I would need to finish my program, through dumb separation of concerns.


Have you tried functional programming, yet? John Hughes argues in "Why functional programming matters" (http://www.cse.chalmers.se/~rjmh/Papers/whyfp.html) that enabling exactly this separation and subsequent gluing together is exactly what makes FP worth looking at.


Yep. The very example I was talking about was written in Ocaml. It was in an earlier version of my web site compiler: http://loup-vaillant.fr/projects/ussm/

I liked Hughes paper. Funnily enough, the lack of laziness in Ocaml quite hindered me when I wrote and used my Parsec clone (again in USSM, but in the current version).


OK, forget reusability. Is a 10-page program written in WEB more maintainable and readable than a 6-line shell script?


Uh uh. We could assume the Pascal program was written in an environment which lacked UNIX utilities. The 6-line shell script is only better if you can run it at all.

Knut's program's most glaring flaw isn't its length, but its monolithic architecture.


This is fascinating, and should be a source of reflection for anyone writing code for a living or being involved with programming on any serious level.

Additionally, I can't help but think that this essay (and others in this style) and subsequent reflections would make for a perfect seminar targeted towards final year CS students, as it merges perfectly academic reflection and real world engineering considerations, which is what many CS programs across the world sorely need.


I think I found a copy of the original Knuth/McIlroy paper itself: http://people.mokk.bme.hu/~kornai/AzkellHaskell/bentley_1986...


If the problem is simple enough to be solved by a shell script, then obviously literate programming is going to be overkill. Algorithmic problems such as data-structures are well suited for literate programming.


Having reusable tools in the first place, to make it a problem simple enough to solve with a shell script, is the accomplishment that advances the state of our art. It reduces a complex data structure to passing data through a pipe, which makes the algorithm more usable for more people.

I wrote in a literate programming style for years, to generate LaTeX documents and a world-scale build system for a multinational. But I'm past that, since it's intricate and unmaintainable by others. Instead, I'd rather generate documents in UTF-8 (an ASCII-compatible and increasingly universal format), and sets of small tools which call each other and expose their data in files for new tools to use. With this approach I can evolve the toolset in small, state-preserving steps, to minimize how much I have to code to meet current needs and implement new approaches.


The re-usable tools in this example are only applicable because the problem is simplistic. If you were required to extend the solution to cover say language specific definitions of word boundaries, the literate style would be a definite win. The shell solution looks elegant but really it's just a quick and dirty hack as far as generality is concerned.


I'm pretty sure that the trick to tackling most problems in practice is to see the basic problems to solve, and find existing tools to apply. Solve 80% of the problem in 20% of the time, and move on to something else you need to get done.

If the problem warrants it, you can revisit it later to do something more ornate. In the meantime, you've got what you need to move on.


"Program specifically designed to demonstrate literate programming turns out to be a good demonstration of literate programming, not a good demonstration of library creation"

... can someone please point out what the problem is? I don't see it :-|


The problem is that it isn't what they wanted it to be.


I haven't checked out the original paper, but the solution presented on the blog is a non-solution. It only works for ascii characters and words composed out of ascii. When you reuse someone elses code, you always get stuck with someone elses assumptions. In this case, that English is the only language and that the regular expression [A-Za-z] encompasses all possible letters in the world. The shell solution, built upon standard tools, can not be extended to work in an international context but a custom solution in Python (or even Pascal) quite concievably could.


In my opinion, you've got that backwards. To fix the shell script, just make sure you're using an [i18n version of tr](http://heirloom.sourceforge.net/), and then use [:alpha:] instead of [A-Za-z].

If unicode tr didn't exist, then by fixing tr you fix this problem for everybody, not just yourself. You're only in trouble if the source for `tr` isn't available. But the source for `tr` has always been available, even for proprietary unices.


> I haven't checked out the original paper, but the solution presented on the blog is a non-solution. It only works for ascii characters and words composed out of ascii.

As I understand it, that was the original spec, which both Knuth and McIlroy wrote to. I agree that it is limited as you say.

> The shell solution, built upon standard tools, can not be extended to work in an international context but a custom solution in Python (or even Pascal) quite concievably could.

As bryanlarsen pointed out, the shell solution can easily be extended by using an internationalized version of tr. The Python equivalent would be to use the built-in Unicode support. (If Pascal had that, you could do the same in Pascal.)

However, it's worth noting that by specifying the problem that way you still have the issue of how the input stream (which is going to be bytes) is encoded. Essentially, the original spec declared by fiat that the encoding was ASCII.

Also, btw, you can express non-English languages in ASCII (though certainly not as wide a variety as in Unicode); the program as written does assume that words are composed only of the 26 standard ASCII letters, but it could easily be extended to include the ASCII special characters. Another exercise for the reader. :-) Though if you're going to do this kind of extension, it might be better just to go the whole way and handle Unicode.


But your assertion was that this was an easy problem solved with just a few lines of shell code. That assertion is only true in an extremely limited context where the only language possible is English and the only character encoding is ascii. That's where the Unix tools shines because they are great at handling problems easily expressable as regular expressions. But it is not a very realistic example, nor a fair comparison.

> As bryanlarsen pointed out, the shell solution can easily be extended by using an internationalized version of tr.

You should try that and then blog about it. :) I've spent lots of time battling issues with ascii-centric libraries. My conclusion is that it is not easy at all which is not strange because tools like tr and sed were written decades before unicode support became a must have. I can't say that it is impossible to write a shell script to count words in a text written in Arabic script, but it doesn't seem easy.


> But your assertion was that this was an easy problem solved with just a few lines of shell code.

Where did I assert that?

> I can't say that it is impossible to write a shell script to count words in a text written in Arabic script, but it doesn't seem easy.

Is there a definition of what counts as a "letter" in Arabic script? That is, which Unicode code points correspond to letters? And does Arabic share the convention that a "word" is a sequence of letters delimited by non-letters?

If the answers to those questions are "yes", then the extension of the algorithm already presented to handle Arabic is straightforward; it's just a matter of substituting the Arabic definition of "letters" for the ASCII definition. (With, as I said in my previous comment, the additional issue of determining the encoding of the input and output.)

If some of the answers to the above questions are "no", then you have an issue with the problem specification, not with the algorithm you're going to use to solve it.


> As I understand it, that was the original spec, which both Knuth and McIlroy wrote to. I agree that it is limited as you say.

My new opinion is that great programmers will go beyond the spec and make their software handle Unicode in cases like this. It's the right thing to do. If the spec didn't require units tests does that mean you don't write unit tests? No, you do it anyway because it's the right way to do things. BTW I'm still working on being a great programmer myself.


Both the article's improved pipeline and Doug McIlroy's original have a bug. I've just written about it on +Ade Oshineye's post about the original article. https://plus.google.com/105037104815911535953/posts/KuXczyiw...


Good catch! I've implemented your suggested fix of adding a sed '/^$/d' at the beginning of McIlroy's pipeline. The github code is updated, along with an updated test file and expected output.

(Btw, the Python version did not have the bug, since the split method of Python strings already ignores initial whitespace.)


Your bug fix is broken; it makes no sense to place it there. I suggest increasing the number of test inputs to first show the problem.


Oops, you're right, the sed command should be on the second line. Fixed the test inputs and the fix.


Doesn't Pascal have libraries for sorting? I find that hard to believe. So there must have been another reason why Knuth chose to implement everything from scratch. Probably to demonstrate programming techniques. You don't learn that much about algorithms by stringing together a couple of shell commands.


Pascal makes code reuse hard. http://www.lysator.liu.se/c/bwk-on-pascal.html puts it better than I could. And when I used Pascal in the early 1990s - and this was Borland Pascal, one of the better ones - it even had strings! - things had not improved massively.


When did that comparison take place anyway? I don't think anybody is still using Pascal in this day and age?


1980s, I think it says?

At any rate, my point, though it wasn't really clear, was that pascal makes code reuse hard to such an extent that I don't think it's actually possible to have a general-purpose sort routine in the library that would be actually useful. Certainly nothing like qsort, anyway. It just can't be expressed.

(I'm sure modern versions of Pascal have this problem licked.)


Maybe I am missing something here, but The python version is: O(N) sequential disk access O(NlogN) cpu O(N) RAM

The shell script is the same for disk and cpu, however, it is O(1) of RAM, and therefor can operate on input of size limited to disk, not RAM


You're correct, the Python version is O(N) instead of O(1) in RAM usage for two reasons, one easily fixed and one not. Sorry if I'm belaboring the point, but I think it's worth going into a little more detail.

The easily fixable issue is that the shell pipeline buffers reads, whereas my Python version just uses sys.stdin.read() for simplicity. I could have buffered the reads by using Python's generators/iterators instead. However, that alone isn't enough to get O(1) RAM usage.

The not so easily fixable issue is that the Unix sort command uses temporary files in order to not have to have all of its data in memory at once. See, for example, here:

http://vkundeti.blogspot.com/2008/03/tech-algorithmic-detail...

Python's sorted builtin doesn't work like this; it can take a generator as input, but it returns a list. This is one respect in which Python's built-in functionality lacks a feature that the Unix utilities have. I don't know if anyone has tried to re-implement the Python sorted function to use temporary files and return a generator to reduce memory usage.

[Edit: the Python version would also have to re-implement the uniq function to take a generator as input and return a generator, which would, I believe, also require temporary files.]


Unix sort is a merge sort, using fixed RAM uniq -c is line by line sed is line by line

A superior solution in every way

If you were going to do this in Python (or another similar language), you need to write a sort function that operates on a file, not a Python collection, as the collection is always bound by RAM, which would at best be re-implementing the sort command. Once you can sort a file, everything else is trivial, as counts can be done line by line, or more siply, via uniq -c.

Of course, if you only care about things that fit in memory, you can do it in Python, but it is still far easier to use command line for these type of problems.


> Of course, if you only care about things that fit in memory, you can do it in Python, but it is still far easier to use command line for these type of problems.

I agree; I was not trying to claim that my Python solution should be used in preference to the shell pipeline solution in any kind of "production" environment. As I noted in another comment, Knuth's Pascal solution appears to be open to the same criticism.


Looking at the original paper that was linked to elsewhere in the comments here, it appears that Knuth's original Pascal program is, like my Python version, O(N) in RAM. It keeps everything in memory and uses no temporary files, as far as I can tell.


I like that article very much, so I wanted to monitor the blog. However, I found neither an Atom feed nor an RSS feed. That's a pretty good way to turn off potential regular readers.


I do have feeds set up, but I just realized they aren't linked to or autodiscoverable. :redface: I'll fix that ASAP. Thanks for the feedback!


Fixed--added links on the home page to RSS 2.0 and Atom feeds. Also feed autodiscovery should work now.


Great! I just subscribed to your Atom feed.

(BTW, who needs RSS anymore? Atom is so much cleaner designed and thus easiert to handle with. I'm even planing use Atom as main format and generating my website from that ...)


Thanks for subscribing!

I generate everything from Markdown source using PyBlosxom's static rendering (which has some issues that may eventually drive me to switch to something else or roll my own). It auto-generates RSS and Atom feeds, so it costs me nothing to have both just in case someone prefers RSS or can't use Atom for some reason.



Neither Knuth’s nor McIlroy’s solution handles Unicode, so they both suck.




Applications are open for YC Winter 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: