
Unix Text Processing (1987) - rhollos
http://oreilly.com/openbook/utp/
======
ralph
I sought permission from Tim O'Reilly, he co-wrote this back when he was an
author, to re-enter the text of the book as troff source, the original being
lost. He kindly gave it and volunteers from the groff@gnu.org mailing list
split the chapters amongst themselves; it was quite fun.
<http://home.windstream.net/kollar/utp/> is the result.

~~~
super_mario
Thanks. This is the best version by far. Is it also possible to generate the
PDF hyperlinks/outline so it is easier to navigate the document?

~~~
dredmorbius
You can generate PDF (without the hyperlinks) with standard psutils (e.g.:
ps2pdf). A good PDF reader will offer text search, which is a bit of a help.

Otherwise, I'd have to investigate groff and text conversion utilities to see
if the metadata are readily accessible. Interesting question, though I suspect
it might take a slight rewrite.

~~~
ralph
PDF is provided on the book's page I gave above. You're correct, small mark-up
changes are needed to make use of the PDF-linking macros now available that
weren't then.

------
clicks
I see resources about Unix text processing utilities, about Bash, about
readline shortcuts, etc. etc. submitted very often to HN.

Exactly how useful are they? Would you say they're among the most important
skills a programmer could have? Or do we just have a disproportionately large
amount of sysadmins for HN readers? Isn't some general-purpose language like
Python or Ruby almost always much better, much faster than these solutions?
Isn't it worth it to rather invest your time learning Python well -- instead
of getting familiar to the segregated and messy environment of modern Unix-
land? Certainly, the learning curve is much steep for Unix utilities than,
say, Python.

~~~
twistedanimator
Just yesterday I was able to write a one liner on the command line to analyze
my skype usage. I wanted to see if it was making financial sense to keep skype
alongside my prepaid wireless plan.

You can download your last six months of skype activity from skype.com and
they are in the form:

Date;Date;Item;Destination;Type;Rate;Duration;Amount;Currency "July 31, 2012
21:16";"2012-07-31T21:16:01+00:00";"+11234567890";"USA";"Call";0.000;00:00:10;0.000;USD
"July 31, 2012
21:15";"2012-07-31T21:15:38+00:00";"+11234567890";"USA";"Call";0.000;01:17:02;0.000;USD

After 15 minutes or so, I came up with the following one liner:

cut -f7 -d";" call_history* | grep -v "Duration" | awk '{ FS=":"; s+=$1*60;
s+=$2; if ($3 != 00) { s+=1 } } END {print s " minutes"}'

I could have done the same thing in perl or python in 5 minutes, but it was
interesting to "program" only by hooking programs together to achieve the same
thing.

After analyzing my skype logs, I found I used 1800 minutes. That would have
cost me $180 with prepaid minutes, but only cost $30 with skype.

~~~
ralph
Another way would be all awk.

    
    
        awk -F\; '
            $7 != "Duration" {
                split($7, t, ":")
                s += t[1] * 60 + t[2] + (t[3] != 0)
            }
            END {print s + 0}
        '
    

Note the handling of s == "" in END.

------
kamaal
A very nice book available for free. It makes me very happy that I can learn
new ways to do my work smarter.

Some of the best known advantages of Unix text processing tools is, as long as
you can reduce your problem to 'Text'- There exist some very powerful,
succinct and quick solutions to even some very difficult problems.

Well it takes some time to get a grip on how to work with Unix text processing
utilities and tools like Perl. Once you are upto speed, you see how much work
you can do so quickly with so much little effort. In fact the more you get
into it, you realize how much useless code you have writing over the years.
While all you needed was a command with a few options.

------
zmoney
So much of this is still useful and relevant. Kind of amazing how much the web
relies upon these ancient (by our standards) tools. UNIX is one of the most
incredible technology stories ever.

------
ghshephard
This book is an absolute treasure, even with it's age, is still one of the
best descriptions of troff/nroff.

One thing that I do find interesting, is that the book (on text processing, no
less), that is 680 pages x 30 lines x 80 columns (plus some minimal line art)
- which, in theory, is around 1.6 megabytes of data, weighs in at 28 MBytes in
this PDF.

Regardless of the Irony, great book which the authors clearly put blood, sweat
and tears into.

~~~
przemoc
26.7 MB is because it's scanned book with text layer on top. If it was retyped
in, let's say, LaTeX, and outputted to PDF it wouldn't take more than one
tenth of the current size. Remember that scanning the book has also archival
purpose.

~~~
ralph
The book has been re-typed, as troff, just like the original. See my comment
elsewhere on this post.

~~~
przemoc
No reason to downvote me (whoever it was). The PDF

<http://oreilly.com/openbook/utp/UnixTextProcessing.pdf>

has _scanned_ pages with text layer on top of it. That's the reason why it's
so big.

I applaud retyping effort, as it's always better to preserve real content than
images of it (even if OCRed), but I cannot say that I'm happy about troff
being used for this purpose. It can be a matter of taste, but I don't like the
way formatting is done in troff/nroff. That's why I never use it directly
(e.g. using ronn to convert markdown text to man page, etc.).

But I understand it's done that way to preserve "the creation process" too,
which is also appreciated. And the book is about troff/nroff, so dogfooding is
present. ;)

------
kephra
I was even writing good old roff a few days ago. So this book is not even out
of date.

~~~
Surio
>> I was even writing good old roff a few days ago

Just curious, what was it for? :)

PS: I had a groff based CV (2006-ish). Now it is a LaTeX based CV.

~~~
ralph
I still regularly use groff; letters, invoices, ad hoc formatting of one-off
tasks. tbl(1) is nice and the overall speed makes it handy for producing PDF
in the back-end.

~~~
fusiongyro
I was going to ask why one would prefer groff to some TeX variant. Speed is a
pretty good reason; I'm impressed at how slow TeX is even on modern hardware.
I usually generate HTML unless I want something beautiful, and then I usually
use ConTeXt. I'm pretty fond of classic Unix stuff so maybe I'll delve more
into groff.

~~~
ralph
I've read _The TexBook_ and other material but just find the TeX mark-up so
noisy to parse compared with the troff style of `.cmd' at the start of the
line, cmd is often short, with the odd bit of \s+2 embedded within the line,
depending on personal preference.

troff and friends were developed on Unix in its early days by the originators
of Unix and it shows in what a good fit they are to the environment and in
their elegance; they are Unix programs. TeX was born outside of Unix; it runs
on Unix.

~~~
fusiongyro
Feel like answering a couple more questions? :) The kinds of questions I have
are not "good" questions for S.O. I'm tempted to try redoing a project of mine
from ConTeXt in troff just to see what it's like. Do you recommend any
particular macro package? Without any other input I'd probably try the -me
macros, just because they've been compared to Pascal (versus the FORTRAN of
-ms and the PL/1 of -mm).

Also, do you find yourself writing your own macros much? I haven't the
faintest idea what that would require with troff, but I rely on this with TeX
quite a bit, mostly to elevate stylistic markup into semantic markup. My
impression is that if you want semantic macros you use a macro package, and
even then you probably freely intersperse non-semantic macros.

This project I've been working on for a while, uses LuaTeX so that I can
connect to a local database, perform some queries and typeset them and their
output. I imagine this kind of thing would not be difficult to do directly
with a custom pipeline step using troff. Have you done that kind of thing
before? If so, how unpleasant was it?

Thanks for talking with me about this.

~~~
ralph
You'd do well to ask these on the
<https://lists.gnu.org/mailman/listinfo/groff> list for a wider set of
opinions. It depends on the style and complexity of the document. -ms is
simple enough that people like W. Richard Stevens would tweak it for their
books. I understand the relative newcomer, -mom, is comprehensive, modern, and
well supported by its author on the above list. I don't recognise the Pascal,
etc., analogies. :-)

I do write my own macros. They can be just short-hands for a combination of
others in the same way my ~/bin/l is exec ls -l "$@", or sometimes for a
simple document I start with just troff and have some macros on top of that.
Yes, any distinction over semantics is purely convention.

You may wish to read Kernighan's _Nroff/Troff User's Manual_,
<http://troff.org/54.pdf>, otherwise known as CSTR #54. It's original troff,
not groff, but as a succinct reference with elegant prose we often refer back
to it. At the end is a tutorial introducing simple macros.

Integrating troff and friends in pipelines and scripts is easy. They take
line-based text as input and produce it as output, only switching to binary
for some output formats at the last hop. You can also run system(3) from
within troff documents, e.g. to include the output of a command, but often
that's not the easiest fit.

I recommend again the groff@gnu.org list; they're friendly, patient with
newcomers, and interested in showing how they tackle the task at hand.

~~~
fusiongyro
As it happens, the analogy is from the Unix Text Processing book, page 97:
"Mark Horton writes us: I think of ms as the FORTRAN of nroff, mm as the PL/I,
and me as the Pascal."

Thank you for taking the time to answer these questions! I will plough through
some of this documentation and make my way over to the list.

------
Surio
Question for the groff experts out there:

XeLaTeX has finally moved LaTeX into the realm of directly embedding
_virtually any true type font_ into the final document. Literally 5-6 lines of
LaTeX commands and you are done! I was amazed when I first did it. It was so
convenient, when compared to messing with pfbs, then afms, and then.... you
get the idea, no?

Is there a similar groff mechanism to pull TTF/OTF fonts into the final PS/PDF
document with similar minimal effort? If, could someone be kind enough to
point a resource to me. Thanks.

P.S: I was searching, but could not home in on the right keywords to drive me
to an answer.

~~~
ralph
I _think_ TrueType fonts with groff is fairly painless; the topic comes up on
the groff@gnu.org list now and again but I don't pay much attention. The list
is low volume and very friendly, with some real old Unix hands on there; feel
free to ask, say I sent you if you like.
<https://lists.gnu.org/mailman/listinfo/groff>

Gunnar's Heirloom troff has TrueType support. "troff can access PostScript
Type 1, OpenType, and TrueType fonts directly, that is, it can read font
metrics from AFM, OpenType, or TrueType files, and can instruct its dpost
post-processor to include glyph data from PFB, PFA, OpenType, and TrueType
files into the output it generates".
<http://heirloom.sourceforge.net/doctools.html>

~~~
Surio
Thanks. I will check both the list and _heirloom_. Hadn't heard of heirloom
before. So, thanks again.

------
contingencies
roff: still the normal way to generate an IETF Internet Standards Draft text
document. This mechanism is showing its age though, especially regarding UTF8
and width limitations.

~~~
4ad
I don't know about other systems, but on Plan 9, where UTF-8 was invented,
troff handles it just fine:

<http://plan9.bell-labs.com/magic/man2html/1/troff>

~~~
ralph
groff does it with the help of its preconv(1), also invoked with groff's -k
option.

    
    
        $ printf 'Hello ①②③\n' | preconv
        .lf 1 -
        Hello \[u2460]\[u2461]\[u2462]
        $ printf 'Hello ①②③\n' | groff -k -Tutf8 | grep .
        Hello ①②③
        $

------
new_test
Is there any advantage (speed?) in using something like Awk over something
like Python?

~~~
npsimons
Open up a terminal on Linux, OSX, or heck even Cygwin or MinGW on Windows. I'm
willing to bet that awk is already pre-installed, even on a base or fresh
install. Awk, like grep, sed and all the rest are pretty standard and very
powerful. Not knocking Python, shell programming is just another paradigm, and
the tools are pretty universally available.

~~~
dredmorbius
Awk is part of the POSIX specification, so yes, it's going to be present on
pretty much any Unix-like environment. Even those which don't aim at POSIX
compliance will almost certainly have an awk interpreter. Busybox includes
one, which means that many minimal / embedded systems will include awk by
default (as they use busybox to provide core utilities).

