
Why you should learn at least a little bit of Awk - gregable
http://gregable.com/2010/09/why-you-should-know-just-little-awk.html
======
jdp23
Back in the 80s I wrote a 500-line program analysis tool in Awk. One day the
woman I was going out with handed me a printout I had left at her place,
saying something along the lines of "here's your awk code". She wasn't a
programmer so I was stunned that she knew it was Awk, and very impressed too.

Years later I ran into Brian Kernighan at a conference and told him the story,
ending it with "and that's when I knew she was the woman for me." He looked at
me like I was nuts.

~~~
SkyMarshal
Great story, but don't leave us hanging. Just how _did_ she know it was awk
code?

~~~
jdp23
one of the women she worked with used awk a lot for munging data and simple
reports from their pre-SQL database ... it's pretty recognizable :-)

------
Kliment
Awk is a great and oft-forgotten tool. Not only is it useful, the awk way of
thinking about stream processing generalizes nicely to a bunch of other areas.
You have a block that runs before anything else happens, a block run just
before the program exits, and a block run for every piece of input. In awk,
the input is a line of text, but nothing stops you from generalizing this to
say a frame from a video (split into channels in various colorspaces, fed
through a processing pipeline, returning another, processed image), a sound
frame, a sensor measurement...

~~~
jakevoytko
_"nothing stops you from generalizing this to say a frame from a video"_

This is 100% true. A coworker of mine implemented an elevation-bitmap-
to-3d-model conversion tool in 160 lines of Awk. It ran faster than our "good"
Matlab tool by a factor of 10.

Awk (or Perl) doubles the usefulness of Unix. Most of the common commands in
Unix are query commands. When you need to start manipulating queried data, Awk
is where the rubber meets the road. Piping data through the shell stops being
read-only, and becomes interactive.

~~~
silentbicycle
> It ran faster than our "good" Matlab tool by a factor of 10.

Could you give a bit more details there? I don't have any experience with
matlab, but I tend to think of awk as fast to write code in (and start up),
though not particularly fast in execution. (Roughly on par with Python, i.e.,
usually good enough.)

~~~
goosemo
You might want to rethink that a bit: [http://anyall.org/blog/2009/09/dont-
mawk-awk-the-fastest-and...](http://anyall.org/blog/2009/09/dont-mawk-awk-the-
fastest-and-most-elegant-big-data-munging-language/)

~~~
silentbicycle
That's mawk. I'm talking about the implementation that post calls "nawk", and
either way, I mean orders of magnitude - I care about a 10-100+x difference in
speed, not a 1.1-5x one. Awk and Python fall in roughly the same performance
tier for that kind of code.

Also: "I have since found large datasets where mawk is buggy and gives the
wrong result. nawk seems safe." makes me uneasy, as does the fact that it was
unmaintained for a while.

~~~
_delirium
Afaict, mawk's maintenance seems to be a bit up in the air--- the original
maintainer basically disappeared years ago and hasn't blessed any successor,
so the Debian-patched version became the de-facto current version, since at
least it staved off bitrot. Recently someone (Thomas Dickey) picked up
maintenance of a new upstream version unilaterally, starting from the Debian-
patched version, but he hasn't managed to convince the Debian mawk maintainer
to accept his new version as a new upstream (somewhat testy thread here:
<http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=554167>). I'm personally a
little more comfortable with something actively maintained like gawk, despite
the speed differences.

~~~
silentbicycle
Right. I usually use (n)awk because it's the default on OpenBSD, but have to
admit gawk's artificial-filesystem-based networking support is pretty cool.

------
d4rt
I can't recommend the Sed & Awk book enough.

Regular expressions are my favourite secret weapon; So many problems are made
simple by regular expressions and so few people (outside of IT) know of them.

~~~
telemachos
After I recommended the Sed & Awk book here a few months ago, silentbicycle
countered[1] that _The Awk Programming Language_ [2] (by Aho, Kernighan and
Weinberger) was much better.

I was curious enough that I bought and read it just at the end of summer. It
really is excellent. Highly, highly recommended.

[1] <http://news.ycombinator.com/item?id=1403376>

[2] <http://cm.bell-labs.com/cm/cs/awkbook/>

~~~
signa11
from what i remember, "the unix programming environment" by pike et al is also
pretty good, and contains a basic introduction to most of the unix utilities.

~~~
silentbicycle
You can't go wrong with any programming books Brian Kernighan co-wrote,
really. I have _The ANSI C Programming Language_ ("K&R"), _The Practice of
Programming_, _The AWK Programming Language_, and _The Unix Programming
Environment_, and they're all great. Concise, with a lot of depth that reveals
itself on repeat reading.

Ieursalimschy's _Programming in Lua_ ("PiL") was written in a similar style. I
recommend it quite highly, too. Great language, great programming book.

Also, the PSD, SMM, and USD books (_4.4BSD Programmer's Supplementary
Documents_, etc.) are dry, but also have excellent introductions to several
classic Unix tools. They're included as documentation in some BSD
installations, and should be easy to find otherwise. The intros to lex and
yacc are particularly good.

~~~
telemachos
I'm glad you saw this thread. It's always nice to find out somebody actually
paid attention to (and appreciated) some advice you put out on the interwebs.

~~~
silentbicycle
:) Awk threads always seem to get my attention.

------
silentbicycle
Many people consider Perl to be the next evolution of awk, but I prefer to
think of awk as (just) the essentials of Perl. Perl has CPAN, etc., but for
quick string hackery, everything you need fits in one tiny awk reference. Its
design hasn't sprawled the way Perl's has. (Except for gawk. The FSF does
bloat better than anyone.)

It's incredibly handy, yet the language is small enough that you can learn
most of it in an evening, with just a bit longer if you don't know regular
expressions.

~~~
_delirium
I definitely prefer awk versus perl for one-liners, with some sed thrown in.
Perl does have some command-line switches to ease certain kinds of one-liners,
but it just feels more verbose for that kind of interactive use (feels more
oriented towards writing scripts).

I do tend to use Perl for things where speed matters, though, especially with
large amounts of data going through a regex--- Perl's regex engine seems
considerably faster than any awk (or especially sed) I've tested, at least on
a few examples I've ported in the past. I was surprised once to get an 8x
speedup by porting a 3-line sed script to a 3-line perl script (it was
basically doing s/ABC/A\nC/g on a multigigabyte file). I've heard mawk can be
speed-competitive with Perl, though.

~~~
silentbicycle
Same here, but I prefer Lua to Perl, and Lua's LPEG (<http://www.inf.puc-
rio.br/~roberto/lpeg/lpeg.html>) compares very favorably to common regexp
implementations. (There are benchmarks in the paper.)

It's based on PEGs, a different formalism than regular expressions. PEGs are
more expressive - they're able to handle balanced, recursive structures, for
example. LPEG is a nice middle ground between regular expressions and a full
LALR(1) parser.

~~~
swift
I'm not sure if "middle ground" is quite right; PEGs and CFGs can express a
different set of languages, and each has their own advantages. Probably the
most important tradeoff is that with PEGs you gain infinite lookahead and
negation, but you lose left recursion and the ability to express ambiguity.

~~~
silentbicycle
I meant "middle ground" in a practical sense, rather than the linguistic one -
REs are good for simple string hackery, but not sufficient for nested
structures. Using an actual parser generator (the yacc clone in your language
of choice) can really be overkill for simple things, though. LPEG is a bit
more expressive than just REs, but still easy to casually drop in during quick
scripting.

------
awakeasleep
Note that the first code he writes on the page

 _awk "{print $0}"_

does not work. Awk programs need single quotes to prevent bash expansion.

~~~
beza1e1
Depends on the shell one uses, i think. On the other hand, who doesn't use
bash these days.

~~~
silentbicycle
See: <http://news.ycombinator.com/item?id=1669409> (Pet peeve of mine.)

------
redcap
It's been a while since I've touched awk, but I've certainly got a lot of use
from it when ripping data out of logfiles and using it elsewhere on the
command-line.

I can recommend this text file of awk one-liners:

<http://www.pement.org/awk/awk1line.txt>

And for completeness, here's one for sed:

<http://sed.sourceforge.net/sed1line.txt>

------
chaostheory
I can understand the usefulness of awk if say only C, C++, Java/C# existed,
but given that it's just easy and fast as Awk to code something useful and
powerful using something modern like either Ruby and Python; I just fail to
see the point.

O yeah and let's not forget Perl.

~~~
loup-vaillant
Awk is so small that you can be productive in half an hour. It's so concise
that most useful programs are easy little one-liners. It's so fast that you
can trust it with massive data crunching.

In other words, awk is unbeatable for stream crunching. (That's the point of
being domain specific, by the way.)

~~~
chaostheory
"Awk is so small that you can be productive in half an hour. It's so concise
that most useful programs are easy little one-liners."

I can say the same for ruby and python (and perl).

From personal experience, as an awk script/program becomes more important - it
will evolve with more requirements and it will start to be clunky. It just
isn't practical to stick with it since you'll eventually need the
features/libraries that the other languages have. Given the choices we have
today, why even start with awk?

On the performance side, you can always just use Lua if that's really
important.

~~~
silentbicycle
The major benefit with awk is that it runs as a pattern recognizing/processing
filter _by default_ , so it handles certain common problems in very little
code, and fits particularly well in Unix shell pipelines. I'm also a big fan
of structuring code in terms of pattern-matching. (I wrote an Erlang-style
pattern matching library for Lua, btw:
<http://github.com/silentbicycle/tamale/> )

I write a lot of little awk scripts, but if they grow past ~5 lines, they
usually get rewritten in Lua. (Perhaps eventually with inner loops in C.)
Still, Awk is simple and useful enough that it's still worth knowing.

~~~
chaostheory
"The major benefit with awk is that it runs as a pattern
recognizing/processing filter by default"

Doesn't every language have regular expressions built in now? Again I still
fail to see the point of writing it in Awk when you can write something small
and fast in a more powerful and modern language.

~~~
silentbicycle
I mean something different than regular expressions: I'm talking about how the
whole program is structured around "pattern -> action; other pattern -> other
action; ...", with special event patterns for BEGIN, END, etc. That pattern-
based dispatch is the _top level_ of the language, rather than function
definitions. (Those came later.) As the man page says, it's "pattern-
directed".

It's a higher-level approach than typical scripting languages, and that's why
it can be so concise - the model makes a lot of unpacking and looping
implicit. It's a DSL for stream-processing problems which are easy phrased as
"count these", "transform this into that", etc.

Are you familiar with Prolog? It uses a similar approach, but can match on
whole trees (and other complex, nested data structures), not just a list of $N
string/numeric tokens. Also, it supports backtracking - at any point, if it
reaches a dead end, it can back up arbitrarily and try a different approach.
Sometimes slow, but very handy for prototyping.

I agree that using another language than awk makes sense after a few lines,
but it's still a sweet spot for 1-5ish line programs. Since awk itself is
small enough that a two page cheat sheet is sufficient, it's worth keeping
around. Perl (for example) has many nooks and crannies I forget about if I
don't use it frequently.

~~~
swift
Anyone who hasn't tried a general purpose language with pattern-based dispatch
(usually referred to in practice as "pattern matching") should really do
themselves a favor and try one; it's one of the most useful language features
around. Now that I've become used to it, it's a bit unpleasant for me to use
languages that don't have it. It's a very convenient way to structure code.

The parent post mentions Prolog, which is a good example, but there are
several others worth trying that frequently come up on HN; Scala, Haskell, F#,
and Ocaml spring to mind.

~~~
silentbicycle
Yes! Anybody who knows me in person is probably tired of hearing about how
good pattern matching is by now. :) I definitely know what you mean about
missing it in languages without it, that's why I've been working on tamale.

I can't speak for Scala, but the PM in Haskell and OCaml is a bit different
since it's informed by the static typing. When patterns have variant types
(i.e., x is either Foo, Bar, or Baz * int), it also checks for complete
coverage. Same general concept, different flavor. Also very useful.

I mentioned Prolog in particular because its emphasis on unification and
backtracking make it the most pattern-matching-centric programming language
I've seen. Where other languages _have_ pattern matching, it almost _is_
pattern matching.

Also, there are well-known ways to compile pattern specifications into
efficient decision trees, so while it's a very expressive abstraction, it's
not necessarily an expensive one. If they're being constructed at runtime (as
they are in my Lua library), you can generally get a big improvement by just
indexing on the patterns' first fields and doing linear search thereafter.

------
SpaceHobo
I had great fun writing the traditional "Cloak of Darkness" exercise for
Interactive Fiction in pure AWK:

<http://zork.net/~nick/loyhargil/if/if.awk>

For comparison, here are all the published examples of this exercise in a
variety of systems:

<http://www.firthworks.com/roger/cloak/>

I won't say it's the best tool for this job, but I feel that the awkishness
provides a certain elegance to some aspects.

------
grease
I went through the article and tried the stuff on the log files on my web-
server. Useful stuff.

------
nwmcsween
I know this goes against what is said here but I _hate_ awk. The syntax is so
convoluted that it seems parts were picked with whatever was reasonable at the
time. It's like I'm banging on rocks in a cave somewhere every time I have to
work with bash, awk and related tools. In fact I did a quick bash script of
pattern matching some files, moving them, resizing and compressing then
uploading and it took three days to read man pages, parsing, etc till I got
fed up, used ruby and had it done in under an hour.

------
brendano
Awk can also be faster than (naively written) C++.
[http://anyall.org/blog/2009/09/dont-mawk-awk-the-fastest-
and...](http://anyall.org/blog/2009/09/dont-mawk-awk-the-fastest-and-most-
elegant-big-data-munging-language/)

~~~
silentbicycle
That's not news, though. Better algorithms trump constant factors, and what is
"naively written C++" if not a murder of bad algorithm choices?

I'd bet that people _get shit done_ 10x+ times faster in
awk/lua/python/ruby/lisp/whatever until having to work with nasty C++-specific
libraries dominates, though. (C is friendlier that way.)

------
stevefink
That and most places with a clue that are hiring competent sys admins will
expect at least some knowledge of sed and awk.

~~~
spudlyo
That may have been true prior to 1987, however since then Perl has largely
superseded sed and awk.

~~~
ciupicri
Or Python at Google.

------
cbernini
Since I was introduced to AWK I didn't look back, 80% I have to do on the
command line end up using AWK for it.

