Hacker News new | comments | show | ask | jobs | submit login
Why you should learn at least a little bit of Awk (gregable.com)
162 points by gregable on Sept 29, 2010 | hide | past | web | favorite | 62 comments

Back in the 80s I wrote a 500-line program analysis tool in Awk. One day the woman I was going out with handed me a printout I had left at her place, saying something along the lines of "here's your awk code". She wasn't a programmer so I was stunned that she knew it was Awk, and very impressed too.

Years later I ran into Brian Kernighan at a conference and told him the story, ending it with "and that's when I knew she was the woman for me." He looked at me like I was nuts.

Great story, but don't leave us hanging. Just how did she know it was awk code?

one of the women she worked with used awk a lot for munging data and simple reports from their pre-SQL database ... it's pretty recognizable :-)

Is she in the kitchen right now?

Awk is a great and oft-forgotten tool. Not only is it useful, the awk way of thinking about stream processing generalizes nicely to a bunch of other areas. You have a block that runs before anything else happens, a block run just before the program exits, and a block run for every piece of input. In awk, the input is a line of text, but nothing stops you from generalizing this to say a frame from a video (split into channels in various colorspaces, fed through a processing pipeline, returning another, processed image), a sound frame, a sensor measurement...

"nothing stops you from generalizing this to say a frame from a video"

This is 100% true. A coworker of mine implemented an elevation-bitmap-to-3d-model conversion tool in 160 lines of Awk. It ran faster than our "good" Matlab tool by a factor of 10.

Awk (or Perl) doubles the usefulness of Unix. Most of the common commands in Unix are query commands. When you need to start manipulating queried data, Awk is where the rubber meets the road. Piping data through the shell stops being read-only, and becomes interactive.

> It ran faster than our "good" Matlab tool by a factor of 10.

Could you give a bit more details there? I don't have any experience with matlab, but I tend to think of awk as fast to write code in (and start up), though not particularly fast in execution. (Roughly on par with Python, i.e., usually good enough.)

That's mawk. I'm talking about the implementation that post calls "nawk", and either way, I mean orders of magnitude - I care about a 10-100+x difference in speed, not a 1.1-5x one. Awk and Python fall in roughly the same performance tier for that kind of code.

Also: "I have since found large datasets where mawk is buggy and gives the wrong result. nawk seems safe." makes me uneasy, as does the fact that it was unmaintained for a while.

Afaict, mawk's maintenance seems to be a bit up in the air--- the original maintainer basically disappeared years ago and hasn't blessed any successor, so the Debian-patched version became the de-facto current version, since at least it staved off bitrot. Recently someone (Thomas Dickey) picked up maintenance of a new upstream version unilaterally, starting from the Debian-patched version, but he hasn't managed to convince the Debian mawk maintainer to accept his new version as a new upstream (somewhat testy thread here: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=554167). I'm personally a little more comfortable with something actively maintained like gawk, despite the speed differences.

Right. I usually use (n)awk because it's the default on OpenBSD, but have to admit gawk's artificial-filesystem-based networking support is pretty cool.

Ultimately, what are you implying? Am I wrong? Awk (any implementation) isn't as fast as C, LuaJIT, or OCaml. It's likely to be good enough for many things, though (certainly prototyping), and it's definitely convenient for quick scripts.

General term for this: "Hylomorphism", defined as the composition of an anamorphism (a generator function) and a catamorphism (a fold/map-reduce function.) The initial base case of the generator runs BEGIN{}, and the terminal base case of the fold runs END{}.

Actually, I think the general term for this is a "pipe". (I've seen it called "generate and test [programming]" in Prolog books, but that's specific to a filtering pipe.)

perl -nle is a nice substitute if you need a bit more code or its version of regexps. This proved to be quite useful when working with multiple Unices who all had different awks.

Still, the One True Awk still has my favorite opening line in its "b.c" source file:

    /* lasciate ogne speranza, voi ch'intrate. */

> You have a block that runs before anything else happens, a block run just before the program exits, and a block run for every piece of input

Interestingly enough, windows powershell structures its cmdlets in the same way. Makes lots of sense for stream processing as you said.

On the other hand, both awk and sed quickly spiral out of control if you need to do anything nontrivial that spans newlines.

If the unit of input in this kind of stream processing system doesn't match the problem domain exactly, things get very difficult very quickly.

Awk isn't so bad if you're clever about RS, but sed sucks. A tragic gap in the Plan 9 legacy has been structural regular expressions, which deal with these situations adroitly.

(RS = record separator, it just defaults to newline. You can handle multi-line patterns in awk.)

My problem was that the records really were purely newline-delimited, but I needed to process them using information from their context in the stream.

Fair enough. That's beyond the common cases awk addresses. At that point, I just switch to Lua. (I forget if you're a Python or Ruby guy.)

I can't recommend the Sed & Awk book enough.

Regular expressions are my favourite secret weapon; So many problems are made simple by regular expressions and so few people (outside of IT) know of them.

After I recommended the Sed & Awk book here a few months ago, silentbicycle countered[1] that The Awk Programming Language[2] (by Aho, Kernighan and Weinberger) was much better.

I was curious enough that I bought and read it just at the end of summer. It really is excellent. Highly, highly recommended.

[1] http://news.ycombinator.com/item?id=1403376

[2] http://cm.bell-labs.com/cm/cs/awkbook/

from what i remember, "the unix programming environment" by pike et al is also pretty good, and contains a basic introduction to most of the unix utilities.

You can't go wrong with any programming books Brian Kernighan co-wrote, really. I have _The ANSI C Programming Language_ ("K&R"), _The Practice of Programming_, _The AWK Programming Language_, and _The Unix Programming Environment_, and they're all great. Concise, with a lot of depth that reveals itself on repeat reading.

Ieursalimschy's _Programming in Lua_ ("PiL") was written in a similar style. I recommend it quite highly, too. Great language, great programming book.

Also, the PSD, SMM, and USD books (_4.4BSD Programmer's Supplementary Documents_, etc.) are dry, but also have excellent introductions to several classic Unix tools. They're included as documentation in some BSD installations, and should be easy to find otherwise. The intros to lex and yacc are particularly good.

I'm glad you saw this thread. It's always nice to find out somebody actually paid attention to (and appreciated) some advice you put out on the interwebs.

:) Awk threads always seem to get my attention.

Many people consider Perl to be the next evolution of awk, but I prefer to think of awk as (just) the essentials of Perl. Perl has CPAN, etc., but for quick string hackery, everything you need fits in one tiny awk reference. Its design hasn't sprawled the way Perl's has. (Except for gawk. The FSF does bloat better than anyone.)

It's incredibly handy, yet the language is small enough that you can learn most of it in an evening, with just a bit longer if you don't know regular expressions.

I definitely prefer awk versus perl for one-liners, with some sed thrown in. Perl does have some command-line switches to ease certain kinds of one-liners, but it just feels more verbose for that kind of interactive use (feels more oriented towards writing scripts).

I do tend to use Perl for things where speed matters, though, especially with large amounts of data going through a regex--- Perl's regex engine seems considerably faster than any awk (or especially sed) I've tested, at least on a few examples I've ported in the past. I was surprised once to get an 8x speedup by porting a 3-line sed script to a 3-line perl script (it was basically doing s/ABC/A\nC/g on a multigigabyte file). I've heard mawk can be speed-competitive with Perl, though.

Same here, but I prefer Lua to Perl, and Lua's LPEG (http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html) compares very favorably to common regexp implementations. (There are benchmarks in the paper.)

It's based on PEGs, a different formalism than regular expressions. PEGs are more expressive - they're able to handle balanced, recursive structures, for example. LPEG is a nice middle ground between regular expressions and a full LALR(1) parser.

I'm not sure if "middle ground" is quite right; PEGs and CFGs can express a different set of languages, and each has their own advantages. Probably the most important tradeoff is that with PEGs you gain infinite lookahead and negation, but you lose left recursion and the ability to express ambiguity.

I meant "middle ground" in a practical sense, rather than the linguistic one - REs are good for simple string hackery, but not sufficient for nested structures. Using an actual parser generator (the yacc clone in your language of choice) can really be overkill for simple things, though. LPEG is a bit more expressive than just REs, but still easy to casually drop in during quick scripting.

Note that the first code he writes on the page

awk "{print $0}"

does not work. Awk programs need single quotes to prevent bash expansion.

Depends on the shell one uses, i think. On the other hand, who doesn't use bash these days.

This can be simplified. :-)

    awk 1

Assuming you're using the Bash shell.

Same thing happens on Bourne shell, or Korn shell.

Doh. Fixed.

It's been a while since I've touched awk, but I've certainly got a lot of use from it when ripping data out of logfiles and using it elsewhere on the command-line.

I can recommend this text file of awk one-liners:


And for completeness, here's one for sed:


I can understand the usefulness of awk if say only C, C++, Java/C# existed, but given that it's just easy and fast as Awk to code something useful and powerful using something modern like either Ruby and Python; I just fail to see the point.

O yeah and let's not forget Perl.

Awk is so small that you can be productive in half an hour. It's so concise that most useful programs are easy little one-liners. It's so fast that you can trust it with massive data crunching.

In other words, awk is unbeatable for stream crunching. (That's the point of being domain specific, by the way.)

"Awk is so small that you can be productive in half an hour. It's so concise that most useful programs are easy little one-liners."

I can say the same for ruby and python (and perl).

From personal experience, as an awk script/program becomes more important - it will evolve with more requirements and it will start to be clunky. It just isn't practical to stick with it since you'll eventually need the features/libraries that the other languages have. Given the choices we have today, why even start with awk?

On the performance side, you can always just use Lua if that's really important.

The major benefit with awk is that it runs as a pattern recognizing/processing filter by default, so it handles certain common problems in very little code, and fits particularly well in Unix shell pipelines. I'm also a big fan of structuring code in terms of pattern-matching. (I wrote an Erlang-style pattern matching library for Lua, btw: http://github.com/silentbicycle/tamale/ )

I write a lot of little awk scripts, but if they grow past ~5 lines, they usually get rewritten in Lua. (Perhaps eventually with inner loops in C.) Still, Awk is simple and useful enough that it's still worth knowing.

"The major benefit with awk is that it runs as a pattern recognizing/processing filter by default"

Doesn't every language have regular expressions built in now? Again I still fail to see the point of writing it in Awk when you can write something small and fast in a more powerful and modern language.

I mean something different than regular expressions: I'm talking about how the whole program is structured around "pattern -> action; other pattern -> other action; ...", with special event patterns for BEGIN, END, etc. That pattern-based dispatch is the top level of the language, rather than function definitions. (Those came later.) As the man page says, it's "pattern-directed".

It's a higher-level approach than typical scripting languages, and that's why it can be so concise - the model makes a lot of unpacking and looping implicit. It's a DSL for stream-processing problems which are easy phrased as "count these", "transform this into that", etc.

Are you familiar with Prolog? It uses a similar approach, but can match on whole trees (and other complex, nested data structures), not just a list of $N string/numeric tokens. Also, it supports backtracking - at any point, if it reaches a dead end, it can back up arbitrarily and try a different approach. Sometimes slow, but very handy for prototyping.

I agree that using another language than awk makes sense after a few lines, but it's still a sweet spot for 1-5ish line programs. Since awk itself is small enough that a two page cheat sheet is sufficient, it's worth keeping around. Perl (for example) has many nooks and crannies I forget about if I don't use it frequently.

Anyone who hasn't tried a general purpose language with pattern-based dispatch (usually referred to in practice as "pattern matching") should really do themselves a favor and try one; it's one of the most useful language features around. Now that I've become used to it, it's a bit unpleasant for me to use languages that don't have it. It's a very convenient way to structure code.

The parent post mentions Prolog, which is a good example, but there are several others worth trying that frequently come up on HN; Scala, Haskell, F#, and Ocaml spring to mind.

Yes! Anybody who knows me in person is probably tired of hearing about how good pattern matching is by now. :) I definitely know what you mean about missing it in languages without it, that's why I've been working on tamale.

I can't speak for Scala, but the PM in Haskell and OCaml is a bit different since it's informed by the static typing. When patterns have variant types (i.e., x is either Foo, Bar, or Baz * int), it also checks for complete coverage. Same general concept, different flavor. Also very useful.

I mentioned Prolog in particular because its emphasis on unification and backtracking make it the most pattern-matching-centric programming language I've seen. Where other languages have pattern matching, it almost is pattern matching.

Also, there are well-known ways to compile pattern specifications into efficient decision trees, so while it's a very expressive abstraction, it's not necessarily an expensive one. If they're being constructed at runtime (as they are in my Lua library), you can generally get a big improvement by just indexing on the patterns' first fields and doing linear search thereafter.

My little story about building a "real" program in awk was really just a toy example - I did it to teach myself some awk. I wouldn't really recommend writing significant persistent scripts in awk that much, if only because nobody else will want to maintain them. I think the beauty of awk for me is the little tiny bits of one liners that I can string together to do a quick bit of ad-hoc work. Anything more and I agree with you - I'd pick a real programming language.

"I think the beauty of awk for me is the little tiny bits of one liners that I can string together to do a quick bit of ad-hoc work. "

Again, my point is that you can do the same thing in Ruby, Python, Perl, or Lua just as easily and fast as you can in Awk. Awk used to have a nice niche years back. It pretty much lost that niche the second Perl got popular. It's even more irrevant now that Ruby and Python are even easier and faster to build stuff with quickly

About the only real pragmatic reason I can think of for learning Awk is to migrate existing Awk scripts, that started as nice useful one liners but eventually evolved into spaghetti, to Python / Ruby

I'm pretty sure he's talking about simple command line work. You seem to be missing the point. Compare:

  | awk '{print $2}' | ...

  | python -c 'import sys
  for line in sys.stdin:
      print line.split()[1]
    except IndexError:
      print' | ...

Yeah, but try

  | perl -nae 'print $F[1], "\n"'
  | ruby -nae 'puts $F[1]'
Your local python oil vendor may have a one liner for that language as well.

I still posit that awk '{print $1}' is simpler

I had great fun writing the traditional "Cloak of Darkness" exercise for Interactive Fiction in pure AWK:


For comparison, here are all the published examples of this exercise in a variety of systems:


I won't say it's the best tool for this job, but I feel that the awkishness provides a certain elegance to some aspects.

I went through the article and tried the stuff on the log files on my web-server. Useful stuff.

I know this goes against what is said here but I hate awk. The syntax is so convoluted that it seems parts were picked with whatever was reasonable at the time. It's like I'm banging on rocks in a cave somewhere every time I have to work with bash, awk and related tools. In fact I did a quick bash script of pattern matching some files, moving them, resizing and compressing then uploading and it took three days to read man pages, parsing, etc till I got fed up, used ruby and had it done in under an hour.

Awk can also be faster than (naively written) C++. http://anyall.org/blog/2009/09/dont-mawk-awk-the-fastest-and...

That's not news, though. Better algorithms trump constant factors, and what is "naively written C++" if not a murder of bad algorithm choices?

I'd bet that people get shit done 10x+ times faster in awk/lua/python/ruby/lisp/whatever until having to work with nasty C++-specific libraries dominates, though. (C is friendlier that way.)

That and most places with a clue that are hiring competent sys admins will expect at least some knowledge of sed and awk.

That may have been true prior to 1987, however since then Perl has largely superseded sed and awk.

Or Python at Google.

Since I was introduced to AWK I didn't look back, 80% I have to do on the command line end up using AWK for it.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact