
Awk in 20 Minutes - craigkerstiens
http://ferd.ca/awk-in-20-minutes.html
======
na85
Story time.

Back in ~2005 when I was still very new to Linux I had this old Dell that I
put Gentoo on. It was a 900MHz Intel Coppermine with a minuscule amount of
memory, so things compiled very slow. Being the wise Linux guru that I was
back then, I decided that I would emerge (compile) all of X, fluxbox,
OpenOffice, and probably 1 or 2 other things.

I didn't have distcc set up. It compiled for days.

One thing I really did enjoy was watching the cryptic messages crawl by, and
one that always mystified me was during the configure stages:

    
    
       checking for gawk... gawk
    

It seemed like a total nonsense word. To this day I still have fond memories
every time someone mentions (g)awk.

------
chris_wot
You are awesome. That's seriously the easiest bit of technical reading I've
ever done. I tip my hat to you!

~~~
pbhjpbhj
Yes, easy reading, as a non-programmer the only thing that lost me was the
throw away line in the code comments on "function arguments are call-by-
value". I've some recollection that call-by-value and call-by-reference are
options here but not entirely clear what the meaning of "call-by-value" is.

It probably doesn't matter though - the Wikipedia reference is ever-so
slightly to terse for me to understand the implications thoroughly too,
[http://en.wikipedia.org/wiki/Evaluation_strategy#Call_by_val...](http://en.wikipedia.org/wiki/Evaluation_strategy#Call_by_value).

Am I right in saying that if I call a function on $2, say, that unless that
variable is explicitly assigned in the function that $2 remains unchanged
after the function operates. But that with a call-by-reference the value of
the variable itself will be altered? Something along those lines.

~~~
antsar
_Am I right in saying that if I call a function on $2, say, that unless that
variable is explicitly assigned in the function that $2 remains unchanged
after the function operates. But that with a call-by-reference the value of
the variable itself will be altered?_

You've got it. At least that's what passing a variable by reference vs. by
value means in other languages, such as PHP.

------
sea6ear
Also Awk has one of the best bite sized tutorial/references out there. Up
there with Programming in Lua in terms of awesomeness per page count.

[http://www.amazon.com/AWK-Programming-Language-Alfred-
Aho/dp...](http://www.amazon.com/AWK-Programming-Language-Alfred-
Aho/dp/020107981X/)

------
freditup
Very well done article.

General question: when would one choose to use an awk script over something
more general purpose such as a python or ruby script? To me it would make
sense to use the latter in most cases.

~~~
babarock
Good question. I've been working with awk for several years now, and here's
how I feel about it.

AWK is old. 1977 old. Later versions that appeared (nawk and gawk are the most
common) helped make it a smoother language, but it's still a pain. There are
definite features you will be missing in an awk script:

\- Any useful data structure slightly more complex than associative arrays.
Try multi dimensional arrays, it's actually fun to do. Once.

\- Any useful programming construct to help manage with complexity of scripts
longer than a few hundred lines. No classes, no variable scoping, no
namespaces in general. Not to mention an extremely permissive compiler.

\- Any easy way to deal with the environment (other than text). Try sending
http requests in awk, it can be a pain.

In 2015, if you need to write a script you should almost always prefer
python/ruby over awk.

Now if you're asking whether you should learn awk? It comes in handy. There
are _a lot_ of awk scripts in the wild, and you may need to read them or edit
them one day. Also, awk has a fun way of parsing input, which makes for very
enjoyable one liners. Learning awk (and some complementary utilities like sed
or find) turned me into a "oh, it can be done in a quick one-liner" kind of
guy. Definitely recommended.

~~~
a3n
> Also, awk has a fun way of parsing input, which makes for very enjoyable one
> liners.

Yes, being able to casually toss off an awk (or sed) one-liner is a very
convenient skill to have.

~~~
busterarm
I agree, but I get most of the same out of Ruby even if might be slightly less
readable (or ugly Ruby code).

It's nice that the mentality comes out of using certain languages though.

also [http://tomayko.com/writings/awkward-
ruby](http://tomayko.com/writings/awkward-ruby)

~~~
kazinator
So you get an ugly program that requires a third party tool, rather than just
a POSIX system. I would say that this solution then requires additional
justification compared to the Awk no-brainer.

------
michaelmcmillan
Piping things in and out of awk is really powerful and fast! Check out how
Gary Bernhardt utilizes different Unix-commands (including awk) to filter out
dead links on his blog: [http://vimeo.com/11202537](http://vimeo.com/11202537)

~~~
greggyb
This is a cool video. I have been using Linux casually for a year and a half
now, but haven't progressed too much past the desktop paradigm.

I am not uncomfortable in the terminal, but I feel more akin to a foreigner
who's mastered a phrasebook, rather than someone able to hold even a
rudimentary conversation.

Do you (or does anyone else) have more examples similar to this, or resources
that would be useful in moving toward this fluency in utilizing command line
tools?

~~~
michaelmcmillan
I would strongly urge you to purchase Gary Bernhardt's screencasts at
[http://destroyallsoftware.com](http://destroyallsoftware.com). It has been
extremely influential for me. I've realized that Unix should be looked at as a
tool when programming. He also has strong opinions on how to test properly,
check this out:
[https://www.youtube.com/watch?v=tdNnN5yTIeM](https://www.youtube.com/watch?v=tdNnN5yTIeM)

~~~
greggyb
These look very interesting. I think I'll probably end up purchasing these.

------
kespindler
Going to put a shameless plug here for the python awk replacement I made.

github.com/kespindler/puffin

Instead of learning a new language and new syntax, just use python!

~~~
andre3k1
If only it was as fast as awk. Is it?

~~~
Russell91
No, the python interpreter itself takes about 0.03 seconds to start up on an
i7. Awk and sed and friends are usually about 1/10th of that. So the python
interpreter startup time dominates the cost for most simple tasks. It's not
enough to notice when you're typing things by hand at the shell, but the
difference can be painful if you're writing bash scripts. I use a similar
python tool for command line work, but I always go for sed when I'm writing
bash scripts that I plan to distribute.

------
proveanegative
This seems like a good thread to ask: are there more featureful languages that
derive from Awk (edit: i.e., can work in a data driven mode) but don't diverge
as much as much as Perl did in terms of syntax?

~~~
a3n
awk applies actions to lines matching patterns. The design seems concise and
limited on purpose.

Not that you shouldn't want more features, but in the context that awk is
typically used, what sort of features would you want to add? I confess I'm not
able to understand what "can work in a data driven mode" might mean.

To answer your question a little more directly (and yet still be almost a non-
answer), additional features can be found on the other side of the pipe into
which you direct awk's output.

~~~
proveanegative
>awk applies actions to lines matching patterns. The design seems concise and
limited on purpose.

That is what I meant by a "data driven mode".

>what sort of features would you want to add?

An extended awk could at least add new types of patterns, e.g., binary,
stateful and nested patterns or formal grammars, regex patterns with match
groups, etc., meaning you could correctly process "real" CSV files or log
files with complex structure. I believe these features could be made to fit in
with the design of the language making them all feel like they belong to the
same kind. It could also add new functions such as ones for Unicode text
normalization.

Perl5 does much of this and Perl6 even introduces grammars but because of some
of its design decisions for me Perl is not a joy to use like Awk. Both of its
versions are just plain too big.

~~~
lsiebert
You can use multiple file options, but I am not sure if you can include files
to load a library file from within awk itself, nor am I aware of any options
for conditional linking. Those would be nice, it would make it easy to write
awk libraries.

------
reacweb
I use awk only for trivial commands that fit in one line. Here is my last one.

awk -F : '/^OPS1:/{print $2}' < machinelist

I think the article could add an example of this kind of simple usages.

~~~
aidos
Here's my last one (find half of the free memory to feed into a stress testing
tool):

stress --vm-bytes $(awk '/MemFree/{printf "%d\n", $2 * 0.5;}' <
/proc/meminfo)k --vm-keep -m 1

It's like playing terminal history awk roulette :)

------
coliveira
The best introduction to awk is its man page. It is such a concise language
that you can find pretty much everything you want to learn about it in just a
few pages.

~~~
Gracana
The manual page isn't an introduction, it's a reference. It says what it is (
_the GNU Project 's implementation of the AWK programming language_), and goes
on to talk about how the program is operated and how the language works, but
it doesn't explain why or where you'd use this tool That's fine for a
reference manual but it's a big omission for an introduction.

~~~
coliveira
I am not talking about the GNU Info page, but the traditional man page. It is
much more concise and useful.

~~~
tarblog
Can you provide a link to an online version? I'd like to read it

------
vram22
Here's a UNIX one-liner I wrote a while ago that uses awk, sed and grep for a
real-life need:

UNIX one-liner to kill a hanging Firefox process:
[http://jugad2.blogspot.in/2008/09/unix-one-liner-to-kill-
han...](http://jugad2.blogspot.in/2008/09/unix-one-liner-to-kill-hanging-
firefox.html)

The comments on that post are also of interest.

------
wazoox
I use perl one-liners, thanks to the power of -n, -i and -p switches.

    
    
      perl -n -e <expression> <file>
    

runs the expression on each and every line of <file> (can be STDIN of course).

    
    
      perl -p -e <expression> <file>
    

prints out every line for which the expression returns true. The -i option
allows treating files in place. Add -i<extension> to create a
<file>.<extension> backup, just in case.

examples:

    
    
      perl -p -e '/admin/' file
      perl -p -e '^/admin/' file
    

behaves exactly like the basic awk examples. We can also replace stuff :

    
    
      perl -p -e 's/admin/bozo/g' file
    

print all lines matching 'admin', replacing 'admin' with 'bozo'.

Now if you just want to replace all occurrences of 'admin' with 'bozo' in the
file:

    
    
      perl -pi -e 's/admin/bozo/g' file
    

For instance perl allows you to use a different separator than / for regexps,
very useful when manipulating paths:

    
    
      perl -pi -e 's#/some/path/#/different/path/#g' file
    

Of course instead of a simple awk/sed substitute, you can run more elaborate
code, and even use perl modules in your one liners with the -M switch:

    
    
      perl -MData::Dumper -n -e 'print Dumper $1 if m/^(admin .w+)/' file
    
    

Lastly, perl allows you to use q(string) instead of 'string' and qq(string)
instead of "string", that avoids lots of escaping when typing in oneliners, so
instead of:

    
    
      perl -n -e 'print \'found\' if m/admin/' file
    

Use

    
    
      perl -n -e 'print q(found) if m/admin/' file

------
decisiveness
This is probably the most concise introduction to awk I've read.

For more extensions[0] and advanced features like arrays of arrays and array
sorting, there's also gawk. And for larger files there's performance driven
mawk which can drastically increase processing speed[1].

[0][https://www.gnu.org/software/gawk/manual/html_node/Extension...](https://www.gnu.org/software/gawk/manual/html_node/Extension-
Samples.html)

[1][http://brenocon.com/blog/2009/09/dont-mawk-awk-the-
fastest-a...](http://brenocon.com/blog/2009/09/dont-mawk-awk-the-fastest-and-
most-elegant-big-data-munging-language/)

------
lotsofcows
Nice write-up!

A couple of points:

On CentOS, at least, gawk == awk which falsifies "patterns cannot capture
specific groups to make them available in the ACTIONS part of the code". Eg:
echo "abcdef" | gawk 'match($0, /b(.*)e/, a) { print a[1]; }'

You missed the match and not match operators ~ and !~ in your list.

Finally, I find people better understand the flexibility of awk when they
realise that awk '/bob/ { print }' is shorthand for awk '$0~/bob/ { print $0;
}'. This makes it clear that the pattern element is not limited to regexs or
matching the whole line.

------
kasperset
Awk is also used a lot in Bioinformatics where you need one-liners to
extract/format the data. Ofcourse, Perl/Python/R/Ruby can also be used but in
some cases Awk is just simple and graceful.

~~~
SilasX
Yes! It shows up a lot among the briefest solutions to Project Rosalind.

------
Friedduck
Were I to have had this instead of the O'Rielly book (Sed & Awk) that I
learned from. The time I could have saved.

My goal: to be able to write as clearly as this.

------
gesman
Excellent tutorial.

I'm actually working on (about to finish) free Splunk app that monitors HTTP
traffic via Apache logs on WHM/Cpanel based hosting servers and visualizes
traffic and activity trends and patterns between IP addresses and sites.

Awk would be an excellent tool to quickly play and "debug" logs content
alongside with visual tool.

Additionally I think I'd want to utilize it for malware detection.

\+ On my "to practice" list.

------
tieTYT
The one thing this is missing is a search/replace example. That's a very
common thing to do and annoyingly clunky in Awk.

~~~
jnazario
a lot of people pipe to sed but you don't need to. you can do regex subs right
in awk. see sub and gsub.

    
    
           sub(r, t, s)
                  substitutes  t  for the first occurrence of the regular expression r in the string s.  If s is
                  not given, $0 is used.
    
           gsub   same as sub except that all occurrences of the regular expression are replaced; sub  and  gsub
                  return the number of replacements.
    

(this is awk from osx, which i _think_ is nawk)

~~~
favadi
If you have gawk, use gensub instead.

------
101914
BEGIN and END are somewhat like what is above the first %% and below the
second %% in a yylex "source" file. Or maybe not. I need to review the
documentation.

This brief AWK intro entices me to try making a similar one for (f)lex using
the author's concise format as a model.

In any event, Pattern --> Action is common to both programs.

~~~
kazinator
The %% in lex and yacc statically organize the file into different areas.
BEGIN and END have run-time semantics: do these things before applying the
pattern/actions to the inputs, and do these things afterward.

~~~
101914
1\. I never mentioned yacc. What relevance does it have to my comment? I
typically use (f)lex without yacc/bison to do a similar job as I would use AWK
for: text processing.

2\. "statically organize the file into different areas" One is a code
generator and the other is a scripting language with an interpreter, is that
what you mean? In effect, this difference means little to me (except for speed
of execution): I store my (f)lex programs as source files that I feed to the
(f)lex code generator. Then I compile the generated C code. I store my AWK
scripts as source files that I feed to the AWK interpreter. I use both flex
and AWK to perform a similar task: text processing.

For whatever it is worth, I get better performance from my compiled flex
scanners than from my interpreted AWK scripts. But I sometimes use them for
the very same text processing jobs.

AWK:

    
    
      BEGIN { define variables }
      pattern-action rules
      END { stuff to do after EOF }
    

(f)lex:

    
    
      { definitions } user variables
      %%
      { rules } pattern-action rules
      %%
      { user routines } stuff to do after EOF
    

From the blog: "_BEGIN_, which matches only before any line has been input to
the file. This is basically where you can _initiate variables_ and all other
kinds of state in your script."

From the Lesk and Schmidt: So far only the rules have been described. The user
needs additional options, though, to _define variables_ for use in his program
and for use by Lex. These can go ... in the _definitions_ section...

From the blog: There is also END, which as you may have guessed, will match
after the whole input has been handled. This lets you clean up or do some
final output before exiting.

From Lesk and Schmidt: Another Lex library routine that the user will
sometimes want to redefine is yywrap() which is called whenever Lex reaches an
end-of-file.

I regularly use yywrap in the "user routines" section. It functions much the
same way as commands I use in the END section of an AWK script.

I guess one can either focus on differences or similarities. I choose the
later.

I care little about the "intended purpose" of a program. I care more about
what a program can actually do.

~~~
kazinator
I know, but both lex and yacc use the %% division in similar ways; that is why
I mentioned it.

Simply put, your "definitions" are not stuff that is done before pattern-
action rules, and "user routines" are not stuff that is done after EOF. It's
all just stuff that is declared. Both sections can contain code, and that code
can be called out from the pattern rules. Either section could contain a main
function that calls yylex. If the lexer is reentrant, it could be re-entered
from any of those places. And so on. Fact is, the %% division has nothing to
do with processing order, unlike BEGIN and END in Awk.

~~~
101914
%% division can be used to do exactly what BEGIN and END do, and that is how I
use it. Moreover, as I recalled correctly, the Lesk and Schmidt paper
specifially mentions such usage.

My comment is not referring to the internal behavior of the two programs (as
yours is). And the Lesk and Schmidt paper is not setting down hard and fast
rules; it is only making suggestions. My commment was about how the two
programs can be used to do similar work, i.e., text processing.

If you do a lot of text processing work, at some point AWK is not fast enough.
I have other programs I use and flex is one of them. Specifically, scanners
(filters) produced with flex.

~~~
kazinator
I don't disagree that you can put stuff that is done first above the first %%,
and then stuff that is done after scanning after the second %%. I just don't
think that this makes %% analogous to BEGIN and END. For one thing, stuff can
be moved around from one of those sections to the other, without changing the
basic organization of the program. For instance, prior to the first %% you can
put prototype declarations, and move everything to the bottom.

------
joshbaptiste
#awk on irc Freenode great is a great place for general or advanced help

------
carapace
The second comma in the second sentence has got to go. "on files, usually
structured" <\-- that one. There are some other ones too. Other than that this
is really great!

------
OneOneOneOne
AWK is very easy for C programmers to learn.

You can also try 'info gawk' from the Unix or Cygwin prompt for a good
tutorial.

~~~
mseepgood
Or JavaScript programmers. JavaScript's function syntax came from AWK.

~~~
agumonkey
First time I read that but well the spec says so.
[http://hepunx.rl.ac.uk/~adye/jsspec11/intro.htm#1006028](http://hepunx.rl.ac.uk/~adye/jsspec11/intro.htm#1006028)

------
hayksaakian
someone should add this to

[http://learnxinyminutes.com/](http://learnxinyminutes.com/)

------
baldfat
Awk & SED I have a hard time think about one without the other. I think they
should be married.

~~~
AdmiralAsshat
Awk & Sed are labeled as the following in my mind:

\- Awk: That thing I use when I need to grep for something over more than one
line and/or do some basic transformations.

\- Sed: That thing with the painful syntax for doing ridiculously complicated
regex substitutions.

With that said, I find sed much more difficult to use than awk and generally
try to avoid it if I can. I'm even prone to just opening the file in vim and
executing the replace command through that rather than using sed.

~~~
a3n
I would use sed for transformations (simple or complex) where I only care
about one transformation (even though you can do multiple): Does the current
line match this pattern? Change it to this other thing and move on the next
line.

I would use awk when there are a handful or more of potential transformations:
Does the current line match any of these multiple patterns? Do the action
that's defined for each of the patterns, then move on to the next line.

If I need to do multiple transformations, and still want to use sed, I find it
easiest to create a chain of single sed transformations, piped together.
Somewhere in that area a shift to awk (or python) becomes justified.

~~~
simpleigh
I found sed great for messing around with SVN repository dumps. Delete a few
lines here to remove the commit that added a directory... change a few paths
to pretend the files in that directory had always been somewhere else... add a
few lines somewhere else.

Sed scripts are a quick way to automate simple edits to large files.

------
jacquesm
This is written by the same person that wrote 'learn you some erlang for great
good'.

------
known
Use gawk for processing large data files

------
qodeninja
Finally, something useful.

------
patrickg_zill
I have been very happy in using awk, have used it to generate real results. It
is fast, a little clunky at first, but since it is a smaller language it is
quick to get up to speed on.

