
Awk in 20 Minutes (2015) - jessaustin
https://ferd.ca/awk-in-20-minutes.html
======
mauvehaus
I learned perl in college in the 5.6 days, and did a lot of text processing
with it for a time. At some point, I bit the bullet and learned awk, and I've
mostly abandoned perl as a result.

Why? awk is small enough that it fits in my head, or at least the bits I need
every couple of months do. And if I forget, it only takes 20 minutes to put
them back in. perl, by contrast, is far too large to fit in my head and comes
with an ecosystem that is much larger still.

I know I can use perl for what I use awk for, and when I've raised this point
before, people have been quick to explain how to process input by lines, and
conditionally do something. For basic stuff, the fact that that's basically
all awk does[0] means there's so many fewer ways to do it wrong. I can't say
the same of perl, even when I was more familiar with it.

caveat: awk, nawk, mawk, and gawk don't necessarily share the same set of
corner cases, and you may not get error messages that make sense to you when
you bump into one with an unfamiliar awk.

[0] I know, not really, and especially for gawk. I've written awk scripts a
couple hundred lines in the past. It's true enough for the 20 minute version
though.

Edited for clarity.

~~~
saberience
What are you using Perl/awk for that it's needed so often?

I've been a software engineer since 2005 and worked my way up to being a VP of
Engineering currently and never had to use either Perl or awk (or similar). I
often read about these tools on Hackernews and I find it quite mystifying as I
manage to have written Java, Scala, C#, SQL, and so on for 15 years and
happily never needed them.

Is this a certain kind of engineering job that requires searching through text
files so often and requiring specialized tools? I've managed my whole career
with ctrl-f, and highlight-all matches.

~~~
submeta
There are hundreds of use-cases. It is not important what the use-cases are. -
There are two types of developers: Those who use hacking to solve all kinds of
tasks that need hundreds of manual steps. And those who never think of
automating smaller steps. The second kind of devs can be very good software
engineers, but they prefer IDEs instead of tweaking Vim or Emacs. The first
kind of dev will look for possiblities to automate repeated steps,
opportunities for tweaking, transforming, hacking. For the fun of it.

~~~
dan-robertson
I think this argument is unfairly dismissive.

I think it depends a lot on the sort of software one works with. A main use of
awk and other unix tools for me is as-hoc data munging, looking through
production logs, taking bits of data from different sources and comparing
them. If you make eg gui applications and sell them or work as a contractor
developing in-house applications for clients, you probably don’t see a lot of
production logs like that or process them that way. Any data that you expect
to process ought to go into a database and then you can use sql (where joins
work much better than the unix join command and you can use the actual
structure of the data instead of trying to tease it out with the smallest
simplest code you can think of). You might be using a debugger or backtracks
or reproduction in test to deal with/investigate production issues rather than
starting with logs. On the other hand, many other companies will have big
production systems that run on lots of different virtual machines and produce
lots of logs and have issues that are hard to reproduce (or maybe the logs are
easier), and have lots of data spread about random places such that it may be
easier to do the hacky thing for one-off cases or to prototype or whatever
rather than Doing things more thoroughly or properly. And this set up will
also lead to tools that are designed to fit into the rest of the unix ETL tack
by the way they output or input data.

On the second point, I think that even as a die-hard emacs user I wouldn’t
recommend that anyone try to use emacs as an ide for something like java (or
maybe c++). I’d probably try to use emacs a myself but I think my java-editing
experience would be worse than for those people who do it with an IDE. More
realistically, I’d try to avoid writing java if at all possible.

------
liammonahan
My absolute favorite example of what's possible with awk is this[0] calculator
from Ward Cunningham about splitting expenses on a ski trip. It's a really
beautiful little piece of code well-adapted to this problem.

[0] - [https://c2.com/doc/expense/](https://c2.com/doc/expense/)

~~~
tekknolagi
This reminds me of a story of a guy who wrote a whole company internal debit
system like this (coffee, meals out, etc) and it turned into a currency. I
feel like I either saw it here, or a similar forum. Anyone have a link? My
searches have not found it...

~~~
stuuuuuuuuu
This one? [https://royrapoport.blogspot.com/2011/05/coffee-and-its-
effe...](https://royrapoport.blogspot.com/2011/05/coffee-and-its-effects-on-
feature-creep.html)

~~~
tekknolagi
That's it! You found it! Thank you.

------
freedomben
This is an excellent blog post. I will refer people to this! He touches on all
of the important points.

If you want a more thorough/deep exploration of Awk, I recently gave a talk on
it (virtually) at Linux Fest Northwest (LFNW) 2020.

Awk: Hack the planet['s text]! Part 1 (Presentation):
[https://www.youtube.com/watch?v=43BNFcOdBlY](https://www.youtube.com/watch?v=43BNFcOdBlY)

Awk: Hack the planet['s text]! Part 2
(Exercises):[https://www.youtube.com/watch?v=4UGLsRYDfo8](https://www.youtube.com/watch?v=4UGLsRYDfo8)

If you want to try your hand at the exercises, they are on github:
[https://github.com/FreedomBen/awk-hack-the-
planet](https://github.com/FreedomBen/awk-hack-the-planet)

------
augustk
It's also worth mentioning that local variables can be simulated using
additional formal parameters. In AWK, any missing parameter in a function call
is initialized to zero.

Let's say we have a function CharCount which takes a character and a line of
text and returns the number of occurrences of that character:

    
    
        function CharCount(ch, line,
            n)
        {
            ...
        }
    

The line break in the parameter list is an AWK convention and indicates that n
is a "local variable."

~~~
ratsmack
I thought the convention was to separate local variables with three spaces
like this:

    
    
        function CharCount(ch, line,   n)

------
Marcus316
I love using awk. It was fairly easy to pick up, and it slides into my command
line workflow pretty well.

That said, there's an amazing amount you can do with it, if you really try.
Someone once joked that I should try building an IRC bot in AWK, so I did:
[https://github.com/Marcus316/rufus](https://github.com/Marcus316/rufus)

There's no pactical reason for it, but it was fun to play with the idea.

------
arendtio
Very nice conclusion of the language. However, keep in mind that awk is not
the fastest language around. A pro awk thread might not be the best place to
tell this story but it is fresh and true:

Last weekend I was playing around with some data. At first, I thought 'let's
just write a line of awk and be done with it' and so I did. The execution took
20 seconds (about 17 million lines) and everything was fine.

Later that day, I came across another task which seemed too complex for an awk
one-liner so I took two lines of R and was surprised when R was done within 5
seconds on the same data set.

I was happy because I found a faster tool than the one I had, but the lesson
is, that just because you use a proven tool like awk, doesn't mean there
aren't any better tools. _Find out_ what works best for you.

~~~
tannhaeuser
awk is a language with multiple implementations, so we can only talk about
performance of a particular awk implementation. Which is especially true for
awk because mawk (based on a vm rather than being a traditional interpreter)
is so much faster than other awks if you can live with its limitations [1]. If
you're using gawk, setting LANG=C also helps performance because no utf-8
honoring needs to be done. Speaking of which, I've noticed gawk on recent
Ubuntus (19.10) appears broken and/or has regexp size limits breaking awk code
with large regexpes (but haven't checked yet thorougly).

[1]: [https://brenocon.com/blog/2009/09/dont-mawk-awk-the-
fastest-...](https://brenocon.com/blog/2009/09/dont-mawk-awk-the-fastest-and-
most-elegant-big-data-munging-language/)

~~~
arendtio
Arch Linux

    
    
      $ awk --version
      GNU Awk 5.1.0, API: 3.0 (GNU MPFR 4.0.2, GNU MP 6.2.0)

------
geocrasher
Whenever somebody says "Oh, I don't know how to use awk", this is the link I
send them. It's easily the most useful tutorial for awk on the 'net. There's a
lot more you can do with awk, but this site shows you the top percentage in a
very short amount of time.

~~~
dwoot
It's crazy how this reappeared on HN only hours after I went searching for awk
on Hacker News. Fred Hebert is so well known in the Erlang and Elixir
communities.

I have just coined this idea the "gateway drug" approach to learning new
tools. We should strive to find those introductions that are small enough that
someone can digest without a massive upfront time investment to get them past
the front door :)

------
hackermailman
I learned awk from the chapter on filters in the pamphlet sized book 'unix
programming environment' by Kernighan and Pike you can buy for $8

~~~
combatentropy
I've heard many good things about that book, which is why I bought it several
months ago. From the bits I have read, I like it. But I have not yet finished
it, partly because it is not pamphlet sized. It is over 300 pages. But I look
forward to the chapter on Filters.

------
daenz
Very cool and I'm grateful for Awk to be presented in such a friendly way.

However, at some point, it makes more sense to write your Awk script in
something like Python, and my intuition says that that time is shortly after
starting a basic Awk script. Using a real programming language is almost
always the way to go with code that will become more complex over time (almost
all code) and code you have to share with a team (learning curve).

------
kazinator
By the way, Awk is whitespace sensitive:

    
    
       pattern { action }  # valid
    
       pattern {           # valid
         action
       }
    
       pattern             # not what you think
       { action }          # this action has no pattern
    

There cannot be a newline between pattern an action. This feature allows
either the pattern or the action to be omitted without ambiguity.

    
    
       pattern        # pattern with default { print } action
       { action }     # unconditional action
       pattern {      # pattern with action
          action
       }
    

Items can be put on a single line without ambiguity using semicolons:

    
    
       pattern ; { action } ; pattern { action }
    

A pattern can have multiple patterns separated by a comma. That syntax admits
optional line separation after the comma separators:

    
    
       pattern,
       pattern,
       pattern {
         action
       }
    

The action fires by a match for any of the patterns.

The POSIX standard Awk expression grammar has no comma operator, on the other
hand; the comma exists only for separating patterns, and function
arguments/parameters, and in the print statement syntax.

Plus, JNIL (just now I learned). The _in_ operator allows comma-separated
expressions for testing "multi-dimensional" array membership. Here is a hello,
world:

    
    
      $ awk 'BEGIN { a[1,2,3] = 4 ; print (1,2,3) in a }'
      1
    

(Note that multi-dimensional arrays in Awk are simulated; a string index is
generated for that 1,2,3).

------
gumby
One trick I use with awk, especially throwaway ones, is to use grep to subset
the data before feeding it into awk. Often the case is using awk to poke at
some data to diagnose a problem, not write a script to be run often or stuck
in cron.

So to use an example from this nice short article: I might do grep GET log-
file | awk blah-blah. Then awk doesn’t need to consider the lines I don’t care
about. This is especially useful when iteratively writing the awk script.

~~~
yiyus
I find easier to do awk /GET/{blah-blah}

~~~
gumby
Oh yeah, sure! I should have been clear that when dredging a really large log
file or such it can be faster to run awk on a much smaller subset when your
awk program has a lot of productions. I find it easier to think "OK, just run
awk on these lines that matter."

Six of one half dozen of the other.

------
alexhutcheson
There are two use-cases I’ve run into where awk really shines, and is hard to
replace:

1\. Writing scripts for environments that only have Busybox. Technically you
can write scripts in ash, but I don’t recommend it for anything beyond a
couple lines. It’s missing a lot of the features from Bash that make scripting
easier, and it’s easy to get mixed up if you’re used to Bash and write things
that don’t work. Awk is the best scripting language available, even if you’re
doing things that don’t exactly match what it was designed to do.

2\. Snippets that are meant to be copy+pasted from documentation or how-to
articles. In that case, it’s often not easy to distribute a separate script
file, so a CLI “one-liner” is preferred. You also can’t count on Perl, Python,
etc. being available on the user’s system, but awk is pretty universal.

For most other cases, I tend to create a new .py file and write a quick Python
script. Even if it’s a little more overhead, it helps keep my Python skills
sharp, and often it turns out that what I actually want is a little more
complicated than my initial idea anyway.

------
penguinjeff
Great post. I didn't know Fred had a blog. Fred's writing style never ceases
to amaze me. His book on property based testing in Elixir was a better
introduction for me to Python's Hypothesis than any of the other tutorials I
found online on the topic because it made me think about discovering
properties of my code. Would recommend it to others.

------
jftuga
One of the things I like to do with _mawk.exe_ under Windows is to automate
the same command over a group of files.

Let's say the myPgm only takes one file name as a command line parameter, then
I can so something like this:

    
    
        dir *.xyz /s/b | mawk "{print 'myPgm -x '$0}" | cmd
    

If the file paths have spaces in them, then you have to wrap the name within
double quotes. I have found it challenging to output those in mawk -- at least
easily any way. In this case, I have a windows version of _tr.exe_

In this case, I can do something like this:

    
    
        dir *.xyz /s/b | mawk "{print 'myPgm -x ~'$0'~'}" | tr ~ \042 | cmd
    

Although somewhat crude, it is effective.

~~~
spapas82
Why not use good ol' for?

F.e for %f in ( _.doc_.txt) do type %f

~~~
jftuga
I also need to do things like:

    
    
        docker container ls -a -q | mawk "{print 'docker rm '$1}" | cmd

------
bcyn
So, I don't use awk right now. This answers "how to awk" more than "why use
awk" for me (understandable given the 20 min claim).

Does anyone have concrete, practical examples of use cases where awk made your
life much easier?

~~~
ggm
Awk '{ print $2 }' does LWSP gobbling.. cut -d' ' -f2 doesn't and this
difference alone makes awk useful to me on a daily basis.

Awk has a hash like perl which is very efficient. I use an awk expression to
print uniq as they come in counted through the hash insert on new instead of
uniq which prints at end.

Awk count unique over 300,000,000 ips was as fast as perl and python and
smaller memory footprint

~~~
chme
If you just use awk for `{ print $2 }` then I would still prefer `tr -s ' ' |
cut -d ' ' -f2`, since both are part of coreutils and with `awk` you would add
a additional dependency to your script.

~~~
em500
Why would being part of coreutils matter? Awk is part of POSIX, just like tr
and cut. Even in very constraint environments you can count on it via busybox.

~~~
ggm
A perpetual argument is use of too many pipe separated distinct commands. The
argument suggests it's lazy to use sed | awk | grep type pipe runs because in
all probability sed or awk alone could have done it and you incurred two
excess fork/exec() and therefore consumed kernel and userspace beyond your
need. The usual rejoinder is "get nicked"

------
dwheeler
Nice intro. Awk is a useful tool when you want to do simple line-by-line
processing.

Note: The article says that awk patterns can't capture groups. The _standard_
doesn't provide that functionality, but if you use the widely-available gawk
implementation, gawk _does_ have that capability (use "match").

------
marcinzelent
Last year, I attended a tech event where Rasmus Lerdorf (the creator of PHP)
was a guest speaker. When asked about his favorite programming language, he
said: AWK. He didn't elaborate why but looking at this article, I think there
is something to it.

------
spsrich2
it's amazingly useful. I have used it since 1987, most recently today.

------
globular-toast
I recommend this article to people all the time and often come back to it
myself. I don't use awk as much these days but when I worked in bioinformatics
I was fluent and loved it.

------
xwdv
Every time I set out to use awk for some one time task I struggle to express
what I need and end up saying fuck it and fire up vim and do exactly what I
want with some quick macros.

Am I the only one?

~~~
yaktubi
I usually say duck it and dust off my shiny Perl necklace

~~~
Noumenon72
I usually persevere and get it done in awk but then find the next time I need
awk that I don't remember any of it and can barely understand my own examples.

------
j_z_reeves
nice, I finally took the time to read the man pages for awk. And whipped out a
script to count the number of errors occurred for a particular day for a
postgres log file.

    
    
       cat logfile | awk '/ERROR:/ {counts[$1] = counts[$1] + 1}; END { for (day in counts) print day " : " counts[day]}' | sort
    

I just needed to know how awk programs are structured, the rest is just simple
programming!

EDIT: I'm not sure if it's actually correct however...

~~~
xorcist
Apart from the useless use of cat, since sort does the work here something
like the following would probably suffice:

grep ERROR logfile | cut -f 1 -d ' ' | sort | uniq -c

~~~
j0057
There's really nothing useless about that use of cat: it makes the pipeline
compose better from left to right. It's not like you have to pay 25 cents for
each process you spawn.

~~~
xorcist
So does the pipeline above.

It's not detrimental to performance since an empty cat is a no-op in a
pipeline. You can have any number of them. But commands should be written for
humans to understand, and inserting no-ops is a distraction to the reader.

In the trivial example, "grep needle haystack" reads better than "cat haystack
| grep needle".

------
dang
Discussed at the time:
[https://news.ycombinator.com/item?id=8893302](https://news.ycombinator.com/item?id=8893302)

------
aidenn0
This is a great intro; I have the GNU Awk user's manual bookmarked because
there are a lot of features in gawk you will only rarely use but are quite
useful.

------
gabrielsroka
bwk talking about awk, C, etc:
[https://youtu.be/Sg4U4r_AgJU](https://youtu.be/Sg4U4r_AgJU)

------
fithisux
This is the manual I needed the last 20 years.

------
dmux
The pattern-action paradigm is really simple to understand and I suspect it's
what made Sinatra style web frameworks stand out.

------
rafaele
Good tutorial. I like the concise survey of the language components. This is
going to make working with awk a lot less awkward.

------
known
#to print duplicate lines

    
    
        awk '++seen[$0] > 1' filename.txt

------
machinesbuddy
Could remove line 32 and just put `{ flag = 0 }` before line 31, right?

------
blackrock
What’s better than awk, are one-liner python programs.

Then, you can even alias it.

------
known
Common lines between 2 files

awk 'NR==FNR{a[$0]; next} $0 in a' colors_1.txt colors_2.txt

[http://archive.vn/mmd80](http://archive.vn/mmd80)

~~~
mkl
The link doesn't load for me, but that seems pretty unreadable. I'd probably
sort, merge, and print duplicates:

    
    
      sort -m <(sort -u file1) <(sort -u file2) | uniq -d

~~~
asicsp
you can use comm as well

    
    
        # common lines 
        comm -12 <(sort file1) <(sort file2)
    
        # lines unique to first file
        comm -23 <(sort file1) <(sort file2)
    
        # lines unique to second file
        comm -13 <(sort file1) <(sort file2)
    

regarding readability, it is the same with any new tool or programming
language, you'd need to be familiar with its syntax and idioms, someone not
familiar with command line and sort/uniq commands will find your solution as
alien as well

------
etxm
One time I had a file and I needed all the values in a column and so I used:

    
    
      awk '{ print $2 }' my-file
    

And it gave me what I wanted. It was cool.

------
js8
To use Awk I tended, up on the ward I ended, it was awkward.

------
jzer0cool
sed grep & awk :)

------
soufron
Awk in 2 minutes : google "awk this" and "awk that" :D

