
Learn to use Awk with hundreds of examples - asicsp
https://github.com/learnbyexample/Command-line-text-processing/blob/master/gnu_awk.md
======
aseure
I would also recommend the good awk blog post series of Jonathan Palardy:

    
    
      * http://blog.jpalardy.com/posts/why-learn-awk/
      * http://blog.jpalardy.com/posts/awk-tutorial-part-1/
      * http://blog.jpalardy.com/posts/awk-tutorial-part-2/
      * http://blog.jpalardy.com/posts/awk-tutorial-part-3/

~~~
asicsp
glanced a bit, looks good.. will add it to further reading section, thanks :)

has some minor issues though

> modern (i.e. Perl) regular expressions

nope, supports only ERE.. doesn't have non-greedy, lookarounds, etc

> $ cat netflix.tsv | awk '{printf "%s %15s %.1f\n", $1, $6, $5}' | sed 1d

could have just added NR>1 condition..

~~~
kazinator
Not to mention that awk takes filename arguments, so this is an egregiously
useless use of cat.

~~~
twmb
From the second post,

> Alternatively, awk '{print $2}' netflix.tsv would have given us the same
> result. For this tutorial, I use cat to visually separate the input data
> from the AWK program itself. This also emphasizes that AWK can treat any
> input and not just existing files.

------
0xFFFE
I use sed & awk all the time. It is an invaluable tool while debugging issues,
extracting fields from log files etc. I am not dissing python or perl, I use
python extensively as well. But while you are in the middle of an incident,
hard to beat a quick oneliner, like this random example.

awk -F":" 'BEGIN{total=0}{if($3>240)total+=$3}END{print total}' /etc/passwd

~~~
indescions_2017
>>> I use sed & awk all the time.

Me too. For example, number of unique IP requests in a log file in
milliseconds ;)

$ awk '{ print $1 } ' caddy_log | sort | uniq | wc -l

But rarely compose my own from scratch. It's mostly copy paste. And store in
admin bin for future use.

~~~
asicsp
No need so many commands :)

    
    
        $ cat duplicates.txt
        abc  7   4
        food toy ****
        abc  7   4
        test toy 123
        good toy ****
    
        $ awk '!seen[$2]++' duplicates.txt
        abc  7   4
        food toy ****
    
        $ awk '!seen[$2]++{cnt++} END{print +cnt}' duplicates.txt
        2

~~~
kazinator
On a large file with many duplicates, seen[x]++ can overflow, unless you're
using GNU Awk with bignums (gawk -M).

~~~
asicsp
that's a good point

I'll add a note, thanks :)

------
johnnylambada
While we're on the subject, let's not forget to avoid parsing HTML with regex:
[https://stackoverflow.com/questions/1732348/regex-match-
open...](https://stackoverflow.com/questions/1732348/regex-match-open-tags-
except-xhtml-self-contained-tags/1732454#1732454)

~~~
ams6110
Generally good advice, but can be OK in specific situations where you have a
known HTML structure and are just scraping some values out of it. This is not
so much parsing HTML as it is matching the patterns of the values you want to
extract.

~~~
johnnylambada
the pon̷y he comes he c̶̮omes

------
SEJeff
awk1line.txt is the original and still one of the absolute de-facto best awk
references with examples:

[http://www.pement.org/awk/awk1line.txt](http://www.pement.org/awk/awk1line.txt)

~~~
Annatar
The best reference is "The AWK programming language" by Aho, Weinberger and
Kernighan.

~~~
geospeck
Which is available at the archive.org:

[https://archive.org/details/pdfy-
MgN0H1joIoDVoIC7](https://archive.org/details/pdfy-MgN0H1joIoDVoIC7)

------
bananicorn
I'm wondering:

Does awk really provide that more value over sed while being easier or faster
to use than a fully-fledged scripting language (thinking of perl, python,
etc).?

(and yes, one may argue that awk IS a scripting language, I'm not disputing
that, just asking)

~~~
Annatar
AWK is a programming language. In the AWK book by Aho, Weinberger and
Kernighan, towards the end of the book they implement an arbitrary assembler,
the virtual processor and a virtual machine for the machine code they just
invented to run. They also implement a relational database management system
in AWK, as well as an autoscale graphing solution.

I myself have implemented XML SOAP command line client, a backup solution, a
SAN UUID management application and an automated Oracle RAC SAN storage
migration solution, a configuration management, and an Oracle database
creation / management applications in AWK.

Usually I develop a thin getopts shell wrapper around an AWK core. Works every
time, the executables are on the order of a few KB (the largest so far, the
XML SOAP client is 24.5 KB) and they all run like a bandit. Memory
requirements are miniscule. Dependencies are minimal: the only external
dependency so far in my software has been the xsltproc binary from the libxslt
package.

AWK is easier to use than Python or Perl, and is much faster than either of
those. Typical code density ratio of Python versus AWK is 10:1, sometimes
more. This means that if you have a 650 line Python program, you can implement
the same functionality in about 280 lines of AWK, and the program will be far
simpler. I've once collapsed a 280+ line Python program into a simple 15 lines
of code in AWK.

AWK is an extremely versatile, powerful programming language.

For even more speed, AWKA can be used to transpile AWK source into C and then
it will call an optimizing C compiler to compile it into a binary executable.
Typical speedup is on the order of 100%, so if your AWK program ran in 12
seconds, it'll now finish in six.

~~~
zephyrfalcon
"""Typical code density ratio of Python versus AWK is 10:1, sometimes more.
This means that if you have a 650 line Python program, you can implement the
same functionality in about 280 lines of AWK, and the program will be far
simpler. I've once collapsed a 280+ line Python program into a simple 15 lines
of code in AWK."""

How does this work? I am not saying it can't be done, but the main benefit of
Awk seems to be quick one-liners, which are possible because you get "records"
(splitting on whitespace) and lines (splitting on newline) and looping for
free. But for larger programs, this easily translates to Python; just call
readlines(), loop over it, call split() on each line. I would think that at
this point, Awk doesn't have much of an advantage anymore... but apparently
your experiences are different. What are some Awk constructs that would take a
lot more code in Python?

~~~
empthought
Every pattern being matched to every line can be a big win in more complex
processing. This is a simple but familiar example:

    
    
         seq 1 30 | awk '
         $0 % 3 == 0 { printf("Fizz"); replaced = 1 }
         $0 % 5 == 0 { printf("Buzz"); replaced = 1 }
         replaced { replaced = 0; printf("\n"); next }
         { print }'
    

Note that the awk script is far more general than the typical interview
question, which specifies the numbers to be iterated in order. The awk script
works on any sequence of numbers.

~~~
microtherion
Yes, but as zephyrfalcon said, that maps onto a series of if statements in
python. No 10:1 magic anywhere.

~~~
empthought
The "series of if statements" also has to read the line, split it, and parse
an integer. To behave like the AWK script it also has to catch an exception
and continue when the input cannot be parsed as an integer.

Go ahead, write the Python script that behaves exactly as this AWK program
does. It will likely be 4x as long, and that's because the number of different
patterns and actions to take is quite low. More complex (and hence more
situated and less easy-to-understand) use cases will benefit even more from
AWK's defaults.

Moreover the pattern expressions are not constrained to simple tests:
[https://www.gnu.org/software/gawk/manual/html_node/Pattern-O...](https://www.gnu.org/software/gawk/manual/html_node/Pattern-
Overview.html#Pattern-Overview)

They can match ranges, regular expressions, or indeed any AWK expression. They
can use the variables managed by the AWK interpreter:
[https://www.gnu.org/software/gawk/manual/html_node/Auto_002d...](https://www.gnu.org/software/gawk/manual/html_node/Auto_002dset.html#Auto_002dset)
(NR and NF are commonly used).

Actions one-way or two-way communicate with coprocesses with minimal ceremony:
[https://www.gnu.org/software/gawk/manual/html_node/Two_002dw...](https://www.gnu.org/software/gawk/manual/html_node/Two_002dway-
I_002fO.html#Two_002dway-I_002fO)

All of those mechanisms can be done in a Python script, but they add up to a
lot of boilerplate and mindless yet error-prone translation to the standard
library or Python looping and conditional logic.

~~~
microtherion
> The "series of if statements" also has to read the line, split it, and parse
> an integer

All of which are built in functions...

> To behave like the AWK script it also has to catch an exception and continue
> when the input cannot be parsed as an integer.

Not quite sure what behavior you're referring to here. When I tested your
script, it happily treated "xy" as divisible by 15.

> Go ahead, write the Python script that behaves exactly as this AWK program
> does.
    
    
      import fileinput
      for line in fileinput.input():
        replaced = False
        if int(line) % 3 == 0: print("Fizz", end=''); replaced = True
        if int(line) % 5 == 0: print("Buzz", end=''); replaced = True
        if replaced: print();
        else: print(line, end='')
    

> Moreover the pattern expressions are not constrained to simple tests

And none of these, except maybe for the range operator, are particularly
challenging for python.

~~~
empthought
1\. Your script crashes when it is given input that does not parse as an
integer. The awk script does not. In this way, the awk design favors
robustness over correctness, which is a valid choice to make at times.

2\. How would you modify it so it parsed a tab-delimited file and did FizzBuzz
on the third column? With awk it is a simple matter of setting FS="\t" and
changing $0 to $3?

3\. How would you modify it so instead of being output unmodified, rows with
$3 that are neither fizz nor buzz output the result of a subprocess called
with the second column's contents?

Now you might say that this is all goalpost-moving, but that's the point. AWK
is more flexible and less cluttered in situations where the goalposts tend to
get moved, but where the basic text processing paradigm stays the same.

~~~
microtherion
1\. Sure, it's a valid choice, and one that can easily be reproduced by
python:

    
    
      def intish(str):
        try:
          return int(str)
        except:
          return 0
    

Can python's default be reproduced as easily in awk?

2\. You'd insert field = line.split('\t') at the beginning of the loop and
then refer to field[2]

3\. os.popen or subprocess.run

I buy the "less cluttered" argument when the problem matches awk's defaults. I
vehemently disagree with the "more flexible" argument. A problem perfectly
suited to awk can easily turn to a poor fit with the addition of a single,
seemingly innocuous requirement (e.g. in your subprocess example, log the
standard error of your subprocess into a separate file).

~~~
empthought
So what does that look like in your program? With respect to failing fast and
verbose error reporting in AWK, it's as simple as

    
    
         !/^[0-9]+$/ {
             print "invalid input: " $0 > "/dev/stderr"
             exit 1
         }
    

at the beginning of the script. None of the other actions need to be changed;
but with your implementation, all of the calls to "int" need to be changed to
"intish".

I've got the following script (I stopped playing games with line breaks):

    
    
        #!/usr/bin/env gawk -f
    
        BEGIN {
    	FS = "|"
        }
    
        $2 % 3 == 0 {
    	printf("Fizz")
    	replaced = 1
        }
    
        $2 % 5 == 0 {
    	printf("Buzz")
    	replaced = 1
        }
    
        replaced {
    	replaced = 0
    	printf("\n")
    	next
        }
    
        {
    	system("cal " $2 " 2018 2> errors.txt")
        }
    
    

Which can produce the following output:

    
    
        $ ./script.awk <<EOF
        > thing1|0
        > thing2|3
        > thing3|7
        > thing4|13
        > EOF
        FizzBuzz
        Fizz
    	 July 2018
        Su Mo Tu We Th Fr Sa
         1  2  3  4  5  6  7
         8  9 10 11 12 13 14
        15 16 17 18 19 20 21
        22 23 24 25 26 27 28
        29 30 31
    
        $ cat errors.txt 
        cal: 13 is neither a month number (1..12) nor a name
    
    

\- What does the equivalent program in Python look like?

\- How many characters does it have with respect to the number of characters
in the awk script? (259 with shebang).

\- How many characters would need to change to split by "," instead? (1 for
awk). (You can achieve this in Python, but you'll end up spending characters
on a utility function.)

\- How many characters would need to be added to print "INVALID: " and then
the input value for lines with non-numeric values in the second column, then
skip to the next line? (55 for awk)

Character adds/changes are the best proxy for "flexibility" I could think of
that doesn't go far afield into static code analysis.

I love Python and don't think awk is a good solution for extremely large or
complex programs; however, it seems obvious to me that it is significantly
more flexible than Python in every line-oriented text-processing task. The
combination of opinionated assumptions, built-in functions and automatically-
set variables, and the pattern-action approach to code organization, all add
up to a powerful tool that's still worth using in order to keep tasks from
becoming large or complex in the first place.

------
superasn
These are the best stories on HN and why i subscribed here in the first place.
I have often seen awk used so many times on SO but I've always put it up for
something later to learn. Finally today I have some basic understanding of awk
and this is really great stuff! I did get by with Perl but this is definitely
more handy and the example approach to teaching it makes is super easy to
understand!

~~~
asicsp
thanks :)

------
dugmartin
awk is what got me into web programming around 1994. I was working at a GE
subsidiary and all the documentation for the RTOS I was working on was printed
in huge binders from actively maintained Interleaf documents. Once I found the
SGML source documents on the server it only took a few hours to learn enough
awk to convert the SGML into a fully interlinked set of HTML documents with a
table of contents. Granted SGML to HTML is not that hard but it was fun and
useful and much nicer to search as opposed to laying out a bunch of binders on
my cube's desk.

------
nsebban
Commands like cat, cut, sort, uniq and awk are still pretty relevant nowadays.
Even on huge volumes of data.

~~~
assafmo
Especially on huge volumes of data.

------
tmaly
I still use a little sed once in a while, but Perl has replaced most of the
tools like awk for me.

~~~
jstimpfle
The advantage of awk is that it is faster in some cases and a little more
convenient for simple command-lines due to automatic field splitting. The
disadvantage is that it swallows errors silently even more than perl (I
think).

~~~
augustk
Awk is also a part of POSIX.

------
mynegation
20 years ago when Python was not pre-installed on most systems, the awk/gawk
used to be my Swiss army knife. My first real programming job was translating
1000 LOC awk program into C.

The syntax was simpler than ed, and getting combination of grep, uniq, cut etc
correct.

I call it "Excel of the command line"

~~~
dmd
It's still extremely useful for those of us in industries where change comes
veeerrrryyyy slooowwwly. For instance, most production servers I work with are
still on Python 1.5.2 if they have Python at all - but awk, that I can depend
on!

~~~
ams6110
Except is it gawk? The gnu enhanced awk is not universally installed, so for
best portability stick with standard awk.

~~~
dmd
Almost certainly not!

------
rebolyte
Definitely check out the rest of that repo as well; it's a gold mine!

~~~
asicsp
thanks :)

------
geospeck
There is also available the source code[1] of the original awk, "The One True
Awk" as Brian Kernighan refers to it at his Princeton webpage.

[1][http://www.cs.princeton.edu/~bwk/btl.mirror/](http://www.cs.princeton.edu/~bwk/btl.mirror/)

------
michaelsbradley
See also the GNU Awk manual (currently Edition 4.2, Oct 2017)

GAWK: Effective AWK Programming

[https://www.gnu.org/software/gawk/manual/gawk.pdf](https://www.gnu.org/software/gawk/manual/gawk.pdf)

~~~
asicsp
yup, nothing beats it for being complete reference as well as tutorial

I added references to it throughout the chapter

------
kazinator
TXR Lisp Awk macro: [http://www.nongnu.org/txr/txr-
manpage.html#N-000264BC](http://www.nongnu.org/txr/txr-
manpage.html#N-000264BC)

Has analogs for all salient POSIX Awk features and most GNU Awk extensions.
(Of course, not semantic cruft like the weak type system, or uninitialized
variables serving as zero in arithmetic.)

Plus:

* You can embed (awk ...) expressions anywhere, including other (awk ...) expressions.

* You can capture a delimited continuation (awk ...) and yield out of there.

* It supports richer range expressions than Awk. Range expressions combine with other range expressions unlike in Awk, so that you can express a range which spans from one range to another. Also, there are variations of the operator to exclude either endpoint of the range: rng, -rng, rng- and -rng-.

* You can "awk" over a list of strings, possibly an infinitely lazy one.
    
    
        1> (awk (:inputs '("a" "b") '("c" "d"))
                (t (prn nr fnr rec)))
        1 1 a
        2 2 b
        3 1 c
        4 2 d
        nil
    

* It has a return value: whatever the last :end returns, or else _nil_ :
    
    
        1> (awk (:end 42) (:end 43))
        [Ctrl-D]
        43
    

Build a list from the first fields of /etc/passwd:

    
    
      1> (build
           (awk (:inputs "/etc/passwd")
                (:set fs ":")
                (t (add [f 0]))))
      ("root" "daemon" "bin" "sys" "sync" "games" "man" "lp" "mail"
       "news" "uucp" "proxy" "www-data" "backup" "list" "irc" "gnats"
       "nobody" "libuuid" "syslog" "messagebus" "avahi-autoipd" "avahi"
       "usbmux" "gdm" "speech-dispatcher" "kernoops" "pulse" "rtkit"
       "hplip" "saned" "kaz" "vboxadd" "sshd" "oprofile" "ntp" "lightdm"
       "colord~" "whoopsie" "postfix")
    
    

Type conversion of fields (which are just strings) is achieved by an elegant
operator _fconv_ which takes a condensed notation such as (fconv i : r : xz)
which means convert the first field to integer as a decimal integer, the last
field as a hexadecimal integer and the fields in between as reals. The xz
means that if the last field is invalid, it gets converted to zero rather than
_nil_. These letters are just the names of lexical functions available in the
_awk_ scope, rather than built-in _fconv_ behaviors.

------
feelin_googley
Kernighan and Van Wyk, "Timing Trials, or, the Trials of Timing: Experiments
with Scripting and User-Interface Languages" (1998)

[http://web.archive.org/web/20000829071436/http://inferno.bel...](http://web.archive.org/web/20000829071436/http://inferno.bell-
labs.com:80/cm/cs/who/bwk/interps/pap.html)

AWK, Perl, Tcl, Scheme, C, Java, Limbo, Visual Basic

What if k scripting language was included in those experiments?

[http://kparc.com/z/bell.k](http://kparc.com/z/bell.k)

k3:

1\. "Basic Loop Test"

    
    
       \t 1000000(1+)/0
    

2\. "Ackermann's Function Test"

    
    
       \t {:[~x;y+1;~y;_f[x-1;1];_f[x-1;_f[x;y-1]]]}[3;7]
    

3\. "Indexed Array Test"

    
    
       \t x(x;|x:!200000)     
    

4\. "String Test"

    
    
       \t f:{(x>#:){(i _ x),(1+i:_.5*#x)#x:,/("123";x;"456";x;"789")}/y};do[10;f[500000;"abcdef"]]
    

5\. "Associative Array Test"

    
    
       \t {+/("0123456789abcdef"16_vs'!x)_lin$!x}40000
    

6\. "File Copy Test"

    
    
       `f 0:(30000 _draw 300)#\:"king "       
       \t `f 0:0:`f   
    

7\. "Word Count Test"

    
    
       \t (#:;+/(+/1<':" "=)';+/#:')@\:0:`f
    

8\. "File Reversal Test"

    
    
       \t `f 0:|0:`f          
    

9\. "Sum Test"

    
    
       `f 0:100000#,"-123.456" 
       \t +/0.0$0:`f
    

Source:
[http://web.archive.org/web/20010501041644/http://www.kx.com:...](http://web.archive.org/web/20010501041644/http://www.kx.com:80/a/k/examples/bell.k)

~~~
kbenson
From what I know (which is not well sourced), the big problem with K is that
it didn't have an open and accessible interpreter for a long time, and that
hampered adoption. I know 5-6 years ago I was interested in it, but couldn't
find any interpreters that were free for non-commercial and commercial user
(it was mostly work related interest), and didn't stumble across Kona[1] at
the time. Not it's on my list of languages to look into again, but that's a
long list.

1: [https://github.com/kevinlawler/kona](https://github.com/kevinlawler/kona)

