
Why Learn Awk? (2016) - LinuxBender
https://blog.jpalardy.com/posts/why-learn-awk/
======
bloopernova
I use awk because there's an almost 100% chance that it's going to be
installed on any unix system I can ssh into.

I use awk because I like to visually refine my output incrementally. By
combining awk with multiple other basic unix commands and pipes, I can get the
data that I want out of the data I have. I'm not writing unit tests or perfect
code, I'm using rough tools to do a quick one-off job.

For instance, "mail server x is getting '81126 delayed delivery' from google
messages in the logs, find out who is sending those messages".

# get all the lines with the 81126 message. Get the queue IDs, exclude
duplicates, save them in a file.

cat maillog.txt | grep 81126 | awk '{print $6}' | sort | uniq | cut -d':' -f1
> queue-ids.txt

# Grep for entries in that file, get the from addresses, exclude duplicates.

cat maillog.txt | grep -F -f queue-ids.txt | grep 'from=<' | awk '{print $7}'
| cut -d'<' -f2 | cut -d'>' -f1 | sort | uniq

Each of those 2 one-liners was built up pipe-by-pipe, looking at the output,
finding what I needed. It's not pretty, it's not elegant, but it works. I'm
sure there's a million ways that a thousand different languages could do this
more elegantly, but it's what I know, and it works for me.

~~~
jcims
I know you’re not asking for awk protips but you can prefix the block with a
match condition for processing.

... | grep foo | awk ‘{print $6}’ | ...

becomes

... | awk ‘/foo/{print $6}’ | ...

If you start working this into your awk habits you’ll find delightful little
edge cases that you can handle with other expressions before the block (you
can, for example, match specific fields).

~~~
emmelaich
To pile on :-) you often want -w (match word) flag to grep.

In awk, I couldn't find how to do this. I tried /\bfoo\b/ and /\<foo\>/ but
neither worked. I don't know why and don't care enough which brings me to my
major awk irritation ...

It doesn't use extended or perl REs, which makes it quite different to ruby,
perl, python, java. Now, according to the man page it _does_ ; at least on OSX
(man re_format) but as mentioned it didn't work for me.

Details

    
    
       $ echo fish | awk  '/\bfish\b/' 
    

gets nothing, vs

    
    
       $ echo fish | perl -ne  '/\bfish\b/ && print' 

fish

~~~
emmelaich
UGH! Found the problem; it simply doesn't work. Assuming the OSX awk is the
same as the freebsd awk there is a very old open bug on this:

awk(1) does not support word-boundary metacharacters
[https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=171725](https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=171725)

~~~
asicsp
GNU awk supports \< and \> for start and end of word anchors, which works for
GNU grep/sed as well

GNU awk also supports \y which is same as \b as well as \B for opposite (same
as GNU grep/sed)

Intererstingly, there's a difference between the three types of word anchors:

    
    
        $ # \b matches both start and end of word boundaries
        $ # 1st and 3rd line have space as second character
        $ echo 'I have 12, he has 2!' | grep -o '\b..\b'
        I 
        12
        , 
        he
         2
    
        $ # \< and \> strictly match only start and end word boundaries respectively
        $ echo 'I have 12, he has 2!' | grep -o '\<..\>'
        12
        he
    
        $ # -w ensures there are no word characters around the matching text
        $ # same as: grep -oP '(?<!\w)..(?!\w)'
        $ echo 'I have 12, he has 2!' | grep -ow '..'
        12
        he
        2!

~~~
emmelaich
Sure, but a fair bit of the value of the tool is it's consistency across
platforms.

There's no point in awk if perl etc are ubiquitous and more consistent.

------
jerf
Because for some bizarre reason, "cut" doesn't ship with any decent column
selection logic that is the equivalent of awk's $1, $2, etc., even in 2020.

That's like 90% of my use of awk right there. I don't know of any easier
equivalent of "awk '{ print $2 }'" for what it does.

Posted partially so the Internet Correction Squad can froth at the mouth and
set me straight, because I'd love to be showed to be wrong here.

~~~
cauthon
> I don't know of any easier equivalent of "awk '{ print $2 }'" for what it
> does.

Does `cut -f2` not work? My complaint with cut is that you can't reorder
columns (e.g. `cut -f3,2` )

Awk is really great for general text munging, more than just column
extraction, highly recommend picking up some of the basics

Edit to agree with the commenters below me: If the file isn't delimited with a
single character, then cut alone won't cut it. You need to use awk or
preprocess with sed in that case. Sorry, didn't realize that's what the parent
comment might be getting at.

~~~
syncsynchalt
It does not. Compare:

    
    
        $ echo '   1     2   3' | cut -f2
           1     2   3
        $ echo '   1     2   3' | cut -f2 -d' '
        
        $ echo '   1     2   3' | awk '{print $2}'
        2
    

"-f [...] Output fields are separated by a single occurrence of the field
delimiter character."

~~~
ratsmack
Also, the field separator (FS) can be a regular expression.

    
    
        FS = "[0-9]"

~~~
jerf
IIRC, there is an invocation of cut that basically does what I want, but every
time I try, I read the manual page for 3 or 4 minutes, craft a dozen non-
functional command lines, then type "awk '{ print $6 }'" and move on.

~~~
masklinn
> IIRC, there is an invocation of cut that basically does what I want

I don't think there is, because cut separates fields strictly on one instance
of the delimiter. Which sometimes works out, but usually doesn't.

Most of the time, you have to process the input through sed or tr in order to
make it suitable for cut.

The most frustrating and asinine part of cut is its behaviour when it has a
single field: it keeps printing the input as-is instead of just going off and
selecting nothing, or printing a warning, or anything which would bloody well
hint let alone tell you something's wrong.

Just try it: `ls -l | cut -f 1` and `ls -l | cut -f 13,25-67` show exactly the
same thing, which is `ls -l`.

cut is a personal hell of mine, every time I try to use it I waste my time and
end up frustrated. And now I'm realising that really cut is the one utility
which should get rewritten with a working UI. exa and fd and their friends are
cool, but I'd guess none of them has wasted as much time as cut.

------
rectang
Awk, like shell, is a fraught programming environment, full of silent failure
and hidden gotchas. Even for one-liners.

I use shell because I have to, not because I like it. I dread maintaining
shell scripts which have a bunch of awk and sed in them.

The Unix ideal of small single-purpose tools and text processing is separable
from these old warhorses.

~~~
rb808
scripts are the antithesis of modern software development. No Unit testing, No
CI, no monitoring, no consistency, often no source control. Appeals to me as a
hacker but only my scripts are intelligible - which is a bad sign. :)

~~~
umvi
I'm not actually convinced unit testing is all that valuable, unless the unit
under test has a very clear input -> output transformation (like algorithms,
string utilities, etc). If it doesn't (and most units don't), unit tests just
encumber you.

~~~
Eikon
I’m really glad to read that.

I never understood the whole religion around unit tests. Integration tests are
often far easier to write and far more valuable.

Like you said, unit tests are really nice when testing for a known, expected
output.

Unit tests that are effectively testing mocks and crazy stubs because your
method has side effects? Not for me.

~~~
coliveira
It is easy to explain: unit tests give a documented proof that you care about
code robustness. It is used more for social and psychological effect than for
its advantages. In fact, outside a few domains, unit tests make it harder to
evolve software because the more tests you write, the harder it is to make
changes that move the design in a different direction. This is, by the way, my
main problem with verbose techniques of programming: the more you have to
code, the harder to make needed changes.

------
eska
I gave awk a sincere attempt, but I have to say that it wasn't worth it. As
soon as one tries to write anything bigger than a one liner the language shows
its warts.. I found myself writing lots of helper routines for things that
should be part of the base language/library, e.g. sets with existence check. I
also had to work around portability issues, because awk is not awk, unlike
this post claims. E.g. matching a part of a line is different among different
awks.

And some language decisions are just asinine, e.g. how function parameters and
variable scope works, fall through by default although you almost never want
that, etc..

But hey, you have socket support! Sounds to me like things have developed in
the wrong direction.

And of course no one on your team will know awk.

I found the idea of rule-based programming interesting, but the way it
interacts with delimiters and sections (switching rules when encountering
certain lines) doesn't work well in practice.

I also found the performance to be very disappoinging when I compared it to C
and python equivalents.

~~~
0xff00ffee
You realize AWK is about 1/100th the size of Python, right? That's like
comparing a Leatherman multi-tool to a Craftsman 2000 piece tool set that
weighs 1,000 lbs. This matters significantly when addressing compatibility and
when building distros that are space constrained.

Awk is there for a reason: to be small. That's why the O'Reilly press book is
called "Sed & Awk", because they were orignally written to work together in
the early days of unix dating back to the late 70's. Sed (1974) & Awk (1977)
are in the DNA of unix, Python is something totally different.

~~~
eska
First of all I'm not a distro maintainer. I also doubt that people would use
awk for seriously space constrained environments. And distros ship both awk
and python anyway. And again, I don't understand why they'd support networking
but not basic data types/functions.

The only reason I could've seen to use awk was to throw code together more
quickly in a DSL.

However this is much less the case than I had hoped. For the one liners there
are usually specialized tools like fex that are easier to use and faster (for
batch processing).

When I compared my C/python/awk programs the difference was msec/sec/minutes.
As soon as I use such a program repeatedly it starts to hurt my productivity.
And the development time is not orders of magnitude slower in non-awk
languages.

~~~
wahern
> I also doubt that people would use awk for seriously space constrained
> environments. And distros ship both awk and python anyway.

Python is absolutely not available everywhere one can find Awk. I've never
seen a system with Python but not Awk, but have seen many systems with Awk but
not Python (excluding the BSDs, where Python is never in base, anyhow).

Actually, not many years ago I used to claim that I never saw a Linux system
with Bash that lacked Perl, but had seen systems with Perl that lacked Bash.
(And forget about Python.) This was because most embedded distros use an Ash
derivative, often because they used BusyBox for core utilities or a simple
Debian install. Perl might not have been default installed, either, but
invariably got pulled in as a core dependency for anything sophisticated.
Anyhow, the upshot was that you'd be more portable, even within the realm of
Linux, with a Perl script than a Bash-reliant shell script. Times have
changed, but only in roughly the past 5 years or so. (Nonetheless, IME Perl is
still slightly more reliable than Python, but variance is greater, which I
guess is a consequence of Docker.)

One thing to keep in mind regarding utility performance is locale support.
Most shell utilities rely on libc for locale support, such as I/O translation.
Last time I measured, circa 2015, setting LC_ALL=C resulted in _significantly_
improved (2x or better, I forget but am being conservative) standard I/O
throughput on glibc systems.[1] I never investigated the reasons. glibc's
locale code is a nightmare[2], and that's more than enough explanation for me.

Heavy scripting languages like Perl, Python, Ruby, etc, do most of their
locale work internally and, apparently, more efficiently. If you don't care
about locale, or are just curious, then set LC_ALL=C in the environment and
test again. I set LC_ALL=C in the preamble of all my shell scripts. It makes
them faster and, more importantly, has sidestepped countless bugs and gotchas.

For the things I do, and I imagine for the vast majority of things people
write shell scripts for, you don't need locale support, or even UTF-8 support.
And even if you do care, the rules for how UTF-8 changes the semantics of the
environment are complex enough that it's preferable to refactor things so you
don't have to care, or can isolate the parts that need to care to a few
utility invocations. In practice, system locale work has gone hand-in-hand
with making libc and shell utilities 8-bit clean in the C/POSIX locale, which
is what most people care about even when they care about locale.

[1] The consequence was that my hexdump implementation,
[http://25thandclement.com/~william/projects/hexdump.c.html](http://25thandclement.com/~william/projects/hexdump.c.html),
was significantly faster than the wrapper typically available on Linux
systems. My implementation did the transformations from a tiny, non-JIT'd
virtual machine, while the wrapper, which only supports a small subset of
options, did the transformation in pure C code. My code was still faster even
compared to LC_ALL=C, which implied glibc's locale architecture has non-
negligible costs.

[2] To be fair, it's a nightmare partly because they've had strong locale
support for many years, and the implementation has been mostly backward
compatible. At least, "strong" and "backward compatible" relative to the BSDs.
Solaris is arguably better on both accounts, though I've never looked at their
source code. Solaris' implementation was fast, whatever it was doing. musl
libc has the benefit of starting last, so they only support the C and UTF-8
locales, and in most places in libc UTF-8 support simply means being 8-bit
clean, so zero or perhaps even negative cost.

~~~
kragen
There was a long period of time where it was easy to find a non-Linux Unix
with Perl installed but not Bash: SunOS, Solaris, IRIX, etc., admins would
typically install Perl pretty early on, while Bash was more niche. Like, maybe
1990 to 2000. Now we're getting into an era where lots of Unix boxes run
MacOS, and although they have Bash, it's a version of only archaeological
interest. But they do have Perl.

------
sremani
Honorary mention of Taco Bell Programming. (fits this genre).

[http://widgetsandshit.com/teddziuba/2010/10/taco-bell-
progra...](http://widgetsandshit.com/teddziuba/2010/10/taco-bell-
programming.html)

Someone ought to write - Zen and the art of Unix tools usage.

~~~
dhruvkar
>> suppose you have millions of web pages that you want to download and save
to disk for later processing. How do you do it?

I don't know enough about the 'real way' or the 'taco bell way', but
interested to know --- is this doable the way Ted describes in the article via
xargs and wget?

~~~
scruple
Yes, absolutely. This is absolutely how ~~we~~ many (most?) of us used to
scrap web pages in the Dark Ages.

------
shpeedy
Skip learning of sed and awk and jump straight to perl instead.

    
    
      $ perl --help
      ...
      -F/pattern/       split() pattern for -a switch (//'s are optional)
      -l[octal]         enable line ending processing, specifies line terminator
      -a                autosplit mode with -n or -p (splits $_ into @F)
    
      -n                assume "while (<>) { ... }" loop around program
      -p                assume loop like -n but print line also, like sed
      -e program        one line of program (several -e's allowed, omit programfile)
    
    

Example. List file name and size:

    
    
      ls -l | perl -ae 'print "@F[8..$#F], $F[4]"'

~~~
just_myles
Because syntatically as a language/tool it is super easy to remember. Writing
one liners with awk feels more intuitive to me.

Awk example:

ls -l | awk '{print $9, $5}' or ls -lh | awk '{print $9, $5}'

Seems a whole lot simpler. To me. I find if you have to write exhaustive shell
scripts then maybe you can look for something more verbose like Perl, I guess.

~~~
shpeedy
Yep, but you have bug in your awk one-liner.

~~~
wahern
If you mean the lack of quotations, then the behavior is well-defined and is
presumably what was intended. Per POSIX,

> The print statement shall write the value of each expression argument onto
> the indicated output stream separated by the current output field separator
> (see variable OFS above), and terminated by the output record separator (see
> variable ORS above).

The default value for OFS is <space> and for ORS, <newline>.

~~~
demiol
> If you mean the lack of quotations,

No, lack of commas in output and broken filenames with spaces.

~~~
just_myles
I see your point regarding spaces in names. Suppose I could use FILENAME. But
I think my point was made.

------
geocrasher
My favorite Awk reference is this:

[https://ferd.ca/awk-in-20-minutes.html](https://ferd.ca/awk-
in-20-minutes.html)

Also, a handy trick is to combine awk and cut. For example I had a log line
that had a variable amount of columns just in one field, but immediately after
the field was a comma. I cut based on the comma:

cut -d, -f1,2

and then awk'd the last column:

cut -d, -f1,2 | awk '{ print $2" "$5" "$NF }'

So, sometimes awk and cut can help each other.

~~~
ubadair
I've been procrastinating learning awk for a while now. Thanks for this, I
read it and it's just what I needed.

------
mason55
Another good resource is "Why you should learn just a little Awk: An Awk
tutorial by Example"[0]

[0] [https://gregable.com/2010/09/why-you-should-know-just-
little...](https://gregable.com/2010/09/why-you-should-know-just-little-
awk.html)

------
dredmorbius
I've been parsing some documents converted from PDF (using the Poppler
library's "pdftotext" command with the "\--layout" option).

I found that reading these -- sort-of half-assed structured data, but with
page-chunked artefacts and idiosyncrasies -- was difficult on a line-by-line
basis, and thought idly "this would be a lot easier if I could process by page
instead".

Text was laid out in columns, and the amount of indenting (the whitespace
between columns) was significant. So preserving this somehow would be Very
Useful.

Suddenly those pesky '^L' formfeeds were an asset, not a liability. Let's
treat the _formfeed_ ("\f") as a _record_ delimiter, and the _newline_ ("\n")
as a _field_ delimiter. We can parse out the actual columns based on
witespace, for each line:

    
    
        BEGIN { RS="\f"; FS="\n" }
        {
            pageno = NR
            lines = NF
            for( line=1; line<=lines; line++ ) {
                ncols = split( $line, columns, " {2,}", gaps )
            }
        }
    

This gives me:

\- The running tally of pages.

\- Each _line_ of the page as an individual _record_.

\- Via the split() function, an array of _columns_ separated by two or more
spaces, which are saved as an array of _gaps_ so I have the whitespace to play
with.

Edge cases and fiddling ensue, but that's the essential bit of the code there.

Since the lines are an array, I can roll back and forth through the page
(basically being able to read forward and backwards through the text record),
testing values, finding out where column boundaries are, etc., and then output
a page's worth of content, transposing to a single-column format, with
appropriate whitespacing, when done.

In testing and debugging the output (working off of 20+ documents of 100s to
~1,000 pages), a lot of test cases, scaffolding, diagnostics, etc., have been
created and removed to make sure the Right Things are happening. Easy with
awk.

~~~
dredmorbius
And to be clear: After realising this ... and knowing what to look for ... I
found this documented in the GNU Awk User's guide:

[https://www.gnu.org/software/gawk/manual/html_node/Multiple-...](https://www.gnu.org/software/gawk/manual/html_node/Multiple-
Line.html)

------
hawski
I love awk, but I'm still waiting for the structural regex awk:
[http://doc.cat-v.org/bell_labs/structural_regexps/se.pdf](http://doc.cat-v.org/bell_labs/structural_regexps/se.pdf)

~~~
linsomniac
I just learned about SE a month or so ago, and it is indeed pretty awesome. I
tried out the "vis" editor, using SE to create multiple cursors that I then
manipulated with vi-like commands. That was a pretty sweet use case.

------
Communitivity
Awk is a command I turn to time and again. For me it's the single most
valuable command for enabling the piped single-purpose pattern.

As an example, if I want a sorted list of all open files under the home
directories on CentOs I can do this:

lsof | awk '{ print $10 }' | grep ^/home/ | sort | uniq

~~~
kmstout
Don' stop there. You've solved a real problem in your life, and you might want
that information another day. Make a tiny script that encapsulates it.
Generalize it a tiny bit, and give it a memorable name (perhaps lsof-tree).
That done, you can stop worrying about the mechanics of the solution and build
on it.

    
    
      #! /bin/bash                                                                                                           
      # lsof-tree: list open files in a given directory tree (default /home)                                                 
                                                                                                                           
      NAME=9 # set to 10 for CentOS                                                                                          
      BASE="${1:-/home}"                                                                                                     
                                                                                                                           
      lsof | awk -v NAME=$NAME '{print $NAME}' | grep "^$BASE" | sort -u

~~~
jabl
YMMV, but I find it easier and faster to know the basic utilities and how to
compose them with pipes than remembering the name of a zillion such scripts.

I guess it depends on how often you need that particular pipeline. Every day?
Sure, make a script. Every few months? Nah, I won't remember it anyway,or
probably I remember that I've made a script like that but then I have to start
searching my bin directory in the end using more time than just writing the
pipeline in the first place.

------
rhombocombus
It is fast, robust, and frequently far more performant than a lot of modern
tools that can be overkill for most data manipulation. I use it all the time
in our ETL processes and it always works as advertised.

~~~
shpeedy
Perl is much faster[0], with much more features, with bunch of ready to use
libraries, with package manager (CPAN), and similar syntax to awk.

Why you use awk?

[0]: [http://rc3.org/2014/08/28/surprisingly-perl-outperforms-
sed-...](http://rc3.org/2014/08/28/surprisingly-perl-outperforms-sed-and-awk/)

~~~
macintux
Perl was my second programming love, but awk is much shorter and easier to
remember for the simple cases where I need it.

Remembering which Perl command-line arguments simulate awk’s line-by-line
processing is harder than just remembering awk.

~~~
mekazu
I use Perl similarly to awk if I need to use regex rather than white space
delimited fields.

I think if you know Perl really well and can remember the command line
arguments - particularly -E, -n, -I and -p - then it’s a good swap in
substitute for grep, sed, awk, cut, bash, etc when whatever 5 min task you’re
working on gets a tiny bit more complex.

Similarly a decent version perl 5 seems to be installed everywhere by default.

I’m curious to know if anyone would say the same about python or any other
programs? I’m not particularly strong in short python scripting.

~~~
macintux
I would say Perl’s native support for regular expressions makes it more useful
on the CLI than Python, but Python is also very low on my preferred languages
list.

I do, however, use it for JSON pretty printing in a pipeline: python
-mjson.tool IIRC.

------
berkes
Last week I threw out AWK and replaced it with Ruby (Could've been Python,
Perl or PHP even).

Because AWK is not suited for CSV. Please prove me wrong!

I had to parse 9million lines. Some of which contain "quoted records", others,
same column, are unquoted. Some contain comma's, in the fields, most don't.
CSV is like that: more like a guideline than actual sense.

Two hours of googling and hacking later, I gave up and rewrote the importer in
Ruby, in under 5 minutes.

Lesson learned: I'll stay clear of AWK, when I know a oneliner of Ruby (or
Python) can solve it just as well. Because I know for certain the latter can
deal with all the edgecases that will certainly pop up.

~~~
F-0X
> I had to parse 9million lines.

Awk would chew through that no problem.

> Some of which contain "quoted records", others, same column, are unquoted.

In which case, there is the FPAT variable which can be used to define what a
field is. FPAT="\"[^\"] _\ "|[^,]_", which means "stuff between quotes, or
things that are not commas", would probably have worked for you. (EDIT: Looks
like formatting has gotten hold of my FPAT and I don't know how to stop it...
hopefully it is still clear where asterisks should be)

> Some contain comma's, in the fields, most don't. CSV is like that: more like
> a guideline than actual sense.

Well, I would say that's absolutely false. You can't just put the delimiter
wherever you fancy and call it a well-formed file. Quoting exists for the
unfortunate cases your data includes the delimiting character (ideally the
author would have the sense to use a more suitable character, like a tab).

This is just a retort to prevent your post from dissuading readers from awk,
which is a fantastic tool. If you actually sit for half and hour and learn it
rather than google to cobble together code that works, it is wonderful. I also
don't think it is valid to base your judgement of a tool on what was
apparently garbage data.

~~~
ajanuary
Garbage and poorly specified csv files are a fact of life and people have to
deal with them all the time.

But if you want to be in a world where people only deal with well specified
files like RFC 4180 (for some definition of well specified), your quick field
pattern doesn’t conform. It doesn’t handle escaped double quotes or quoted
line breaks. If you’re using your quick awk command to transform an RFC 4180
file into another RFC 4180 file you’ve just puked out the sort of garbage you
were railing against.

While awk is a great tool if you’re dealing with a csv format with a
predictable specification, and probably could be made to bend to the GP will
with a little more knowledge, it gets trickier if you’re dealing with handling
some of the garbage that comes up in the real world. What’s worse is the
programming model leads you down the path of never validating your assumptions
and silently failing.

I love awk for interactive sessions when I can manually sanity check the
output. But if I’m writing something mildly complex that has to work in a
batch on input I’ve never seen, I too would reach for ruby.

------
technofiend
Parenthetically, since there are a bunch of UNIX greybeards on HN: if anyone
has artwork of the AWK t-shirt I will happily pay any reasonable price. The
shirt has a parachute-wearing bird about to jump from an airplane and is
captioned with AWK's most famous error message: awk: bailing out near line 1.

Contact information is hn handle @ yahoo.com.

------
LinuxBender
Here are some clever one-liners for awk [1] Please be sure to add your own.

[1] -
[https://www.commandlinefu.com/commands/matching/awk/YXdr/sor...](https://www.commandlinefu.com/commands/matching/awk/YXdr/sort-
by-votes)

~~~
asicsp
I have a repo dedicated for some of the cli text processing tools like
grep/sed/awk/perl/sort/etc. Here's my one-liner collection for awk [1]

[1] [https://github.com/learnbyexample/Command-line-text-
processi...](https://github.com/learnbyexample/Command-line-text-
processing/blob/master/gnu_awk.md)

~~~
LinuxBender
That is a very well written set of examples. Clear and concise. Thankyou!

------
Chris2048
Hmm. I'm sure this question will induce a flamewar of practical
"#NeverAwk"-ers fighting toolbelt bloat, versus tech-hoarding AWK apologists
arguing against throwing something out given if fills <niche>.

Here's the thing: these arguments all too commonly focus on subjective notions
of "simplicity", and toy examples divorced from _actual_ common practise, and
or solid comparable _benchmarks_.

Show me a range of practical examples, for each competing env (awk, sed, perl,
python, ruby, maybe bash).

Include:

\- time it takes to teach a total novice (the time it takes to learn whatever
is needed _that_ example, not the entire language)

\- how easy it is to recall said knowledge at a later date

\- how fast example is multiplied by how much you are likely to use it =
actual time saved in terms of execution. for small, fast examples the
difference is irrelevant, a 10x speedup that is 0.1s vs 0.01s is meaningless.

\- how _extendable_ an example is. Hence the original example should include a
series of extensions to the original task, to demonstrate how flexible /
composable they are: e.g. task 1) count lines in a file; task 2) count lines
in a file, then add 42 to it;

I suspect awk falls behind in practicality vs perl (which can do simple one-
liners, but also more complicated constructs), but perhaps has a hidden virtue
wrt speed in more expensive tasks, ala
[https://news.ycombinator.com/item?id=17135841](https://news.ycombinator.com/item?id=17135841)
or
[https://news.ycombinator.com/item?id=20293579](https://news.ycombinator.com/item?id=20293579)

~~~
flukus
> Here's the thing: these arguments all too commonly focus on subjective
> notions of "simplicity", and toy examples divorced from actual common
> practise, and or solid comparable benchmarks.

I think your missing the point of awk. The O'Reilly sed and awk book has some
complex examples, but when I look at my own usage they are all toy examples
within a much larger scope. It's more like a special DSL extension for my
shell than something I'd pick to build the entire solution, so a comparison to
perl, python and ruby don't really make sense, they are general purpose
languages but awk just has a couple of features that make it a very
specialized yet useful one.

As an example, a have a system for importing and parsing log files that mostly
done from a shell script, awk is used in two parts. The first is to transform
a structured and easy to read file (records '\n\n' separated) into a csv
easier to consume for bash, there's probably quite a few options to do this
from tr to bash and it's done inline. The second is to filter the results down
to what I need, so I have scripts like:

    
    
      #!/usr/bin/awk -f
      /some common error I don't care about/ { next } #skip line
      /other common error/ { next }
      /Error/ { print $0 } #this prints error lines, alternatively:
      /Error/ && !errors[gensub($1", "", "g", $0)]++ { print $0 } #print each error once
      {next} #skip everything else 
    

Apart from one single line which wasn't in the original that's something you
could teach a total novice in minutes, the /pattern/{action} syntax is about
as simple as programming can be. Execution speed could probably be improved
with a specific program but I suspect the bottleneck would be the spinning
disk anyway, I run this over hundreds of MB every few minutes and it's not a
problem, when I run it manually it's near instantaneous, I spend longer
waiting for the desktop calculator to open up these days.

~~~
Chris2048
Can you give an example of when you'd choose AWK over perl?

perl is general purpose, but that doesn't mean it can't be used for one-
liners.

------
znpy
do you know that feeling when a problem is too much for the shell but too
little for C? that's where Perl was supposed to fit.

in everyday life however, many small problems are a bit too much for the shell
but too little for Perl/Python/whatever.

awk fits very nice in there.

------
samatman
> _Imagine programming without regular expressions._

I live in this future and it's beautiful.

Steps to programming without regular expressions:

1) find a PEG library for your language of choice

There is no step 2.

~~~
iagovar
Sorry for asking such dub question, but what's a PEG library?

~~~
tejtm
Parsing expression grammar;

It is a recursive decent parser with the tiny tweak that productions are
ordered (not a set) and short circuit.

A 90's language you did not have to imagine that saved you from regex in this
way was[is] REBOL with it built in `parse` dsl.

examples here
[http://www.rebol.org/search.r?find=parse&form=yes](http://www.rebol.org/search.r?find=parse&form=yes)

[0][https://en.wikipedia.org/wiki/Parsing_expression_grammar](https://en.wikipedia.org/wiki/Parsing_expression_grammar)
[1][http://www.rebol.com/](http://www.rebol.com/)

------
shmerl
So why for example Awk instead of let's say Perl? The argument of "it's likely
to be installed" isn't very compelling.

~~~
dredmorbius
If you're versed in both, it comes down to taste, though there really _are_
cases where you'll have awk (usually via busybox) but _not_ Perl. OpenWRT
comes to mind (just verified it's not present by default, though yes, packages
are available).

For a huge number of simple tasks, awk is available and sufficient. It's
largely a subset of Perl, so yes, there's some skills overlap, but there are
times where knowing awk is the right tool and the _available_ tool will pay
off.

~~~
shmerl
I like Perl regexes more though, especially aliases. Using \d is a lot neater
than [0-9] or [[:digit:]].

~~~
dredmorbius
I hear Perl may support those ;-)

(Use the right tool for the job.)

------
YarickR2
Because you often don't need anything else to deal with a surprisingly large
class of tasks in IT.

------
jackcosgrove
After a decade of writing application software in C family languages, I am now
working on a big devops effort in a Linux environment. A lot of the syntax is
janky but it's pretty amazing what you can accomplish with a shell and the
Linux command line tools.

------
ineedasername
In the early aughts, fresh out of college, one of my first projects was to
take semi-structured text from database blobs and convert it to XML. I quickly
realized it was not a task to be fully automated, too many edge cases that
really required human judgement because the it was only very loosely semi-
structured. I turned to sed, awk, and pico. Sed & awk did the 99%, and dumped
into pico when it didn't know what to do and I would resolve the issue. Doing
it semi-supervised in this way was 100 times faster than doing it all
manually, and 10 times faster than full automation, and probably more
accurate.

But it was the ability to string together these types of command line tools
that made it possible.

------
3xblah
The author forgot one: C-like syntax

I believe this was intentional by the authors.

In the early days of UNIX, I think more of its users knew C. Today, it is
probably a much smaller number. However, learning AWK today, IMO, can help
someone who also intends to learn C.

------
loriverkutya
"Imagine programming without regular expressions."

Best of all worlds. I wish there would be a way to get back the time I spent
with debugging edge cases of different regex implementations on various OS's.

------
AzzieElbab
I have used awk for decades, but I simply stoped and dropped the habit of
using it along with sed and perl. Nowadays I would rather write a program than
a script just to avoid memorizing all these tricks and hacks and glitches

------
doctor_eval
It would be so great if awk had a csv mode. For whatever reason (Excel), CSV
seems to be the default text format for field oriented data.

Maybe I’m dumb but I’ve never come up with a separator regex that is quite
right.

~~~
dredmorbius
The simple case is:

    
    
        FS = ","
    

or:

    
    
        awk -F ',' <program>
    

If you're working with CSV data that has quoted strings with embedded commas,
FPAT is your friend:

    
    
        FPAT = "([^,]+)|(\"[^\"]+\")"
    

See:
[https://www.gnu.org/software/gawk/manual/gawk.html#Splitting...](https://www.gnu.org/software/gawk/manual/gawk.html#Splitting-
By-Content)

~~~
doctor_eval
just want to say thanks for this, I haven't had to deal with CSV for years and
now only a week after you posted this I needed it.

brew install gawk

good to go

:)

------
jibbit
There used to be a website shared on HN a lot, which was a table of: Task,
SED, Awk. It was really useful but I haven’t seen it for years.

------
rwnspace
Instead, learn fex:

[https://www.mankier.com/1/fex](https://www.mankier.com/1/fex)

------
arendtio
Does someone know how I have to invoke awk to make sure that the awk code I
wrote is POSIX compliant?

------
dhruvkar
Needs a '(2016)' at the end of the title.

~~~
LinuxBender
[updated] ty.

------
jmercouris
Why not just use Emacs for simple scripts and text manipulation? Way easier,
lots of one liners as well, just as expressive.

~~~
clarry
Too big and slow. Also I disagree with the expressiveness, (emacs-has-chosen-
a-very-longwinded 'way (to (express things))) that would be shorter in
languages with more syntax.

------
Afton
(2016)

------
0xff00ffee
When writing a shell script the robustness/reliabilty can be inferred from
it's scope:

1\. Uses shell-only commands (echo, for) - most robust; but things like
basename/dirname and regex's vary by shell (sh, bash, zsh, ksh)

2\. Uses /bin - might run into missing a binary but not likely, still robust
and allows a richer set of tools (e.g., uname, chmod and admin-ish things live
in /bin)

3\. Uses /usr/bin - runs risk if missing packages, likely not very robust
(packages drop things in here, like gzip, yacc, gcc)

4\. Uses /usr/local/bin or /opt/local/bin - definitely requires package
installs, least robust

