
The State of the AWK - benhoyt
https://lwn.net/SubscriberLink/820829/5bf9bf8bb9d6f2bf/
======
freedomben
The new namespaces look interesting, although I really don't think the lack of
namespaces it what has hindered Awk from widespread use.

I am speculating of course, but as someone who evangelizes Awk the most common
thing I hear from people as to why they don't use it, is that they just don't
know the language all that well. For people that don't know much about Awk, it
looks really complex and esoteric.

To address the Awk ignorance, I put together a talk for Linux Fest Northwest
last year, and it was so popular that I gave it again (virtually due to
COVID-19) this year. The conference was fully remote so the talk is on
Youtube.

If you've ever wanted to learn Awk, this will take you from zero to
proficient. There are exercises as well for practice:

* Presentation: [https://youtu.be/43BNFcOdBlY](https://youtu.be/43BNFcOdBlY)

* Exercises:

\- Source: [https://github.com/FreedomBen/awk-hack-the-
planet](https://github.com/FreedomBen/awk-hack-the-planet)

\- My solutions with explanation:
[https://youtu.be/4UGLsRYDfo8](https://youtu.be/4UGLsRYDfo8)

~~~
BiteCode_dev
The reason AWK is not widespread is that it's only great for one case: working
on a stream of lines. It's a pretty narrow niche.

Sure you can code anything with it, but there is always another language that
will be better at this other thing, and still decent at what awk does. Awk
looks more like a DSL than a generalist language.

So why learn awk to save you a few minutes for the rare cases you do need it?
Just learn Python/Ruby/etc, it's a better investment, and it has a better
crossplateform story.

~~~
jimbokun
> It's a pretty narrow niche.

If you're working on any kind of Unix, it's a niche that you encounter almost
constantly.

> Awk looks more like a DSL than a generalist language.

Yes, it is pretty much the ultimate text processing DSL. So if you need to
process a text file a line at a time, in most cases AWK is the optimal
solution.

> and it has a better crossplateform story.

I think AWK is available on almost every platform where you can install those
languages?

------
alexhutcheson
There are two use-cases I’ve run into where awk really shines, and is hard to
replace:

1\. Writing scripts for environments that only have Busybox. Technically you
can write scripts in ash, but I don’t recommend it for anything beyond a
couple lines. It’s missing a lot of the features from Bash that make scripting
easier, and it’s easy to get mixed up if you’re used to Bash and write things
that don’t work. Awk is the best scripting language available, even if you’re
doing things that don’t exactly match what it was designed to do.

2\. Snippets that are meant to be copy+pasted from documentation or how-to
articles. In that case, it’s often not easy to distribute a separate script
file, so a CLI “one-liner” is preferred. You also can’t count on Perl, Python,
etc. being available on the user’s system, but awk is pretty universal.

For most other cases, I tend to create a new .py file and write a quick Python
script. Even if it’s a little more overhead, it helps keep my Python skills
sharp, and often it turns out that what I actually want is a little more
complicated than my initial idea anyway.

~~~
theamk
This is a great question, actually -- is busybox's AWK full-featured? When I
saw that is busybox-provided, I kinda assumed it is annoyingly limited, just
like ash is.

Should I have looked at it more carefully?

~~~
alexhutcheson
As far as I know, it supports all the awk features specified in POSIX, but
doesn’t include the extensions added in gawk or mawk.

------
dan-robertson
I’m mostly happy with awk. But splitting strings _really_ sucks. I have to
read the docs every time.

I think my dream awk-like tool would look something like:

1\. Has an “interactive” mode to see what your script is doing as you write
it. Something like a combination of less, fzf, and a Bret Victor style
debugger showing matches/values of variables at each line

2\. Supports things that aren’t list RS-separated records of of FS-separated
fields. Some formats I would like are json, csv, records where the fields are
all key=value, and maybe some other formats. Support would mean some way to
specify patterns for different formats

3\. Extracting marching groups from regex matches.

~~~
jiggawatts
The thing you have to be _really_ careful about is escaping.

Compared to the Object Oriented PowerShell where every column is just an
object property, strongly typed and everything, the string-based bash
programming to me seems absolutely bonkers.

Like... what do you do if some text doesn't fit into the space available?

How do you handle Unicode?

What about simple escaping of names like O'Toole, embedded double quotes,
leading or trailing spaces that are meaningful, embedded line feeds, etc...

Eventually you _have_ to use a full parser, not a bunch of regexes. I've found
that "eventually" to mean: almost immediately, even for supposedly simple
problems.

Even seeming trivial things like _correctly_ splitting up an X.500 name as
seen in LDAP or PKI is deceptively difficult. Now, if it's a LDAP name that
includes quotes embedded in a CSV... err... I don't even know where to begin.

~~~
theamk
I found tab-separated files can get me pretty far. It can handle arbitrary
length text, Unicode is no problem at all, leading/trailing spaces are
preserved and so on. It is trivially convertible to name=value format, as long
as you guarantee the "name" part is well-formed (usually the case).

The only tricky parts are embedded line feeds and tabs, and those can be
defeated with trivial escape schema, even something as simple as '\r', '\t',
'\\\'.

Bash quoting also works. If you master the difference between "${x[@]}" and
${x[*]}, keep in mind order of expand / split all the time, and never forget
the right kind of quotes, you can have a robust system which can handle
arbitrary strings. But I would not recommend this to anyone.

~~~
benibela
Or ASCII control characters

No reason to use tabs, when there explicit group/record/unit separators

------
salmo
Kernighan’s recent memoir has made awk pop in my head a lot more recently and
allowed me to stay in shell script land vs bailing to Python.

I used it plenty over the years, but with a lot of trial and error. Now, I see
it as an event engine and it just flows naturally. It’s so great for
generating reports from log files, simple csv, data dumps, etc. it’s sad to
see it used as ‘cut’ in most scripts.

It’s like a basin wrench. It has a small scope, but once you use it and grok
it, you’d never want to do those things with another tool.

That said, I have no use for features above POSIX. It’s essentially a DSL for
me, and turning it into Perl ruins the simplicity. I’ll move on to a general
purpose language that others can understand and maintain vs use esoteric
extensions.

~~~
jjice
> allowed me to stay in shell script land vs bailing to Python

This is something I've been working on as well. I know how to get any utility
task I want done in Python, but it requires a lot more work than being able to
pipe commands well in a shell. I've been going out of my way to do things in
Bash instead as a way to become to familiar with standard Unix utils. It's a
fun little thing to do that ends up resulting in more efficient use down the
line.

------
VWWHFSfQ
> Robbins believes that AWK's lack of namespaces is one of the key reasons it
> hasn't caught on as a larger-scale programming language

AWK never caught on as a large-scale programming language because it's an
esoteric language for processing text-based streams. Not because it didn't
support "namespaces". Don't kid yourselves about "what coulda been..."

~~~
Spivak
I mean that describes bash as well and it’s pretty darn popular.

Had awk been where it is today twenty years ago it could have been the glue
systems language instead of bash for non-interactive usage.

~~~
fiddlerwoaroof
20 years ago, there was Perl that was often claimed to replace sed and awk: I
remember several books that said things like “there’s no reason to use awk now
that we have Perl”

------
conorh
I used dynamically generated awk scripts a few years ago to take some
delimited text files that were several hundred gigabytes and pre-process them
for faster loading to a DB. It sped up the whole process a huge amount - I
don’t remember exactly now but I think it went from 24 hours to get it all
loaded to under 2 hours. Useful tool to have in the tool box.

------
afiodorov
I work in data processing and I use awk occasionally to work with csv files
often gigabytes in size.

I join csv files that each have a header with

    
    
      awk '(NR == 1) || (FNR > 1)' *.csv > joined.csv
    

Note this only works if your csv files don't contain new lines. However if
they do, I recommend using
[https://github.com/dbro/csvquote](https://github.com/dbro/csvquote) to
circumvent the issue.

Yesterday I used awk as a QA tool. I had to subtract a sum of values in the
last column of one csv file from another, and I produced a

    
    
      expr $(tail -n+2 file1.csv | awk -F, '{s+=$(NF)} END {print s}') - $(tail -n+2 file2.csv | awk -F, '{s+=$(NF)} END {print s}')
    

beauty. This allowed me to quickly check whether my computation was correct.
Doing same in pandas would require loading both files into RAM and writing
more code.

However I avoid writing awk programs that are longer than a few lines. I am
not too familiar with the development environment of awk, and I stick to
either Python or Go (for speed) where I know how to debug, jump to definition,
write unit tests and read documentation.

~~~
davidgould
If you add a guard to your awk script to check the line number you can avoid
using tail:

    
    
      awk -F, 'NR > 2 {s+=$(NF)} END {print s}'

------
jrochkind1
My first paid job programming was as a high schooler, a summer job for a
university professor, circa 1992, using awk to analyze log file from a
NeXTStep desktop app (for some kind of scientific computing, I forget), for a
usability study.

Even during that job, as the complexity of the analysis increased, my
supervisor/mentor suggested "Have you heard of Perl? You might find that more
convenient." And I did find it more convenient, it was clear it could do
everything awk provided as well as awk (even using close to the same syntax if
you wanted), plus more.

Which led to my first web job, as a university student circa 1996, writing
'dynamically generated web pages' with Perl cgi, for the university. (At this
point I haven't written Perl either in at least 15 years, and don't tend to
miss it).

The main reason to use awk as opposed to a more convenient language (other
than "our legacy code is written in it and it would be a big investment to
port it" \-- a couple cases in OP) seems to be that it's present on nearly
every system. But isn't also Perl?

~~~
PopeDotNinja
I believe Perl is not part of POSIX. So while you'll see it most places, you
won't see it everywhere. For example, you'd bump into strict POSIX compliance
in containers that are kept as small as possible. You probably won't find Perl
on a Busybox container installed by default, but I believe that container
would still contain awk. I can't verify this myself at the moment.

~~~
jrochkind1
Thanks, that makes sense.

That the _functioning_ of awk was part of POXIS was something I just learned
from this article, and which surprised me. That POXIS includes specifications
of entire languages within it! I guess it makes sense, but wow.

But I guess the extensions and advancements to awk being discussed in OP would
not be part of POSIX-awk, so not necessarily to be found on a POXIS system,
true? Would people using awk for it's POSIX-reliability avoid using new
fangled awk innovations?

------
jasoneckert
"AWK build scripts are surprisingly hard to kill, even when people want to"

Truer words have not been spoken, in my opinion. I've used AWK heavily for
decades, and still use it for a wide variety of parsing.

------
nly
We have a tool at work, that is forked from
[https://github.com/onetrueawk/awk](https://github.com/onetrueawk/awk), for
the binary serialization of one of our proprietary in-house protocols, which
has hundreds of record types.

I am blown away by how elegant the result is.

~~~
benhoyt
Neat! I'm curious -- would it be possible to share a few (non-sensitive!)
details about how this works / what this looks like?

~~~
nly
So essentially the binary records are tagged with a record type, which is used
by the tool (via a library maintained by another team) to provide metadata
(field names, types, ordering etc) for that record type.

Human meaningful field names are made available when records are processed in
your awk expressions. This is an improvement over using a delimited text dump
of the record and using regular awk with meaningless $1, $2 variables.

I could be wrong, but I believe the relational operators also recognize common
record field types like dates and timestamps that a text dump + regular awk
couldn't.

The output of the tool is always a human readable serialization.

------
vintnes
Namespaces are an interesting addition, and I appreciate GNU trying to
modernize such a ubiquitous tool, but I wish they would put more effort into
expanding the built-in utilities. There's still no `join`, for example. No
high-level date mechanics. Also, no Oniguruma! I'm tired of looping over
`match` and `sub` statements because you can't specify non-greedy regex.

The @include and @load directives are extremely useful for shipping your own
customizations, but I prefer the maintenance priorities of the JQ maintainers
[1], who understand that powerful builtins are what burn into user's minds,
making a tool mentally indispensable.

1\. [https://github.com/stedolan/jq](https://github.com/stedolan/jq)

~~~
tyingq
It is fairly straightforward to extend gawk since they added the gawkextlib
functionality to regular gawk (vs xgawk).

Here's the readfile extension, for example:
[https://github.com/gvlx/gawk/blob/master/extension/readfile....](https://github.com/gvlx/gawk/blob/master/extension/readfile.c)

But, you're right...there aren't many extensions, and little activity around
adding new ones.

------
ori_b
Forget namespaces. What about local variables?

~~~
prussian
They would probably point you to unused function parameters as your "local
variables."

~~~
coliveira
This is official syntax by now. It works. I don't see any reason to think it
as necessarily inferior, it is just another way to handle local variables.

~~~
hawski
It is inferior, because there are no local variables on top level.

------
every
For me, AWK is mostly for one-liners, at which it excels...

~~~
m463
I generally use it in the same way. Anything greater than one line uses
something else.

To be honest, what I mostly use it for rearranging fields:

    
    
      something | awk '{print $1" "$2}'
      something | awk '{print "xyz:"$0}'
      something | awk '{print "cp "$3" "$4}' | sh
    

what's a shame is how cut or other utilities make it unnecessarily bothersome
to rearrange fields.

I actually made a utility once called "words" that does:

    
    
      ls -l | words 1 3 4-5

------
heinrichhartman
I came to AWK via DTrace
[http://dtrace.org/blogs/about/](http://dtrace.org/blogs/about/)

When processing event streams, it's very natural to express transformation as
FILTER + ACTION rules.

AWK is an embodiment of this idea for the domain of text processing where
events are lines (or multi-line records).

DTrace uses the paradigm to process low-level system events (like scheduling
events, or (kernel) function calls).

It's a good paradigm, worth using.

However, it's easily replicated in every other language with a loop:

    
    
        while true {
          rec = read()
          if( filter1(rec) ) { action1(rec) }
          if( filter2(rec) ) { action2(rec) }
          ...
        }
    

You don't need a special language to do event processing like this.

------
cls59
I wish gawk had a native strptime function. Always seems like an odd omission
given strftime is included.

~~~
gav
If you add the gawkextlib/timex library you can get access to strptime.

[https://sourceforge.net/p/gawkextlib/code/ci/master/tree/](https://sourceforge.net/p/gawkextlib/code/ci/master/tree/)

~~~
cls59
Most of the awk scripts I end up writing are for execution by other people on
machines I don't have control over. If strptime was in the standard library, I
could use it if I knew the target OS was something like "Ubuntu XX.XX or
newer, RedHat X or newer".

When strptime is in an add-on library, I can't use it.

------
ubadair
On multiple occasions I have been surprised and disappointed that [my
installed flavor of] awk has no native support for 64-bit integers. That
aside, I love awk and use it daily. That "awk in 20 minutes" guide was a life
changer when I read it a few months ago.

------
umvi
Pardon the pun, but I've always found awk awkward to use.

Literally every time I use awk I have to google the syntax because I don't use
it often enough for it to persist in long term memory.

If only awk's syntax were a subset of a language I already knew, like python
or JS

~~~
OriginalPenguin
If awk's syntax was a subset of either python or JS, wouldn't it be better to
just use python or JS?

~~~
JadeNB
> If awk's syntax was a subset of either python or JS, wouldn't it be better
> to just use python or JS?

Not necessarily—sometimes it makes sense to use Datalog instead of Prolog, for
example.

------
jayzalowitz
Personal opinion:

"The State of the AWK" should actually be called "The AWK word"

------
zetalemur
For a long time, I thought awk was just an abbreviation for "awkward" (in fact
it is named after its inventors, Alfred Aho, Peter Weinberger, and Brian
Kernighan) which seemed fitting as its syntax looks arcane (compared to modern
languages) - I really thought the name was some kind of elaborated joke ... :)

However I learned to like it and use it often in CLI one-liners (mostly to cut
out or reformat specific columns, probably the most common usage).

------
Wildgoose
I don't know any other tool that is so simple and can be learned so quickly
and yet is so powerful. I use it daily.

The handful of minutes it took to learn have repaid that investment thousands
of times over.

Yes, there will be better solutions for individual cases. But awk is always
installed, always available, and for anybody quickly querying semi-structured
text data on the Unix command line it is a godsend.

------
srean
AWK really ought to have the inverse of its 'split'. Yes I can do that with
string concatenation in a loop but that's not ideal.

