Hacker News new | past | comments | ask | show | jobs | submit login
Awk: The Power and Promise of a 40-Year-Old Language (fosslife.org)
251 points by jangid 82 days ago | hide | past | favorite | 118 comments

HN discussion threads for some of the links mentioned in the article:

* Using AWK and R to parse 25TB - https://news.ycombinator.com/item?id=20293579

* Command-line Tools can be 235x Faster than a Hadoop Cluster - https://news.ycombinator.com/item?id=17135841

* The State of the AWK - https://news.ycombinator.com/item?id=23240800

For awk alternative implementations, I'm keeping an eye on frawk [0]. Aims to be faster, supports csv, etc.

[0] https://github.com/ezrosent/frawk

> CSV is a complicated format

Surprisingly and unnecessarily so:

> ["DSV"] is to Unix what CSV (comma-separated value) format is under Microsoft Windows and elsewhere outside the Unix world. CSV (fields separated by commas, double quotes used to escape commas, no continuation lines) is rarely found under Unix.

> In fact, the Microsoft version of CSV is a textbook example of how not to design a textual file format. Its problems begin with the case in which the separator character (in this case, a comma) is found inside a field. The Unix way would be to simply escape the separator with a backslash, and have a double escape represent a literal backslash. This design gives us a single special case (the escape character) to check for when parsing the file, and only a single action when the escape is found (treat the following character as a literal). The latter conveniently not only handles the separator character, but gives us a way to handle the escape character and newlines for free. CSV, on the other hand, encloses the entire field in double quotes if it contains the separator. If the field contains double quotes, it must also be enclosed in double quotes, and the individual double quotes in the field must themselves be repeated twice to indicate that they don't end the field.

> The bad results of proliferating special cases are twofold. First, the complexity of the parser (and its vulnerability to bugs) is increased. Second, because the format rules are complex and underspecified, different implementations diverge in their handling of edge cases. Sometimes continuation lines are supported, by starting the last field of the line with an unterminated double quote — but only in some products! Microsoft has incompatible versions of CSV files between its own applications, and in some cases between different versions of the same application (Excel being the obvious example here).

The Art of Unix Programming http://www.catb.org/~esr/writings/taoup/html/ch05s02.html

> The latter conveniently not only handles the separator character, but gives us a way to handle the escape character and newlines for free. CSV, on the other hand, encloses the entire field in double quotes if it contains the separator. If the field contains double quotes, it must also be enclosed in double quotes, and the individual double quotes in the field must themselves be repeated twice to indicate that they don't end the field.

I KNOW how CSV works, for the most part. And my brain still started tuning out/stopped building up the mental model.

The quoting also helps preserve embedded non-printable characters, newlines, etc. (yes, which can appear).

One extension of the "Unix version" would be to impose a requirement like that in JSON, where all non-printable and/or non-ASCII characters must be written as an escape sequence like "\uXXXX" escape.

This is why I hate CSV files. Trying to reformat huge blocks of data is a job that Awk does well. The associative arrays let you build structures that let you do the heavy lifting. For record processing, Awk should be one of the first tools you look at.

"A good programmer uses the most powerful tool to do a job. A great programmer uses the least powerful tool that does the job." I believe this, and I always try to find the combination of simple and lightweight tools which does the job at hand correctly.

Awk sometimes proves surprisingly powerful. Just look at the concision of this awk one liner doing a fairly complex job:

    zcat large.log.gz | awk '{print $0 | "gzip -v9c > large.log-"$1"_"$2".gz"}' # Breakup compressed log by syslog date and recompress. #awksome
Taken from: https://mobile.twitter.com/climagic/status/61415389723039744...

Ehh. Until the 'job' gets extended and then your simple tool makes it exponentially more complex and you have to rewrite it with the more powerful tool.

The nice thing about a 1-liner is you only lose a few minutes to throwing it out entirely and rewriting it to fit a new purpose. Dwelling on what might be needed is of limited utility, because of the very real possibility that what's actually needed in the future is wildly different from what you spent all that time planning for.

This is fine. I often "prototype" my automations as shell scripts, to explore what I actually want the tool to handle. Once it gets longer than 20 or so lines, it's time to move to a better language, but I don't mind rewriting. This is a chance to add error handling, config, proper arguments, built-in help texts and whatever else.

I started to add error handling to my shell scripts and often never rewrite them. Defo agree with the sentiment that you should always be happy (and able) to rewrite a shell scripts, dont let its scope creep. I don't mind long(ish) shell scripts as long as the program flow is fairly linear. Too many function calls is the smell that makes me rewrite.

Choosing a "good enough for the medium term with minimal effort now" is a winner in my book, even if it's likely to be rewritten in the long term.

Exactly. I end up re-implementing my scripts if they outgrow the original scripting language anyway, because it's a good time to add proper argument and error handling, logging, etc.

Surely that isn't a weakness of a simple tool?

A 5 min job that probably won't get extended saving you from having to spend 20 mins coding something up is better than, feeling annoyed that you have spent the 20 mins coding up the original implementation and then extend it.

Hopefully, you also get the benefit of additional knowledge on that future implementation as well. Why wouldn't this just be a net win?

Unless you're talking about writing hack after hack after hack, eventually leaving yourself with some incomprehensible eldritch monstrosity, in which case, don't do that?

If I understand this correctly, it will gzip every line separately instead of gzipping them together... it's not really the most effective but it does work

It does not. The pipe command leaves the pipe open and successive pipes with identical strings remain open until the pipe is explicitly closed.


Here's the link to the gawk documentation, but most flavors of AWK work similarly: https://www.gnu.org/software/gawk/manual/gawk.html#Close-Fil...

Wow, this is amazing. It really shows how complexity should be managed in the tool so that the user can do the naive thing and have it be accidentally optimal

It is surprising to people who expect them to behave like shell pipelines and redirections though. I somehow never got bit by it, but have definitely corrected other's awk scripts who didn't know about this feature.

I really love that quote "..A good programmer...", do you have a source?

I never use Awk until last year. I wanted to monitor an embedded device with little more than bustbox and python on it. There was quite a bit of information in the log files (I had already written a custom log file viewer with some highlighting) but I wanted to monitor in real-time. Somehow I decided to use Awk to monitor the tail of the log file and do realtime bar-graphs by generating appropriate cursor control sequences. In the end I had about 50 lines of Awk to upload to the board and run a command to pipe the log into it - very minimally invasive and very informative.

Would recommend learning Awk with some kind of real-world use of your own. BTW it reminded me of using XSLT which I think is another often overlooked "good thing".

The biggest reason to learn AWK, IMO, is that it's on pretty much every single linux distribution.

You might not have perl or python. You WILL have AWK. Only the most minimal of minimal linux systems will exclude it. Even busybox includes awk. That's how essential it's viewed.

Something fun in that regard, speaking of minimal...the TRS-80 Color Computer community now has a version of awk that runs on NitrOS-9, a variant of OS-9/6809 originally written for the Hitachi 6309. (64K address space, no separate I and D space.)

I'm curious what linux distros don't have either some version of perl or python.

I like awk, mind, but this is not necessarily (IME) a good argument for it.

The POSIX specification includes awk, but not perl or python. The world of UNIX and UNIX-likes is larger than just Linux distributions. Depending on the utility you plan on building and the platforms you expect it to run on, it may be wiser to reach for awk than other PLs.

Modern BSDs, macOS, and Solaris certainly have Perl and Python. (iOS and Android don't, but they don't have awk either.) What other Unixes are you thinking of? AIX, HP/UX, IRIX, UnixWare, etc. should be considered retrocomputing at this point and not relevant to modern compatibility discussions.

Linux distros based on busybox, as mentioned elsewhere in this thread, are a more compelling reason for considering awk than considerations involving other Unixes.

When it comes to Python on macOS, the only version that’s installed by default is the deprecated copy of Python 2.7 that’s slated to be removed in the future. For Python 3, you need to install the developer tools. (/usr/bin/python3 ships with the OS, but is just a stub that runs the developer tools version if installed or prompts you to install it otherwise).

It’s not hard to install, but it’s not guaranteed to already be installed on every system.

You can install python and perl on BSDs, but its different than awk, where its part of the core OS and guaranteed to be there without needing to install extra stuff.

Wasn't awk added to android in 9?

> The world of UNIX and UNIX-likes is larger than just Linux distributions

The post I was replying to specifically said Linux distros.

Anything busybox-based. I'm not sure busybox awk is very complete, either.

> I'm curious what linux distros don't have either some version of perl or python.

I imagine that DamnSmallLinux or TinyCoreLinux possibly don't have them by default. Their focus is to be as small as possible in order to download quickly and fit in a USB drive or CD. Their small size was more important back when speeds were slower and drives were smaller. They were also good for when you had a limited number of storage options and you wanted the running OS to fit completely in RAM (back when RAM was smaller).

I don't think I ever ran TinyCore without immediately connecting it to the Internet to grab a bunch of packages. Puppy Linux included Perl in its base install at one time (I don't know if it still does), and Damn Small Linux was supposed to have a cut-down version of Perl included as well.

Python definitely not, though.

Yeah, but if you are happy to program in Perl, that's basically every major Linux distro covered. Anything using DEB or RPM packaging, any machine with Git installed (which includes Windows), plus the ones I already mentioned, already have access to Perl. This is a formidable installed base with no effort needed to install a runtime.

I agree, but michaelcampbell's point seemed to be: why learn a language for its ubiquity, when more commonly used languages seem to be just as ubiquitous? So, I focused on how they're not that ubiquitous.

I see what you mean. I guess what I was trying to say is that my position is close to that of michaelcampbell's, and that I wanted to emphasize how little portability is sacrificed by adopting this position on most environments one will ever work in.

If you’re using DamnSmallLinux etc I’d imagine you can package your own awk quite easily! Perl would require a lot more packages. But all you need to do is copy a couple binaries right?

Haven't used these distros since a decade or so ago.

Not sure why I'd have to package awk. Busybox's is probably sufficient for most uses, if the need ever arised, which I don't think it normally does when using these distros.

Agreed. Not having enough space for awk would be daaaaamn small indeed.

The better question might be "which Linux distro's don't have perl or python installed by default" as a lot of people are working on systems where they can't just add additional packages.

Perl has been getting cut from minimal builds of distro's for a while. Default installed version of python is a bit of a crap-shoot, nevermind which modules you might happen to have available.

You'll find this a lot in the embedded space. As well, you'll see a bunch of docker images that don't have perl/python.

Building a Docker image gives basically full freedom over the choice of a runtime. If your Dockerized application is written in Java or Python or PHP or C#, why not just write the tooling and scripts in the same language too? Or at least install a suitable runtime just for the scripts? Or if starting from an empty container, why not build the script into a statically-linked binary to be placed next to the application?

Typically, you want docker images as slim as possible. Both to make it faster to distribute and to prevent attacks if something escapes your application. The less in the image, the less exploitable your image is.

Beyond keeping the images slim, the times I'd reach for awk when dealing with a docker container would be when I'm debugging problems within that container. I might need to do some quick text parsing or finagling in order to troubleshoot why the application is sucking.

I'd rather not need to upload a Java script into my docker container just for quick troubleshooting.

I agree on the slimness of Docker images, but if you e.g. have some kind of video or photo CMS written in PHP, then any housekeeping or export scripts etc are better off being written in PHP as well (or even integrated into the application) given how close they're already bound with the rest of the application.

For anything beyond that, I would very greatly prefer to have "black box", extremely verbose log dumps and database dumps that I could analyse over at my actual dev machine, or a good debugger that lets me step through the code to figure out what's going wrong.

I do realise that not all languages have good tooling, or that some people prefer to use `printf` style debugging, so it may not apply to all.

A nice thing about awk vs. Perl/Python: there's a small focused set of things to learn. Once you learn them you're done.

This suggests an opening for a Perl/Python intro focused on the exact same tasks, admittedly. That seems more realistic for Perl -- unless there's someone who writes Python one-liners at the shell?

I don't think true python "one liners" are a thing, but the awkward thing about awk is sits in this place where what you are doing is complicated enough you need awk, but simple enough you need a one liner? Those cases have been exceedingly few and far between for me enough that every time I want to reach for awk I have to go lookup how to do anything more complex than printing fields. That completely defeats the point of the quick one liner.

May as well open up vim, write my 7 lines of python, and run it. Because I use it everyday and didn't have to look anything up it ends up far faster. Then when I am done I either delete it, throw it in a scripts directory, or make it part of some existing infrastructure repo. Now if I keep it because I used python it is much more readable than the awk 1 liner would have been.

I have tried in earnest to memorize awk's idiosyncrasies multiple times now. By the time I go to use what I learned the last time it is months later and I have forgot enough I need to go look stuff up.

So in a way, here I am: The guy that writes "one liners" in python.

I think that is a good point, that often writing a short python script is usually the best solution.

I use awk (and python) daily at work. I work with a lot of flat files, and I use awk when I am doing data quality checks. One of the "sweet spots" it hits for me is when I need to group data by value, or other relatively simple aggregations.

Yeah, it's a different world from when I learned Awk. You might enjoy the (very short) book by the creators just because it's a great focused expression of the Unix way. But nobody needs to learn it.

Perl is sometimes "better awk".

perl actually has a one-liner way of invocation that's modeled after awk.

For example, to print the first field of a line work the default delimiter could be accomplished in perl by running:

    perl -ane 'print $F[0], "\n"'
Where $F[0] is the equivalent of $1 in awk.

IMO, unless you're doing embedded work or building minimal containers, you'll pretty much always have access to a decent runtime (or several).

Python: almost every conventional server. Python dependencies are so ubiquitous that you aren't likely to find a Linux install without it.

Perl: every DEB and RPM machine, and anything with Git installed. You can't really escape it, unless you're embedded.

PowerShell (yeah, I know): every Windows machine from XP onwards (though usable only from 7 onwards), and some Linux computers if installed.

Java: lots and lots of places will have this available.

Dockerized runtime of your choice: not ubiquitous, but I expect more and more developer machines and servers to gain Docker or Docker-like container support.

There really isn't any reason to stick to AWK, unless you're working directly on embedded devices or just like using it.

> Very few people still code with the legacies of the 1970s: ML, Pascal, Scheme, Smalltalk.

Arguably, the software world would be better off if more people did code with those 1970s languages, than with the ones we are stuck with now.

And that applies to Awk, too. As the author quotes Neil Ormos stating, Awk is well suited for personal computing, something which we have gotten further and further from at the same time as computers have become more distributed. At what point in history have such a large fraction of the human race had the ability to calculate to such an amazing order of magnitude, and at what point in history have such a large fraction of the same human race not bothered with calculation?

Awk is a great tool precisely because it puts quite a lot of expressive power in the hands of an average user on a Unix system. Sure, on a Lisp machine or Smalltalk machine there really isn't the same need for Awk: the systems languages on such machines are safe enough and expressive enough to do what Awk does. But in the Unix context — which is basically what we're all living in, with even the VMS-derived Windows more-or-less adhering to the Unix model — Awk is a godsend.

edit: correct typo

Oh man, you sound like a long lost friend. As someone who struggles to adopt really anything post ~1995 in the programming world, I couldn't agree more. I've worked for Fortune 100s my whole career; mostly in big data problem-spaces, before it ever was cool (if it even is now?), and I really feel all the problems people perceive today were solved all the way back to the 1960s (i.e. Snobol4). I understand for modern web and mobile contexts, sure there is new fancy tools for that; but as you said, in the personal computing space, the proper tools have existed for decades.

Gawk's ability to extend it with C code is interesting as well, and pretty straightforward.

Here's the source for the fork() extension that ships with gawk...it's ~150 lines or so: https://git.savannah.gnu.org/cgit/gawk.git/tree/extension/fo...

I was able to make a (terrible/joke/but-it-kinda-works) web server with gawk using the extensions that ship with it: https://gist.github.com/willurd/5720255#gistcomment-3143007

My opinion that belongs to me is as follows. This is how it goes. The next thing I'm going to say is my opinion.

The C interop and name-spaces (also in gawk) is a bridge too far for me. By the time you need one of those, it's time to look for another language. Awk is just not enough of a language to write serious programs in. And I really like awk. It has enabled great scripting not only for log files, but also for dictionaries, back in the day when it was still hard to load one in memory.

That is my opinion, it is mine, and belongs to me and I own it, and what it is too.

It's good you're unapologetic. At the same time, these sort of features are what I love as they avoid me having to move onwards to something new, and start near ground zero. Living by the mantra "Do 2 things 1000 times, not 1000 things 2 times."

My first and only real use of awk was around 1995. I was working at a new job doing embedded software work at GE and we had a lot of documentation in SGML, written/viewed using Interleaf. Interleaf was super slow on the HP-UX workstations we had and iirc search was even slower. I got the idea to convert all the SGML files into a single HTML file and I reached for awk as I had used it for some one-liners previously. I ended up writing an awk script that generated a frameset with one sidebar frame that was a treeish table of contents and the other frame the mondo html file with anchors for the table of contents. It loaded pretty fast in the HP-UX browser and search was really fast.

I've used Python almost my entire career, but started with out the UNIX tools. I never found awk interesting, then took a peek at it recently and understood: this was the pre-perl! it had scripting-language hash tables!

PERL was originally advertised as a replacement for “awk and sed”

yep- and I went straight to perl after learning sed, and ignoring awk. awk looked even weirder than perl (I wasn't a big fan of the pattern matching style). In retrospect, I think awk is a massively underappreciated (for its time and context). I can't say I'd want to work with it regularly (same for perl; in the long run, I prefer variants of C style).

First version of Perl was a replacement for C+awk+sed.

These are days when things like GC, hashmaps, file operations etc were hard things on Unix.

My company mandates Windows but Git Bash has been a backdoor into Unix tools and I've recently learned sed and awk to take full advantage of it. You need to think a bit about your one liners and they'll always feel very hacky, but sed/awk (with a bit of sort thrown in) are an amazingly powerful combination for dealing with all sorts of messy data dumps. In 10 minutes I can craft a one liner that replaces a 2 hours C# console app and runs just as fast. And, surprisingly, I often find it easier to go back months later and understand the messy looking one liner than the nicely formatted, well commented, unit tested console app.

My first job getting paid to program was in awk. Processing log files.

In the middle of that job, my supervsior, you know what, we're doing increasingly complicated things with awk and it's getting increasingly hacky... I've heard that Perl is like awk but better, do you want to learn Perl and switch to that?

And so we did. My thought then was there was little that was easier in awk than Perl, you could use Perl very much like awk if you wanted, you can even use the right command-line args to have Perl have an "implied loop" like awk... but then you can do a lot more with Perl too.

I don't use Perl anymore. Or awk.

I think I remember reading somewhere Larry Wall was inspired to create Perl in order to combine awk+sed functionality. He was sick of awk+sed being almost powerful enough to do what he needed. (I can't find a reference to this though.)

i no longer use it but Perl was always the better solution when one thought AWK was the answer.

Perl will do those things where AWK really shines and if the problem got bigger, Perl was easier to deal with.

The problem is that awk is a very simple language, which you can learn in an afternoon. Perl is a very complex language, and is not used anymore, so you're just spending your time on something you'll rarely use.

>> Perl is a very complex language, and is not used anymore, so you're just spending your time on something you'll rarely use.

Perl is no more complex than Python, Ruby, or Powershell. If you use any of those you can be productive with Perl in a few hours.

Perl is still used, it is just not as popular as it was in the past. Do you use Git? Parts of it are written in perl. Large parts of Git were originally written in Perl, but have been migrated to C over time.

The part that's equivalent to what you'd use for your regular awk isn't very different. Sure, you can do full-scale OO programs, but that doesn't have a large impact on small string munging. I get that you might not learn it to fluff up your CV.

Also, it's usually the same kind of Perl, so you don't have to worry about whether awk is the "one true" one, or mawk, or gawk...

If you work a lot with Linux, you can pretty much count on Perl and awk always being there. So it comes in quite handy to know them.

Perl is very must still used. lol

It's used in Debian system tools and in Git, so it's still in wide use.

OpenBSD's binary package system is written in perl.

Probably as much for legacy reasons as anything else. Perl was the chosen scripting language for utilities, it works, they understand it, and they've kept with it. Sort of how they stay with CVS for their source repository.

Python isn't even installed on a base OpenBSD system.

Mark Espie rewrote the entire package system in perl in 2010, which is a bit late to be classed as legacy.


I'm not sure what was used for the version before this, but the original BSD package system was written in C.

But perl was already the "standard" for other system/config utilities, no?

I don't know what we mean by "standard," but I found a number of perl references with the following shell fragment:

    $ for x in $(echo $PATH|sed 's/:/ /g'); do file $x/*|grep perl;done
All but two hits were in /usr/sbin, and /usr/bin. I isolated those files with:

    $ file /usr/sbin/* | awk '/perl/{sub(/:.*/,"");sub(/^.*[/]/,"");printf "%s, ", $0}';echo ''
The sbin results are:

    adduser, fw_update, pkg_add, pkg_check, pkg_create, pkg_delete, pkg_info, pkg_mklocatedb, pkg_sign, rmuser, 
There are more in /bin:

    $ file /usr/bin/* | awk '/perl/{sub(/:.*/,"");sub(/^.*[/]/,"");printf "%s, ", $0}';echo ''

    c2ph, corelist, cpan, enc2xs, encguess, h2ph, h2xs, instmodsh, libnetcfg, libtool, perl, perlbug, perldoc, perlivp, piconv, pkg-config, pl2pm, pod2html, pod2man, pod2text, pod2usage, podchecker, podselect, prove, pstruct, skeyprune, splain, streamzip, xsubpp, 
A perl script can't pledge() or unveil(), so I am guessing that anything sensitive has moved to C.

> A perl script can't pledge() or unveil()

It doesn't seem to support all of OpenBSD's privilege separation, but there are OpenBSD::Unveil(3p), OpenBSD::Pledge(3p), and https://github.com/rfarr/Unix-Pledge


Did not know that, thanks.

I found that to be the case many times as well. But awk also often outperforms Perl, especially mawk.

Perl was built initially as a sed/awk killer but got distracted into trying to take over the world. The interpreter for a language with 100x the number of features will always be slower. Also there's a very clear boundary for when I should use awk by itself, as part of a pipeline, or switch to a better tool. I feel like Perl has the potential to suck me imperceptibly into a huge mess where I spend 80% of my time refactoring everything.

Did you have ever found that a Perl oneliner is slow so rewriting by awk meaningful?

Yes but you can't learn perl as quickly as you can learn awk.

Though you can learn just enough perl to do awk-like things fairly easily. And then grow from there as needed.

IDK. On my OpenBSD system the awk man page is under 500 lines, and it pretty much covers the subject.

I've tried to get started in Perl a few times, and just found it weird. It doesn't click. Awk is kind of weird too but it's so simple it doesn't matter.

I'm sure I would eventually get Perl if I had to use it. But for me, awk and sed and shell scripting have covered my needs.

Learning awk is actually pretty simple. For years I just used the '{print $2}' version to extract fields, but after reading some short book I felt pretty confident of having understood the basics.

Sadly I don't remember which book it was, but this page looks like a good start: https://ferd.ca/awk-in-20-minutes.html

Likely the one by A, W, and K. https://news.ycombinator.com/item?id=13451454

Yes, this looks like it. Thanks :-)

There's a free awk course here for anyone interested https://www.udemy.com/course/awk-tutorial/

When you have a standardized problem setting like the implicit loop in awk, n alternative to a whole new programming language is a simple < 100 lines of code program generator [1].

This design lets you retain easy access to large sets of pre-existing libraries as well as have a "compiled/statically typed" situation, if you want. It also leverages familiarity with your existing programming languages. I adapted a similar small program like this to emit a C program, but anything else is obviously pretty easy. Easy is good. Familiar is good.

Interactivity-wise, with a TinyC/tcc fast running compiler backend my `rp` programs run sub-second from ENTER to completion on small data. Even with not optimizing tcc, they they still run faster than byte-compiled/VM interpreted mawk/gawk on a per input-byte basis. If you take the time to do an optimized build with gcc -O3/etc., they can run much faster.

And I leave the source code around if you want to just use the program generator as a way to save keystrokes/get a fast start on a row processing program.

Anyway, I'm not trying to start a language holy war, but just exhibit how if you rotate the problem (or your head looking at the problem) ever so slightly another answer exists in this space and is quite easy. :-)

[1] https://github.com/c-blake/cligen/blob/master/examples/rp.ni...

I use awk constantly in bioinformatics, for many of the file formats designed to store genomic data, awk is the easiest tool you can use for processing.

There's even a version of awk specifically designed for bioinformatics that natively knows how to handle fasta, fastq, and sam files, among other formats.


I did the exact same thing!

quickly looking at averages/errors, a simple awk one-liner will do.

I use awk to auto-generate C header files from other header files. I work with $vendor's huge complicated kernel driver codebase. I need small pieces of $vendor's interconnected header files in order to make kernel calls to their drivers without pulling in all their code.

I only recently learned Awk enough to be useful. But I still don't reach for it when I probably should.

What are the most common cases where you reach for Awk instead of some other tools?

I recently used it to parse and recombine data from the OpenVPN status file. That file has a few differently formatted tables in the same file. Using Awk, I was able to change a variable as each table was encountered, this I could change the Awk program behavior by which table it was operating on.

Here is a script that I use to send SMTP mail, via the gawk networking extensions. I have a few different versions, but this is the most basic:

    #!/bin/gawk -f

    BEGIN { smtp="/inet/tcp/0/smtp.yourhost.com/25";
    ORS="\r\n"; r=ARGV[1]; s=ARGV[2]; sbj=ARGV[3]; # /usr/local/bin/awkmail to from subj < in

    print "helo " ENVIRON["HOSTNAME"]       |& smtp; smtp |& getline j; print j
    print "mail from: " s                   |& smtp; smtp |& getline j; print j
    if(match(r, ","))
      split(r, z, ",")
      for(y in z) { print "rcpt to: " z[y]  |& smtp; smtp |& getline j; print j }
    else { print "rcpt to: " r              |& smtp; smtp |& getline j; print j }
    print "data"                            |& smtp; smtp |& getline j; print j

    print "From: " s                        |& smtp; ARGV[2] = ""   # not a file
    print "To: " r                          |& smtp; ARGV[1] = ""   # not a file
    if(length(sbj)) { print "Subject: " sbj |& smtp; ARGV[3] = "" } # not a file
    print ""                                |& smtp

    while(getline > 0) print                |& smtp

    print "."                               |& smtp; smtp |& getline j; print j
    print "quit"                            |& smtp; smtp |& getline j; print j

    close(smtp) } # /inet/protocol/local-port/remote-host/remote-port
This allows me to bypass the local MTA (if present). The message ID is also returned, which can be useful to log.

I had to take large CSV files like {question, right_ans, wrong_ans1, wrong_ans2, wrong_ans3} and covert them into SQL insert files. Few caveats - some could be duplicates, some characters were not allowed, and some had formatting issues. The first issue was avoided by upserting, but the other two I used Awk and Sed for and put together a fairly robust script far quicker than if I reached for Python. I probably would have reached for Python if I realised how many edge cases there were but I didn't know that at the start so the script just sort of grew as I went along, but now they're my go-to tools for similar tasks.

Awk is not really very good at reading complex CSVs (as defined in RFC-4180), where newlines (record separators) can appear within quoted strings. It can be done, but sometimes it's tricky.

The PHP fgetcsv function has been more convenient when I have had more exotic examples.

If the CSV is simple, awk remains a very good tool.

CSVs with quoted fields and imbedded newlines can be troublesome in awk. Years ago I had found a script that worked for me, I'm not sure but I think it was this:


There's also https://github.com/dbro/csvquote which is more unix-like in philosophy: it sits in a pipeline, and only handles transforming the CVS data into something that awk (or other utilities) can more easily deal with. I haven't used it but will probably try it next time I need something like that.

if the csv is RFC-4180 then it can handle it[0]. the only caveat is that you can't disable FS="" correctly. but a gawk -i ./csv.awk -e '{print $5}' would work on most csv files I've tried.


"""I probably would have reached for Python if I realised how many edge cases there were"""

This is the counter for all the "success" stories of awk users that walked away with an underspecced and underdeveloped 5 minute solution.

Most people reach for what they know best. I'm not sure it really proves anything about relative merits.

Have found static builds of awk useful in low-dependency work. I bundled it with a windows installer to do some wrangling we needed at install time. Another time I was sending packages to a unix cluster, but did not have access myself. Used awk as part of the bootstrap for the package.

I used to write event-driven scripts off it - each line is a message, interpreted by awk. Something I was not able to get working with any of the awks I tried was where you append messages to the file as you are consuming it (this is kind of like code generation). I ended up doing this in python (https://github.com/cratuki/interface_script_py).

Anything that is command line based and needs small changes to text input can be done with awk. It is a very competent language for scripts.

I use it a lot to filter, slice, and dice CSV (or other delimited) or fixed-format files. Sometimes I'll use q[1] if my needs are more complex. Or awk piped to q. It can be used as a fairly decent report generator for plain-text or HTML reports.

An time I want to process a bunch of lines in a text file, awk is my first consideration.

[1] http://harelba.github.io/q/

From what I can tell, Awk really shines in two places, transformation and collation, both of which require some form of structured file. You can transform one structure into another and you can process record by record to some form of collation or summary.

And let's not forget about the amazing commercial offering of Awk, known as Tawk (by Thompson Automation). To this day some features from Tawk cannot be found in Gawk.

Loved TWAK, but sadly they went out of business

awk is great for data analysis - usually, I start with cut, then move to awk as complexity increases and finally to python.

I find it very unpleasant to read Awk code. It looks as bad as regex to me.

sed is pretty ancient too. I've used it a lot with Docker to alter parameters during builds.

awk is fast and really useful.

It's also generally unreadable.

I don't agree. Awk is very readable for people used to c-like languages like javascript. And it is much cleaner that Perl.

It is certainly more readable than sed for example.

Yeah I use sed not infrequently but try to keep things simple. Anything more complicated than a "standard" sed one-liner (google it) I will start looking for something else.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact