Hacker News new | past | comments | ask | show | jobs | submit login
Awk As A Major Systems Programming Language, Revisited (2018) (skeeve.com)
215 points by kick 16 days ago | hide | past | web | favorite | 80 comments

Back in the 90s when I got my first job, Awk was the gateway drug to the stronger stuff, Perl. Once you were hooked, productivity skyrocketed and managers would wonder around the office saying "how does he do that 10-day thing in just 10 minutes?" Those were the days when dynamic languages were game changers in the Unix landscape. But, in fact, the mainframe folks and the PC crowd had had Rexx and BASIC for at least a decade prior. I still can't believe C was the standard for data etl and general reporting at so many venues.

AWK was written in 1977, so Unix folks have also had AWK for quite a well too.

> productivity skyrocketed

Was the increased productivity due to Awk or was it due to Perl?

Really, any of the scripting languages.

My recollection is there was a lot of disdain for scripting languages because they could not match the speed of C or other low level languages. Today, computer speed is so many orders of magnitude faster it’s hard to believe the speed of scripting languages was ever an issue.

John Ousterhout’s paper on programmer productivity gains comes to mind:


Just a few days ago I wrote a simple awk script to parse some log files but it was horrendously slow. I had to replace understandable loops with weird calls to builtin functions to make it fast enough for my usecase.

You're doing something wrong. I've used awk to run big data reformatting jobs in under an hour that took most of a day to run in Scala on an Apache Spark cluster. In the vast majority of cases today, if speed is your problem, then you are the problem - especially since most problems fit into RAM these days, even w/o exotic stuff like RAMcloud...

try mawk, I've had it run 4x faster than gnu awk on some things.

This! It was not unusually to have 100folks sharing a 20mhz workstation. Trying to run interpreted language was a pain! Heck, even compiling a few hundred line C code would take seconds.

Perl was 10x faster in benchmarks I did than awk. When Yahoo benchmarked scripting languages, mod_perl won but they chose PHP anyway.

Also Perl has better support for programming-in-the-large than awk, with modules, lexical scoping and CPAN.

Booking.com, IMDB and I believe the Amazon frontend are all written in Perl.

Which version of awk did you benchmark though? Mawk is pretty fast, for bread-and-butter awk stuff often significantly faster than perl. E.g.

    perl -anE 'say($F[0]) if /error/' big.log >/dev/null  0.85s user 0.02s system 99% cpu 0.867 total

    mawk '/error/ {print $1}' big.log > /dev/null  0.21s user 0.03s system 99% cpu 0.246 total

Not the OP but mawk wasn't really a thing in the 90s. Although the project was started in the mid 90s it only really saw a year of development before it languished unmaintained until 2009.

Not so. Mawk was released in 1991. Perl (as in Perl5) in 94.


1991 to 1996 (when it was abandoned) isn't all that long for a programming language; particularly when there are already mature and widely deployed implementations out there for awk. Also bare in mind that in the 90s people weren't as disciplined about keeping their OS updated so mawk might never have made it onto peoples systems unless they built it themselves. Which is a hard sell if you've already got awk installed on the host given the point of running awk is a short term productivity gain (ie if you were going to the trouble to compile mawk then you might as well write your script in something lower level to begin with)

Plus if you're going to talk about older builds of mawk then you can't really ignore older versions of Perl as well (which was originally released in 1987). Otherwise you're not making a fair comparison.

I should add, I have absolutely nothing against mawk. It just wasn't something available on any of the POSIX systems I used in the 90s. tbh even now it's an optional install but at least it is an easier install than it was in the 90s.

Awk has funtions with named parameters.

Local variables unfortunately can only be obtained in the form of extra parameters (which are not passed by the caller).

  function foo(x, y,   # params
               z, w)   # locals
There is a convention to separate the two by some obvious whitespace.

There are no block scoped-locals: all locals have to go into the parameter list. Initial values cannot be specified.

(Speaking of which, there are hacks for simulating the feature of optional parameters with defaulted values.)

In GNU Awk, from the following experiment, the scope appears lexical:

  function bar()
    return x

  function foo(x)
    x = 3
    return bar();

  BEGIN { x = 42; print foo(); }
The output is 42, which means that the "x = 3" assignment to the local variable x in foo does not affect the access to the free variable x in bar, as it would under dynamic scope.

In my career, I wrote a number of multi-thousand line gawk programs. The longest was a text formatter about 6,000 lines long. I have now been retired for 9 years and still write occasional small gawk programs.

The good old days.

Years ago, I bought the O'Reilly Awk & Sed book with the intention of becoming a linux guru and master the commandline.

Then I realized most all the awk/sed stuff looked very similar to Perl, which I already knew, and I ended up just becoming very good at Perl 1-liners.

My first programming job was on a telephone exchange system that was 50k lines of sed, awk, and korn shell.

We had a CAD workstation management with user management, backup and restore, plotting management and a lot of other stuff. Everything built with ksh. It worked really well at the time.

I'm sorry. ;)

lol. It was a complete system with accounts, billing, 'tui' screens, all glued together by scripts, kron jobs, and awk text processing. It definitely embodied The Unix Way™ of everything is a file and small tools you can pipe in/out to to build larger systems. They did.

How did that function in terms of maintenance? Was it clear where bugs existed or where new functionality needed added? I'm curious as I've never gotten to work on a system like that before.

New features? They started rewriting it shortly after I was hired. Maintenance issues were always fixed by the grey beards. One of them was blind and used a talk box, a wyse terminal, and vi.

That sounds like a man who could handle his abstractions.

It can work quite well, especially when combined with other tools that leverage the same kind of model, such as, say, Carlo Strozzi's NoSQL (his use of the term predates and is totally different from what we think of today as NoSQL), or even things like text/cli-based accounting systems. The modularity and focus on each thing doing one specific thing makes maintenance a breeze, assuming you're familiar with the tools.

Few months back I finally discovered the true power of the shell, began writing an into book


Your "Gitbooks" link links to a Go book, not to your awk book.

Oh yes. Apologies for that. I haven't created gitbooks of awk guide yet. Will fix it asap

From the using the right tool for the right job perspective, can any expert let me know when is the best time to use awk or sed if other tools are also available? I know awk and sed are different tools so maybe I'll ask a more general question: given other tools e.g. Python are available, what's the suitable scenario to use cl tools?

Given other tools are available is loading the question. Right now almost all Unix-like systems have Python, but which version 2.7 or 3.x? We still have boxes in our environment that only have 2.6. If they have 3.x, what is the x? Script portability in a diverse environment is a crucial issue for system scripts and even a lot of application support scripts.

Then there's flexibility. With bash you have a massive ecosystem of command line tools, including the ability to in-line code for tools like sed, awk and perl right there in your script. Python has a great standard library, but often the best way to do something in Python is using a third party module, so now you need to curate a module library across all your boxes. Also suppose your script needs to do things other than just process text, suppose I want to ssh onto a box and run a command? Is Paramiko installed, no? So now I need to invoke ssh in a sub-shell which is a PITA in Python compared to bash. Yes it's easier in later 3.x versions but now we're back to version soup again.

We have bash scripts running in our environment that are probably 20 years old, and if I write a new one now it will still work fine in 20 years time. Often it will be a quarter of the length of an equivalent Python script and much easier to understand and reason about.

Then again, there are many cases where a Python script makes a huge amount more sense, often run as part of a bash script.

I agree, this is probably not a good question. I think maybe it's indeed a case by case question and I have to master a lot of tools to make good judgement.

For now I'm teaching myself CL tools, and I feel if I need to pipe more than 5 times in one liners I'd rather use Python.

Awk and sed are generally superior for simple data transformations than python. Like cutting off specific lines, columns or replacing characters in files. They have pretty great performance in large files.

Once things get complicated though you should probably push whatever transformations you need to do to a database.

I would like to offer a counter-argument. I’ve been torrenting TV series for years, and I always have to do a mass rename of The files afterward to get Plex to process them properly and download the metadata. They’re always named something like “TV Show Name 2014.L337.H4XXX1080p-420.mkv”.

The other day I set out to use BASH to mass rename, and I had such trouble backslash-escaping the dot characters and whitespaces to prevent sed from interpreting them specially. Eventually I gave up and searched for the pythonic way to rename them, and it was as simple as a string replacement followed by a call to os.rename() inside a for-loop. It was a breath of fresh air to escape “command line Kung Fu” and fearing the thrashing from shell globbing.

To be fair, I got my start in using sed and awk as powertools in the BASH command pipeline, but I don’t miss them compared to a language with strong data types and simple built-in methods for handling complex manipulation. Python is built in to basically every Linux distro that has BASH, and for the sake of simple transformations, it offers a lot of succinct methods that work on either 2.7 or 3.x with no external packages.

Your example shows a very particular shortcoming in the unix command line programs. The mv command doesn't play nice with spaces in file names. I think python is a fine solution for your use case and wouldn't recommend against your code. Especially if you had to build it out for a lot of different patterns.

But if you're dealing with changing stuff in text files greater than 25 gigs I'd always recommend using the unix cmd line tool set for simple modifications if performance is a key concern. It works the same everywhere and can rip through big files. But I'm a Vim guy and I live in the shell so I'm surely quite biased lol.

There are plenty of pretty easy one liners though that can rename files depending on if you can define a pattern correctly though.

echo 'TV Show Name 2014.L337.H4XXX1080p-420.mkv' | awk -F . -v q="'" '{print q$0q" "q$1".mkv"q }' | xargs mv

find . -type f -name '*.mkv' -printf '%f' | awk -F . -v q="'" '{print q$0q" "q$1".mkv"q}' | xargs mv

You need zmv

Thanks! I hadn't heard of zmv before, and it fits my use case perfectly.

Thanks this is a good point. I did read some posts here that use sqlite for sort of middle storage in the pipelines.

When you are at the command line! (and you should be more) If you want to do quick analysis for log files, some data, then the command line is the best friend! no need to open any IDE, editor, no need to open excel files, just quick one-liners at worst multi liners, this is a super charge!

Sure, but you can invoke nearly all languages (and write one-liners) at the command line, e.g.:

python -c 'import sys; f=open(sys.argv[1]);print(len(f.readlines()))' .zshrc

But I wouldn't recommend python as a good one-liner language. So the question remains, what is awk particularly good at?

Think of Awk as being to traditional tab or whitespace separated Unix command output, as `jq` is to JSON command output. Extract a field, loop and iterate, template output, etc.

Now remember, this was popular when perl wasn't even something you could depend upon on all systems.

These days there's no reason you should even do anything terribly complicated in a nasty oneliner. Just save a nice function somewhere and call it with a concise alias.

Awk splits lines into numbered fields, so it’s really convenient for things like extracting and reordering columns from a csv file. For example, getting the second field of every line is:

  awk -F, '{ print $2 }'

I recently tried csvsql and it seemes to be a stronger tool to process csv files?

> But I wouldn't recommend python as a good one-liner language. So the question remains, what is awk particularly good at?

Processing regular (i.e. machine-, not human-generated) tabular text files line by line. If you already know Python there are probably few reasons to learn awk.

One of those could be working in constrained environments where you have awk but no Python. Awk is part of POSIX and part of Busybox, so there are almost no environments which have a working Python installed but no Awk, while the reverse is common enough (e.g. embedded systems). Apple plans to remove Python, Perl and Ruby from the default MacOS install in the next version, but probably not Awk (POSIX).

Good answers.

(And I hadn't realized Apple was planning on getting rid of Python, Perl, and Ruby from default installs! Wow.)

One-off jobs or jobs that you foresee won't be expanded.

For these jobs even an "import re" in Python is too much typing, compared to ubiquitous regular expressions in sed or awk. Awk is great at handling lines of delimited fields. Again in Python you would be writing your own boilerplate to read lines from files, splitting these lines, converting strings to numbers etc.

I don't know if I'm an "expert" but I only use awk and/or sed for simple shell scripts, when I'm munging around in a terminal, or for simple CLI tools I write for myself. If I'm writing code that I need to maintain, could grow, or needs to be contributed to in a team setting...I write it in a real programming language.

I've got a real soft spot for Awk but any time I want to get anything done I have to read the manual.

I know I like it, but I'd have to be working with it alot more for the basics to sink in.

And the thing is that in almost all cases Python or (ugh) bash can get the same job done so I rarely pick up Awk.

I use awk very often at my current job when analyzing our logs. I've been slowly compiling a list of useful shell pipelines, the majority of which involve awk. Any time I need to do some novel analysis, I can steal the syntax/ideas from myself

Mawk on debian is super fast and can determine statistics like messages/sec, uniq ip addresses for a particular user, etc from 10GB log files very quickly.

I use 'tldr' for that. It's a great man summary tool.

Awk could be a sleeping giant, because it's required as part of POSIX compliance. With the dominance of Linux though, I don't know if portability is as large of a concern as it might have been historically.

I still regularly need to write portable scripts, but it is niche, and not nearly as easy as it should be.

Several of our "grunt work" servers run macOS. So you're either left with a seriously outdated set of utilities, or need to write wrappers to first attempt to use gX versions first, and then fallback to standard names.

The options I'm trying to use aren't GNU specific, and can be found across the BSDs. But Apple is old, so you can't fully count on POSIX.

> But Apple is old, so you can't fully count on POSIX.

Strange thing to say since macOS is officially certified UNIX(r) POSIX, and has been for many years:

* https://www.opengroup.org/openbrand/register/brand3653.htm

* https://en.wikipedia.org/wiki/POSIX#POSIX-certified

However, if you check the compliance documents, you'll find macOS has a number of waivers. For example [0].

[0] https://www.opengroup.org/csq/repository/noreferences=1&RID=...

I like awk as a lisp replacement.

I have to look at the manual to remember how to write loops in bash, so I write an awk script that writes a bash script and pipes to 'bash'. People will tell you this is a bad idea because you get character escaping risks as with SQL injection - they are right, but it is so much fun.

I have done this with three layers of code generation.

I totally understand your pain about iteration in bash. That was one of the main reasons I ended up writing my own shell and scripting language.

It's possible to handle the character escaping with some care.

I do this on occasion, it is fun~

'89 or '90 I wrote a personal version control tool in awk, shell and rcs. Because I didn't know better.

The problem with awk today is that it has very few features that would make it superior to Perl,Ruby or Python.

The only instance where awk has utility for me is when the program is short enough to be explicitly specified at the command line, for example:

  $3 > 10 { print $6 - $5 }
for that it is awesome, you don't have to look inside another program to figure exactly what is happening, it is explicit etc. it is also super fast, much-much faster than splitting with a typical scripting language

for anything more complicated than that, it offers very few benefits (I'd be hard pressed to name any) and significant limitations.

Thus IMO the problem with awk is that it does too little and offers to little room to grow.

I think awk is better than any of the alternatives if the problem you're solving matches a pattern-expression paradigm and, as others have said, you have no external dependencies.

Granted, this is quite a constraint and perhaps there aren't many problems that fit this model, but when you do encounter one of those problems it's good to have the right tool for the job.

But you’re missing out on so much good stuff when you do this. Awk might not replace Python once you start needing external dependencies but awk can absolutely replace bash in a lot of situations.

Bash and Awk are not the same class of languages, so they can't be compared to/replace each other.

Bash is a glue language, while Awk a scripting one, so it rather makes sense to compare Awk with Python/Ruby/similar.

I will say that I have yet to see a program that, when rewritten in awk rather than python would look worth to learn awk instead of python.

But it could be that I was just never shown one.

I tried awk before (the default implementation on most distros) to do a simple task of making a template engine, I learnt a bit about awk and sed (basic stuff) but I couldn't manage to do what I wanted that I could do in python with a few lines and one minute.

The man pages are nice but I didn't have the patience to start reading every thing to just do simple stuff like replace regex pattern with a content of a file located at the path generated from a capture group of that regex and some other stuff.

> to do a simple task of making a template engine

Not quite sure what you mean but it does sound like awk was the wrong tool for the job there. For the sort of templating I'm thinking of shell scripts or m4 would have been a better tool. Taking some structured data and piping it to one of those is where awk shines (that and pattern matching).

It was for making a static site generator. Templating engine was a part of it where I wanted to add functionality for components, generating event handling, adding SEO (meta tags), gluing shell code etc similiar to jsx/vue.

Would you have been able to do that in python if you hadn't previously programmed in python, read the docs, seen examples?

Yeah, because it's easier to find what you need. I think I came across only big guides or irrelevant examples/answers - pretty old too.

Comparatively, perl was also easier to find stuff for in their docs. I ended up using that for some places.

I have a small oneliner wrapped in shell I use to get uniq, but emitting stuff as seen and never repeating. (hash table counting refs)

I routinely use awk when cut is obsessing about IFS awk does LWSP compression so you can get ' one two. three' to match properly when cut thinks field 1 (oh god, code which counts from 1 not zero..) is a ' ' space awk '{print $1}' just works.

I used awk to compile a list of unique IP addresses seen over GB inputs, 350m+ unique IPs. it was within scale both for memory footprint and speed of python and perl, for hash constructs. Basically, Brian coded it efficiently, all the perl claims of maximal hash efficiency did not add much in terms of speed OR size outcome.

I choose to code in python3, but I use awk for one liners. Its great. I avoid gawk-isms. I don't see the need.

Agree, it's so convenient for grabbing fields that I ended up writing a bash script that generates an awk script since the '{print $1}' is cumbersome to type, and I can never remember how to properly output multiple fields.

At some point I thought that had the ideal use case for awk (a git --graph filter) and spent an evening desperately putting it together because, as other commenters mentioned, it's hard to find good documentation and examples online. Sure, I have a fast and mostly-working filter now, but the code is also hard to understand or even debug. On the other hand, the examples linked in the article are actually a lot more readable than I expected, so maybe it's something to consider for small but frequently-used log parsing scripts.

If I wound up on a host in /rescue mode awk would be my go-to to fix up 'convert this to that' changes, maybe even grobble into the piped inputs of other commands to get debug data marshalled up. If you have a bigger system, there are better tools. If you have to live in the small state of a /rescue, knowing how to use sed/ed/grep/awk is data-saving.

Sometimes people observe I'm using three tools with pipes to do one job, and I freely admit grep <pat> file | awk '{print $2}' | sed -e 's/this/that/g' is probably stupid, but I do think of these atoms as tools for the job. Grep aside, sed and awk should be fully interchangeable for many pipe jobs, and when not BEGIN{} ... END{} you could do the whole thing in awk or sed simply. If it has pre- and post- states, Awk is ideal. But.. the mind does what the fingers remember.

Pipes are cheap.

The AWK programming language. PDF ahead https://ia802309.us.archive.org/25/items/pdfy-MgN0H1joIoDVoI...

Heh. My 1988 copy is sitting on the shelf next to me...

It has a great history, but the only thing I use it for anymore is splitting fields on whitespace, and that only because the 'cut' maintainers won't add this (tiny) feature.

Fex is your new best friend!


So this is nice, but it needs to be installed. It feels dumb to use awk just to grab fields, but it's everywhere by default with no effort, and it works.

This is a real issue. In principle, we can often install whatever, but it takes years in the Linux world before one can assume that any given tool or feature is ubiquitous.

And even then, things can fall apart. Python2 made it, but now it's going away. (Yes, there is a Python3 often present, but neither is a substitute for the other.)

Ugh. Why braces ( { } )? That's the biggest tiny pain of using awk. It's an odd stretch for my fingers to type them and I have to look down to be sure I hit them right.

Use one of the many tools for customizing your keyboard layout. {} are Alt+J and Alt+K for me.

braces are pretty well built into my muscle memory at this point, but I actually agree with you aesthetically, I would prefer square brackets `fex [1:3,5]`. I presume it's because of awk <shrug>

Great tool though

Awk is the ideal AWS Lambda language, and should be supported as a first-class citizen there. Add the ability to tag such a function from a URL (w/o the API gateway cruft), and AWK and netcat could replace tons of troublesome and expensive dynamic data management and ETL code that winds up living in much more complex and expensive environments today...

https://rosettacode.org/wiki/Category:AWK has versatile code snippets in AWK

Applications are open for YC Summer 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact