Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A Crash Course In Awk (bignerdranch.com)
107 points by zdw on Oct 20, 2013 | hide | past | favorite | 37 comments


Having used bash, awk, sed and a hodgepodge of utilities for twenty years, it was a revelation to finally spend some time learning perl. It's not a panacea but it does offer a much larger return for time invested in learning the basics. It is today as ubiquitous as sed & awk, and provides much more power and sophistication for no extra complexity.

$0.02


Well, considering the original remit for perl (back in the pre-Perl 5, pre-OOP, pre-CPAN days) was to glue sed and awk together with a proper programming language and full access to the POSIX library, that shouldn't be a surprise. Aren't a2p and s2p still part of the standard Perl distribution? (Haven't looked: stopped driving Perl for a living more than ten years ago.)

If there's one argument against using perl in place of these other tools it's simply the cognitive overhead of learning the extra stuff perl brings to the table on top of bash, sed and awk. On the other hand, picking up non-OO old-school perl if you're already a proficient shell scripter should be a day's work ... the important thing is knowing when to switch tools.


Author here. You're right that awk is a limited tool. That's exactly why I wanted to write about it. It just wouldn't be possible to make a comprehensive case for perl in a blog post this short, because perl does much more than awk does.

And that's kind of why I like awk as much as I do, in spite of its limitations. There's so little to the language, it's of practical use, and as a precursor to tools like perl, python, and ruby, it's historically interesting, too. And I know a lot of people who like to learn neat, useful things who don't know anything about awk.

Thanks for reading, btw - I think perl is another older tool that could stand for some evangelizing. I don't write perl, though, so I'm not the guy to do it.


I have one situation where bash + sed is preferable: embedded devices where storage is a primary cost component. Having a shell on the device is somewhat necessary for development. It so happens that the shell is also a decent language interpreter, so adding another interpreter doesn't always justify the cost.

I eventually got pretty good at doing both functional and OO-style programming (each where appropriate, and straight-up imperative scripts in many other places) in POSIX-compliant shell.


Perl is great, but today you can just as well (or easier) use ruby, python or even php for scripting one-liners and small tasks. However awk and sed are still fantastic option when you need just a quick processing on a command line. Pipe it in, do some transformations or replacements, pipe it out. Same goes for other unix power-tools like sort, unique, wc, head, tail, split, etc., they are all still very useful...


Well, I have close to 30 years Unix experience so you don't have to convince me of the power of a pipeline. But I do have to disagree with your affection for awk and sed even though in the past I have used them both extensively.

The advantage of learning perl is that you can replace both utilities in your piped command line and only have to remember one syntax. And it integrates just as well as sed and awk do in any pipeline. But quite often, it will obviate the need for piping through other utilities.

Perl is much closer to sed and awk than any of the other scripting languages you mention. As noted in other posts, you can even automatically convert awk scripts to perl. And it doesn't take much effort to convert sed scripts to perl where you can leverage more powerful pattern matching and procedural facilities to boot.

If you already know sed and awk, by all means keep using them, they're fine. But if you're new to the Unix command line, you'll get the most bang for your buck by learning perl instead of awk and sed.


As a sysadmin (not web/app or systems programmer) I spent about 7 years using Perl (wrote my company's HR <-> LDAP synchronization system in it) until Python came along (for me, around 2003), nowadays for one or two liners, it's sed/awk - the simplicity of awk '{print $1,$2}', awk '{tot+=$2; c++}END{print tot/c}' has just worked it's way into my fingers, I don't even have to think about sed 's/^.client="\(.\)",dev.*/\1/' foo.txt - I just look at the log file and the sed expression just appears at my fingertips. Anything more complex than that (unless it's a really obvious fit for awk) I switch over to Python. Its simple data model for things like HoA or AoH just imposes less cognitive overhead on me, and it's rich assembly of built in libraries make me productive from the get-go. This sort of thing would be incredibly difficult in awk - not sure if I would have to go to CPAN with perl - but I just assumed (and was correct) that python would have it built in.

  import json
  f=open("bus-stops.json")
  j=json.load(f)
  for a in j:
    print a['no'],",",a['lat'],",",a['lng'],",",a['name']

A simple "import json", "help(json)" at the Python command line, and 2 minutes later I was done. Also - I'm able to understand my code the next day - something I was never able to do with Perl, but for some reason I can with Python.

I probably spend 90% of my time in sed/awk, and 10% of my time in Python. Haven't touched Perl in 10 years - not because it isn't an awesome language (it really is) - it's just that I have room in my head for one full blown language at a time, and Python replaced Perl for me.


Yeah, i have no problem with python or any other language. Use what works for you. But perl does an acceptable job on every one of your examples..

  perl -naE 'say $F[0], $F[1]'

  perl -nae '$tot+=$F[1]; $c++;END{print $tot/$c}'

  perl -ape 's/^.client="(.)",dev.*/\1/' foo.txt
And finally:

  use JSON;
  use IO::All;
  use feature 'say';
  $f = io('bus-stops.json')->all;
  $j = decode_json($f);
  $,=',';
  for (@$j) {
     say @$_{'no','lat','lng','name'};
  }
I still think newbies would get much more value out of learning perl basics than spending any time on sed/awk intricacies. But to each his own.


awk tip: in your averaging example, the "c++" and reference to c in the END block can be replaced by the built-in var NR (for number of records). awk's builtins are useful.


Good luck doing one liners with Python, as long as it use wspace indentation for e.g. loops and don't support {}:s as an option. :-(

I'd argue with most others, Perl is the single most useful command line tool. Not the only one, of course. But afaik, you can't e.g. load a JSON lib in awk as part of a pipeline. (I deserialize dumped data structures multiple times a week in [ad hoc testing with] pipelined cmds.)

Imho, if you know Ruby or PHP as the back of your hand, don't learn another scripting language for command line use. Learn some completely different language for some other use, instead.


https://en.wikipedia.org/wiki/One-liner_program#Python

You can use semi-colons instead of newlines, and colons don't require a newline.

  python -c 'for c in "abc": print c; print c'
Imports and actual pipelines are a little more tedious, it can be done, but python isn't as straightforward as perl or shell-ish tools for pipelines.


Thanks. Can you also use that for if/else, other block things?

(Last I looked -- the answer was no. It removes a use case for ideology, sigh.)


I think you're right. The closest, I think, is the pseudo ternary operator:

print "hi" if True else "bye"


No, you can't work with JSON very easily in awk. Any kind of hierarchical format like JSON and XML will give you a headache in awk, and CSV can be difficult as well.


While not as complete as a full CSV parsing lib, finding this made working with CSV in awk much easier: http://www.gnu.org/software/gawk/manual/html_node/Splitting-...

  gawk -vFPAT='[^,]*|"[^"]*"'
http://stackoverflow.com/questions/4205431/parse-a-csv-using...


That regexp fails with fields containing '"':s, but I guess you can grep for embedded double quotes ("") first.

Are there multiple variants of coding '"' in CSV fields? I don't know -- but some people who do know are those who write the CSV libs I use!

Edit: And as your link notes, it fails for embedded \n:s. Imnsho, awk needs csv (and json, etc) builtin, preferable as a plugin architecture. But then, why not just use the Perl superset?


Arnold Robbins created FPAT to parse CSV, but it doesn't really do that very well. I agree that it would have been better to just hardcode a CSV mode. CSV is common, so you shouldn't have to think hard in order to parse it, and FPAT is hard. PHP makes parsing CSV a breeze. You could write a good CSV parser in gawk and @include it in other scripts as a solution short of hacking gawk itself. But it's generally easier just to find some other way, such as swapping CSV for TSV -- which works better in awk.

Hierarchical formats like JSON are a little different, because they don't fit the awk model very well. You could add functions to work with JSON, but working with it this way wouldn't be very awk-like. You're better off preprocessing the JSON into records with another tool to make it more awk-friendly, or simply using another language altogether.


Man, seconded. I'm not sure if a CSV-ized awk is a sensible idea, but I'd love to have it if it were. CSV might be #1 on my list of "things that will cause problems for you because they are slightly harder than you think they are".


I hear you, re CSV.

Join the dar... cough, Perl side, we have cookies. :-) We have CSV parsers and everything else, all the way up to e.g. good web libraries and the best OO among the scripting languages (Moose, ~ like the Common Lisp OO environment; more or less std for new Perl projects today.)

And there is more! You can reuse most everything you know from awk! Write: perldoc perlrun

Check for -n, -p, -i, -E flags. And, as many have noted, there is a2p.

http://perldoc.perl.org/5.16.2/perlrun.html

http://perldoc.perl.org/5.16.2/a2p.html

But the main reason is that we have fun. An insane programming language which throw all this "minimal mathematical notation" stuff out the window with some linguist inspirations, but still works wonderfully (do insist on keeping to the coding standards in your group. Seriously. At a minimum -- lie and say that you do that, when people interview for a job at your place. :-) )


Which is why, in similar fashion to awk, we have utilities to deal with JSON [0] and CSV [1] output.

[0] https://github.com/stedolan/jq [1] https://github.com/onyxfish/csvkit


No question you can do more with Perl. But I wonder if you've checked out more recent versions of gawk. Modern gawk is a better language than the awk of 20 years ago, so there's less reason to dump it and use Perl.


A pretty good article. Events aren't a bad metaphor, but it's actually simpler than that. The whole program, except BEGIN and END blocks, is a big loop. Blocks with conditions in front are just if statements without having to type if. They're executed one after the other, and they can affect each other by setting variables. If it encounters a next statement, no further commands are executed and it starts over on the next line. The beauty of it is that it saves you from having to set up the loop and write I/O commands to loop through a text file.


The event / pattern-matching analogy was pretty great for me. I never really bothered much with awk, possibly out of sheer laziness, but with this small piece of knowledge, it somehow feels much more accessible and logical to use.

That said, the loop concept and using next is also very valuable piece of information. So thanks to both!


Nice article. awk is a fantastic tool to get quick insights into not so large data. This is my take which I wrote last tuesday incidentally http://hkelkar.com/2013/10/15/rolling-up-data-with-awk/


Your post caught my eye. Since I mentioned earlier that learning perl was a good investment, thought i'd just show a quick equivalent to the awk script in your blog post.

This produces same output as your awk script (except it's sorted alphabetically):

  #!/usr/bin/perl -F, -an "$@"
  $wins{$F[0]} += $F[6];
  $losses{$F[0]} += $F[7];
  END {
    print "manager,total_wins,total_losses\n"
    for (sort keys %wins) {
      print "$_,$wins{$_},$losses{$_}\n";
    }
  }
EDIT: should also mention that you can automatically convert awk scripts to perl with the a2p utility which should already be installed along with perl.


Here is another awk version of that script that's a little more concise and sorts, like the above, using newer features that are in gawk:

  #!/usr/local/bin/gawk -E
  BEGIN{
  	FS=","
  }
  {
  	total_wins[$1]+=$7;
  	total_losses[$1]+=$8;
  }
  END{
  	print "manager,total_wins,total_losses"
  	n = asorti(total_wins, managers)
  	for (i=1; i<=n; i++){
  		m = managers[i]
  		print m "," total_wins[m] "," total_losses[m]
  	}
  }
(I recommend using gawk over mawk or nawk. You might need to install it over the awk that came with your OS.)


Thank you mtdewcmu, did not know about the built in sort in gawk, will make the switch.


Using gawk-only features means sacrificing portability, but gawk is so much more refined than the other awks. I don't know why it's not the default awk on a lot of OSes.


wouldn't installing gawk on all target machines take care of portability?


Portability usually means taking the machine as you find it. If you're pre-installing stuff then that is just configuring your environment.


Yes. As long as you have the right version of gawk, you're good to go.


thank you tux1968, will consider learning perl! I did have a query on the post asking how perl would fare and you just gave that answer.


A correct FizzBuzz in awk (the version in the article doesn't print numbers in the general case):

    $ seq 1 100 | awk '!($1%3){printf "Fizz";p=1}!($1%5){printf "Buzz";p=1}!p{printf $1}{printf "\n";p=0}'


Argh. I should know what FizzBuzz is before I write one. I've updated the implementation in my post to be... not wrong. (It's a little different from yours, though.)


I hope it's shorter :) 60 chars:

    !($1%3){$2="Fizz"}!($1%5){$2=$2 "Buzz"}!$2{$2=$1}{print $2}


Nice. Using $2 as an auto-clearing global is a neat little hack!


crash course indeed. I just learned the basic concept of awk in just a few minutes. superb article!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: