Sculpting text with regex, grep, sed and awk

gpapilion · on Jan 23, 2012

Sed and awk are two underused unix tools. One you learn how to use them well, you'll constantly surprise yourself with what one can do in a simple shell script.

mattmight · on Jan 23, 2012

Agreed.

awk, in particular, is a great not-too-hot, not-too-cold ad hoc database engine for shell scripts where firing up MySQL would just be overkill.

pkrumins · on Jan 23, 2012

I wrote two e-books to teach everyone awk and sed:

awk one-liners explained http://www.catonmat.net/blog/awk-book/

sed one-liners explained http://www.catonmat.net/blog/sed-book/

Check them out if you want to become proficient at shell scripting!

khafra · on Jan 23, 2012

I've learned sed well enough to replace simple patterns usefully, and awk well enough to do a '{print $2}'. I could doubtless do much more if I really spent some time learning them. But, with the current ubiquity of Python, is there a big benefit offered by awk and sed?

billjings · on Jan 23, 2012

Python's just a different tool. Sed and awk fall down with a sufficiently involved task, but they shine in duct taping something together quickly at the command line or in a short shell script.

Two reasons why: one, everything you realistically would want to do with sed and awk is available immediately. You don't have to import any libraries or do any setup work, and the language itself implicitly assumes that you're iterating over delimited text. Getting to '/^GET/ { print $3 }' takes a lot more characters in python than awk.

Two is that in the case of awk, the language model is superior for simple parsing of structured text. If I want to extract some content from a specific XML file, for example, the event based programming model gets me there quickly and without any specialized libraries.

Of course, python wins as your problems get harder and veer away from sed+awk's strengths. My rule of thumb is that as soon as I start thinking I should break things out into functions, I switch to a stouter programming language.

cema · on Jan 23, 2012

  Getting to '/^GET/ { print $3 }' takes 
  a lot more characters in python than awk.

True. Then there is Perl which is closer to awk/sed (intentionally, too) and thus more compact than Python.

  perl -lane 'print $F[2] if (-m /^GET/)'

It tends to grow unwieldy as the problems (more precisely, solutions) become more complex.

dredmorbius · on Jan 23, 2012

The principle advantages of awk:

It's everywhere. It's part of the POSIX standard, which means that any POSIX implementation will have it. Including a surprising number of embedded systems. Even Perl is slightly less ubiquitous.

It's simple. The entire command set is in a single manpage. This is a two-edged sword: you can quickly scan the entire commandset, but there are a limited number of features.

It's fast. Both in startup and execution. And variants may differ in their speed of processing specific code (gawk and mawk may differ by an order of magnitude, either way, in my experience). There's no initial scramble to read diverse libraries as in Perl or Python. But there's no functionality provided by these diverse libraries.

There's a large set of known idiomatic code to accomplish standard tasks. Can be said for many languages, but it's true.

Awk is very useful for many standard sysadminly tasks. There are other tools which fit the bill, but awk is certainly among the useful tools in your bag.

rue · on Jan 23, 2012

As a counterpoint, I'd peg its ubiquity as the only advantage for someone who isn't primarily a sysadmin and already knows Python, Ruby or Perl (and surely others). The latter produce much more maintainable code, especially in larger scripts.

For those increasingly rare times when higher-level scripting languages aren't an option, one can look up the necessary syntax.

simcop2387 · on Jan 23, 2012

Awk happens to be one of those tools that I know I really probably should learn properly for more than doing awk '{print $2}'. Anyone know a good resource for learning awk in and out?

high5ths · on Jan 23, 2012

I think I learned it by reading this, long ago: http://www.grymoire.com/Unix/Awk.html

mattmight · on Jan 23, 2012

Read the AWK section of the article.

It covers about 90% of AWK, but in condensed form.

For using AWK, try the resources linked at the bottom of the article, like Eric Pement's one-liners and Bruce Barnett's page:

http://www.grymoire.com/Unix/Awk.html

http://www.pement.org/awk/awk1line.txt

The man page for awk is pretty good too.

pkrumins · on Jan 23, 2012

See also Eric Pement's one-liners explained:

http://www.catonmat.net/blog/awk-one-liners-explained-part-o...

TheCapn · on Jan 23, 2012

When I got my first job out of university I was doing C coding for a dept. that shared space with a group that had heavy unix scripting jobs. They were swamped with work and I was relatively free. One day one of the acting Business Analysts asks me if I know Unix. I responded "Sure, I know all the basics." He asked if I knew sed to which I responded "No not really." He turned away at that point, I went to google.

At that point I picked up a few awk/sed tutorials (I already was quite familiar with grep) and suddenly saw the world for what it could be. I'm seriously blown away day to day by the way I make things easier with these two tools. Parsing out data from massive files along with trying to do lots of adjustments to scripting files is suddenly super easy. It makes a lot of tasks easier and I look at some of the stuff I do in my new job and wonder how I'd get through without these tools. Hell, I even wrote a crappy C program to do the basic global search/replace that existed in sed before I realized how it worked in vi :%s/.../.../g

I also get more slashdot jokes now...

tl;dr - You don't know Unix until you know sed+awk.

qntm · on Jan 24, 2012

Is it acceptable to not know sed, grep or awk but to know Perl?

gmaslov · on Jan 23, 2012

Oh dear. I like this article, but using XML parsing as the example for sed made me cringe hard. Please never, ever attempt to work with XML using regular expressions!

bwarp · on Jan 23, 2012

The "can't parse XML using regular expressions" argument is possibly flawed as it's a false dichotomy. XML is a superset of text therefore it is parseable with regular expressions, but in the context and constraint of text, not as structured XML. I.e. you can parse the textual content but not the structure with regular expressions.

It's possibly a hack but ignoring a level of abstraction for the sake of simplicity works in a lot of cases absolutely fine.

Let's also add the constraint that an XML parser can't mathematically parse broken XML whereas a regular expression can extract data from XML. That end of the problem is far more interesting.

GEB made me appreciate this fact after some rather late nights and considerable amounts of alcohol.

mattmight · on Jan 23, 2012

True.

To be fair, I noted this limitation in the example, and said that if the nesting structure is important, you should use a tool that can handle context-free languages.

My more general warning was that if you find yourself exploiting the Turing-completeness of sed, you're using the wrong tool.

tjpick · on Jan 23, 2012

I agree.

However, many regular expression implementations are not regular and have enough power to parse nested structures.

aidenn0 · on Jan 23, 2012

"awk manipulates an ad hoc database stored as text, e.g. CSV files."

I love awk and use it on a daily basis, but it's biggest weakness is CSV files, since data so often contains commas. Please don't use CSV as an example of what awk is good at!

[edit] Easiest workaround if you do need to do something quickly with a CSV is to just use sed to replace unquoted commas with a string not in your data; for non throwaway uses, there is a CSV library for awk.

bingaling · on Jan 23, 2012

Csvkit may be useful: http://news.ycombinator.com/item?id=3477771

dredmorbius · on Jan 23, 2012

Also csvtool (just went hunting for it after seeing your post). https://forge.ocamlcore.org/projects/csv/

brendano · on Jan 23, 2012

Yes! -- translation into a strictly delimited format is key.

g3orge · on Jan 23, 2012

the link is missing in the "AWK, according to its creators, Aho, Weingberger and Kernighan" note.

mattmight · on Jan 23, 2012

Thanks for catching that! It's fixed.