
Learn just a little Awk (2010) - mason55
https://gregable.com/2010/09/why-you-should-know-just-little-awk.html
======
kumarharsh
Recently, I was helping out a friend in analyzing some RNA samples for her
work. These samples are huge - like nearly a gigabyte of data. There was this
tool which was recommended for the job - mirexpress. It was a small job,
perhaps 10 minutes worth of effort. To make my work easier, I provisioned a
beefy (and costly) machine on Azure to do the job, took a quick look at the
clock (it said 11 PM), ran the tool, and relaxed. The tool crashed while
reading the file.

In an attempt to fix the bug, I opened mirexpress's code. And all my
confidence in my programming ability vanished when I saw its innards. I
understand that the code may have been written by scientists who had no
experience in programming, but I have never been so utterly _disoriented_ by
bad code. Anyways, after hacking away at the mess for about 3-4 hours, I
realized that this was a fool's errand and thought I'll just phone it in the
next day saying I couldn't do it. I went to sleep thinking that it was already
late and I'd get late for work the next day.

\- 5 minutes later -

I woke up with a start, recalling this nifty tool called _awk_. I had last
used it maybe 3 years ago, and before that only in college. But I could see
how awk could do some of the things which mirexpress was claiming to do. So I
fire up my computer, write an awk script - 2 lines only! TWO FUCKING LINES!
And it runs like a charm - eats away at megabytes of sample data and gives me
results I can show. So then like any rational person, I spent the remaining
hours re-discovering awk and forgot to sleep. Pissed away the whole next day
(and some part of the day after that too!) :-D

It's really fascinating that this nifty little tools invented DECADES ago are
still going strong, and there's been no _evolutionary_ leap in areas where
tools like awk/grep/sed excel at.

~~~
AceJohnny2
Nice story! I have a friend who worked in bioinformatics, but he was good at
programming and went to better-paying pastures.

I wonder if the problem is twofold: 1) a lack of education compounded by 2)
the rapid evolution of computer systems.

Unix is a rare beast in that not just its philosophy but even its components
survive to this day and remain relevant[1]. People in unrelated fields rebuild
tools that could just as well be assembled using Unix's basic components, but
they're just not _aware_ of them. And why would they look for these antiquated
tools? They've been trained to reasonably expect old tools to have been
replaced by newer, better, more featureful ones.

[1] AWK was created before I was born, and I'm among the more senior engineers
in my 20+ team.

~~~
ENGNR
I think part of it too is that it's exploratory. Sometimes I use javascript to
write music and as creative twists and turns happen the code folds in on
itself in ugly ways that don't happen when solving a clearer problem

Compounded by the fact that in science once you have the result the code is
probably just an artifact so there's no real reason to refactor it

~~~
kkaranth
> Sometimes I use javascript to write music

Could you explain what you mean by this?

~~~
ENGNR
Just messing around and nothing worth showing, but the web audio API gives all
the low level stuff, and then hot module reload from webpack means you can
change a function while the music is playing and it updates in real time. Not
being bound by a GUI makes it trivial to do things like arpeggiate or automate
filters, and then orchestrate that at higher levels

It's easier to break the mould if you're not bound by other people's software,
but starts to look awfully sciency if you explore too far : D So it's useful
to keep wiping the slate to not be bogged down by previous experiments

~~~
usrusr
So are you programming on the sequence level (notes, parameter automation)
instead of on the synthesis level? From a previous life as a music software
geek I remember an abundance of programming environments for synthesis, and a
shocking lack of options for programming one level higher.

~~~
abhishekjha
I have never done music programming. Where can I read more about this?

~~~
cat199
not 100% focused on the programming/sound design side of things, but
[https://linuxaudio.org/resources.html](https://linuxaudio.org/resources.html)
has a lot of good pointers.

pd, processing, chuck, cm/clm, or just old-school mod-tracking are some good
ways to get going

------
MaDeuce
I was first exposed to Awk when I started work at Bell Labs in the late 80s.
Until then, I'd been using either Lisp or C exclusively and was really blown
away by how simple some things were in Awk. I used it with impunity to munge
all sorts of data for input into fault prediction tools I was working on.
Speed was never an issue for me, so I never explored the potential
improvements offered by 'awkcc'. Although perl was becoming the new hotness at
that time, Awk remained my goto tool for many years.

If you are interested in learning Awk, I highly recommend "The AWK Programming
Language" by Aho, Kernighan, and Weinberger. It's about the same size as the
original "The C Programming Language" and is equally well-written. Previously
on HN:
[https://news.ycombinator.com/item?id=13451454](https://news.ycombinator.com/item?id=13451454)

~~~
jandrese
Perl was basically written because Larry Wall found Awk's syntax to be a
little too cryptic. In the language design business this is what we refer to
as baby steps.

Also, Awk isn't great for making reports, which is why Perl 5 to this day has
an awkward report creation system[1] that looks like some COBOL refugee
instead of idiomatic perl code.

[1]
[https://perldoc.perl.org/perlform.html](https://perldoc.perl.org/perlform.html)

~~~
kazinator
> _Perl was basically written because Larry Wall found Awk 's syntax to be a
> little too cryptic._

That seems ridiculous; where is it substantiated?

When Wall posted Perl to comp.sources.unix for the first time, he wrote _" If
you have a problem that would ordinarily use sed or awk or sh, but it exceeds
their capabilities or must run a little faster, and you don't want to write
the silly thing in C, then perl may be for you."_

Or rather, not Larry Wall, but the apparent newsgroup moderator added that
text, lifting it from the Perl manual page.

Thus he was pitching it as something that performs faster than awk and sed,
with a greater range of capabilities.

~~~
jandrese
That was a little tongue in cheek but there is a comment in the Camel book
about how he wrote perl because he was scared of Awk's parser.

~~~
vram22
Could have been a joke. His writing (and talks, like the "State of" ones, is
full of them, some of them a bit subtle too :)

------
djtriptych
One of the best things I ever did for my programming career was read The Linux
Programming Inrerface cover to cover. Obviously I didn’t expect to completely
grok 1500ish pages of tech details, but a survey of how Linux is organized and
what is does well was a revelation.

As a result, I’m WAY less likely to reinvent the wheel. I love to argue down
tech proposals at work with something along the lines of “Linux already does
that for you”. This sort of understanding can easily save man-years of effort
on moderately complex undertakings.

As I’m writing this I again wonder if some sort of professional licensing is
needed in software engineering. A bioinformatics PhD has absolutely no way of
assessing potential CS consultants, and picking the right one could save so
much money, time, and effort.

------
dbolgheroni
Some people might think "why learn awk today when there is Python, Go, etc".
But a lot of tasks (even big ones) can be done easier with this tool.

The fact that most OSs have a POSIX compliant version of it in the base system
also make it very valuable, not just for the people crunching data but also to
sysadmin/devops people.

~~~
jandrese
Awk works great in pipes on the commandline, unlike Python where you would be
writing a multiline script to accomplish the same task.

My 99% use case is extracting particular fields from some input stream.

    
    
      awk '{print $1,$5}'
    

That's a tremendously common problem and awk is the best tool for the job.

I once started diving deeper into awk and there is an impressive amount you
can do with it, but I hit a point pretty quickly where it would be cleaner to
write the script in a more traditional language where my successor won't be
cursing my name for writing something complex in an obscure language.

~~~
Spivak
If you just want to extract fields the cut utility might be a better fit.

    
    
        cut -d $delim -f $field1,$field2,...

~~~
kjeetgill
I first learned awk precisely to replace cut. Not being able to reorder fields
make it an especially narrow tool. Is there anything from cut I'm missing out
on?

~~~
mmt
> Is there anything from cut I'm missing out on?

It's a narrow tool. Other than the inherent advantage this brings in
clarity/simplicity in a script, there's nothing else to it.

~~~
kazinator
Scripts in which you have both cut and awk are not simpler or clearer than
ones in which you only use awk.

~~~
mmt
Although I completely agree with this for any given "one-liner" or pipeline
(even if it's multiple lines), I don't believe the mere use of awk in one
place in the script means using cut elsewhere isn't simpler/clearer there.

Consider a script that originally had now awk but just a

    
    
      field=$(somecmd | cut -d: -f3)
    

but now has been modified to have awk elsewhere. Did the above just become
less simple or clear? If so, is it worth changing it to the below for clarity?
Does it matter if the awk predates the cut, instead?

    
    
      field=$(somecmd | awk -F: '{print $2}')
    

I say "no" to both, although I do recognize the argument for just using the
same, consistent tool everywhere, instead.

------
akavel
Ok; my favorite awk one-liner I found somewhere and learnt by heart is for
extracting "blocks of text between some starting pattern/line and some ending
pattern/line"; most often, that would be "functions with particular name, with
their whole body". Brace yourselves:

    
    
        awk '/^func Test/{p=1}; p; /^}/{p=0}'
    

EXPLANATION

First of all, this is 3 separate "commands", separated by ';'. In order:

    
    
        /^func Test/ {p=1}
    

— if line matches regexp '^func Test' (i.e., starts with "func Test"), then
set variable p to 1 (a.k.a. "True").

    
    
        p
    

— equivalent to any of:

    
    
        p { print }
        p { print $0 }
        { if (p) { print $0 } }
    

meaning: if variable p is true-ish (in case of this script, if p==1), then
print current line (if action is not specified after a condition, then it's by
default {print}).

    
    
        /^}/ {p=0}
    

— you may have guessed already now: stop printing after encountering end of
function (line starting with '}').

~~~
dbolgheroni
You can get the same behaviour with sed without "programmability":

    
    
      $ sed -n '/^func Test/,/^}/p' file
    

Which means: match 'func Test' at the beginning of a line and print (p command
at the end) until you find a line beginnning with '}'.

This is because the print command (p) accepts 2 addresses to delimit a range.
I used regexes as addresses but, for instance, can use line numbers also:

    
    
      $ sed -n '10,20p' file
    

This prints file's line from 10 to 20.

By default, sed prints the pattern space (modifications to each line) at the
end of the script. The -n I used is to avoid that.

~~~
akavel
Hah, lol, thanks, Today I Learnt ;) Will try using this next time, probably.

That said, there's one extra advantage with the awk script, that by
rearranging the expressions appropriately, I can choose whether to include or
exclude the the first and last line in the output (i.e. 'FIRST; p; LAST' vs.
'FIRST; LAST; p' or 'p; FIRST; LAST', etc.) Is this also possible with sed? :)

------
curiousgal
Awk is a neat language but the obsession of some programmers to condense awk
commands into terse one-linners is just off-putting.

~~~
3rdAccount
I disagree. To me that is the value. If I wanted to write a script I'd use
Python. If I want to do something as a oneliner I use Awk and don't worry
about anything as it isn't saved.

~~~
mehrdadn
> To me that is the value.

Interesting... to me the value (er, _a_ value) is that you can write

    
    
      /PATTERN/ { print $3 }
    

instead of

    
    
      import re, sys
      for line in sys.stdin:
        if re.search('PATTERN', line):
          print(line.split()[2])
    

And another value is that it's more likely to be preinstalled than Python.

~~~
fireattack
Does this python example actually work? What does line.split()[2] do?

~~~
mehrdadn
[https://onlinegdb.com/SkTP0yfW7](https://onlinegdb.com/SkTP0yfW7)

~~~
fireattack
Thanks. I thought /PATTERN/ { print $3 } is about to extract 3rd group of the
regex.

------
rbc
I mostly think of awk as a query language for text that's in record format. It
doesn't seem to handle unformatted text very well and I tend to use sed or
some other kind of tool to clean it up before feeding it to awk.

I like the baked in logic for dealing with record/field formatted data. One
thing that seems to hang people up is the lack of a concatenation operator.
You just put things next to each other. It looks a little weird. Having
associative arrays is a nice plus-up from shell programming.

------
b0rsuk
My boss recently asked me to write a couple of very simple Python scripts for
mobile translations. I had to convert files to replace &#123456 type sequences
into proper UTF-8 codepoints. I had to replace a bunch of unicode codepoints
with a mirrored en-us side. I had to convert lines like "abc" = "def"; into a
trivial .json file (a single flat dictionary).

The task was very simple in Python, but would be absolutely trivial and fun in
AWK if it a) had first class unicode support, including conversion b) had a
support for quoted fields. I think it needs a new variable, like QC (Quote
Character) of FD (Field Delimiter).

I would put money on a kickstarter of another crowdfunding initiative to
modernize AWK. I don't mean by slapping a Python or other programming language
on it, but by fishing long-standing issues with it. I think a _for_ loop like
in Python, Rust and VimL would be a better fit to an otherwise simple language
(only C-style fors are available in BEGIN/END).

~~~
hawski
What do you mean by C-style for? for (key in array) is POSIX.

I miss in awk mainly a structural regular expressions mode. With it I would
not miss lack of structures or that functions like gsub do not return a string
instead of assigning it to a variable.

~~~
b0rsuk
I mean for(x=0;x<10;x++).

This stands out like a sore thumb in a language famous for its brevity and
frictionless syntax. There should be a better way.

------
Pete_D
Note that you can put conditions before the braces, so you can also express
the example

    
    
        awk '{if ($(NF-2) == "200") {print $0}}' logs.txt
    

as

    
    
        awk '$(NF-2) == "200" { print $0 }' logs.txt
    

IMO, this style scans easier for most of the grep-ish use cases.

~~~
kazinator
{ print $0 } is the default action if you don't specify one.

Using "200" in quotes makes sense if you're looking for an exact string match;
this will fail if the datum is 0200, or 200.0.

~~~
newscracker
My guess is that the GP's example is to get all HTTP status 200 lines from a
web server log file. In that case, a precise match for "200" is fine.

------
jedberg
As a sysadmin, knowing awk, cut, and join is what lets you solve problems in
seconds instead of minutes. I can't tell you how many times being able to
quickly slice up a log file lead me straight to the problem I was trying to
solve.

------
slx26
>> To this day, 90% of the programmers I talk to have never used awk.

Ok, time for someone that doesn't regularly use awk to say something. I
understand that awk is great. The language being terse is nice if you use it
regularly, but otherwise it's very easy to forget, and it's strange to look at
it the first time. 90% of the programmers probably don't regularly need the
kind of functionality awk provides, and the few times they do, they can create
a simple script in a language they know better.

Awk can be a super useful tool, but I think it's reasonable that most
programmers don't use it, and the language is not designed to be used by
everyone.

In my opinion, a more universal alternative to this would be something like a
web application that allows visual programming and translates the operations
to awk, and possibly other languages too (the visual language could encode
only a subset of awk, only common operations). You would learn awk yourself if
you use it regularly, and you would know that you can simply use an intuitive
interface to solve those formatting problems you need to handle from time to
time otherwise.

------
lkurusa
Julia Evans (@b0rk) has a great zine about Awk:
[https://twitter.com/b0rk/status/1000604334026055681](https://twitter.com/b0rk/status/1000604334026055681)

(The twitter thread that follows also has some great one-liners!)

------
js2
I once wrote a network calculator in awk:

[https://gist.github.com/jaysoffian/e41ca479d70e60efe59fded93...](https://gist.github.com/jaysoffian/e41ca479d70e60efe59fded93528ab1b)

Backstory: circa 2000, at an early cloud company that no longer exists, we
used Solaris boxes and provisioned them using JumpStart. They booted to a
minimal state (we called them "embros") using DHCP. Later, our datacenter ops
folks would have to switch their networking from DHCP to static as part of
moving them to another network where a DHCP server didn't exist. So I wrote
them this menu driven program they could run on console to help with the
static configuration. (There were no high-level languages on the box save for
shell, sed and awk. Even then, I apparently I had to use /usr/xpg4/bin/awk.)

~~~
noisy_boy
That reminds me of the _xpg_ extension of programs on Solaris boxes - I don't
remember the specifics but often if I was looking for something that wasn't in
/usr/bin/* I would check for the xpg version.

~~~
js2
xpg4 is the X/Open Portability Guide Issue 4 versions of the programs. The
commands in /usr/bin on Solaris were (are?) ancient backwards compatible
versions.

------
ams6110
I'm working on learning just a little perl.

For some tasks it's amazingly effective, and it's usually installed even if
Python isn't.

------
inopinatus
In this age where JSON is universally encountered, I find myself using jq(1)
as much as I do awk.

(1) [https://stedolan.github.io/jq/](https://stedolan.github.io/jq/)

------
oceanghost
Does anyone else have trouble _retaining_ the syntaxes, gotchyas, etc. from
dozens of languages, hundreds of tools, 3 major operating systems... etc?

~~~
jodrellblank
Yes, which is one reason I like PowerShell so much.

As a shell, the tab completion, parameter completion, long names, makes it
easier to discover, easier to understand, and easier to remember.

Then it's a high level language too so if there's something that needs
scripting, it doesn't mean a complete change from shell to Python/Ruby/Perl,
it stays PowerShell.

Then it's a .Net language too, so if there's something getting a bit big for
it, it doesn't mean a complete change to C#/Java instead, it means a small
change to PowerShell with .Net methods, then maybe PowerShell with a C# core
(like Python with a C module, but still much easier to create).

Of course there's a bit of XKCD "fix having too many things by adding another
thing" going on.

But, the fact that it covers the common shell tools with all their different
syntaxes reasonably well, and it can be tuned to approach the speed of C# as
well, makes it useful for a whole lot of situations, despite having a pretty
huge syntax and list of warts, it still seems to come out well.

------
zeptomu
Don't forget about `-F`, the field delimator (which is WHITE by default).
Obviously an option of awk, but some people disregard some *nix tools, because
there data is not space delimited (and delimiter choice flags differ between
tools :( ).

For quick filtering it's really great, e.g. you can parse simple (pretty-
printed) XML tags via `-F[<>] '{print $2}'` (for a quick glance on the data -
of course not a good idea in production).

------
bajsejohannes
I once read the two or three first chapters of _ The AWK Programming
Language_. Those chapters cover the whole language and I could read it in one
sitting (actually standing; in a book store, but I digress).

To my surprise, it was powerful enough to write a somewhat limited
implementation of Snake:
[https://github.com/johshoff/snawk](https://github.com/johshoff/snawk)

------
j0e1
Here is an Awk program (not a one-liner) that extracted select snippets of
text from lots of docs and worked almost 10x faster than a grep equivalent:

    
    
       awk 'BEGIN { ORS=" "} { print $0 }' $2 | awk --re-interval -v pat="$1" '
    	{
                cut_content = substr($0, 1, match($0, "==== Refs"))
                orig_content = substr($0, 1, match($0, "==== Refs"))
                idx = 0
                regex = "(^|[^a-zA-Z]{1})" pat "([^a-zA-Z]{1}|$)"
                while (match(cut_content, regex)) {
                  if (RLENGTH > 0) {
                     prot = substr(cut_content, RSTART-1, RLENGTH+1)
                     gsub(/[:punct: ]$/, "", prot)
                     gsub(/^[:punct: ]/, "", prot)
    
                     print prot, "\t", substr(orig_content, idx+RSTART-400, RLENGTH+800)
                     idx += (RSTART + RLENGTH -1)
                     cut_content = substr(cut_content, RSTART+RLENGTH)
                  }
    	    }
            }'

~~~
AdmiralAsshat
Looks interesting; can you explain how the hell it works?

I tried writing a script that would extract entries from logs (e.g. if every
entry started "====CRITICAL ERROR LOG 2018/06/15====", I wanted it to start
there and grab every line below it until it hits the next log entry starting
with the "=====" header), but I gave up. Your script might do something
similar, if I can figure out what it's looking for.

~~~
bcbrown
You can use regex ranges to match every line between two regexes:
[https://www.gnu.org/software/gawk/manual/html_node/Ranges.ht...](https://www.gnu.org/software/gawk/manual/html_node/Ranges.html)

Another option might be to use multi-line records
([https://www.gnu.org/software/gawk/manual/html_node/Multiple-...](https://www.gnu.org/software/gawk/manual/html_node/Multiple-
Line.html)). First pre-process the file to add an empty line before each
header, then use blank lines as the record separator and newlines for the
field separator.

------
kjeetgill
awk is irreplaceable on the command line because there's not any other good
way to extract columns (awk '{print $2, $3}') or parse/work with numbers. How
else would you print a running total for select columns easily? awk '$4>3 {
x+=$3; print $2, $3, x}' It's the only tool that's aware of numbers, unlike
grep and sed.

My big new power move for awk is -F. You can use any regex as a field
separator! Mind blown!

~~~
sampo
> awk '{print $2, $3}'
    
    
        perl -lane 'print "$F[1] $F[2]"'
    

> awk '$4>3 { x+=$3; print $2, $3, x}'
    
    
        perl -lane '$F[3] > 3 && {$x += $F[2], print "$F[1] $F[2] $x"}'
    

Don't know about perl6, though.

Edit: Better syntax per comments.

~~~
acqq
Add the "w" (for "warn") make it "-wlane" and read:

Scalar value @F[0] better written as $F[0] at -e line 1.

Scalar value @F[1] better written as $F[1] at -e line 1.

In Perl 5 the array elements of @a are $a[0] $a[1] etc. However @F[0] works
too because a "slice" is returned.

------
forinti
Awk is nice to know, but my experience is that Perl one-liners are a lot
faster. And where you have Awk, you usually find Perl also.

------
artisin
I'm a big fan of "one-liners". Granted they're not always the best solution
but they're hard to beat. Something like printing the disk space across
filesystems:

    
    
      df -m | awk '{p+=$3}; END {print p}'
    

I bought a book titled "Awk One-Liners Explained" on a whim a while back. And
to this day, I consider it to be umong the most "useful" $6 I've ever spent in
terms of productivity.

------
ealhad
Awk is an amazing tool, and everyday I use it I wonder why I haven't learned
it earlier.

~~~
scbrg
I tend to post this whenever an awk discussion pops up, but at least for me,
the reason I didn't learn awk earlier is that the gnu man page for awk is
quite offputting. I spent ten years thinking "awk is too complicated for my
weak mind." Then I stumbled on the man page of the plan9 implementation of awk
and I learned the language in fifteen minutes - if even that.

Sure, the gnu version is more powerful, but the plan9/original awk
implementation handles most cases I run into, and I can now look up the gnu
extras when I need them.

[http://man.cat-v.org/plan_9/1/awk](http://man.cat-v.org/plan_9/1/awk)

~~~
coliveira
This is a good point: GNU software complicates terribly what is supposed to be
simple.

~~~
ealhad
What can explain this?

------
coliveira
The most important feature of awk is simplicity. No worries about complex
language syntax, you just have a language that seems like C without everything
that makes it complex. And it excels at a simple task: reading lines of data
and transforming them as needed.

------
linsomniac
I've never really been able to keep a significant amount of awk in my head.
Mostly I use it for splitting out columns in output, especially when the
fields may be padded for output. So "{ print $2 }".

Awk has great abilities, but when I want to use it I always have to spend a
lot of time reading.

I've always wanted to come up with something in Python that combined the
awesomeness of awk with my ability to pick up and hack something together in
Python, but I've never been able to come up with the right semantics.

One thing I think could be really useful is the regex lines range, where you
give it two regexes and that block of code gets executed on every line in that
range.

------
dominotw
Problem with this is that you keep relearning it every couple of months that
you need it.

~~~
ealhad
You just have to need it more often ;)

It can be really helpful in mundane data/logs analysis, for example.

------
u801e
I used to write scripts with awk, but stopped doing that after learning perl.
I wonder if learning how to use perl with the -ane option would be easier
compared to learning awk (especially in terms of text maninpulation).

------
ElijahLynn
Just went through the whole thing, which was maybe 15-20 minutes of tinkering,
and was way better than procrastinating on just reading other crap. Leveled up
for sure, and it was well worth the 20 minutes!

------
mattbillenstein
I forget where I saw it now, but I've seen a pretty good write up on using
unix tool to do very common operations -- set counting, histograms, etc -- and
awk is at the heart of most of them. Usually with sort and uniq -c in there.
It's amazing how quickly you can pick apart a log file -- minutes -- while it
might take others hours to get to the same information using Excel.

------
dbolgheroni
awk is also interesting because it's not just a language, it's a "model". It
isn't something that you can't write with some lines of any other language
but, because the loop is embedded, makes it easier to write utilities for the
command line.

~~~
ollysb
You can get the loop behaviour with ruby

    
    
        $ echo -e $'line one\nline two' | ruby -ne 'puts $_.split(/\W+/)[1]'
        one
        two

~~~
fl0wenol
ruby's -n and -p options are venerable equivalents to perl's options which
were present in version 1.0 waaay back in 1987 and they were there so it could
be that drop in replacement for awk

ruby inherited a lot of structure and ideas from perl which it inherited from
awk and that thought makes me irrationally happy

------
fizixer
shell + shell tools + piping = a league of its own.

As a big fan and advocate of a low level language (C) and a high level
language (Python), as a combo, being a very powerful paradigm, I still can't
believe some of the things the, shall we call it, shell piping workflow, let's
you accomplish in one line and in a matter of minutes.

Here's what you do.

Say you have a text log file. It's "semi structured" (like most log files) in
that you can extract quite a bit of information using things like
grep/sed/awk, but still not fully parse it (without a very complex parser; we
don't want to get into that). What's the first thing you do with the log file?
You view it:

    
    
        view logfile
    

(view is just opening the file in vim in read-only mode, if you don't have it,
you can use "vim -R"). You inspect the file and then quit. Try this instead:

    
    
        cat logfile | view -
    

(don't forget the hyphen at the end). Again, inspect and quit (e.g., using
:q). You might say, what a round-about/inefficient way to view a logfile. But.
Now you can do something in the middle:

    
    
        cat logfile | do something | view -
    

inspect and quit. Or more than one thing

    
    
        cat logfile | do something | do something else | view -
    

and you keep adding the piped commands until you're satisfied with your
output. Once you're satisfied, you can dump the output into another text file
as a "report" by replacing the last "| view -" with "> reportfile".

    
    
        cat logfile | many | processing | commands | later > reportfile
    

and you can do this multiple times to generate multiple report files of
various kinds. And you can concatente some of those report files

    
    
        cat file1 file2
    

or put some columns of some of the report side by side

    
    
        paste -sd' ' <(cat file1 | pick column) <(cat file2 | pick column)
    

The possibilities are endless.

To give one example from my shell history. Some times I have a dozens of pdf
files open in my linux desktop (using evince pdf viewer) and I wanna restart
the computer but don't want to lose track of which pdf files were open. There
may be automated ways of doing this but let's say there aren't any. I start
with ps ax:

    
    
        ps ax | view -
    

A long log. I want to pick only lines that have evince:

    
    
        ps ax | grep evince | view -
    

Now I included the 'grep evince' line too, which I don't want:

    
    
        ps ax | grep evince | grep -v grep | view -
    

Good. But I don't care about all the columns of ps log, except for the last
one (the one that shows the full path of the file). That's column 6:

    
    
        ps ax | grep evince | grep -v grep | awk '{ print $6 }' | view -
    

Looks good. Generate a report file from this:

    
    
        ps ax | grep evince | grep -v grep | awk '{ print $6 }' > openpdfs_YYYYMMDD.txt
    

Then I close all my pdfs, and reboot. And I don't know grep, awk except for
very basic things like what I already did. But I know similar tidbits about
many other commands. If I have to perform regex search only I use grep, but
for search and replace I use sed. Sometimes cut is more handy for column
selection than awk. If I don't know the command but I know what I need to do,
I just google it, and 99% of the time, I can find a stackoverflow post where a
linux based one-liner is mentioned which is pretty much a drop-in for my
piping workflow.

Finally, if I want to traverse through each line of the entry and process one
by one, I use the while loop. For example, if instead of dumping the file
paths of the pdf I wanted to format it a little bit using directory name and
file name, I'll do this:

    
    
        ps ax | grep evince | grep -v grep | akw '{ print $6 }' | while read -r dfnam; do echo 'Directory:' $(dirname ${dfnam}) 'File:' $(basename ${dfnam}); done | view -
    

And again, view or dump.

Try this, and you would realize that the possibilities are endless.

~~~
vram22
This is a great technique; thanks for sharing it. While I knew about all the
individual parts that you used, I don't think I ever thought of combining all
the parts with the "view -" at end, although I knew that "-" as a command-line
arg means standard input (to many commands that support it).

Very useful stuff, and as you say, the possibilities are endless.

I had done something sort of analogous, an experimental Python tool called
pipe_controller, which is not about piping commands to each other in the
traditional Unix sense; rather it is about "piping" the output of one function
to a second one, and the output of the second to a third one, and so on, as
many as one needs, under the control of a for loop. The net effect is like
normal composition of function calls, like f(g(h(x))), but some other
interesting effects can be achieved by doing it with a for loop, and changing
some things at run time:

After first creating pipe_controller (which is simple, really), I played
around with using it in a few different ways, and found that it can be used
for at least a couple of interesting things:

\- running a "pipe" (of those functions) incrementally (something like the
technique you showed), and saving / viewing the output of each intermediate
stage;

\- swapping components of the "pipe" at run time, under program control, which
again can lead to some interesting use cases.

I blogged a small series of posts about pipe_controller and such uses of it.
Here are the two last or so posts, and the previous posts can be reached by
following links in those posts:

Swapping pipe components at runtime with pipe_controller:

[https://jugad2.blogspot.com/2012/10/swapping-pipe-
components...](https://jugad2.blogspot.com/2012/10/swapping-pipe-components-
at-runtime.html)

Using PipeController to run a pipe incrementally:

[https://jugad2.blogspot.com/2012/09/using-pipecontroller-
to-...](https://jugad2.blogspot.com/2012/09/using-pipecontroller-to-run-
pipe.html)

The pipe_controller code is here:

[https://bitbucket.org/vasudevram/pipe_controller](https://bitbucket.org/vasudevram/pipe_controller)

~~~
fizixer
Interesting. Looks like you're trying to do shell piping in python but without
the annoying nesting.

I have thought of this myself too, both in terms of python and in terms of
scheme. For example, for your python function composition example f(g(h(x))),
shell piping would look like this:

cat x | h | g | f | view -

while in scheme it would be like python, just a little different:

(f (g (h x)))

Essentially the point is, by using existing syntax and language facilities,
can we mimic the seamless shell piping workflow (e.g., with no annoying
nesting).

I think one issue that python or scheme will have is that, when something goes
wrong, debugging the problem would still be very annoying in python/scheme,
whereas in shell piping, you simply remove some tail-end processing commands
to view and earlier output, then fix your issue, then reintroduce the tail-end
commands that you removed, possibility with some modifications. So "view -"
acts as a debugger, not just for visual inspection of your report.

Anyway this is an interesting area. Keep up the good work. I would just like
to mention that this is related, in fact it is, part of a much larger paradigm
of programming called dataflow programming, sometimes called stream
processing. And also has connection with reactive programming. (which if you
lookup in wikipedia [1] falls into a larger paradigm called declarative
programming).

[1]
[https://en.wikipedia.org/wiki/Programming_paradigm](https://en.wikipedia.org/wiki/Programming_paradigm)

~~~
vram22
Yes, good points about the debugging issues.

Thanks for the encouragement and links. Will check them out.

------
starchild_3001
As an ML engineer & former data scientist, I use awk almost everyday. It's
indispensable to pick columns, run simple stats, parse unstructured files,
convert between two formats... Hacky, but often the fastest tool.

------
_31
I saw a comment just yesterday(?) on HN that mentioned awk and it got me
interested (first time I heard about it). This was a good little intro that
was easy to follow along with and get a sense of what it's used for.

------
quickthrower2
I wonder how many Google job applications he's got in the last 14 hours!

~~~
HeadlessChild
According to the footer he is a software engineer at Google.

~~~
quickthrower2
Yes, I am referring to

> If you are the type of person interested in Awk, you are probably the type
> of person I'd like to see working with me at Google. If you send me your
> resume (____@gmail.com)

------
wyclif
Previously (2011, the last time this was posted):
[https://news.ycombinator.com/item?id=2932450](https://news.ycombinator.com/item?id=2932450)

------
skanga
Also try mawk - it runs much faster than awk.

------
olskool
Back in the 1980s my company wouldn't buy me a database so I hacked one up in
awk.

------
indentit
awk is a great tool, and I used to use it quite a lot. But now that we have
cross platform PowerShell, I find it so much easier to use that, its much more
powerful

------
RickJWagner
Hacker News paydirt!

Awk is a great tool. Glad to see this.

