
Skip grep, use awk - dedalus
http://blog.jpalardy.com/posts/skip-grep-use-awk/
======
kiwidrew
While awk is indeed under-appreciated, there are many instances where using
grep to pre-filter your input is helpful because the grep family (GNU grep,
agrep, etc.) can match strings much much faster than awk's regex engine.

For example: GNU grep uses a heavily optimized implementation of the Boyer-
Moore string search algorithm [1] which (it is claimed) requires just 3 (x86)
cycles per byte of input. Boyer-Moore only searches for literal strings, but
grep will extract the longest literal from the search regex (e.g. the string
"foo" from the regex /foo(bar|baz)+/) and use Boyer-Moore to find candidate
matches that are then verified using the regex engine.

So if you have a large corpus of text to search, grep is your friend. It can
easily saturate your I/O bandwidth and perform an order of magnitude faster
than typical "big data" platforms. [2]

[1] [https://lists.freebsd.org/pipermail/freebsd-
current/2010-Aug...](https://lists.freebsd.org/pipermail/freebsd-
current/2010-August/019310.html)

[2] [https://aadrake.com/command-line-tools-can-be-235x-faster-
th...](https://aadrake.com/command-line-tools-can-be-235x-faster-than-your-
hadoop-cluster.html)

~~~
voltagex_
I have found ack and ag to be much faster than grep (I should actually time
this).

~~~
thesmallestcat
They are perceived as faster because they automatically skip binary and VCS-
ignored files. grep is faster.

~~~
lathiat
They are faster for practical usage, though I'm fairly sure ag also executes
the search in parallel no?

I also seem to recall the ripgrep author recently trying to optimise towards
grep in another HN post.

Love ag ever since I discovered it, though you have to be careful every so
often it doesn't look in a filetype that mattered.

------
Ultimatt
Skip awk, use perl....

The alias below sets perl to loop over STDIN splitting each line on more than
one whitespace character and populate an array F. The -nE will then Evaluate
an expression from the command line looping over the input line-by-line.

    
    
        alias glorp='perl -aF"/\s+/" -nE'
    

So now we have the command `glorp` to play with which has more familiar syntax
than awk and all of CPAN available to play with!

    
    
        $ [data is generated] | glorp '/Something/ and say $F[2]'
    

We have access to any Perl module by putting -MModule::Name=function after the
command, the following will parse a JSON record per line and glorp out what we
wanted:

    
    
        $ echo -e '{"hello":"world"}\n{"hello":"cat"}' | glorp 'say decode_json($_)->{hello};' -MJSON=decode_json
        world
        cat
    

Maybe you are used to using curl too. There is a nice web framework in Perl
called Mojolicious ([http://mojolicious.org](http://mojolicious.org)) that
provides a convenience module called 'ojo' for command line use. So grabbing
the summary sentence from Wikipedia articles is as straight forward as below.
Notice Mojolicious lets us use CSS selectors!

    
    
        $ echo -e 'grep\nawk\nperl' \
          | glorp 'say g("wikipedia.org/wiki/$F[0]")->dom->at("#mw-content-text > div > p")->all_text' -Mojo

~~~
d33
Hey, it would be awesome to have something like this for Python!

~~~
Ultimatt
Not really possible in Python for anything overly useful/complex because of
the significant white space, and regular expressions are not a first class
object in the language. ipython is the equivalent solution I suppose.

However, Ruby completely inherited this aspect of Perl and is another good
candidate. It's just even old systems have a perl installed that will perform
the awk like functionality.

~~~
phaemon
There is "Pyed Piper" aka `pyp`

[https://code.google.com/archive/p/pyp/](https://code.google.com/archive/p/pyp/)

"Pyp is a linux command line text manipulation tool similar to awk or sed, but
which uses standard python string and list methods as well as custom functions
evolved to generate fast results in an intense production environment."

~~~
Ultimatt
That's cool, but I feel like you're unlikely to run into it very often in the
wild? The vast majority of Linux systems come with ruby/perl out of the box.
Plus if Python did support this sort of hackery it would quickly garner most
of the bad press Perl has over the years :'(

~~~
bobpaul
Who cares? It's on pypi
([https://pypi.python.org/pypi/pyp/2.11](https://pypi.python.org/pypi/pyp/2.11))
so just `pip install pyp` and then you have it. You're probably not going to
want to distribute shell scripts that rely on pyp, but if you're distributing
scripts you can just distribute scripts written entirely in Python and use pyp
for your own one-liners.

------
beefsack
ripgrep[1] is functionally incredibly similar to grep and ag, but is
significantly faster[2] and supports a wider range of character encodings. In
its short lifetime it has already become the default search tool for VSCode.

I've switched to using it as my daily driver for text search and am incredibly
happy with it.

[1]:
[https://github.com/BurntSushi/ripgrep](https://github.com/BurntSushi/ripgrep)

[2]:
[http://blog.burntsushi.net/ripgrep/](http://blog.burntsushi.net/ripgrep/)

Edit: I confused awk with ag originally leading to this comment. Using ripgrep
as a pre-filter to awk is still a ridiculous amount faster, especially on
large trees, so while the OP's suggestion is cool I can't see myself reaching
for it often.

~~~
cyphar
awk is a full programming language, and most of the times that I'm doing awk
scripting I have to use things like associative arrays and arithmetic. Pulling
out a field from a line is what most people use awk for, but it's honestly the
least interesting part of awk. In fact, if cut supported regular expressions
for specifying the field and record separators people wouldn't be using awk
for that purpose (because that's all that $n does).

I was under the impression that ripgrep is a grep implementation that was
incredibly optimised thanks to BurntSushi being a complete madman.

~~~
fiddlerwoaroof
The main reason I use awk 90% of the time is that its field parsing algorithm
does "the right thing" in most cases (i.e. divide fields by 1 or more
whitespace characters) without a lot of boilerplate, so it's really easy to
throw into a pipeline.

~~~
cyphar
That's what I was referencing when I said

> Pulling out a field from a line is what most people use awk for, but it's
> honestly the least interesting part of awk. In fact, if cut supported
> regular expressions for specifying the field and record separators people
> wouldn't be using awk for that purpose (because that's all that $n does).

There's nothing magical about awk's default FS. It's literally just /\s+/. If
cut's -d was slightly more clever you wouldn't need to use awk.

~~~
joosters
I wish the default 'cut' implementation could be just a little more clever -
regex delimiters would be good, it doesn't even support multiple characters :(

Also, cut's output manipulation is surprising. '-f 2,1' is actually the same
as '-f 1,2' \- you can't change the order of printing.

I know there are other programs that can do the job, but it's a little
frustrating when you can 'almost' get there with a chain of piped commands and
a simple tool like cut, but have to fall back on a 'real' programming language
to do just that extra bit of manipulation (awk, perl, whatever, and yes, I
know the shell is a programming language but you get my point!)

~~~
derriz
If you pipe your file through "while read f1 f2 f3 ; do echo field2 is $f2 ;
done" for example you can pick out fields. Re-ordering them is just of a
special case of any sort of bash manipulation you can do in the loop body.
Admittedly for the very basic case, it's not as terse as "cut -f2" but if
you're doing any further processing on the stream then I find it's often
shorter.

~~~
vidarh
In bash you can also do do "while read -a F ; do echo field2 is ${F:1} ;
done". (-a will assign each field to an zero indexed array)

------
hiq
The author mentions that using awk instead of grep -v is not a good idea, what
about:

    
    
      awk '! /something/'
    

Doesn't it reproduce the same behaviour as the following?

    
    
      grep -v 'something'

~~~
gbrown_
Yup, was going to post this as a comment myself.

------
aargh_aargh
I'd like to share a recent experience on a related note, but in the opposite
spirit - rather than reduce the number of command invocations on a command
line, it may make sense to increase it.

I had a loop operating on a text file like this:

    
    
      while read line
      do
        echo "$line" | sed -e "s/A/X/" -e "s/B/Y/" -e "s/C/Z/"
        ...
    

Gradually, as I added more things to replace, I noticed severe slowdown.
Things got fast again when I rewrote it as

    
    
      while read line
      do
        echo "$line" | sed -e "s/A/X/" | sed -e "s/B/Y/" | sed -e "s/C/Z/"
        ...
    

Turns out the later (multiple processes in parallel) helped with throughput.

~~~
avaika
why do you need `while read` part at all? this is the main slow down here.
shell is extremely time consuming compared to sed/awk/whatever tool.

~~~
aargh_aargh
The script is currently a filter - the input comes from stdin.

~~~
sebcat
You can redirect stdin to sed directly, e.g.,

    
    
        cat << EOF | sed -e 's/A/X/;s/B/Y/;s/C/Z/;'
        A Brave Cow
        Was Walking
        Crows Were Airborne
        All Was Well
        EOF
    

or,

    
    
        cat << EOF > myfile
        A Brave Cow
        Was Walking
        Crows Were Airborne
        All Was Well
        EOF
        sed -e 's/A/X/;s/B/Y/;s/C/Z/;' < myfile
    

or (copy-pasted from shell for clarity),

    
    
        $ cat foo.sh
        #!/bin/sh
        sed -e 's/A/X/;s/B/Y/;s/C/Z/;'
        $ cat << EOF | ./foo.sh
        A
        Brave 
        Cow
        EOF
        X
        Yrave
        Zow

~~~
aargh_aargh
Thank you, that helped. I had to get rid of '^' anchors (because it's not
processing individual lines anymore) and I put

    
    
      sed -e '' < /dev/stdin
    

into the script, but the solution is now much simpler and faster.

~~~
sebcat
sed works on stdin by default (see third example) so there shouldn't be a need
to pipe from /dev/stdin, and line anchors should still work.

    
    
        $ cat myfile 
        foo foo
        baz baz
        $ cat foo.sh
        #!/bin/sh
        sed 's/^foo/bar/;s/^baz/foobar/'
        $ ./foo.sh < myfile 
        bar foo
        foobar baz
    

Depending on your expression, you may want to use -E to get POSIX ERE syntax.

------
jwilk
> awk uses modern (read “Perl”) regular expressions, by default – like grep -E

No, Perl regular expression are different than extended regular expressions
that awk and "grep -E" use.

------
dredmorbius
The reason my one-liners often end up with both grep and awk pipes is because
that's how they're built: run some command, grep for content of interest,
throw awk at it to manipulate output. If this is a throwaway, the exploration
flexibility is more important than the command cleanness, and there are some
awk idioms (inverted match, case-insensitive, match count, first match only)
which are easier to code than their awk equivalents.

Sure, if I'm going to commit that to a script or shell function, I'll consider
going back to refactor it and clean it up. But for quick-and-dirty exploration
and prototyping, a long shell pipe is often the best modular way to compose a
tool.

Protip: M-X-E will call up your one-liner into an editor session, from which
you can save it directly to a permanent file. The number of locally-written
tools which have originated in this fashion is ... probably embarassing to
admit.

------
manasvi_gupta
I have used awk to color application logs (info=green, error=red, warn=yellow)
on linux console.

Blog - [http://manasvigupta.github.io/2015/06/27/color-your-logs-
and...](http://manasvigupta.github.io/2015/06/27/color-your-logs-and-stack-
traces.html)

------
assafmo
It's a nice idea, but I think maybe only to replace a simple egrep.

grep has tons of great option like

-F = fgrep; no regex - way faster

-v = as mentioned in the article

-o = print only matched input, not the entire line

-C = context, print lines before and after the match; can also be used partially with -A (after) and -B (before)

~~~
burntsushi
Small note: For GNU grep on single regexes at least, the -F flag should not
impact performance. It is smart enough to see through a pattern as a literal
and avoid the regex engine.

~~~
assafmo
Thanks, I didn't know this.

~~~
dullgiulio
Of course you might still want to use fgrep to make sure + and . match with
the characters and are not interpreted like regular expression operators.

------
emmelaich
If you don't mind a bit of Perl, then Perl can be used for grep, awk and sed.
And find. Even comes with little conversion utilities to do it (mostly) for
you: a2p, s2p, find2perl.

I used Python most of the time but still use Perl where it is appropriate.

~~~
jwilk
These converters were actually removed from Perl in v5.21.1.

They are available as a separate distributions on CPAN:

[https://metacpan.org/pod/App::a2p](https://metacpan.org/pod/App::a2p)

[https://metacpan.org/pod/App::s2p](https://metacpan.org/pod/App::s2p)

[https://metacpan.org/pod/App::find2perl](https://metacpan.org/pod/App::find2perl)

------
emmelaich
My favourite bad example of using grep was from a big enterprise software
vendor to kill one of their processes.

It looked something like

    
    
      ps -ef| grep SomeDaemon | grep -v grep | grep -v perl | perl -e '<do something with the pid>'

~~~
thaumaturgy
Serious question: what would be a better way to do this?

~~~
CaptSpify
I would just recommend pkill:
[https://en.wikipedia.org/wiki/Pkill](https://en.wikipedia.org/wiki/Pkill)

~~~
stephen82
I tend to use the following command:

    
    
        kill -HUP $(pidof SomeDaemon)
    

and if it insists on running, I use:

    
    
        sudo kill -9 $(pidof SomeDaemon)
    

That's it, really.

~~~
desdiv
I suggest:

    
    
        killall -HUP SomeDaemon

~~~
xorcist
Don't do this! Not only is pkill more competent, it is available on several
other operating systems.

There is a killall on Solaris also which is very different but true to its
name. You do _not_ want to run it by accident.

------
netheril96
Isn't rg the hot one for grep replacement these days?

~~~
Sean1708
It doesn't replace Awk though. I think the point of this post is that you
shouldn't do `<search tool> <pattern> | awk ...`, you should just do `awk
/<pattern>/ ...`.

------
tannhaeuser
I'm sure the author is aware, but awk has at least three implementations: nawk
(the one true), gawk (what most are using), and mawk (performance-oriented,
unmaintained). Plus busybox-awk.

When benchmarking gawk, I've found using LANG=C and avoiding UTF-8 to make a
substantial difference for pattern matching.

~~~
bsg75
Mawk is currently maintained: [http://invisible-
island.net/mawk/mawk.html](http://invisible-island.net/mawk/mawk.html)

------
fiercenoodle
He says that you can't cleanly do grep -v with awk, but awk '!/foo/' seems to
work in my copy of awk.

------
bluetomcat
AWK is the general-purpose programmatic filter and reporting tool in the Unix
pipeline. Sed, grep and cut are specializations for specific use cases whose
implementations might have better performance. Perl and Python are probably
too general-purpose for writing compact one-liners in a pipeline.

~~~
rflrob
Even though it's more characters, I usually use perl -pe 's/search/replace/'
instead of sed in pipelines because it understands /n (and other escape
characters I don't remember). Because all it takes is to get burned a couple
times for it to be worth sticking with what you know will work.

~~~
dozzie
"/n"? What's that? My `man perlre' doesn't say anything about this regexp
flag.

~~~
jwilk
Probably not what rflrob meant, but /n prevents parentheses from capturing.

[https://perldoc.perl.org/perlre.html#%2an%2a](https://perldoc.perl.org/perlre.html#%2an%2a)

~~~
dozzie
This is somewhat surprising. A relatively new feature, and not present in my
current Perl.

------
usgroup
Except "grep | awk" is often way quicker because it does not need to split the
record before it does anything and I usually want to span more processes
because awk will usually eat a whole cpu if it's doing something gnarly.

------
topspin
But I like my colorized match: grep --color ...

Also grep -n is so much nicer than something like awk '{print NR "," $0}'

Finally, I think grep must be faster if only because grep doesn't line buffer
by default, although you may --line-buffered.

------
skywhopper
awk is certainly a really important tool to know, but in general this is poor
advice. awk is significantly slower than grep -E, and typing awk commands is
often much slower as well. Not to mention that awk only operates on a single
stream of data, and can't do operations with file awareness. Sure, use awk's
line filtering when you're already going to need awk for something else, but
in general your first instinct should be grep.

------
racl101
I've always wanted to learn to use awk but I just can never find enough
examples that allow me to understand what the hell I'm doing. The learning
curve is too high for me.

------
linedash
Use of grep -v equiv is wrong and dirty. Instead should be:

    
    
        $ [data is generated] | awk '!/something/'

------
adsche
Just FYI, the code lines do not wrap and are not scrollable for me on mobile
(Firefox/Android).

------
rhizome
To address one of the benefits, `egrep ^[^#]` is shorter than `awk '/^[^#]/'`.

~~~
thesmallestcat
That pattern doesn't need extended regular expressions (use grep -E instead of
egrep though). Also `grep -v ^#` does the same thing. `sed /^#/d` is one char
shorter.

~~~
vacri
> _Also `grep -v ^#` does the same thing._

Depends on what you're after. 'grep ^[^#]' also gets rid of empty lines, as
they don't have a first character to match.

------
readme
grep -E performs an extended regex. To get a perl compatible one you need to
use grep -P.

------
tonmoy
How am I supposed to do `grep -nrb --include \\*.cpp something` ?

~~~
Xophmeister
This is close, but I don't know how to implement grep's `-b` easily:

    
    
        find . -type f -name "*.cpp" -exec awk '/something/ {print FILENAME, NR, $0}' {} \+
    

Clearly grep wins this round!

------
thinkMOAR
is there any awk equivalent for grep "matcha\|matchb" ?

~~~
gbrown_
awk '/matcha|matchb/'

~~~
thinkMOAR
excellent :) though slightly disappointed in myself, that i did not try that
myself as i did imagine it when i asked...

------
torrent-of-ions
This is basic awk. But grep is much faster if that's all you're doing. fgrep
even more so if that's what you want.

------
kubakuba
Yay, now I've found a replacement for grep under macOS. Thanks for that!

~~~
_ph_
Why would you need that, when grep comes with macOS?

~~~
kubakuba
but this version does not support Perl-like expressions

