Hacker News new | past | comments | ask | show | jobs | submit login
Skip grep, use awk (jpalardy.com)
332 points by dedalus on July 3, 2017 | hide | past | favorite | 130 comments

While awk is indeed under-appreciated, there are many instances where using grep to pre-filter your input is helpful because the grep family (GNU grep, agrep, etc.) can match strings much much faster than awk's regex engine.

For example: GNU grep uses a heavily optimized implementation of the Boyer-Moore string search algorithm [1] which (it is claimed) requires just 3 (x86) cycles per byte of input. Boyer-Moore only searches for literal strings, but grep will extract the longest literal from the search regex (e.g. the string "foo" from the regex /foo(bar|baz)+/) and use Boyer-Moore to find candidate matches that are then verified using the regex engine.

So if you have a large corpus of text to search, grep is your friend. It can easily saturate your I/O bandwidth and perform an order of magnitude faster than typical "big data" platforms. [2]

[1] https://lists.freebsd.org/pipermail/freebsd-current/2010-Aug...

[2] https://aadrake.com/command-line-tools-can-be-235x-faster-th...

I have found ack and ag to be much faster than grep (I should actually time this).

Creator of ack here.

The three biggest reasons to use grepalikes is not necessarily the time to execute the search, but

1) The reduced amount of typing. It's much faster to type "ack foo --ruby" than it is to type "find . -name '* .rb' | xargs grep foo"

2) Better filetype matching, because ack and ag check things beyond just file extensions. ack's --ruby flag checks for *.rb and Rakefile and "ruby" appearing in the shebang line, for example.

3) More features that grep doesn't include, like better highlighting and grouping, or full Perl regular expressions (in ack) and a ton of other features.

Here's a command line to find all the header files in your C source. It takes advantage of the ability to use Perl's regular expressions to specify output:

ack --cc '#include <(.+?)>' -H --output='$1' | sort -u

Here's an article I wrote a while ago on comparing the features of ack and ag. https://blog.newrelic.com/2015/01/28/grep-ack-ag/

Even though I created ack, I don't care which tool you use, so long as it suits your needs and makes you happy, but if all you're using your grep/ack/ag for is "grep -R function_name ." then you're only scratching the surface of what the tools can do for you.

They are perceived as faster because they automatically skip binary and VCS-ignored files. grep is faster.

They are faster for practical usage, though I'm fairly sure ag also executes the search in parallel no?

I also seem to recall the ripgrep author recently trying to optimise towards grep in another HN post.

Love ag ever since I discovered it, though you have to be careful every so often it doesn't look in a filetype that mattered.

> they automatically skip binary and VCS-ignored files

If it does less work, and takes less time, then isn't that faster?

See http://blog.burntsushi.net/ripgrep/ for a quite nice comparison which is counter to your experience; ack/ag/pt are all slower than either grep or ripgrep.

"ack/ag/pt are all slower than either grep or ripgrep"

You need to be a touch careful with blanket statements, as this conclusion isn't quite what the data in my blog says. What it says is that ag is generally slower than GNU grep on large files, primarily because GNU grep's core search code is quite a bit more optimized than ag's. However, ag can outclass GNU grep when searching across entire directories primarily by culling the set of files that is searched, and also by parallelizing search if you want to exclude running GNU grep in a simple `find ... | xargs -P`-like command. (This is thesmallestcat's point.) This is why the first set of benchmarks in my blog don't actually benchmark GNU grep, because there is an impedance mismatch. Instead, ag/pt are benchmarked against `git grep`, where the comparison is quite a bit closer. (You'll want to check out the benchmarks on my local machine, which avoid penalizing ag for its use of memory maps.[1]) The second set of benchmarks basically swap out `git grep` for GNU grep and compare the tools on very large files.

ack isn't actually included in the benchmarks because it was incredibly slow when I tried it, although there may be something funny going on.[2] To be honest, it isn't terribly surprising to me, since every other tool is compiled down to native code.

The pt/sift story is interesting, and my current hypothesis is that the tools have received a misleading reputation for being fast. In particular, when using the tools with simple literals, it's likely that search will benefit from an AVX2 optimization when searching for the first byte. As soon as you start feeding more complex patterns (including case insensitivity), these tools slow down quite a bit because 1) the underlying regex engine needs more optimization work and 2) the literal analysis is lacking, which means the tools rely even more on the regex engine than other tools do.

The short summary of ripgrep is that it should outclass all the tools on all the benchmarks in my blog. I should update them at some point as ripgrep has gotten a bit faster since that blog post. Primarily, a parallel recursive directory iterator, which is present in sift, pt and ucg as well. Secondarily, its line counting has been sped up with more specialized SIMD routines. (I should hedge a bit here. It is possible to build patterns that a FSM-like engine will do very poorly on, but where a backtracking engine may not degrade as much. The prototypical example is large repetitions, e.g., `(foo){100}`. Of course, the reverse is true as well, since FSMs don't have exponential worst case time. Also, my benchmarks aren't quite exhaustive. For example, they don't benchmark GNU grep/ripgrep's `-f` flag for searching many regexes.)

[1] - https://github.com/BurntSushi/ripgrep/blob/master/benchsuite...

[2] - https://github.com/petdance/ack3/issues/42

Regarding your '(foo){100}' example, couldn't you expand this into a literal search for 100 repetitions of 'foo'? I guess this could interact poorly with the regex engine, and you'd be expected to cover more complex instances like '(foo){50,100}', but I think it might be worth the effort in some cases.

Yes, my example was bad. Consider `\pL{100}` instead. :-)

With respect to expanding repetitions on literals, that already works today. e.g.,

    $ rg '(foo){100}' /dev/null --debug
    DEBUG:grep::literals: required literal found: "foofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoo"
The truly eagle-eyed would notice that there are actually only 83 instances of `foo` in this string. In fact, literal detection is actually quite a tricky task on arbitrary regular expressions, and one must be careful to bound the number of and size of literals you produce. For example, the regex `\w+` matches an ~infinite number of literals, but `\w` is really just a normal character class, and some character classes are good to include in your literal analysis.

The 83 comes from the fact that 83 times 3 = 249, which buts up against a hard-coded limit in the regex engine for literal detection: https://github.com/rust-lang/regex/blob/d894c631cb6c9a062c13...

In this case, the literal detector knows that the literal is a prefix. That means a prefix match must still be confirmed by entering the regex engine. If you pass a smaller literal, e.g., `(foo){4}`, then you'll see slightly different output:

    DEBUG:grep::literals: literal prefixes detected: Literals { lits: [Complete(foofoofoofoo)], limit_size: 250, limit_class: 10 }
In this case, the regex compiler sees that a literal match corresponds to an overall match of the regex, and can therefore stay out of the regex engine entirely. The "completion" analysis doesn't stop at simple regexes either, for example, `(foo){2}|(ba[rz]){2}` yields:

    DEBUG:grep::literals: literal prefixes detected: Literals { lits: [Complete(foofoo), Complete(barbar), Complete(bazbar), Complete(barbaz), Complete(bazbaz)], limit_size: 250, limit_class: 10 }
This is important, because this particular pattern will probably wind up using a special SIMD multi-pattern matcher. Failing that, it will use the "advanced" version of Aho-Corasick, which is a DFA that is computed ahead-of-time without extra failure transitions (as opposed to the more general lazy DFA used by the regex engine).

Literal detection is a big part of the secret sauce of ripgrep. Pretty much every single regex you feed to a tool like ripgrep will contain a literal somewhere, and this will greatly increase search speed when compared to tools that don't do literal extraction.

Of course, literal extraction has downsides. If your literal extractor picks out a prefix that is a very very common string in your haystack, then it would probably be better to just use the regex engine instead of ping-ponging back-and-forth between the prefix searcher and the regex engine. Of course, there are probably ways to detect and stop this ping-ponging, but I haven't invested much effort in that yet. Alas, the performance improvement in the common case (literals make things faster) seems to wind up being more beneficial for the general use case.

There is more in my blog post: http://blog.burntsushi.net/ripgrep/#literal-optimizations

Skip awk, use perl....

The alias below sets perl to loop over STDIN splitting each line on more than one whitespace character and populate an array F. The -nE will then Evaluate an expression from the command line looping over the input line-by-line.

    alias glorp='perl -aF"/\s+/" -nE'
So now we have the command `glorp` to play with which has more familiar syntax than awk and all of CPAN available to play with!

    $ [data is generated] | glorp '/Something/ and say $F[2]'
We have access to any Perl module by putting -MModule::Name=function after the command, the following will parse a JSON record per line and glorp out what we wanted:

    $ echo -e '{"hello":"world"}\n{"hello":"cat"}' | glorp 'say decode_json($_)->{hello};' -MJSON=decode_json
Maybe you are used to using curl too. There is a nice web framework in Perl called Mojolicious (http://mojolicious.org) that provides a convenience module called 'ojo' for command line use. So grabbing the summary sentence from Wikipedia articles is as straight forward as below. Notice Mojolicious lets us use CSS selectors!

    $ echo -e 'grep\nawk\nperl' \
      | glorp 'say g("wikipedia.org/wiki/$F[0]")->dom->at("#mw-content-text > div > p")->all_text' -Mojo

Here's the equivalent for Ruby:

    alias glorp='ruby -ane '

    $ [data is generated] | glorp ' ~ /Something/ and puts $F[2]'

    $ echo -e '{"hello":"world"}\n{"hello":"cat"}' | glorp 'puts JSON.load($_)["hello"] ' -rjson

(Of course Ruby got the -a autosplit-mode and the -n assumed 'while gets(); ... end' loop from Perl along with $_ and $F, so it's very intentional that they're similar)

Somewhat related nodejs self plug: Use nip https://github.com/kolodny/nip

    $  echo -e 'this\nis\na\nwhatever foo' | nip 'return /whatever/.test(line) && cols[1]' # foo

Awesome thanks for sharing this! I was too lazy to give a Ruby example alongside.

And Ruby regexes are amazing.

I thought that they were amazing because they were just like Perl's. Are there any differences?

Sorry, but I strongly disagree.

I can think of two very strong reasons to prefer awk over perl:

* busybox implements awk, so its (usually) available even on embedded platforms

* awk syntax is way clearer than perl, anyone that has a few experience in C + shell can figure out what is being done with some googling

awk is also powerful enough to separate output to different files, accumulate inputs, etc. And if you need anything more complex, there are tons of other languages to choose from then (including perl).

what's the reason of using -F option? -a defaults " " separator, which already emulates awk behavior. If I'm not wrong, only difference between perl -aF"/\s+/" and perl -a is treating of leading whitespace in lines

To show you could use other separators defined by any regex you like.

Hey, it would be awesome to have something like this for Python!

I wrote a little tool called "pyline" about a hundred years ago... I still reach for it often when I'm in a hurry, and don't have time to futz around with awk/sed.


That's pretty cool, thanks!

Not really possible in Python for anything overly useful/complex because of the significant white space, and regular expressions are not a first class object in the language. ipython is the equivalent solution I suppose.

However, Ruby completely inherited this aspect of Perl and is another good candidate. It's just even old systems have a perl installed that will perform the awk like functionality.

There is "Pyed Piper" aka `pyp`


"Pyp is a linux command line text manipulation tool similar to awk or sed, but which uses standard python string and list methods as well as custom functions evolved to generate fast results in an intense production environment."

That's cool, but I feel like you're unlikely to run into it very often in the wild? The vast majority of Linux systems come with ruby/perl out of the box. Plus if Python did support this sort of hackery it would quickly garner most of the bad press Perl has over the years :'(

Who cares? It's on pypi (https://pypi.python.org/pypi/pyp/2.11) so just `pip install pyp` and then you have it. You're probably not going to want to distribute shell scripts that rely on pyp, but if you're distributing scripts you can just distribute scripts written entirely in Python and use pyp for your own one-liners.

The lack of something like Perl/Ruby's -n and -p switches is a barrier as well. Those are handy because they automatically work for either piped input, or files named on the command line.

Of course, you can do all of this in Python, just not as a one liner.

so is the intention to be python equivalent of awk

There was an awk/perl thread yesterday too.

ripgrep[1] is functionally incredibly similar to grep and ag, but is significantly faster[2] and supports a wider range of character encodings. In its short lifetime it has already become the default search tool for VSCode.

I've switched to using it as my daily driver for text search and am incredibly happy with it.

[1]: https://github.com/BurntSushi/ripgrep

[2]: http://blog.burntsushi.net/ripgrep/

Edit: I confused awk with ag originally leading to this comment. Using ripgrep as a pre-filter to awk is still a ridiculous amount faster, especially on large trees, so while the OP's suggestion is cool I can't see myself reaching for it often.

> and supports a wider range of character encodings

Since this is an uncommon feature, I'd like to emphasize this. :-) In particular, ripgrep will automatically search UTF-16 encoded files via BOM sniffing. That means you can run ripgrep over a directory on Windows and be confident that it will correctly search both UTF-8 and UTF-16 encoded files automatically without having to think about it.

More generally, it supports all encodings found in the Encoding Standard[1]. However, only UTF-16 is automatically detected (where UTF-8 is the presumed default), so you'll need to explicitly specify `-E sjis` (for example) if you want to search Shift_JIS encoded files.

Also, I didn't really need to do much of anything to get this working. This is all thanks to @hsivonen's encoding_rs[2] crate, which is now (I think) in Firefox.

[1] - https://encoding.spec.whatwg.org/#concept-encoding-get

[2] - https://github.com/hsivonen/encoding_rs

awk is a full programming language, and most of the times that I'm doing awk scripting I have to use things like associative arrays and arithmetic. Pulling out a field from a line is what most people use awk for, but it's honestly the least interesting part of awk. In fact, if cut supported regular expressions for specifying the field and record separators people wouldn't be using awk for that purpose (because that's all that $n does).

I was under the impression that ripgrep is a grep implementation that was incredibly optimised thanks to BurntSushi being a complete madman.

Yeah, ripgrep is "a grep," not an awk. I'm not sure how they wound up being conflated here. ripgrep does have a `-r/--replace` flag which is somewhat of a generalization of grep's `-o/--only-matching` flag (and also part of ack, I believe) by permitting sub-capture expansion that probably does replace awk for the "pulling out a field from a line" use case you mentioned. But it's pretty awkward for simple cases. e.g., I'd much rather use `blah | awk '{print $2}'` to get the second field delimited by arbitrary whitespace.

>Yeah, ripgrep is "a grep," not an awk. I'm not sure how they wound up being conflated here.

Simple: a lot of people use awk just as a grep.

The main reason I use awk 90% of the time is that its field parsing algorithm does "the right thing" in most cases (i.e. divide fields by 1 or more whitespace characters) without a lot of boilerplate, so it's really easy to throw into a pipeline.

That's what I was referencing when I said

> Pulling out a field from a line is what most people use awk for, but it's honestly the least interesting part of awk. In fact, if cut supported regular expressions for specifying the field and record separators people wouldn't be using awk for that purpose (because that's all that $n does).

There's nothing magical about awk's default FS. It's literally just /\s+/. If cut's -d was slightly more clever you wouldn't need to use awk.

I wish the default 'cut' implementation could be just a little more clever - regex delimiters would be good, it doesn't even support multiple characters :(

Also, cut's output manipulation is surprising. '-f 2,1' is actually the same as '-f 1,2' - you can't change the order of printing.

I know there are other programs that can do the job, but it's a little frustrating when you can 'almost' get there with a chain of piped commands and a simple tool like cut, but have to fall back on a 'real' programming language to do just that extra bit of manipulation (awk, perl, whatever, and yes, I know the shell is a programming language but you get my point!)

If you pipe your file through "while read f1 f2 f3 ; do echo field2 is $f2 ; done" for example you can pick out fields. Re-ordering them is just of a special case of any sort of bash manipulation you can do in the loop body. Admittedly for the very basic case, it's not as terse as "cut -f2" but if you're doing any further processing on the stream then I find it's often shorter.

In bash you can also do do "while read -a F ; do echo field2 is ${F:1} ; done". (-a will assign each field to an zero indexed array)

Yeah, I thought of mentioning this. The reason I don't do this, though, is that I generally dislike keeping track of when the shell does or does not automatically split input, so I only use `read` with the -r option.

You mean to say you don't use

    some-cmd | sed -E 's/^\s+//;s/\s+/ /g' | cut -f$n
Rather than

    some-cmd | awk '{ print $n }'
;) Also I agree it's silly you can't change the field order.

The default FS throws away leading blanks, though, which doesn't happen if you set it explicitly to \s+, so a tiny little bit of magic does go on after all.

Fair enough, though ultimately you could emulate it with sed:

    some-cmd | sed -E '{ s/^\s+//g ; s/\s+/ /g }' | cut -f$n

perl can do the same with -alne parms. but indeed awk has been widely used due to having this by default.

ripgrep doesn't yet support compressed files (z/gz), while grep/ack/ag do. On *nix it's extremely common to have sparsely compressed directories, from logfiles to non-changing documentation. I really wished ripgrep would just stream to a fast coprocess and support any stream compressor transparently instead of trying to use a built-in rust library.

Suggestions are most welcome on the issue tracker. I don't think your idea has been suggested yet?

I have commented on #225 about this.

I tried a lot but could not install ripgrep on ubuntu :(

Apparently the code is in rust, which I know zero of. So, I didn't try building it either.

It's simple to install ripgrep on pretty much any Linux because I distribute statically compiled binaries:

    $ curl -LO 'https://github.com/BurntSushi/ripgrep/releases/download/0.5.2/ripgrep-0.5.2-x86_64-unknown-linux-musl.tar.gz'
    $ tar xf ripgrep-0.5.2-*.tar.gz
    $ cp ripgrep-0.5.2-*/rg $HOME/bin/rg
If you're not a fan of this approach (downloading random binaries and slapping them into your $HOME/bin), then your other choice is to build from source. I don't use Ubuntu, but install Rust[1] and then building ripgrep is easy:

    $ git clone git://github.com/BurntSushi/ripgrep
    $ cd ripgrep
    $ cargo build --release
    $ ./target/release/rg -V
    ripgrep 0.5.2
And yes, it would be great to get ripgrep packaged into Ubuntu.[2] There seems to be an up-to-date PPA here.[3]

[1] - https://www.rust-lang.org/en-US/install.html

[2] - https://github.com/BurntSushi/ripgrep/issues/10

[3] - https://launchpad.net/~x4121/+archive/ubuntu/ripgrep

The author mentions that using awk instead of grep -v is not a good idea, what about:

  awk '! /something/'
Doesn't it reproduce the same behaviour as the following?

  grep -v 'something'

Yup, was going to post this as a comment myself.

Yes, these are the same

I'd like to share a recent experience on a related note, but in the opposite spirit - rather than reduce the number of command invocations on a command line, it may make sense to increase it.

I had a loop operating on a text file like this:

  while read line
    echo "$line" | sed -e "s/A/X/" -e "s/B/Y/" -e "s/C/Z/"
Gradually, as I added more things to replace, I noticed severe slowdown. Things got fast again when I rewrote it as

  while read line
    echo "$line" | sed -e "s/A/X/" | sed -e "s/B/Y/" | sed -e "s/C/Z/"
Turns out the later (multiple processes in parallel) helped with throughput.

As a minor note, the first one can be written as "s/A/X;s/B/Y;s/C/Z/"

why do you need `while read` part at all? this is the main slow down here. shell is extremely time consuming compared to sed/awk/whatever tool.

The script is currently a filter - the input comes from stdin.

You can redirect stdin to sed directly, e.g.,

    cat << EOF | sed -e 's/A/X/;s/B/Y/;s/C/Z/;'
    A Brave Cow
    Was Walking
    Crows Were Airborne
    All Was Well

    cat << EOF > myfile
    A Brave Cow
    Was Walking
    Crows Were Airborne
    All Was Well
    sed -e 's/A/X/;s/B/Y/;s/C/Z/;' < myfile
or (copy-pasted from shell for clarity),

    $ cat foo.sh
    sed -e 's/A/X/;s/B/Y/;s/C/Z/;'
    $ cat << EOF | ./foo.sh

Thank you, that helped. I had to get rid of '^' anchors (because it's not processing individual lines anymore) and I put

  sed -e '' < /dev/stdin
into the script, but the solution is now much simpler and faster.

sed works on stdin by default (see third example) so there shouldn't be a need to pipe from /dev/stdin, and line anchors should still work.

    $ cat myfile 
    foo foo
    baz baz
    $ cat foo.sh
    sed 's/^foo/bar/;s/^baz/foobar/'
    $ ./foo.sh < myfile 
    bar foo
    foobar baz
Depending on your expression, you may want to use -E to get POSIX ERE syntax.

> awk uses modern (read “Perl”) regular expressions, by default – like grep -E

No, Perl regular expression are different than extended regular expressions that awk and "grep -E" use.

The reason my one-liners often end up with both grep and awk pipes is because that's how they're built: run some command, grep for content of interest, throw awk at it to manipulate output. If this is a throwaway, the exploration flexibility is more important than the command cleanness, and there are some awk idioms (inverted match, case-insensitive, match count, first match only) which are easier to code than their awk equivalents.

Sure, if I'm going to commit that to a script or shell function, I'll consider going back to refactor it and clean it up. But for quick-and-dirty exploration and prototyping, a long shell pipe is often the best modular way to compose a tool.

Protip: M-X-E will call up your one-liner into an editor session, from which you can save it directly to a permanent file. The number of locally-written tools which have originated in this fashion is ... probably embarassing to admit.

I have used awk to color application logs (info=green, error=red, warn=yellow) on linux console.

Blog - http://manasvigupta.github.io/2015/06/27/color-your-logs-and...

It's a nice idea, but I think maybe only to replace a simple egrep.

grep has tons of great option like

-F = fgrep; no regex - way faster

-v = as mentioned in the article

-o = print only matched input, not the entire line

-C = context, print lines before and after the match; can also be used partially with -A (after) and -B (before)

Small note: For GNU grep on single regexes at least, the -F flag should not impact performance. It is smart enough to see through a pattern as a literal and avoid the regex engine.

Thanks, I didn't know this.

Of course you might still want to use fgrep to make sure + and . match with the characters and are not interpreted like regular expression operators.

Also, super handy: grep colors the part of the line that matched.

If you don't mind a bit of Perl, then Perl can be used for grep, awk and sed. And find. Even comes with little conversion utilities to do it (mostly) for you: a2p, s2p, find2perl.

I used Python most of the time but still use Perl where it is appropriate.

These converters were actually removed from Perl in v5.21.1.

They are available as a separate distributions on CPAN:




perl is still one of my favorite programming languages to use for straight text-manipulation

My favourite bad example of using grep was from a big enterprise software vendor to kill one of their processes.

It looked something like

  ps -ef| grep SomeDaemon | grep -v grep | grep -v perl | perl -e '<do something with the pid>'

Serious question: what would be a better way to do this?

As per other users, you can use pkill but it's not entirely portable in meaning between OSs. This software had to run on just about every possible Unix out there. Solaris, HP-UX, AIX, many others ...

My main point is that using perl and grep, (multiple thereof!) is nuts when you can do it all in perl.

Also crazy was using -ef (all processes) when the process of interest was using a known user. So ps -u <user> would be more appropriate.

Doing as much in Perl as possible, for portability, you'd do something like

  ps -u <user>| perl -ane 'm/[S]omeDaemon/ && kill "SIGTERM", $F[1]'
-a means autosplit into fields, $F[0], F[1], etc so the pid is $F[1].

See `perldoc -f kill` for the Perl function kill.

You can also do the `ps` from within Perl but I don't think you'd be gaining much in readability or portability.

If it's something you need to do on a semi-regular basis, using a pidfile seems to me to be the best solution. Instead of trying to be clever about pipelines, just write the pid(s) to a file and use that.

That's the right way to do it but you always need a backup method to use in case the process dies without removing the pidfile.

Or use a dedicated uid for the daemon. You probably want that anyway.

I would just recommend pkill: https://en.wikipedia.org/wiki/Pkill

I tend to use the following command:

    kill -HUP $(pidof SomeDaemon)
and if it insists on running, I use:

    sudo kill -9 $(pidof SomeDaemon)
That's it, really.

I suggest:

    killall -HUP SomeDaemon

Don't do this! Not only is pkill more competent, it is available on several other operating systems.

There is a killall on Solaris also which is very different but true to its name. You do not want to run it by accident.

Top tip!

`pkill -v` is not verbose mode!

Remember pkill is next to pgrep...

pgrep is also useful if you're not trying to kill. They're both from Solaris, and if you have one, you have the other. pgrep | xargs can also be very useful on occasion, depending on what you're doing (mainly debugging).

> xargs can also be very useful on occasion

yes, and xargs has a particularly useful argument -P that allows to do stuff in parallel and probably deserves more love :P

I often roll with 'pgrep' part of pkill package to fetch the pids of either the ucomm or longname (-l) of the process.

pgrep -l 'reg.*ex" | xargs -L1 do-stuff

or just pkill if the idea is just to send a signal.

To gather particular process metrics, ps can be invoked with process ids (-p), with full control of its output (-o), grep is rarely needed.

perl -e 'print `pgrep SomeDaemon `'

  do_something_with_pids $(ps -C SomeDaemon ho pid)
Though it's GNU procps.


In a script:

Get-Process | Where-Object { $_.ProcessName -eq 'SomeDaemon' } | Stop-Process

Or at the command line:

ps | ? { $_.ProcessName -eq 'SomeDaemon' } | kill

A pid file is the best IMO

grep -v grep.. I wish I could admit to using this less often than I do; but it's always manual not automated.

At some point I'll figure out a better alternative than this or pgrep which often misses processes.

Just do 'grep [S]omeDaemon', and you won't see your grep line.

No, please don't do that. Use filtering capabilities of ps, so grep becomes completely unnecessary.

For manual use, I have a shell function

  psgrep () {
    ps aux | sed -n '1p;/\<sed\>/d;/'"$1"'/p'
It's basically the same "grep -v grep" trick, but with pure sed. Also, the initial "1p" ensures that I still get the column headers from ps.

Isn't rg the hot one for grep replacement these days?

It doesn't replace Awk though. I think the point of this post is that you shouldn't do `<search tool> <pattern> | awk ...`, you should just do `awk /<pattern>/ ...`.

rg + fzf = <3

Took me a while to understand the name.

can't go back, ripgrep is so fast.

I'm sure the author is aware, but awk has at least three implementations: nawk (the one true), gawk (what most are using), and mawk (performance-oriented, unmaintained). Plus busybox-awk.

When benchmarking gawk, I've found using LANG=C and avoiding UTF-8 to make a substantial difference for pattern matching.

Mawk is currently maintained: http://invisible-island.net/mawk/mawk.html

> I've found using LANG=C

This is also true for grep and tr and sort - the unicode handling does impact speed quite a bit and I still don't understand why this here https://stackoverflow.com/q/20226851/772013 is not treated as a bug

He says that you can't cleanly do grep -v with awk, but awk '!/foo/' seems to work in my copy of awk.

AWK is the general-purpose programmatic filter and reporting tool in the Unix pipeline. Sed, grep and cut are specializations for specific use cases whose implementations might have better performance. Perl and Python are probably too general-purpose for writing compact one-liners in a pipeline.

Even though it's more characters, I usually use perl -pe 's/search/replace/' instead of sed in pipelines because it understands /n (and other escape characters I don't remember). Because all it takes is to get burned a couple times for it to be worth sticking with what you know will work.

The Perl version is considerably faster on large files as well. Or at least, it is when compared to the version of BSD sed shipped with OSX.

This is probably not the case with the sed in Linux.

"/n"? What's that? My `man perlre' doesn't say anything about this regexp flag.

because OP is referring to escape characters I assume it is a typo for "\n"

Probably not what rflrob meant, but /n prevents parentheses from capturing.


This is somewhat surprising. A relatively new feature, and not present in my current Perl.

Except "grep | awk" is often way quicker because it does not need to split the record before it does anything and I usually want to span more processes because awk will usually eat a whole cpu if it's doing something gnarly.

But I like my colorized match: grep --color ...

Also grep -n is so much nicer than something like awk '{print NR "," $0}'

Finally, I think grep must be faster if only because grep doesn't line buffer by default, although you may --line-buffered.

awk is certainly a really important tool to know, but in general this is poor advice. awk is significantly slower than grep -E, and typing awk commands is often much slower as well. Not to mention that awk only operates on a single stream of data, and can't do operations with file awareness. Sure, use awk's line filtering when you're already going to need awk for something else, but in general your first instinct should be grep.

I've always wanted to learn to use awk but I just can never find enough examples that allow me to understand what the hell I'm doing. The learning curve is too high for me.

Use of grep -v equiv is wrong and dirty. Instead should be:

    $ [data is generated] | awk '!/something/'

Just FYI, the code lines do not wrap and are not scrollable for me on mobile (Firefox/Android).

To address one of the benefits, `egrep ^[^#]` is shorter than `awk '/^[^#]/'`.

That pattern doesn't need extended regular expressions (use grep -E instead of egrep though). Also `grep -v ^#` does the same thing. `sed /^#/d` is one char shorter.

> Also `grep -v ^#` does the same thing.

Depends on what you're after. 'grep ^[^#]' also gets rid of empty lines, as they don't have a first character to match.

grep -E performs an extended regex. To get a perl compatible one you need to use grep -P.

How am I supposed to do `grep -nrb --include \*.cpp something` ?

This is close, but I don't know how to implement grep's `-b` easily:

    find . -type f -name "*.cpp" -exec awk '/something/ {print FILENAME, NR, $0}' {} \+
Clearly grep wins this round!

is there any awk equivalent for grep "matcha\|matchb" ?

awk '/matcha|matchb/'

excellent :) though slightly disappointed in myself, that i did not try that myself as i did imagine it when i asked...

This is basic awk. But grep is much faster if that's all you're doing. fgrep even more so if that's what you want.

Yay, now I've found a replacement for grep under macOS. Thanks for that!

Why would you need that, when grep comes with macOS?

but this version does not support Perl-like expressions

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact