While awk is indeed under-appreciated, there are many instances where using grep to pre-filter your input is helpful because the grep family (GNU grep, agrep, etc.) can match strings much much faster than awk's regex engine.
For example: GNU grep uses a heavily optimized implementation of the Boyer-Moore string search algorithm [1] which (it is claimed) requires just 3 (x86) cycles per byte of input. Boyer-Moore only searches for literal strings, but grep will extract the longest literal from the search regex (e.g. the string "foo" from the regex /foo(bar|baz)+/) and use Boyer-Moore to find candidate matches that are then verified using the regex engine.
So if you have a large corpus of text to search, grep is your friend. It can easily saturate your I/O bandwidth and perform an order of magnitude faster than typical "big data" platforms. [2]
The three biggest reasons to use grepalikes is not necessarily the time to execute the search, but
1) The reduced amount of typing. It's much faster to type "ack foo --ruby" than it is to type "find . -name '* .rb' | xargs grep foo"
2) Better filetype matching, because ack and ag check things beyond just file extensions. ack's --ruby flag checks for *.rb and Rakefile and "ruby" appearing in the shebang line, for example.
3) More features that grep doesn't include, like better highlighting and grouping, or full Perl regular expressions (in ack) and a ton of other features.
Here's a command line to find all the header files in your C source. It takes advantage of the ability to use Perl's regular expressions to specify output:
Even though I created ack, I don't care which tool you use, so long as it suits your needs and makes you happy, but if all you're using your grep/ack/ag for is "grep -R function_name ." then you're only scratching the surface of what the tools can do for you.
See http://blog.burntsushi.net/ripgrep/ for a quite nice comparison which is counter to your experience; ack/ag/pt are all slower than either grep or ripgrep.
"ack/ag/pt are all slower than either grep or ripgrep"
You need to be a touch careful with blanket statements, as this conclusion isn't quite what the data in my blog says. What it says is that ag is generally slower than GNU grep on large files, primarily because GNU grep's core search code is quite a bit more optimized than ag's. However, ag can outclass GNU grep when searching across entire directories primarily by culling the set of files that is searched, and also by parallelizing search if you want to exclude running GNU grep in a simple `find ... | xargs -P`-like command. (This is thesmallestcat's point.) This is why the first set of benchmarks in my blog don't actually benchmark GNU grep, because there is an impedance mismatch. Instead, ag/pt are benchmarked against `git grep`, where the comparison is quite a bit closer. (You'll want to check out the benchmarks on my local machine, which avoid penalizing ag for its use of memory maps.[1]) The second set of benchmarks basically swap out `git grep` for GNU grep and compare the tools on very large files.
ack isn't actually included in the benchmarks because it was incredibly slow when I tried it, although there may be something funny going on.[2] To be honest, it isn't terribly surprising to me, since every other tool is compiled down to native code.
The pt/sift story is interesting, and my current hypothesis is that the tools have received a misleading reputation for being fast. In particular, when using the tools with simple literals, it's likely that search will benefit from an AVX2 optimization when searching for the first byte. As soon as you start feeding more complex patterns (including case insensitivity), these tools slow down quite a bit because 1) the underlying regex engine needs more optimization work and 2) the literal analysis is lacking, which means the tools rely even more on the regex engine than other tools do.
The short summary of ripgrep is that it should outclass all the tools on all the benchmarks in my blog. I should update them at some point as ripgrep has gotten a bit faster since that blog post. Primarily, a parallel recursive directory iterator, which is present in sift, pt and ucg as well. Secondarily, its line counting has been sped up with more specialized SIMD routines. (I should hedge a bit here. It is possible to build patterns that a FSM-like engine will do very poorly on, but where a backtracking engine may not degrade as much. The prototypical example is large repetitions, e.g., `(foo){100}`. Of course, the reverse is true as well, since FSMs don't have exponential worst case time. Also, my benchmarks aren't quite exhaustive. For example, they don't benchmark GNU grep/ripgrep's `-f` flag for searching many regexes.)
Regarding your '(foo){100}' example, couldn't you expand this into a literal search for 100 repetitions of 'foo'? I guess this could interact poorly with the regex engine, and you'd be expected to cover more complex instances like '(foo){50,100}', but I think it might be worth the effort in some cases.
The truly eagle-eyed would notice that there are actually only 83 instances of `foo` in this string. In fact, literal detection is actually quite a tricky task on arbitrary regular expressions, and one must be careful to bound the number of and size of literals you produce. For example, the regex `\w+` matches an ~infinite number of literals, but `\w` is really just a normal character class, and some character classes are good to include in your literal analysis.
In this case, the literal detector knows that the literal is a prefix. That means a prefix match must still be confirmed by entering the regex engine. If you pass a smaller literal, e.g., `(foo){4}`, then you'll see slightly different output:
In this case, the regex compiler sees that a literal match corresponds to an overall match of the regex, and can therefore stay out of the regex engine entirely. The "completion" analysis doesn't stop at simple regexes either, for example, `(foo){2}|(ba[rz]){2}` yields:
This is important, because this particular pattern will probably wind up using a special SIMD multi-pattern matcher. Failing that, it will use the "advanced" version of Aho-Corasick, which is a DFA that is computed ahead-of-time without extra failure transitions (as opposed to the more general lazy DFA used by the regex engine).
Literal detection is a big part of the secret sauce of ripgrep. Pretty much every single regex you feed to a tool like ripgrep will contain a literal somewhere, and this will greatly increase search speed when compared to tools that don't do literal extraction.
Of course, literal extraction has downsides. If your literal extractor picks out a prefix that is a very very common string in your haystack, then it would probably be better to just use the regex engine instead of ping-ponging back-and-forth between the prefix searcher and the regex engine. Of course, there are probably ways to detect and stop this ping-ponging, but I haven't invested much effort in that yet. Alas, the performance improvement in the common case (literals make things faster) seems to wind up being more beneficial for the general use case.
The alias below sets perl to loop over STDIN splitting each line on more than one whitespace character and populate an array F. The -nE will then Evaluate an expression from the command line looping over the input line-by-line.
alias glorp='perl -aF"/\s+/" -nE'
So now we have the command `glorp` to play with which has more familiar syntax than awk and all of CPAN available to play with!
$ [data is generated] | glorp '/Something/ and say $F[2]'
We have access to any Perl module by putting -MModule::Name=function after the command, the following will parse a JSON record per line and glorp out what we wanted:
Maybe you are used to using curl too. There is a nice web framework in Perl called Mojolicious (http://mojolicious.org) that provides a convenience module called 'ojo' for command line use. So grabbing the summary sentence from Wikipedia articles is as straight forward as below. Notice Mojolicious lets us use CSS selectors!
(Of course Ruby got the -a autosplit-mode and the -n assumed 'while gets(); ... end' loop from Perl along with $_ and $F, so it's very intentional that they're similar)
I can think of two very strong reasons to prefer awk over perl:
* busybox implements awk, so its (usually) available even on embedded platforms
* awk syntax is way clearer than perl, anyone that has a few experience in C + shell can figure out what is being done with some googling
awk is also powerful enough to separate output to different files, accumulate inputs, etc. And if you need anything more complex, there are tons of other languages to choose from then (including perl).
what's the reason of using -F option? -a defaults " " separator, which already emulates awk behavior. If I'm not wrong, only difference between perl -aF"/\s+/" and perl -a is treating of leading whitespace in lines
I wrote a little tool called "pyline" about a hundred years ago... I still reach for it often when I'm in a hurry, and don't have time to futz around with awk/sed.
Not really possible in Python for anything overly useful/complex because of the significant white space, and regular expressions are not a first class object in the language. ipython is the equivalent solution I suppose.
However, Ruby completely inherited this aspect of Perl and is another good candidate. It's just even old systems have a perl installed that will perform the awk like functionality.
"Pyp is a linux command line text manipulation tool similar to awk or sed, but which uses standard python string and list methods as well as custom functions evolved to generate fast results in an intense production environment."
That's cool, but I feel like you're unlikely to run into it very often in the wild? The vast majority of Linux systems come with ruby/perl out of the box. Plus if Python did support this sort of hackery it would quickly garner most of the bad press Perl has over the years :'(
Who cares? It's on pypi (https://pypi.python.org/pypi/pyp/2.11) so just `pip install pyp` and then you have it. You're probably not going to want to distribute shell scripts that rely on pyp, but if you're distributing scripts you can just distribute scripts written entirely in Python and use pyp for your own one-liners.
The lack of something like Perl/Ruby's -n and -p switches is a barrier as well. Those are handy because they automatically work for either piped input, or files named on the command line.
Of course, you can do all of this in Python, just not as a one liner.
ripgrep[1] is functionally incredibly similar to grep and ag, but is significantly faster[2] and supports a wider range of character encodings. In its short lifetime it has already become the default search tool for VSCode.
I've switched to using it as my daily driver for text search and am incredibly happy with it.
Edit: I confused awk with ag originally leading to this comment. Using ripgrep as a pre-filter to awk is still a ridiculous amount faster, especially on large trees, so while the OP's suggestion is cool I can't see myself reaching for it often.
> and supports a wider range of character encodings
Since this is an uncommon feature, I'd like to emphasize this. :-) In particular, ripgrep will automatically search UTF-16 encoded files via BOM sniffing. That means you can run ripgrep over a directory on Windows and be confident that it will correctly search both UTF-8 and UTF-16 encoded files automatically without having to think about it.
More generally, it supports all encodings found in the Encoding Standard[1]. However, only UTF-16 is automatically detected (where UTF-8 is the presumed default), so you'll need to explicitly specify `-E sjis` (for example) if you want to search Shift_JIS encoded files.
Also, I didn't really need to do much of anything to get this working. This is all thanks to @hsivonen's encoding_rs[2] crate, which is now (I think) in Firefox.
awk is a full programming language, and most of the times that I'm doing awk scripting I have to use things like associative arrays and arithmetic. Pulling out a field from a line is what most people use awk for, but it's honestly the least interesting part of awk. In fact, if cut supported regular expressions for specifying the field and record separators people wouldn't be using awk for that purpose (because that's all that $n does).
I was under the impression that ripgrep is a grep implementation that was incredibly optimised thanks to BurntSushi being a complete madman.
Yeah, ripgrep is "a grep," not an awk. I'm not sure how they wound up being conflated here. ripgrep does have a `-r/--replace` flag which is somewhat of a generalization of grep's `-o/--only-matching` flag (and also part of ack, I believe) by permitting sub-capture expansion that probably does replace awk for the "pulling out a field from a line" use case you mentioned. But it's pretty awkward for simple cases. e.g., I'd much rather use `blah | awk '{print $2}'` to get the second field delimited by arbitrary whitespace.
The main reason I use awk 90% of the time is that its field parsing algorithm does "the right thing" in most cases (i.e. divide fields by 1 or more whitespace characters) without a lot of boilerplate, so it's really easy to throw into a pipeline.
> Pulling out a field from a line is what most people use awk for, but it's honestly the least interesting part of awk. In fact, if cut supported regular expressions for specifying the field and record separators people wouldn't be using awk for that purpose (because that's all that $n does).
There's nothing magical about awk's default FS. It's literally just /\s+/. If cut's -d was slightly more clever you wouldn't need to use awk.
I wish the default 'cut' implementation could be just a little more clever - regex delimiters would be good, it doesn't even support multiple characters :(
Also, cut's output manipulation is surprising. '-f 2,1' is actually the same as '-f 1,2' - you can't change the order of printing.
I know there are other programs that can do the job, but it's a little frustrating when you can 'almost' get there with a chain of piped commands and a simple tool like cut, but have to fall back on a 'real' programming language to do just that extra bit of manipulation (awk, perl, whatever, and yes, I know the shell is a programming language but you get my point!)
If you pipe your file through "while read f1 f2 f3 ; do echo field2 is $f2 ; done" for example you can pick out fields. Re-ordering them is just of a special case of any sort of bash manipulation you can do in the loop body. Admittedly for the very basic case, it's not as terse as "cut -f2" but if you're doing any further processing on the stream then I find it's often shorter.
Yeah, I thought of mentioning this. The reason I don't do this, though, is that I generally dislike keeping track of when the shell does or does not automatically split input, so I only use `read` with the -r option.
The default FS throws away leading blanks, though, which doesn't happen if you set it explicitly to \s+, so a tiny little bit of magic does go on after all.
ripgrep doesn't yet support compressed files (z/gz), while grep/ack/ag do. On *nix it's extremely common to have sparsely compressed directories, from logfiles to non-changing documentation. I really wished ripgrep would just stream to a fast coprocess and support any stream compressor transparently instead of trying to use a built-in rust library.
If you're not a fan of this approach (downloading random binaries and slapping them into your $HOME/bin), then your other choice is to build from source. I don't use Ubuntu, but install Rust[1] and then building ripgrep is easy:
I'd like to share a recent experience on a related note, but in the opposite spirit - rather than reduce the number of command invocations on a command line, it may make sense to increase it.
I had a loop operating on a text file like this:
while read line
do
echo "$line" | sed -e "s/A/X/" -e "s/B/Y/" -e "s/C/Z/"
...
Gradually, as I added more things to replace, I noticed severe slowdown. Things got fast again when I rewrote it as
while read line
do
echo "$line" | sed -e "s/A/X/" | sed -e "s/B/Y/" | sed -e "s/C/Z/"
...
Turns out the later (multiple processes in parallel) helped with throughput.
The reason my one-liners often end up with both grep and awk pipes is because that's how they're built: run some command, grep for content of interest, throw awk at it to manipulate output. If this is a throwaway, the exploration flexibility is more important than the command cleanness, and there are some awk idioms (inverted match, case-insensitive, match count, first match only) which are easier to code than their awk equivalents.
Sure, if I'm going to commit that to a script or shell function, I'll consider going back to refactor it and clean it up. But for quick-and-dirty exploration and prototyping, a long shell pipe is often the best modular way to compose a tool.
Protip: M-X-E will call up your one-liner into an editor session, from which you can save it directly to a permanent file. The number of locally-written tools which have originated in this fashion is ... probably embarassing to admit.
Small note: For GNU grep on single regexes at least, the -F flag should not impact performance. It is smart enough to see through a pattern as a literal and avoid the regex engine.
If you don't mind a bit of Perl, then Perl can be used for grep, awk and sed. And find. Even comes with little conversion utilities to do it (mostly) for you: a2p, s2p, find2perl.
I used Python most of the time but still use Perl where it is appropriate.
As per other users, you can use pkill but it's not entirely portable in meaning between OSs. This software had to run on just about every possible Unix out there. Solaris, HP-UX, AIX, many others ...
My main point is that using perl and grep, (multiple thereof!) is nuts when you can do it all in perl.
Also crazy was using -ef (all processes) when the process of interest was using a known user. So ps -u <user> would be more appropriate.
Doing as much in Perl as possible, for portability, you'd do something like
If it's something you need to do on a semi-regular basis, using a pidfile seems to me to be the best solution. Instead of trying to be clever about pipelines, just write the pid(s) to a file and use that.
pgrep is also useful if you're not trying to kill. They're both from Solaris, and if you have one, you have the other. pgrep | xargs can also be very useful on occasion, depending on what you're doing (mainly debugging).
It doesn't replace Awk though. I think the point of this post is that you shouldn't do `<search tool> <pattern> | awk ...`, you should just do `awk /<pattern>/ ...`.
I'm sure the author is aware, but awk has at least three implementations: nawk (the one true), gawk (what most are using), and mawk (performance-oriented, unmaintained). Plus busybox-awk.
When benchmarking gawk, I've found using LANG=C and avoiding UTF-8 to make a substantial difference for pattern matching.
This is also true for grep and tr and sort - the unicode handling does impact speed quite a bit and I still don't understand why this here https://stackoverflow.com/q/20226851/772013 is not treated as a bug
AWK is the general-purpose programmatic filter and reporting tool in the Unix pipeline. Sed, grep and cut are specializations for specific use cases whose implementations might have better performance. Perl and Python are probably too general-purpose for writing compact one-liners in a pipeline.
Even though it's more characters, I usually use perl -pe 's/search/replace/' instead of sed in pipelines because it understands /n (and other escape characters I don't remember). Because all it takes is to get burned a couple times for it to be worth sticking with what you know will work.
Except "grep | awk" is often way quicker because it does not need to split the record before it does anything and I usually want to span more processes because awk will usually eat a whole cpu if it's doing something gnarly.
awk is certainly a really important tool to know, but in general this is poor advice. awk is significantly slower than grep -E, and typing awk commands is often much slower as well. Not to mention that awk only operates on a single stream of data, and can't do operations with file awareness. Sure, use awk's line filtering when you're already going to need awk for something else, but in general your first instinct should be grep.
I've always wanted to learn to use awk but I just can never find enough examples that allow me to understand what the hell I'm doing. The learning curve is too high for me.
That pattern doesn't need extended regular expressions (use grep -E instead of egrep though). Also `grep -v ^#` does the same thing. `sed /^#/d` is one char shorter.
For example: GNU grep uses a heavily optimized implementation of the Boyer-Moore string search algorithm [1] which (it is claimed) requires just 3 (x86) cycles per byte of input. Boyer-Moore only searches for literal strings, but grep will extract the longest literal from the search regex (e.g. the string "foo" from the regex /foo(bar|baz)+/) and use Boyer-Moore to find candidate matches that are then verified using the regex engine.
So if you have a large corpus of text to search, grep is your friend. It can easily saturate your I/O bandwidth and perform an order of magnitude faster than typical "big data" platforms. [2]
[1] https://lists.freebsd.org/pipermail/freebsd-current/2010-Aug...
[2] https://aadrake.com/command-line-tools-can-be-235x-faster-th...