Hacker News new | past | comments | ask | show | jobs | submit login
Why GNU grep is fast (2010) (freebsd.org)
495 points by kurren on Nov 28, 2013 | hide | past | favorite | 129 comments

In case you're interested to see what older HN contributors had to say about this, here are some of the previous submissions, all of which have significant discussion:

https://news.ycombinator.com/item?id=1626305, 1193 days ago, 115 comments

https://news.ycombinator.com/item?id=2393587, 972 days ago, 68 comments

https://news.ycombinator.com/item?id=2860759, 842 days ago, 43 comments

( ionelm also pointed out a previous submission: https://news.ycombinator.com/item?id=6814153 )

Also worth mentioning is "The Treacherous Optimization" ( http://ridiculousfish.com/blog/posts/old-age-and-treachery.h... ), although previous submissions of that provoked no discussion at all.



ADDED IN EDIT: More rigorous searching has turned up substantial discussion of the Treacherous Optimization: https://news.ycombinator.com/item?id=1627367

It's great that you posted this, but it would be greater if HN was capable of doing it automatically since it is mostly trivial, but provides a lot of value.

In the past, there were semi-automated accounts that did this, but they made more people upset than happy.

I would also prefer to see this happen on every story.

hey hey hey, if you want hacker news to stay fast you should't make it do too much!

The systems I wrote were independent of HN and didn't slow it down. I got significant grief for them and pulled them.

Standard "wisdom" for start-ups includes "If you're not embarrassed by your first product then you didn't launch early enough," and "Try the minimal viable product before investing too much time." I both launched early, and made sure the "product" was absolutely minimal. It got slated by the people it was trying to help, so I didn't bother refining it.

The experience was educational and instructive.

(minor edit to remove some inappropriate snark - apologies)

I wouldn't be surprised if that was because people misinterpreted it as complaining about reposting, which is a contentious topic on HN.

The complainers were probably just a vocal minority.

Stay? When was HN ever fast? It's a tech demo for a language that's been more-or-less abandoned, that accidentally grew into a popular website.

See also:





I implemented a sort of micro-grep in the past using the Thompson's algorithm, it was a great exercise in practical applications of heavily-theoretical CS stuff, with recursive-descent parsing of the regular expression, then using Thompsons algorithm to create a NFA out of the parse tree, then a simulation of an NFA with a DFA using two stacks. I used the Dragon Book for reference on this, all pretty awesome stuff.

The author of The Silver Searcher[1] has some blog posts about his efforts at optimizing his own grep-like tool.

There's nothing wrong with good ole grep, but on boxen I spend any significant time on, I always install ag. The integration with Emacs provided by ag.el[2] is awesome too!

[1] https://github.com/ggreer/the_silver_searcher#how-is-it-so-f...

[2] https://github.com/Wilfred/ag.el

I'm actually more familiar with the KMP algorithm than BM. So I looked it up to see what the difference was:

The classic Boyer-Moore algorithm suffers from the phenomenon that it tends not to work so efficiently on small alphabets like DNA.

The skip distance tends to stop growing with the pattern length because substrings re-occur frequently.

By remembering more of what has already been matched, one can get larger skips through the text.

One can even arrange 'perfect memory' and thus look at each character at most once, whereas the Boyer-Moore algorithm, while linear, may inspect a character from the text multiple times.

1: http://stackoverflow.com/questions/12656160/what-are-the-mai...

You would inspect characters multiple times only in a very naive implementation of Boyer-Moore. The Galil Rule[0] fixes this and is also required for proving linear worst-case runtime.

[0] https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_sea...

This page [1] nicely demonstrates differences between KMP and BM.

[1]: http://www.cs.utexas.edu/users/moore/best-ideas/string-searc...

I have a book called Flexible Pattern Matching in Strings[1], which covers all the algorithms that might go into a modern, state-of-the-art grep, with practical advice on performance (N.B.: Boyer-Moore is no longer state-of-the-art). The publication year is 2002.

[1] http://www.dcc.uchile.cl/~gnavarro/FPMbook/

Couldn't you speed up a small-alphabet search by considering pairs or triplets of characters so you have N^2 or N^3 glyphs instead of N characters (so, going from 4 characters to 16 cgaracters to 64 characters)?

I'm not sure exactly how your proposal might work, but, generally, the problem you would run into is that the group of characters can be shifted. So you'd end up having to compare every possible shift.

That's also a Forth way of looking at optimization.

Also reminds me of Kent Beck's quip when he was asked to optimize Chrysler's C3 system. He asked for validated sets of input and output. The programmers on site said the the system wasn't producing correct results yet. His response: In that case, I can make this real fast!

I recall nearly the same story about Jerry Weinberg:

Engineers: Oh your system can only do a thousand cards a minute? Ours can do ten thousand.

Weinberg: Yeah but your system doesn't work! If mine doesn't have to work I can do a million cards a minute.

I thought I got it from one of Weinberg's books.


I think Kent was being sarcastic - his point is that it's rather pointless to optimize until the system is producing correct results. He's also famous for the "Make it work, make it right, make it fast" quote.

Are you sure he's the source of the quote?

A quick scholar.google.com search found a match in Byte magazine from 1983:

> Furthermore, many C environments contain measurement tools that enable the programmer to identify these critical sections easily. But the strategy is definitely: first make it work, then make it right, and, finally, make it fast.

The Google URL is http://books.google.com/books?ei=dtOXUoCVHoegkAef9YGQBg&id=A... but that's not enough to tell who wrote it. The full text is at http://archive.org/stream/byte-magazine-1983-08/1983_08_BYTE... . Unfortunately, it's OCR'ed, with advertisements and articles intermixed.

I believe it was written by James Joyce.

In any case, it looks like Beck was a ~22 year old undergraduate at the University of Oregon when that was published.

Looks like I'm wrong about the attribution. It's in the article "The C Language and Models for Systems Programming" by Johnson and Kernighan.

'Compile, conform, perform'

Where is this from?

Conversely, Why GNU grep is slow (and how to make it not-slow): http://dtrace.org/blogs/brendan/2011/12/08/2000x-performance...

By this logic you can write really fast programs in Haskell, because it avoids computing unused values (and therefore complete calltrees). (Unfortunately, the management of such calltrees -- thunks -- often has higher cost than outright computing them in the first place.)

I try hard, but I'm not smart enough to write programs that do nothing. :)

You can do that in any programming language by, well, not computing unused values.

That's like saying anyone can make money by just making money. It's a null statement!

In the context of the comment I replied to, no, it's not.

Haskell gives you some constructs that makes this easier but there exists no program whose Haskell implementations will run faster because of this, only programs that are easier to make fast.

The big advantage lazy evaluation has is composability. In an eager language, you have to explicitly code operations in a lazy fashion, which means that if you use a library, it will probably be eager. With Haskell you can be pretty sure that the runtime will evaluate only as much of the program as it needs to.

That said, the downside is that it then becomes very difficult to reason about the performance characteristics of your library, because its performance depends on how it's used. Simon Peyton-Jones is on record as saying that laziness is probably the wrong default for a language - the big benefit for Haskell was that it "kept them honest" wrt purity, but in a production system you probably want strictness.

The term you're looking for is tautology, not 'null statement'.

You can. A lazy algorithm will do no more reductions than an eager one, and it will usually do fewer. You just have to specify strictness. The compiler is pretty conservative about adding strictness because it has to preserve behaviour. But if you know something will be computed, e.g., a loop accumulator, then you can just tell the compiler so.

The inverse is this...what is modern software doing that makes them so slow?

Figuratively speaking, a lot of modern software is busy peeling layers of various onions, going to places to check if it needs to go there, and writing and organizing detailed reports of what it's doing, every step along the way. In other words, bureaucracy and logistics.

> GNU grep also tried very hard to set things up so that the kernel could ALSO avoid handling every byte of the input, by using mmap() instead of read() for file input. At the time, using read() caused most Unix versions to do extra copying.

Good luck pulling this off in Chrome.

With nacl, you could. But I think you're talking about js, and yeah, you are right.

Development time. Why invest time in such optimizations instead of other features when the app is fast enough? Time is not infinite and we need to prioritize what takes our development time.

I'd say time spent optimizing grep (GNU or otherwise) is always well spent. Saving a clock cycle or two per execution in my work's enterprise would translate into needing to lease less hardware, based just on how much grep is used. I imagine the same is true of all large Unix-y enterprises.

Portability and maintainability, if you want to boil it down to the core. The speed of gnu grep probably isn't portable since it relies on the kernel to do certain things in a cooperative manner, and it certainly won't be easy to maintain while staying fast.

I see the same in some of our software. One of our senior is rolling out a number of very, very fast collections based on compare-and-swap-operations, but you always end up thinking an hour or two about five lines of code. Most problem domains don't need this kind of performance, so most software teams don't pay the maintenance price and I think they are correct about that judgement call.

Part of the performance is algorithmic. Mixing Boyer-Moore and regexes isn't elementary. However, the knowledge is available for anyone keen to apply it.

The fastest programs are the ones that don't prepare themselves to do something and then don't do that thing. Another way to say just do one thing well.

The fastest programs return 0 immediately and don't do shit.

Returning 0 is technically doing something. It might not be something that you find useful but it is something.

On unix you'll find two programs which do this exact thing: `false` and `true`. False returns 1, and true returns 0.

You might not believe that these are real programs but they are. You can find the binaries using `whereis`. For example on my linux install true is /bin/true.

The command line is pretty awesome.

And here are the actual code source for them :


/bin/true source is a bit longer than expected but /bin/false implementation is really interesting : http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob_p...

That is both hilarious and brilliant. The bin/false source code made me laugh out loud.

true.c could be useful as a kind of helloworld -- a minimal example of the conventions followed by gnu command line programs.

GNU Hello exists for this purpose.

> Returning 0 is technically doing something.

Yeah, but try telling your boss that :)

Yep, boss will turn right around and "do" paying your salary

I'd imagine the fastest programs don't always return 0, since that's an extra MOV instruction.

    $ touch foo
    $ chmod +x foo
    $ ./foo
    $ echo $?

For me, the trivial C program appears to run faster than the empty file:

    $ touch empty
    $ chmod +x empty
    $ time ./empty

    real    0m0.002s
    user    0m0.000s
    sys     0m0.000s

    $ echo "int main(){return 0;}" > trivial.c
    $ gcc trivial.c -o trivial
    $ time ./trivial

    real    0m0.001s
    user    0m0.000s
    sys     0m0.000s
Timing results are consistent over several repetitions (provided everything's in cache from disk). Linux x86_64. `mov` takes ten thousand to a million times less than a millisecond ( https://gist.github.com/jboner/2841832 ), so I can't find out this way whether removing 'return 0' changes anything.

(If I use my default zsh shell to execute ./empty, it gives me

    zsh: exec format error: ./empty
    ./empty  0.00s user 0.00s system 0% cpu 0.008 total
So I used bash for this.)

So here's what happens with empty. If you run it from the shell, first the shell will fork, then try to exec empty. But the exec fails, since empty doesn't begin with a magic value. Therefore the exec call returns an error. Now the shell picks up this error, and then tries to run the program again, this time as an argument to an invocation of the shell (i.e., it does an exec of /bin/bash, passing it "empty" as a parameter). This is why empty ends up taking longer to run.

This is the normal pattern, just in case you forget to put "#!/bin/bash" at the top of the script, so that the script can be run anyway. This is also a source of confusion for some sysadmins, when a script works from the command line but not from something like a cron script.

Before the system can execute any one of these programs, it has fork the bash process, call exec to load the new binary in the process memory, load the standard library and map it to the adress space, open stdin / stdout, run the program, close stdin / stdout, wait for the process termination ...

Comparing to all this, a move instruction in userland won't really make a difference.

Edit: List on system calls required to execute trivial.c (on my Linux):

execve, brk, access, mmap, access, open, open, open, open, open, stat, open, stat, open, stat, open, stat, open, fstat, mmap, close, access, open, read, fstat, mmap, mprotect, mmap, mmap, close, mmap, mmap, mmap, arch_prctl, mprotect, mprotect, munmap, exit_group

I get the same results:

    $ touch empty

    $ chmod +x empty 

    $ time ./empty 

        real    0m0.005s
        user    0m0.001s
        sys     0m0.001s

    $ echo "int main(){return 0;}" > trivial.c

    $ gcc trivial.c -o trivial

    $ time ./trivial

        real    0m0.002s
        user    0m0.001s
        sys     0m0.002s

Isn't that just a default in the bash implementation?

Technically the fastest non-portable program returns whatever was in eax upon call if I remember right.

Well not really because the "do one thing well" leads to shell scripting where you join together dozens of separate tiny programs. Shell scripting is not exactly fast, nor is it robust.

Please add [2010] to the title

Why? Because it doesn't apply anymore?

Because it can help people realize they might have read this before (I had), which in turn can affect how one prioritizes which articles to read.

Also because "it's how it's done around here": the site is about news so it makes sense to at least tag possible re-posts to flag them as "not quite news yet still interesting and recommended reading".

Yes, because avoiding lots of LOC by relying on tons of external libraries is the new 'doing practically nothing'. /s

The article references an article published in 1991. Does that need stating too?

If this story was just a blogspam saying "Hey look I found this article from 1991 which is still relevant" then yes, because the story would be worth nothing by itself.

But this story is actually an email (from 2010) written by the guy WHO ORIGINALLY WROTE GREP, making it a pretty damn interesting source in its own right. In that sense, the 1991 article is very much an aside.

In conclusion, no.


Just a pedantic remark: He (Mike Haertel) originally wrote GNU grep. Grep, on the other hand, was originally written by Ken Thompson.

Good point, well made :)

Does this still work in Unicode? It seems like it'd cause the Boyer-Moore lookup tables to blow up in size.

You can use Unicode code points for the shift table(s -- the Horspool variant has only one table) but make them sparse, that way the tables only contain characters that exist in the pattern. Hash tables make for easy lookups, but with a bit blob of a few hundred kB you can also use a bitmask.

Of course, you can also just reduce the Unicode pattern to bytes, so your alphabet is never larger than 256. This will run slower, but not as much as you'd think: Boyer-Moore does benefit from larger alphabets, but only to the extent that the alphabet is actually used.

> Of course, you can also just reduce the Unicode pattern to bytes

Ah, right. I was under the impression this was unsafe, since you could end up with spurious byte matches that are not on character boundaries. But it seems the keyword is "self synchronizing", and UTF-8 (but not UTF-16) is safe to do byte-oriented searching on.

Link to a no-registration-required PDF version of "Fast String Searching" mentioned in that post


Original title is better. I'm gonna call the cops.

In case it helps anyone else, to install GNU grep on OSX using homebrew I did

  brew tap homebrew/dupes
  brew install --default-names homebrew/dupes/grep
following http://www.heystephenwood.com/2013/09/install-gnu-grep-on-ma...

Be careful with:

  --default-names do not prepend 'g' to the binary
On OS X (a BSD derivative with BSD-grep), --default-names will put GNU grep in your path. A lot of things in OS X depend on BSD sed/awk/grep. Omitting --default-names links then GNU variants to ggrep/gsed/gawk.

Quote of the day:

  The key to making programs fast is to make them do practically nothing. 
         -- Mike Haertel

I know all this, but it still feels like a magic when you run it against a fat log file and it's done in a couple of seconds, while awk and tail were previously struggling with it for like 15 minutes...

That is similar to how I explain to some students how have a fast database system: The key to a fast query is one that do little, and the way to make it do little is have the data exactly as how is needed.

This applies very well to optimization, too. There are only really two ways to optimize code:

1) make it do less

2) make it do more at a time.

The first corresponds to using more efficient algorithms and data structures. The second is parallelism.

I've found that taking a considerable amount of time (days even) to determine the very best solution to a problem always results in less effort and less time spent coding overall. It's also no coincidence that the code becomes incredibly simple, more readable, more efficient, more flexible, easier to build on top of, and fewer issues down the road. Plus, in the end, depending on the magnitude of the project, less time is spent on overall development.

It is never a good idea to rush any large coding project. By rush, I mean just sit down and start churning out code just so you can have something tangible within a day or two. And by large, I mean anything requiring at least a few thousand LOC. Anything less can be okay to rush though; i.e., small projects where maintainability and scalability aren't as important; e.g., some MVP to test some market.

I wholly agree that it's worth investing in finding the best solution. However, in my experience I found that having concrete code to work with is invaluable. The magic bullet for me is to slap something concrete together, then aggressively refactor / cut the crap out of it. In some sense, building software is like sculpture. You've got this amorphous block of half baked ideas, and you need to turn it somehow into a svelte shape.

* We end up finding more crap to remove then we'd thought at the start.

* We can cut down misguided arguments from wannabe architects of "what if we'll need x" by saying "we'll add this back when someone asks for x". It's much easier to point out that x is currently pointless with concrete code than at the demiurge phase where everything is possible and timelines are ignored.

* We can chop at the problem as a team, without having a long sequential and solitary step of "designing the best system". Amdahl's law applies to dev teams as well.

I call that "sketching in code." I've gotten in some trouble in interviews for "starting to write code right away," so it's a habit I'm having to fight at the moment, but I find having something in code (that I'm fully willing to throw away) really helps.

It's like Fred Brooks said:

    ...plan to throw one away; you will, anyhow.

I agree. It helps to have something tangible and to refactor later. I think the main issue that some people (myself included of course) have with that is refactoring too late. You never want to build something big on top of sketchy code. I've done this a few times myself to try to meet a deadline, and it's really bit me in the ass. We're talking weeks maybe months down the drain.

"Sketching in code" is a great analogy. Similar to how you'd trim off the rough edges in a sketched drawing, I've found myself cutting a lot of cruft when refactoring.

> Make it do less.

Yes, but less in a particular way. It's important to skip processing stuff that can't matter in the end. In this example, skipping over bytes that can't matter because the end doesn't match. In DBMS, it's important to skip over records/columns/values that can't contribute to the answer.

The surprising realization in optimization is that procrastination is a virtue. Lazy beats eager.

There is a third one, at low level: make it do something else.

For example, programmer requests a multiplication, compiler generates a shift; programmer requests a division, compiler generates a multiplication; programmer specifies a switch, compiler generates a jump table.

I'd argue that's just a variation of doing less, in the same way that e.g. vectorization is a variation of parallelism.

The same concept applies to transportation. How to get around faster: take shorter routes, and take all the stuff you're going to need on the first trip.

> The second is parallelism.

Or vectorization, or avoiding stalling, or filling all execution units, or...

I wonder how much these tricks matter nowadays. Even if we ignore the time to read from disk, for simple algorithms that scan data linearly, it is often the case that the CPU is spending most of its time waiting for the data to be read from RAM.

Also, in modern x86 processors, an entire cache line (64 bytes) is read whenever a single byte is needed. So for strings smaller than 64, even algorithms that don't check every byte will read every byte from disk to RAM and from RAM to the processor cache.

Even if we ignore the time to read from disk, for simple algorithms that scan data linearly, it is often the case that the CPU is spending most of its time waiting for the data to be read from RAM.

Nope. Even DDR-3 can perform sequential transfers at speeds well exceeding one byte per CPU clock cycle.

And yes, you can ignore the time to read from disk. Think a SAN over a 10-gig link. That gets you close to byte-per-cycle territory as well. It takes on the order of a dozen cycles (more or less depending on architecture and algorithm) to perform one "step" of a search. So yes, these algorithms very much matter.

Also, in modern x86 processors, an entire cache line (64 bytes) is read whenever a single byte is needed. So for strings smaller than 64, even algorithms that don't check every byte will read every byte from disk to RAM and from RAM to the processor cache.

Yep. But this is not necessarily the bottleneck; see my above comments about the CPU.

The CPU can also execute instructions exceeding one per clock cycle. For example instructions such as MOV, CMP and even conditional jumps run at a typical throughput of 3 instructions per clock cycle on the intel Core line of CPUs. (source: http://www.agner.org/optimize/instruction_tables.pdf)

This means that the simplest handcrafted loop that loads a byte, compares its value to the first byte of the searched string and then does a conditional jump should (if unrolled) take about 1 clock cycle per byte examined!

This means that if our DDR3 memory can read 6GB of data per second and our CPU core is clocked at 3Gh, this completely naive algorithm will run at half of the theoretical maximum speed.

Using XMM instructions (that work on 16 bytes at a time), should probably get us to the limit of the RAM speed.

Regarding SAN-over-10-gig link. I don't know about you, but the computer I'm typing this on has an SSD that can read only a few hundred megabytes per second.

SANs typically have dozens of drives configured as a RAID array to achieve 10 Gbps performance.

Corollary (from my fortunes file):

Deleted code is debugged code. - Jeff Sickel

The phrase should be attributed to Ray Ontko, I first heard it while working on database projects for him in the early nineties. -- Jeff Sickel

Impressive, but if the goal is for BSD to have a faster grep then clearly the solution is to use GNU grep instead of rolling their own redundant and inferior clone.

presumably GNU grep is GPL and incompatible with their BSD license

The GNU GPL is perfectly compatible with their BSD license... :]

It's just that the end result is then restricted by the GPL, and they don't want that.

That is what "not compatible" means.

You can release some software that contains both GNU GPL and BSD-licensed code, and it will be legal to distribute as long as you follow the rules of both licenses; as the rules of these licenses do not conflict, there's no legal problem.

This is what most people mean when they say two FOSS licenses are "compatible."

Is grep part of the BSD base system?

In other words the correct choice of algorithm and data structure can dramatically simplify a problem and the amount of code and time needed to solve it.

It also means that having tools where you can quickly apply different techniques, ahem, composable functions, that you can search for more efficient solutions with a lot less effort. That doesn't solve the smartness problem but it makes it a lot more tractable.

This is not just because of algorithm and data structure though, it's also very well implemented

>It also means that having tools where you can quickly apply different techniques, ahem, composable functions

The key word there is composable, and not functions.

The points of discarding data you don't need, not looking at stuff you don't need to look at, and minimizing the processing of the stuff you do look at are also highly valid.

I've seen 90% performance boosts in code where access to low-level data formats really wasn't possible. Though at times, using appropriate data structures / methods also helped markedly.

I don't know who to attribute the quote to, but there is one the goes something along the lines of:

"The fastest method to execute is an empty method."

> "The fastest method to execute is an empty method."

The fastest method to execute is an empty method that was never called.

The fastest method to execute is an empty method that was never called and never written.

The fastest method to execute is an empty method that was never called and never written and never planned.

This is why I say I'm at my most efficient when I just don't do any work at all.

Funny, but I think you finally went too far! Speed is one thing, and not executing a function sure can happen fast, but efficiency is productivity per unit time, and if there is no productivity, it can't be "most efficient". (Yes, I'm a pedant sometimes.)

Wally? Is that you?

One of the things I like best about Lean-derived software development methods is that they maximize the amount of work not done.

In case anyone is interested in checking it out, you can download the source here: ftp://mirrors.kernel.org/gnu/grep/

You will need some gnu common libraries. I tried building it from scratch recently, as a first step for implementing an extension idea, but didn't succeed.

(If anyone is interested, I want to add more operators, like intersection or difference of regular languages.)

The shell, combined with grep already implements these intersection and difference operators. For intersection, all you have to do is pipe into a second grep [1]; for difference, you pipe into a second grep with -v [2].

[1] Intersection: grep regex1 file1 | grep regex2

[2] Difference: grep regex1 file1 | grep -v regex2

Yes, you can do that. But you can only do that at the top level of your expression (ie outside the expression), not intra regular expression.

The article mentions the difference in times from forcing mmap.

  $ grep --mmap "fleagle" *
  grep: the --mmap option has been a no-op since 2010
Coincidentally, the article is from 2010.

The art of solving the problem in question and ONLY the problem in question.

In some cases, Perl can actually be faster than grep and sed at basic text processing-- IIRC particularly with larger files-- but I'm not sure about the mechanics behind it.

Original submission title was: "The key to making programs fast is to make them do practically nothing."

if you don't need backtracking, re2 [1] should be faster.

[1] http://code.google.com/p/re2/

I've always loved grep. Now I know why.

Thanks to grep, it almost payed my rent

Leaving a comment here and upvoting simply to have an access for the resources. I will really need to update my way of bookmarking pages...

Practically doing nothing is not doing things efficient and intelligently.It is a joke but it may be misleading.

> It is a joke but it may be misleading

It's something along the lines of a mantra, not a joke.

This is a classic post.

what is grep?

A text search utility in Unix. Name comes from a command in the text editor ed and stands for "global regular expression print".

Thank-you, guys, but I was kidding. I've been using Linux and other Unixes since 1998. I just got impressed in how this post became busy with everything, but Linux programming.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact