Why GNU grep is fast

jrockway · on March 31, 2011

The key to making programs fast is to make them do practically nothing.

This is the Ultimate Truth of optimizing computer programs, and it seems so few people understand it.

"Why can't you make Python faster!?" Because Python does a lot of stuff without you asking.

stavros · on March 31, 2011

It might be the Ultimate Truth, but it also begs the question. Of course you need to make programs do less, the question is how you actually do it...

j_baker · on March 31, 2011

He didn't say it was the Complete Ultimate Truth, did he?

stavros · on March 31, 2011

It's just a truism. You make things faster by making them faster.

RiderOfGiraffes · on April 1, 2011

No, you don't make things faster by making them faster. You can't make your computer do things faster.

You can only make it do less.

dasil003 · on April 1, 2011

It's not just a truism. It's like a koan. It's meant to make you think about what you really need.

JeanPierre · on April 1, 2011

I have to disagree, and I could actually use Boyer-Moore as an example on this:

Split the string you're searching into four roughly equal pieces, with an overlap so that potential matches is guaranteed to be found. Then do a Boyer-Moore on those in parallel. E.g. if you look after FOO, you'd have to split the string like this (pipes are the start/end of the piece exclusive, i.e. piece 1 does not contain the last O)

    piece 1 here|
  ...7890abcdeFOOghijk
             |piece 2 here

Now, what we know is that in the worst cases, this would require more operations: The O may be detected by piece 1, and it will check if the character in front is an O as well. Piece 2 will also check the same O, so there's redundant computation going on. However, this is theoretically faster (I'd guess it is practically faster for large files) even though this requires more operations.

anaisbetts · on April 1, 2011

Another way of putting this was from an nVidia developer deck I once read: the fastest triangle you can render is the one you don't render at all.

crasshopper · on April 1, 2011

I thought the Ultimate Truth of optimizing computer programs was: Premature optimization is the root of all evil.

nostrademons · on April 1, 2011

If you're doing it right, then by the time you've decided to optimize, it's no longer premature.

crasshopper · on April 4, 2011

Isn't that the same as what I said above?

gaius · on April 1, 2011

http://www.jwz.org/doc/easter-eggs.html

Mozilla is not big because it's full of useless crap. Mozilla is big because your needs are big. Your needs are big because the Internet is big. There are lots of small, lean web browsers out there that, incidentally, do almost nothing useful

crasshopper · on April 4, 2011

Which browsers? uzbl? midori? seamonkey? lynx?

DarkShikari · on March 31, 2011

- Roll your own unbuffered input using raw system calls.

A small toy project I wrote last year was a modification of GNU grep that did the opposite -- it aggressively prefetched ahead of the file it was currently reading. This helped performance dramatically on fragmented data (e.g. tons of small files).

For most typical greps (at least of files as opposed to standard input), "grep" is likely disk-bound, not CPU-bound.

(Note: I mean literal prefetch, not actually reading the file from disk. This is important because file input is a blocking operation in UNIX -- the thread blocks when read() is called and can only be resumed when the read is complete, unlike the case of output. This prevents the filesystem from reordering or merging multiple reads unless they come simultaneously from different threads. This is why reads are often slower than writes on typical data.)

dasht · on March 31, 2011

Regarding this bit: "Roll your own unbuffered input using raw system calls. / A small toy project I wrote last year was a modification of GNU grep that did the opposite -- it aggressively prefetched ahead of the file it was currently reading. "

I think you mis-understood. "Roll your own unbuffered input [....]" does not mean "do not read ahead" it means "do not use stdio". Stdio buffers have a lot of overhead (relatively speaking) in support of features that grep does not need.

electrum · on March 31, 2011

How did that compare to --mmap?

angusgr · on March 31, 2011

Wow, I didn't realise grep had a separate --mmap option. Interesting.

I don't think it would make much difference in the parent's case of many, small, fragmented files because if you're mmapping each file in turn and it's not cached, it still needs to be loaded from disk - it just happens in a page fault instead of the read() call.

Possibly if you mmapped all of the files and then used madvise() or something to prefetch in front of where you are in the list of files. Maybe grep does that, I don't know?

I guess the case where that technique would help is actually when you have a combination of (a) many files, (b) a computationally expensive pattern match (even just -i is a measurable hit) and (c) largeish files.

Because on many small files and a simple match, the disk I/O is still going to be the major component - even if you prefetch you still can't get around needing to load all the file contents from disk.

DarkShikari · on March 31, 2011

Possibly if you mmapped all of the files and then used madvise() or something to prefetch in front of where you are. Maybe grep does that, I don't know?

It doesn't do that, and that's the first method I used in my hack.

fadvise() works just as well though, I think.

angusgr · on March 31, 2011

It doesn't do that, and that's the first method I used in my hack.

Fair enough. :) If you feel like sharing then I'm very curious as to what the additional methods were.

DarkShikari · on March 31, 2011

It was a total hack, so I just opened files ahead of time, fadvised them, then closed them.

dasht · on March 31, 2011

"mmap" is not a general solution for grep. It's nice when it can be used but grep must be able to grep files much to large to reasonably mmap and various kinds of streams.

nikcub · on April 1, 2011

Yes but you would think that a reasonable default solution would be to use both. ie. mmap by default and then if the file is larger than MAX (based on memory use, file size) then use read.

yan · on March 31, 2011

Obligatory post by ridiculous_fish on grep's speed optimizations: http://ridiculousfish.com/blog/archives/2006/05/30/old-age-a...

RiderOfGiraffes · on March 31, 2011

And the HN discussion: http://news.ycombinator.com/item?id=1627367

RiderOfGiraffes · on March 31, 2011

Dup: http://news.ycombinator.com/item?id=1626305

Many, many, many comments there.

hartror · on April 1, 2011

Less than a year old. HN really needs to alert you to dupes when you submit . . or does it? I know it doesn't allow you to post a dupe for a cool off period after it has been posted . . .

jonknee · on April 1, 2011

There is a dupe checker, but only for exact matches (which if there is one instead of getting a warning it automatically lodges an upvote of that link on your behalf). The OP of this thread appended a # to the end of the URL to bypass the filter.

scott_s · on April 1, 2011

There's also a semantic dupe checker called RiderOfGiraffes. (Seriously, and with all due gratitude.)

RiderOfGiraffes · on April 1, 2011

Thank you, although not everyone agrees. Although sometimes they then get re-upvoted, many of the duplication notifications get downvotes. Here are two recent ones:

http://news.ycombinator.com/item?id=2392849

http://news.ycombinator.com/item?id=2392739

scott_s · on April 1, 2011

That's baffling to me, because I almost always click-through on the previous discussions. Maintaining discussion history adds value to the community.

nikcub · on April 1, 2011

then HN needs to implement URL normalization and equivalence from RFC3986. These are fresh in my mind because I extended urllib in Python to support them last week.

It isn't an easy problem to solve, working out if two URLs are equivalent, but the RFC goes some way to solving it which picks up easy things like adding a #.

see: http://tools.ietf.org/html/rfc3986#section-6

steve-howard · on April 1, 2011

Thanks to hashbangs (or really just the prevalence of Ajax-only sites), it's now a lot harder to tell whether two pages with #something in the URL are effectively the same.

cma · on April 1, 2011

Sometimes the content on a page has changed, but still has the same URL; submitting it again isn't a dupe.

RiderOfGiraffes · on April 1, 2011

In those cases the item should be flagged - any discussion is potentially no longer relevant because it's about an earlier version.

duck · on April 1, 2011

No cool off period (although that is a good idea). I just submitted a good read and it head already been posted 616 days ago, and thus no one will see it.

aidenn0 · on March 31, 2011

Interestingly enough, awk is many times faster for an inverted grep (grep -v) than grep is. Get a large file and test yourself!

grep -v regex awk '$0 !~ /regex/ {print}'

This is possibly due largely to this not helping in that case:

GNU grep AVOIDS BREAKING THE INPUT INTO LINES.

awk is very fast at breaking the input into lines (that's what it spends most of the day doing!). I don't understand why it's so much slower though. (I had a 100+MB log file that I was searching when I discovered this).

sliverstorm · on April 1, 2011

Odds are grep is not super optimized for the -v case...

ecaron · on March 31, 2011

This is discussed best in the opening chapter of O'Reilly's Beautiful Code: http://oreilly.com/catalog/9780596510046. Google Books has a readable snippet of the section at http://bit.ly/g34QRh, but I highly recommend buying the book because it is a great read.

nprincigalli · on March 31, 2011

Check out ack, tailored for programmers: http://betterthangrep.com/

TallGuyShort · on April 1, 2011

For vi users, I recommend this plugin: https://github.com/mileszs/ack.vim. I like it better than simply replacing grep with ack. This has made me a lot more productive, as I can do complex searches MUCH faster!

arnemart · on April 1, 2011

Textmate users should absolutely check out AckMate (https://github.com/protocool/AckMate), which runs circles around "find in project".

wazoox · on April 1, 2011

Generally speaking, matching regexpes using perl instead of grep is often a win :

    perl -p -e 'm/regexp/' somefile

chaosfox · on April 1, 2011

That will print the whole file regardless, I think you meant:

       perl -ne 'print if /regex/' file

wazoox · on April 1, 2011

Indeed :) Sorry for the mishap.

gorset · on March 31, 2011

The other day I was actually surprised over how slow gnu grep was. I wanted to count how many lines in a log file contained a domain. Using grep -c <domain> <file> took 25 seconds for 1.7GB log file, whilst agrep only took 5 seconds.

novas0x2a · on March 31, 2011

One thing to check: what is your current locale? GNU grep is much, much slower when it's multibyte-aware, even if you're searching for an ascii string. Try repeating your test with LC_ALL=C grep (don't forget to take the filecache into account).

You can check the current values of your locale with `locale`.

gorset · on April 1, 2011

Nice! That fixed it. Now it's 3 seconds.

I'm still surprised why it's so much slower with UTF-8, though. I guess gnu grep is naively converting back and forth between representations? There's nothing in UTF-8 that should prevent it from doing this efficiently. Even with complex patterns, it possible to search through the file in about the same time as a simple pattern. E.g. aspell can build a FSA where each transition is O(1), making the search time more or less independent of the search pattern.

pkteison · on April 1, 2011

I assume the problem is that you can't skip ahead X characters on multibyte data without parsing the bytes, because you don't know how big a character is. So you may have to read every byte.

gorset · on April 1, 2011

But you can skip ahead without parsing all the bytes since UTF-8 is self-synchronizing. The only scenario I can envision is that gnu grep wants to perform unicode normalization, to catch equivalent codepoints, and that this is implemented inefficiently.

Edit: Granted, since I'm using -c, it has to look at all the bytes to find all the newlines.

ssp · on March 31, 2011

That's also the case for sort. If you often sort huge ASCII files, doing it in LC_ALL=C is a huge win.

juiceandjuice · on March 31, 2011

I have some serious love for mmap.

I don't know how many people's code I've optimized by eliminating:

FILE *f;

f = open("file.dat","r")

f.read(...)

and replacing it with mmap'd I/O

minimax · on April 1, 2011

What language could you possibly be using here? The first line looks like C, but then in C on most systems, open returns an int (a file descriptor). Maybe you wrote open() but you meant fopen()? An easy mistake, but then how are we to interpret the syntax in the 3rd line? f.read(...)? What is this I don't even.

> I don't know how many people's code I've optimized...

I'm gonna venture a guess of 0 on this one.

juiceandjuice · on April 1, 2011

haha, I just realized this... I meant C, but I think I've been screwing around in python so much lately that I kind of did that naturally on accident.

dasht · on March 31, 2011

What if the input is not from a regular file?

jules · on March 31, 2011

In which cases is mmap faster?

juiceandjuice · on April 1, 2011

data analysis, i.e. reading in large files, but overall it's pretty fast anyways. Furthermore, you don't read the whole file into memory so your program only uses a small amount of memory while running. I'd work with grad students reading in 1-2GB of data to analyze it and allocating that all to a process, and they'd wonder why their system would slow down.

The best thing about it is that the OS does the caching. So, say I analyze a file, I run a program with 4 arguments that maps a file to memory. Next time it runs, the file is still in memory so it doesn't bother re-reading the original file. (Try doing a recursive grep on directory, then try it again right after)

jules · on April 3, 2011

Thanks! That leaves the question: when is mmap slower, or is it always faster? And why isn't mmap the default usage pattern if it's better in most cases (and easier to use to boot)?

dasht · on March 31, 2011

Mike has only given part of the answer there. GNU grep obviously is very I/O tuned. And GNU grep optimizes the special cases of constant strings and of regexps that (more or less) start with a constant string -- but there's more!

GNU grep is also fast for most commonly encountered non-constant-string regexps (even those that don't start with a constant string) because its regexp engine avoids backtracking by doing an on-the-fly conversion (of many patterns) to a DFA. These extra cases are algorithmically neat and when you want them, you're glad they're there -- but they are less sexy in benchmarks because the most common use case in the real world, by far, is a search for a constant string.

pedrocr · on April 1, 2011

How far could you go in a discussion like this before BSD grep could reasonably be considered a derived work of GNU grep?

The discussion is going deeply into how GNU grep is implemented so it's clearly not a clean-room reverse engineering kind of situation. On the other hand nothing is being discussed that could be subject of copyright as only ideas and algorithms are put forth and no code is shown. How careful do you have to be to be sure?

redthrowaway · on April 1, 2011

I'm assuming you're asking when BSD grep has to be released under the GPL? As the BSD license is slightly more restrictive than the GPL, I imagine that the GPL would have to be adopted if there was significant code usage, which there doesn't appear to be in this case.

pedrocr · on April 1, 2011

The case where code is used would make it clear cut that you need to GPL it. You can relicense BSD (without advertisement clause) to GPL but not the other way around.

My doubt is at what point would you need to GPL it even though you haven't actually reused any code, just described the GPL code to someone who then reimplemented it into the BSD grep. This case still seems pretty benign as only general algorithms and approaches are shared but it made me consider where the line might be.

redthrowaway · on April 1, 2011

I'd be surprised if the GNU people claimed use of algorithms was sufficient to warrant the forced adoption of GPL, as that goes against everything they stand for. Rather, I suspect a direct transfer of code would be the minimum necessary condition for the FSF to raise a stink.

pedrocr · on April 1, 2011

This case is clearly not over the line but it seems to me that the minimum can't be a direct use of code. If someone did enough documentation of the architecture and algorithms used in the Linux kernel and handed that off to someone else who reimplemented I assume it would be hard to argue it wasn't a derived work.

When LWN does a description of how the VM or scheduler works they are effectively describing the code. But maybe that's the same as a research paper on algorithms where reimplementation is clearly possible. Maybe for it to be a derived work you'd have to describe the code so exactly you'd be effectively translating it in whole to another form.

jallmann · on April 1, 2011

> As the BSD license is slightly more restrictive than the GPL

Other way around.

redthrowaway · on April 1, 2011

I anticipated that response, but was on my way out the door and had no time to clarify.

From the point of view of the FSF, the BSD license is more restrictive, as it grants fewer freedoms to users. From the BSD point of view, GPL grants fewer freedoms to developers, and is thus more restrictive.

In the context of the question asked, I was approaching it from the POV of the GPL's authors, given that the question was whether the BSD code would have to adopt the GPL license. I apologize for my inarticulateness.

terrapinbear · on April 1, 2011

Unrolling that inner loop in the Boyer-Moore algo is called "Duff's Device".

petermin · on April 1, 2011

Thank you very much for sharing this!!!

crasshopper · on April 1, 2011

This makes me wonder: how is greplin so fast? And how come gmail search is slow?

(Yeah, I could ask somewhere else ... but this is HN, I bet someone looking at this Just Knows.)