
Why GNU grep is fast (2010) - bpierre
http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html
======
jmcphers
If you want the very fastest "find strings in files, especially code files",
I've become a big fan of ag, "the silver searcher":

[https://github.com/ggreer/the_silver_searcher](https://github.com/ggreer/the_silver_searcher)

In addition to all the tricks applied by GNU grep (such as Boyer-Moore), it
also splits work among threads to take advantage of multi-core machines.
Integrates with your favorite text editor, too!

------
sltkr
Obvious questions: 1\. How do you extend Boyer-Moore to arbitrary regexps? 2\.
Does "not looking at every byte" really matter when the strings being searched
for are typically so short that you're gonna hit every cache line anyway?

~~~
kevinnk
> Does "not looking at every byte" really matter when the strings being
> searched for are typically so short that you're gonna hit every cache line
> anyway?

You're not skipping bytes of the search string, you're skipping bytes of the
file being searched.

~~~
pacificmint
Of course, but you can never skip more bytes than the search string is long.

~~~
aganders3
But you get to skip those few bites (potentially) very many times.

~~~
rcxdude
yes, but the point is that on modern CPUs checking those bytes is likely far
from the bottleneck and thus skipping them saves you very little time, if any
at all.

~~~
taeric
You might be surprised. Consider, counting to 2 million by 5 is significantly
faster than counting by 1. Especially if you just go ahead and start some
prefetching in the memory, I would expect that this can speed things up
considerably.

Edit: I should say I would still think things should be benchmarked to really
know...

~~~
justin66
(as mentioned by yan below) Ridiculous Fish did these sorts of benchmarks
here: [http://ridiculousfish.com/blog/posts/old-age-and-
treachery.h...](http://ridiculousfish.com/blog/posts/old-age-and-
treachery.html)

------
yan
More required reading on this topic:
[http://ridiculousfish.com/blog/posts/old-age-and-
treachery.h...](http://ridiculousfish.com/blog/posts/old-age-and-
treachery.html)

------
alayne
Previous discussion
[https://news.ycombinator.com/item?id=6813937](https://news.ycombinator.com/item?id=6813937)

~~~
wglb
Amusingly, the top comment of which is a few of the other times that it has
shown up.

------
myrandomcomment
This has been posted in the past many times.

~~~
tedunangst
Which is curious. Some content you run into by accident and think needs
reposting. But how does one stumble upon this post in 2015? Googling for "why
is gnu grep fast" and then not noticing that two of the top four results are
submissions to HN?

~~~
danudey
Finding it linked from some other article or blog post, or having it sent to
you from someone else and then assuming that a mail thread from 2010 has never
been seen on HN before.

------
lettergram
Wow.. I had this exact question when I interviewed at a big tech company, then
I had to implement it in pseudocode.

~~~
cosarara97
You were supposed to know why a particular implementation of a tool was fast
beforehand? Or they gave you GNU grep and BSD grep to compare?

~~~
taeric
Yeah, this definitely falls into the absurd category of interviewing
whiteboard questions. Ridiculously absurd. I have a book with this algorithm
detailed in it, and I think I would still have trouble implementing it well.

That or I am just dumb. Very likely.

~~~
tedunangst
"Let's implement BM" (with some hints/guidance) doesn't seem absurd. It's not
much code; handwritten can fit on a piece of paper or two. "Why is GNU grep
fast?" would definitely be an absurd question though.

~~~
taeric
I could understand possibly going through sections. Though, really, that form
of algorithm coding is so different from what I typically do day to day, that
it just seems irrelevant.

I would probably find it fun, provided the hints/guidance were good. Not sure
what to expect of it, though. As the sibling post asks, what would be the
takeaway?

~~~
tedunangst
Is the candidate capable of decomposing a problem and analyzing subproblems?
Are they capable of integrating subsolutions back into a working whole?

~~~
taeric
The problem really comes down to expectations on your point. For many of us,
this would just be knowing that there are good algorithms on doing this.

If you don't already know of this algorithm, finding it is something that took
a surprisingly long time to do. Using a field of algorithm design (pushdown
automata) that just isn't that widely utilized nowdays. (Well, I suppose your
domain may vary in this claim.)

------
rivd
mentioned article about fast string search:

[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13....](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.9460&rep=rep1&type=pdf)

------
jlarocco
Text processing is actually a pretty neat area, with a lot of practical uses.

A while back I found an out of print book, "Text Algorithms," available for
download from the author in PDF and some other formats. It's not cutting edge
at this point, but still covers all the basics (like Boyer-Moore) really well.
[http://igm.univ-mlv.fr/~mac/REC/B1.html](http://igm.univ-
mlv.fr/~mac/REC/B1.html)

~~~
sturakov
Awesome, thanks for the books!

------
lsiebert
I've been finding the following very useful for understanding string matching
algorithms. It has a bunch of different algorithms for exact string matching,
details on them, drawings, descriptions, speed comparisons, and code in C
[http://www-igm.univ-mlv.fr/~lecroq/string/index.html](http://www-igm.univ-
mlv.fr/~lecroq/string/index.html)

------
pekk
This should probably have a (2010) in the title or so

------
mparramon
Related: [http://www.developingandstuff.com/2013/07/grep-too-slow-
tire...](http://www.developingandstuff.com/2013/07/grep-too-slow-tired-of-
waiting-for-it.html)

------
edwintorok
On the topic of fast grep-like program there is jrep too:
[http://lwn.net/Articles/589009/](http://lwn.net/Articles/589009/)

------
dllthomas
_" The key to making programs fast is to make them do practically nothing.
;-)"_

Very much the case.

------
Sevzi
_lettergram_ could do a faster version that would work without errors the
first time.

