
Why GNU grep is fast (2010) - letientai299
https://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html
======
userbinator
_GNU grep uses the well-known Boyer-Moore algorithm, which looks first for the
final letter of the target string, and uses a lookup table to tell it how far
ahead it can skip in the input whenever it finds a non-matching character._

I wonder how many times that algorithm or its variants has been rediscovered,
given that not everyone who writes code reads CS papers; I recall several
decades ago, working in x86 Asm, and coming up with that "match the last
element first and use an indexed skip table" idea when I had to do substring
matching, and then much later, when I explained it to someone else who was
interested to know how I made that functionality so fast, he basically told me
I had rediscovered the
[https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore%E2%80%93Ho...](https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore%E2%80%93Horspool_algorithm)
.

As it turns out, BMH is more common in practice (even when the author doesn't
realise it) than BM, and on average performs very well:

[https://old.blog.phusion.nl/2010/12/06/efficient-
substring-s...](https://old.blog.phusion.nl/2010/12/06/efficient-substring-
searching/)

I suspect at least some standard library's memmem() uses BMH, since it doesn't
require dynamic allocation or other complexity unlike BM, and is otherwise a
drop-in replacement for the naive algorithm.

~~~
burntsushi
As the author of ripgrep, I'd like to add some more notes to your comment:

* Some libc implementations (at least, glibc and musl) use Two-Way (linked in a sibling comment). In particular, it also skips parts of the input like BM, uses constant space (no allocation needed) and formally has O(n) time.

* The advice in the OP is somewhat outdated. The fact that BM skips some of the input isn't really too important on modern hardware. What matters about BM is its skip loop. Most implementations use `memchr` on the last byte of the needle. If that last byte happens to occur infrequently in the haystack, then the vast majority of your search time will be in an extremely well optimized loop that uses SIMD. For that reason, BM is somewhat of a red herring, and can be easily beaten through heuristics that choose better bytes to feed into its skip loop. (For example, assume a frequency distribution of bytes and pick the rarest one in your needle.)

~~~
bberenberg
OT but seriously thanks for rg, it makes massive log file analysis so much
faster for me.

------
zenhack
It's always amazing to me how much performance work the basic gnu tools have
seen in general. Grep makes some sense, but even yes(1) is fairly carefully
tuned; in some cases it actually strikes me as kindof excessive, and not
clearly worth the readability drop.

~~~
smabie
The GNU people are absolutely obsessed with correctness and performance. It's
kind of annoying actually. I assume it has something to do with RMS' ties to
MIT and Lisp machines (an example of staggering complexity), see Richard
Gabriel's "Worse is Better" essay.

I tend to fall in the NJ school myself, though I do think Emacs is something
special. And though it's derided as slow, Emacs has a ton of super crazy
optimizations, especially around rendering.

Pretty much every GNU project written in C is totally unreadable and
hopelessly baroque: gcc, glibc, coreutils, etc. For fun, compare some OpenBSD
Unix utility implementations with their GNU counterparts. The OpenBSD tools'
source code is a joy to read and represents the pinnacle of elegant C. The GNU
versions are ugly as sin, but functionally superior in pretty much every way.
I don't really agree with how they write code, but you got to respect the
almost OCD attention to detail: no edge case goes unhandled.

World's apart from the terse and cavalier style of traditional Unix hackers.

~~~
throwaway_pdp09
> The GNU people are absolutely obsessed with correctness and performance.
> It's kind of annoying actually

I can't accept this as a criticism without some elaboration.

> The GNU versions are ugly as sin, but functionally superior in pretty much
> every way. I don't really agree with how they write code, but you got to
> respect the almost OCD attention to detail: no edge case goes unhandled.

it sounds like you want the gnu people to stop making software that is
'functionally superior in pretty much every way'.

I don't get what you're saying.

~~~
smabie
I think the cost of the complexity is too high. I much prefer simple code that
provides 90% functionality to 10x more complicated code that provides 98%.

~~~
thomaslord
Personally I think there's a line to be drawn somewhere - I can certainly
appreciate focusing on readability, but having certain tools run super fast
can be very advantageous on a system level.

Generally I think OS-bundled tools that may be used for stream processing of
large amounts of data (like grep) should skew further towards performance,
because there are a non-trivial number of people passing gigabytes of data
through grep every day and a 10-20% performance improvement there is a
significant boon to global productivity.

Tools that solve a very specific problem and do it with a strong expectation
of correctness and performance are what makes *nix environments so powerful -
not only can you throw together a nifty bash script that duplicates the
functionality of a massive cluster operation, the script may actually
outperform the cluster because of how performance-focused those tools are!

I see this issue as being a bit similar to high- vs. low-level programming
languages. Could we write absolutely everything in JavaScript and probably
have a lot of the code be more readable than it'd be in C or Rust? Absolutely.
But that doesn't mean it's not worth using C in some cases to make the program
perform well on hardware that doesn't have enough firepower to just ignore the
overhead, and it doesn't mean it's not worth using Rust in cases where
stability is absolutely required.

~~~
zenhack
I think grep is a compelling example of when it makes sense to do the extra
work here. But I'm more dubious of the idea that carefully optimising yes was
a good use of engineering time.

~~~
throwaway_pdp09
> But I'm more dubious of the idea that carefully optimising yes was a good
> use of engineering time.

I literally can make no sense of that. Why do you think widely used pieces of
software should _not_ be well made.

~~~
zenhack
There's an opportunity cost. There has to have been a better use of that
person's time than taking a tool designed to say yes to repetitive "are you
sure?" prompts, and get it to keep up with memory throughput. The throughput
of a tool like that does not matter.

~~~
throwaway_pdp09
Here's the code for it
[https://github.com/coreutils/coreutils/blob/master/src/yes.c](https://github.com/coreutils/coreutils/blob/master/src/yes.c)

The core is about 70 lines. As for over-optimising it, I think chesterton's
fence needs mentioning [https://fs.blog/2020/03/chestertons-
fence/](https://fs.blog/2020/03/chestertons-fence/)

I don't think this is a useful avenue of debate for me.

------
hvs
1 year ago
[https://news.ycombinator.com/item?id=19521872](https://news.ycombinator.com/item?id=19521872)

4 years ago
[https://news.ycombinator.com/item?id=12350890](https://news.ycombinator.com/item?id=12350890)

many more:
[https://hn.algolia.com/?q=Why+GNU+grep+is+fast](https://hn.algolia.com/?q=Why+GNU+grep+is+fast)

~~~
jsnell
But also: [https://ridiculousfish.com/blog/posts/old-age-and-
treachery....](https://ridiculousfish.com/blog/posts/old-age-and-
treachery.html)

------
rektide
/me laughs in ripgrep

~~~
lcdoutlet
I am a big fan of ripgrep. I think the tool is great for quick simple
searches. On larger data sets with sufficiently complicated regex patterns the
performance gap is small. I have also encountered situations using rg on med-
large datasets where the files may contain Unicode, regular expressions,
random garbage... and rg failed. I don't recall a situation like this where
the error handling in grep didn't recover and move on to the next file.

Here is an example showing the performance gap to be small using
(enwiki-20200401-pages-articles-multistream.xml 72G) split into 24 files on a
ramdisk. I ran this 50 times and the results are similar.

du -sch . 72G . 72G total

time ls|xargs -i -P24 rg -c "<. _? >" {} real 0m6.841s user 2m8.766s sys
0m5.637s

time ls|xargs -i -P24 grep -Ec "<._?>" {} real 0m6.168s user 1m35.058s sys
0m24.177s

~~~
burntsushi
> I have also encountered situations using rg on med-large datasets where the
> files may contain Unicode, regular expressions, random garbage... and rg
> failed.

I would really love to have a bug report if you could find a way to reproduce
this. ripgrep should certainly not fail regardless of the input. I've actually
spent a lot of time trying to make sure that ripgrep's error handling and so
forth generally match GNU grep's behavior.

> Here is an example showing the performance gap to be small using
> (enwiki-20200401-pages-articles-multistream.xml 72G) split into 24 files on
> a ramdisk. I ran this 50 times and the results are similar.

Could you show me where to get `enwiki-20200401-pages-articles-
multistream.xml`? I'd love to run the benchmark myself, although I don't have
a ramdisk quite that big. :P I could shrink the data set though a bit.

If I were to guess though, yeah, I'd generally expect GNU grep and ripgrep to
perform similarly here. Speculating:

If the match count of those runs is very high, then total search time is
likely dominated by per-match overhead. Both ripgrep and GNU grep should be
pretty close there, but there's not too much room to improve drastically.

If the match count of those runs isn't high or is very small, then GNU grep
and ripgrep will use a search algorithm that is virtually identical in its
performance characteristics. They will both probably look for occurrences of
`>` using `memchr`, and then run a full search on matching lines. It's hard to
get much faster than that, so again, not a lot of room for differentiating
yourself.

If I can get my hands on the corpus, then I'm pretty sure I could give you
(non-pathological) patterns on which searches would show very different
performance characteristics between GNU grep and ripgrep.

~~~
lcdoutlet
Thank you for your reply. Again, I am big fan of your work specfically rg and
fst.

I apologize for mentioning this without submitting bug reports. I will do so
in the near future.

The dataset can be found here.
[https://dumps.wikimedia.org/enwiki/20200420/enwiki-20200420-...](https://dumps.wikimedia.org/enwiki/20200420/enwiki-20200420-pages-
articles-multistream.xml.bz2)

In general when I start a search, the patterns are somewhat pathological. For
example when learning about a new codebase I might start with 10-100 regex and
100+ keywords. With each iteration the complexity is reduced until I find the
most relevant parts of the codebase.

I know rg performs significantly better than grep out of the box. I think grep
by default is compiled without optimizations and does not use concurrency. I
would be interested in comparing the performance characteristics between the
tools in more detail.

------
statquontrarian
> Since GNU grep passed out of my hands, it appears that use of mmap became
> non-default, but you can still get it via --mmap. [...] using --mmap can be
> worth a >20% speedup.

This option was ignored in March 2010 and later removed completely:
[https://lists.gnu.org/archive/html/grep-
commit/2010-03/msg00...](https://lists.gnu.org/archive/html/grep-
commit/2010-03/msg00016.html)

~~~
shakna
And removed for performance reasons, too.

> On modern systems, --mmap rarely if ever yields better performance.

and

> mmap is a bad idea for sequentially accessed file because it will cause a
> page fault for every read page. Just consider it a failed experiment, and
> ignore --mmap while accepting it for backwards compatibility.

~~~
burntsushi
That may have been true back then, but it isn't true today. Or at least, it
depends on your OS. On Linux, you can actually run this test fairly reliable
with ripgrep. Pick a big file (hundreds of MB or a couple GB), make sure it's
on a ramdisk (or in cache) and search for a pattern that doesn't exist:

    
    
        $ rg --mmap zqzqzqzq big-file
        $ rg --no-mmap zqzqzqzq big-file
    

The first is faster on my system by a sizeable portion.

There are cases where mmap can be slower. For example, when searching lots of
small files simultaneously. The overhead of managing the memory maps in the OS
ends up dominating.

EDIT: Memorys maps also have the downside of being susceptible to SIGBUS if
the file it is searching is truncated. In this case, ripgrep aborts. (This is
mentioned in the README, and you can avoid this in all scenarios by passing
--no-mmap.)

------
The_rationalist
[https://github.com/BurntSushi/ripgrep](https://github.com/BurntSushi/ripgrep)

~~~
zests
This part seemed very applicable to the thread at hand:

> It supports searching with either memory maps or by searching incrementally
> with an intermediate buffer. The former is better for single files and the
> latter is better for large directories. ripgrep chooses the best searching
> strategy for you automatically.

------
zquestz
Going to just leave this here:

[https://sift-tool.org/performance](https://sift-tool.org/performance)

------
klyrs
[2010]

This is an inspiration.

~~~
jasonmp85
Conversely, your post is depressing.

You’re impressed someone wrote something worth reading after a decade?

~~~
klyrs
One, your interpretation of my comment widely misses the mark. First paragraph
indicates the title is (was) lacking a date. Second paragraph is how I feel
about the article itself.

Two, I said "inspiration" and not "impressive." I aspire to be this diligent
about performance. None of the actual tricks are terribly impressive to me, as
I'm familiar with them. The communication itself, from a GNU author to the BSD
community, clearly explaining how they can achieve similar performance, is an
exemplary open source attitude. That's even more inspirational than the actual
tricks, IMO

