
Why GNU grep is fast (2010) - jacobedawson
https://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html
======
Twirrim
It's probably worth pointing out that OS X is using BSD grep. You will see a
phenomenal difference if you install the gnu grep via homebrew:

    
    
        $ homebrew info grep
        GNU grep, egrep and fgrep
        https://www.gnu.org/software/grep/
        /usr/local/Cellar/grep/3.3 (21 files, 885.3KB) *
          Poured from bottle on 2019-01-03 at 12:23:37
        From: https://github.com/Homebrew/homebrew-core/blob/master/Formula/grep.rb
        ...
    

This is a very arbitrary example, but I've got 1.5Gb application log sitting
on my machine right now that'll suffice. I'll even do this in a way that might
give a slight performance advantage to BSD, by using gnu grep first:

    
    
        $ time /usr/local/bin/ggrep "foobarbaz" application.log
        
        real 0m1.319s
        user 0m0.948s
        sys 0m0.345s
    

vs OSX's grep:

    
    
        $ time /usr/bin/grep "foobarbaz" application.log
        
        real 0m37.225s
        user 0m31.036s
        sys 0m1.286s
    

1s vs 37s. Same file.

There's nothing odd about the file, straight boring application logs, and the
line starts with the logging level in capital letters. So I can use a regex
that is pinned to the start of the line, about as optimal as you can get:

first gnugrep:

    
    
        $ time /usr/local/bin/ggrep -c "^INFO" application.log
        1527786
        
        real 0m0.622s
        user 0m0.323s
        sys 0m0.292s
    

then OS X's native bsd grep:

    
    
        $ time /usr/bin/grep -c "^INFO" application.log
        1527786
        
        real 0m3.588s
        user 0m3.206s
        sys 0m0.349s
    

BSD grep was significantly better than the prior search, but it is _still_
notably slower than the gnu grep.

In my experience, this isn't the most pathological example, but it's close.
Note that this also applies to other tools that leverage grep under the
surface, like zgrep etc.

I've specifically aliased grep over to ggrep in my shell so that I'm avoiding
bsd grep whenever I can:

    
    
        $ alias grep
        alias grep='/usr/local/bin/ggrep'

~~~
maxxxxx
How can it crunch through 1GB in 1s? Even just reading the data would take
longer on any system I know.

~~~
cyphar
As TFA mentions,

> #1 trick: GNU grep is fast because it AVOIDS LOOKING AT EVERY INPUT BYTE.

TFA is incredibly short, and will explain it much better than I can.

~~~
saagarjha
> it AVOIDS LOOKING AT EVERY INPUT BYTE

This would not help, since the backing storage doesn't provide support for
this kind of resolution. It would end up reading in the entire file anyways,
unless your input string is on the order of an actual block.

~~~
BeeOnRope
Sure it would help, not for the IO part, but the CPU-bound part of actually
checking each character, which is apparently a much lower bound in this case.

~~~
saagarjha
Yeah, that's why the article talks about decreasing the amount of CPU work.
From the context of disk IO though (which is what this thread seems to be
about) this can't help.

------
harveywi
GNU grep is an excellent tool! A similar tool that is rising in popularity is
ripgrep
([https://github.com/BurntSushi/ripgrep](https://github.com/BurntSushi/ripgrep)),
which can be orders of magnitude faster than GNU grep. Give it a try!

~~~
armitron
In my personal tests, it's not at all orders of magnitude faster than GNU
grep. In fact, for single file, non-unicode greps (most of my usage) the
difference is so small as to be imperceptible when interactively using it.

Even for multi-file scenarios, the difference is nowhere near close the
difference between GNU grep and BSD grep. This means that compatibility with
GNU grep takes priority for me and it's not worth switching over to ripgrep.

~~~
burntsushi
It's a simple break down of communication. I try to be upfront about this, but
no matter how hard I try or how many times I try to clarify it, I can't stop
the spread of inaccurate claims. (And my goodness, this entire HN thread has
numerous inaccuracies just with GNU grep alone.)

On _equivalent_ tasks, ripgrep is not orders of magnitude faster than GNU
grep, outside of pathological cases that involve Unicode support. (I can
provide evidence for that if you like.)

For example, in my checkout of the Linux kernel, here's a recursive grep that
searches everything:

    
    
        $ time LC_ALL=C grep -ar PM_RESUME | wc -l
        17
    
        real    1.176
        user    0.758
        sys     0.407
        maxmem  7 MB
        faults  0
    

Now compare that with ripgrep, with a command that uses the same amount of
parallelism and searches the same amount of data:

    
    
        $ time rg -j1 -uuu PM_RESUME | wc -l
        17
    
        real    0.581
        user    0.187
        sys     0.384
        maxmem  7 MB
        faults  0
    

Which is 2x faster, but not "order of magnitude." Now compare it with how long
ripgrep takes using the _default_ command:

    
    
        $ time rg PM_RESUME | wc -l
        17
    
        real    0.125
        user    0.646
        sys     0.654
        maxmem  19 MB
        faults  0
    

At 10x faster, this is where you start to get to "order of magnitude" faster
claims. But for someone who cares about precise claims with respect to
performance, this is uninteresting because ripgrep is 1) using parallelism and
2) skipping some files due to `.gitignore` and other such rules.

You can imagine that if your directory has a lot of large binary files, or if
you're searching in a directory with high latency (a network mount), then you
might see even bigger differences from ripgrep without generally seeing a
difference in search results because ripgrep _tends_ to skip things you don't
care about anyway.

In summary, there is an impedance mismatch when talking about performance
because most people don't have a good working mental model of how these tools
work internally. Many people report on their own perceived performance
improvements and compare that directly to how they used to use grep. They
aren't wrong in a certain light, because ultimately, the user experience is
what matters. But of course, they _are_ wrong in another light if you're
interpreting it as a precise technical claim about the performance
characteristics of a program.

~~~
dralley
Whoa, which "time" command are you using that provides memory and page fault
info?

~~~
burntsushi
I'm using the `time` built-in from zsh, with this config in my ~/.zshrc:

    
    
        TIMEFMT=$'\nreal\t%*E\nuser\t%*U\nsys\t%*S\nmaxmem\t%M MB\nfaults\t%F'

------
abhinai
Early on in my career, like most novice programmers, I thought that custom
written C programs could be much faster than unix tools if written well and
for a specific purpose. However, I could not beat the speed of unix tools like
_grep_ , _cut_ or _cat_ even once. That is when I realized just how well
written these tools are and just how much optimization work has been done.

~~~
_bxg1
It's amazing to me how programs like these have seemed to avoid the universal
phenomenon of technical debt. They've crystallized into an ideal version of
themselves, and haven't continued to decay past that point. Maybe it's because
of the Unix philosophy of single-purpose programs; no feature-creep tends to
mean no technical debt.

~~~
asveikau
There is truth in what you say, however, if you think GNU tools are free of
technical debt or feature creep, look into how ./configure works, as an
example.

Or, there's the old joke about GNU echo: [https://www.gnu.org/fun/jokes/echo-
msg.en.html](https://www.gnu.org/fun/jokes/echo-msg.en.html)

~~~
wahern
Autoconf code is actually quite clean--you just need to know M4 and have an
appreciation of the problems autoconf was built to overcome. What changed over
time is that most proprietary unix systems disappeared[1], and POSIX
compliance and general consistency have improved dramatically, BSDs and GNU
included.

Also, best practices have shifted to a continuous development model which
keeps everybody on the upgrade treadmill. There's less concern with
maintaining backwards compatibility and catering to those not running the
latest environments. So if you make use of some newer Linux kernel API there's
only a short window where people will put in the effort to maintain a
compatibility mode, assuming they bother at all.

Lastly, containers mean people often develop and ship static environments that
can be maintained independently, sometimes never upgraded at all.

What I find interesting is how people have begun to ditch autoconf in favor of
even _more_ complex (but newer and therefore cooler) build systems when
ironically there's less need than ever for these things. Autoconf doesn't need
replacing; such layers can often be left out entirely.

That said, when feature detection and backwards compatibility _truly_ matters
there's no good alternative. CMake, for example, effectively requires the end
user to install the latest version of CMake, and if you _already_ expect
someone to install the latest version of something then why the contrivance at
all? I always sigh aloud whenever I download a project that relies on CMake
because I know that I now have _two_ problems on my hand, not just one. (But
better CMake than the other alternatives--I just won't even bother.)

[1] All that's left are AIX and Solaris. HP-UX/PA-RISC will be officially dead
next year, and EOL for HP-UX/Itanium is 2025. From a commercial perspective
Solaris seems to be deader than AIX, however Solaris still seems to see more
development--especially improved POSIX and Linux compatibility. It's much
easier to port to Solaris than AIX. It's a real shame Solaris is disappearing
because on big machines with heavy workloads the OOM killer is a fscking
nightmare on Linux. Solaris and Windows (and maybe AIX?) are the only
operating systems that do proper and thorough memory accounting, permitting
you to write reliable software.[2] The cloud services principle that says
individual processes are expendable doesn't work when your job takes hours or
days to run. (Or even just minutes, because workloads accumulate when the OOM
killer starts shooting things down, and even in the cloud you run into hard
limits on resource usage--i.e. cap on numbers of nodes. Memory overcommit is
just like network buffer bloat--intended to improve things at the small scale
but which results in catastrophic degradation at the macro level.)

[2] You can disable overcommit on Linux but the _model_ of overcommit is baked
too deeply into the kernel's design. A machine can still end up with the OOM
killer shooting down processes if, for example, the rate of dirty page
generation outpaces the patience of the allocator trying to reclaim and access
memory that is technically otherwise available.

~~~
overgard
Well, there's a lot of gross things about CMake (the syntax of its DSL is
horrifying), but in my experience if you want Windows support it's a lot
better than autoconf and make.

------
anonu
Previous Discussions:

[https://news.ycombinator.com/item?id=1626305](https://news.ycombinator.com/item?id=1626305)

[https://news.ycombinator.com/item?id=2393587](https://news.ycombinator.com/item?id=2393587)

[https://news.ycombinator.com/item?id=2860759](https://news.ycombinator.com/item?id=2860759)

[https://news.ycombinator.com/item?id=6814153](https://news.ycombinator.com/item?id=6814153)

[https://news.ycombinator.com/item?id=12350890](https://news.ycombinator.com/item?id=12350890)

[https://news.ycombinator.com/item?id=9153203](https://news.ycombinator.com/item?id=9153203)

[https://news.ycombinator.com/item?id=6813937](https://news.ycombinator.com/item?id=6813937)

(edited)

~~~
dang
Thanks! It's great to provide links to previous discussions.

Just so everybody knows: the links are for curiosity purposes. Reposts are ok
after about a year, as the FAQ explains:
[https://news.ycombinator.com/newsfaq.html](https://news.ycombinator.com/newsfaq.html)

------
blattimwind
For many situations the more complex substring search algorithms gave way to
raw, brute force some time ago, I believe. For example, if you are looking for
a given relatively short string, you can just take a prefix of say four bytes
and then make parallel comparisons within a vector; this simple technique gets
you already down to the general area of 1 cpb. The SSE 4.2 PCMPSTR family of
instructions is basically the same thing but microcoded, and is a bit faster.

For short patterns, which are imho by far the most common use, any algorithm
that tries to be smart and skip a couple bytes wastes cycles on being smart
where a simpler brute force algorithm has already fetched the next 16 bytes
and started to compare them while the prefetcher already went off to get the
next line from L2.

~~~
glangdale
When did you measure SSE4.2 against straightforward vector usage and find it
faster? In my experience this was true more or less _never_. There's almost
nothing in SSE4.2 that can't be done better with SSSE3. Intel has slowly
deprecated SSE4.2 by not promoting it to wider vectors (it is still 128b when
most other instructions are 512b) and letting the throughput/latency stagnate.

1 cycle per byte is not a good result for single string search; you can
practically do a shift-and comparison of your input in that speed using
general purpose registers.

~~~
blattimwind
That work was all in the context of searching for a binary string with a mask.
I didn't try too long to optimize it, since performance was quickly
satisfactory, so it's not just possible, but rather very likely that my
implementation isn't particularly good. I'll keep your advice about SSE 4.2 in
mind should I revisit that code.

[https://github.com/rust-
lang/regex/blob/master/src/literal/t...](https://github.com/rust-
lang/regex/blob/master/src/literal/teddy_ssse3/imp.rs) might be of interest to
casual readers (not necessarily you, considering you were probably very
involved in developing that algorithm ;)

~~~
glangdale
I wouldn't use "Teddy" to look for single strings, at least not without heavy
modification. The boring approach of hunting for a predicate or two with
PCMPEQB or equivalent then shift-and'ing things together has worked well in
practice for that sort of thing, although it can be a bit brittle if you get
the predicate(s) wrong.

------
xiphias2
Ripgrep uses SSE3 parallelization instead of skipping input bytes to get
faster on current architectures.

~~~
taeric
I'm assuming it still does both. Parallel is nice. Never touching a byte is
tough to beat, though.

I've often thought of making sure my IDs are uncommon characters to exploit
the ability to skip a lot.

~~~
burntsushi
It does not. ripgrep does not use Boyer-Moore in most searches.

In particular, the advice in the OP is generally out of date. The "secret"
sauce to ripgrep's speed in simple literal searches is a simple heuristic:
choose the rarest byte in the needle and feed that to memchr. (The "heuristic"
is that you can't actually know the optimal choice, but it turns out that a
guess works pretty well most of time since most things you search have a
similar frequency distribution.)

The SSSE3 optimizations come from Hyperscan, and are only applicable when
searching a small number of small patterns. e.g.,
`Holmes|Watson|Moriarty|Adler`.

In other words, for common searches (which are short strings), it is much
better to spend more time in a vectorized routine than to try to skip bytes.

~~~
adrianratnapala
>... but it turns out that a guess works pretty well most of time since most
things you search have a similar frequency distribution.)

This kind of thing is an pattern unless the benefit over Boyer-Moore _huge_.

A small performance gain in the common-case is not worth the pain of
introducing pathological cases that only bite you once you are deeply
committed.

Presumably this is not too bad for ripgrep itself, as long as it falls back to
something sensible when the assumption fails.

~~~
burntsushi
This is wrong. Boyer-Moore does the same thing. It just always selects the
last byte to feed into memchr. If that byte happens to be rare, then it works
great. But if it happens to be common, then it gets into the weeds pretty
quickly. My suggested heuristic does the same thing, but increases the chances
that it selects a rare byte.

And yes, the performance difference can be very large. Here's an example on a
~9.3GB file:

    
    
        $ time rg --no-mmap 'Sherlock ' OpenSubtitles2016.raw.en | wc -l
        6698
        
        real    3.006
        user    1.658
        sys     1.345
        maxmem  8 MB
        faults  0
        
        $ time grep 'Sherlock ' OpenSubtitles2016.raw.en | wc -l
        6698
        
        real    9.023
        user    7.921
        sys     1.092
        maxmem  8 MB
        faults  0
    

Notice that the pattern is `Sherlock `. The last byte is an ASCII space
character, which is incredibly common. Boyer-Moore blindly picks this as the
skip byte, but ripgrep uses a simple pre-computed frequency table to select a
better byte as the skip byte.

> Presumably this is not too bad for ripgrep itself, as long as it falls back
> to something sensible when the assumption fails.

It does. That's why I said, "ripgrep does not use Boyer-Moore in _most_
searches."

------
prolepunk
I've been using ack ([https://beyondgrep.com/](https://beyondgrep.com/)) for a
while and it seems to be better suited for my specific purpose -- looking into
source files and avoiding things like data dumps and VCS. It may be slower
than grep, but it faster in the sense that I don't spend time remembering all
the parameter combinations when I search for something.

Here are a few aliases I use:

alias ack="ack --color" # color output

alias ackl="ack -l" # show file names only

acksubl () { ack -l -i "${1}" | xargs subl } # Do case-insensitive search and
open files in sublime.

Turns out in practice I don't use regular expressions very often when
searching text, and the most frequent question is -- where is this
function/variable might have been used?

------
Erwin
This is the journal referenced in the mailing list:
[https://onlinelibrary.wiley.com/journal/1097024x](https://onlinelibrary.wiley.com/journal/1097024x)

Seems the articles are free, if you don't want to pay the €6085 yearly
instituational subscription.

Other than Adrian's [https://blog.acolyer.org/](https://blog.acolyer.org/)
what else is there worth watching that is closer to engineering than theory? I
wish we still had DDJ or C/C++ User Journal, but today's periodicals are
things like Rasperry Pi enthusiast or Monthly Minecraft Tips.

------
js2
See also “The Treacherous Optimization (2006)”:

[http://ridiculousfish.com/blog/posts/old-age-and-
treachery.h...](http://ridiculousfish.com/blog/posts/old-age-and-
treachery.html)

------
lazyant
One of my favorite idioms for searching for a string: find /path -type f
|xargs grep blah

Has been replaced by `ag`
[https://github.com/ggreer/the_silver_searcher](https://github.com/ggreer/the_silver_searcher)
, ag is instant in a big repo compared to find|grep

~~~
twalla
it's mentioned elsewhere in this thread but you should try ripgrep

[https://github.com/BurntSushi/ripgrep](https://github.com/BurntSushi/ripgrep)

~~~
hwj
ag has PCRE enabled by default. That's why I prefer it over ripgrep.

~~~
burntsushi
You might already know this, but the current release of ripgrep has support
for PCRE2 (as opposed to PCRE1 in ag). All you need to do is pass the -P flag.
If you want it enabled by default, then:

    
    
        echo -P > $HOME/.ripgreprc
    

and add this to your .bashrc or equivalent:

    
    
        export RIPGREP_CONFIG_PATH="$HOME/.ripgreprc"

------
westbywest
I think the author phrased it somewhat differently, but my understanding is
grep's high throughput also comes from its use of what we (Computer
Engineering grad research group) referred to as a variable state machine. A
colleague was researching implementing analogous on an FPGA, for gigabit line-
speed throughput. Preferred Computer Science term is apparently Thompson's
NFA.
[https://en.wikipedia.org/wiki/Thompson%27s_construction#The_...](https://en.wikipedia.org/wiki/Thompson%27s_construction#The_algorithm)

~~~
burntsushi
Thompson's NFA construction, when used directly to search via an NFA
simulation, is dog slow. It does give you a O(nm) search, but in practice it's
slow. AIUI, GNU grep uses a lazy DFA, which uses Thompson's NFA construction
to build a DFA at search time. This does indeed lead to pretty good
performance for the regex engine. But GNU grep's speed largely comes from
optimizing the common case: extracting literals from your pattern, identifying
_candidate_ matching lines, and then checking with the full regex engine to
confirm (or reject) the match.

~~~
glangdale
I suspect Thompson's NFA is not inherently dog slow (Glushkov can be done
reasonably fast for decent-sized NFAs). The fact is that most Thompson-lineage
engines opted for the 'lazy DFA' approach and optimized that (which is
effective until it isn't). I imagine a more aggressive 'native' Thompson NFA
is possible. A nice benefit of that is not having to write to your bytecode -
there's a good deal of systems-level complexity stuff in RE2 that springs out
of a consequence of the 'lazy DFA construction' decision.

That being said, matching literals is always going to be faster, especially if
you decompose the pattern to get more use out of your literal matcher - the
downside of filtration is that if the literal is always present, you are just
doing strictly more work. At least with decomposition you've taken the literal
out of the picture. See [https://branchfree.org/2019/02/28/paper-hyperscan-a-
fast-mul...](https://branchfree.org/2019/02/28/paper-hyperscan-a-fast-multi-
pattern-regex-matcher-for-modern-cpus/) for those who don't know what I'm
talking about (I know you've read it).

Am flirting with doing another regex engine that gets some of the benefit of
decomposition and literal matching without taking on the nosebleed complexity
of Hyperscan...

~~~
burntsushi
Do you know of any fast Thompson NFA simulation implementation? I don't think
I've seen one outside of a JIT.

Is there a fast glushkov implementation that isn't bit parallel? I've never
been able to figure out how to use bit parallel approaches with large Unicode
classes. Just using a single Unicode aware \w puts it into the weeds pretty
quickly. That's where the lazy DFA shines, because it doesn't need to build
the full DFA for \w (which is quite large, even after the standard DFA
compression tricks).

~~~
glangdale
Unicode is a PITA. In Hyperscan, it's not pretty what gets generated for a
bare \w in UCP mode if you force it into an NFA (it's rather more tractable as
a DFA, even if you aren't lazily generating, although of course betting the
farm that you can always 'busily' generate a DFA isn't great).

I've always thought that a better job of doing NFAs (Gluskov or otherwise) and
staying bit-parallel would be done with having character reachability on
codepoints, not bytes, generally remapping down to 'which codepoints make an
actual difference'. This sounds ugly/terrifying, but the nice thing is that
remapping a long stream of codepoints could be done in parallel (as it's not
hard to find boundaries) and with SIMD. Step by step NFA or DFA work is more
ponderous as every state depends on previous states.

~~~
burntsushi
Yeah, I've looked at glushkov based primarily on your comments about it, but
Unicode is always where I get stuck. In my regex engine, Unicode is enabled by
default and \w is fairly common, so it needs to be handled well.

And of course, one doesn't need to bet the farm on a lazy DFA if you have one,
although it is quite robust in a large number of practical scenarios. (I think
RE2 does bet the farm, to be fair.)

~~~
glangdale
Unicode + UCP is a perfectly principled thing, but it wasn't a design point
that made any sense for Hyperscan as a default. The bulk of our customers were
not interested in turning 1 state for ASCII \w into 600 states for UCP \w
unless it was free.

I think both Glushkov and Thompson can be done fast, but I agree that they are
both going to be Really Big for UCP stuff. Idle discussions among the ex-
Hyperscan folks generally leans towards 'NFA over codepoints' being the right
way of doing things.

Occam's razor suggests if you do only 1 thing in a regex system (i.e.
designing for simplicity/elegance, which would be an interesting change after
Hyperscan) it _must_ be NFA, as not all patterns determinize. If you are OK
with a lazy DFA system that can be made to create a new state per byte of
input (in the worst case) then I guess you can do that too.

I am not sure how to solve the problem of "NFA over codepoints", btw. Having
no more than 256 distinct characters was easy, but even with remapping, the
prospect of having to handle arbitrary Unicode is... unnerving.

~~~
burntsushi
Yeah, my Thompson NFA uses codepoints for those reasons. But not in
particularly smart way; mostly just to reduce space usage. It is indeed an
annoying problem to deal with!

------
GalacticDomin8r
FWIW, BSD grep has significantly closed the gap since then, often by
replication GNU approach in some ways.

Also BSD grep has other advantages, primarily it's not GNU grep.

~~~
wglb
This is curious, as I just did a test with grep and ggrep. The latter is
almost 3 times faster for a very common use case I have.

~~~
GalacticDomin8r
grep -V ? ldd grep ? Use case sample ?

~~~
wglb
grep (BSD grep) 2.5.1-FreeBSD

The other commenter in this thread pointed out that this is a very old version
and the newer bsd version is better.

    
    
        real 0m2.044s
        user 0m1.932s
        sys 0m0.085s
        wgl@pondera:~$ time grep LiteonTe *.text | wc -l
           11020
        
        real 0m1.939s
        user 0m1.905s
        sys 0m0.038s
        wgl@pondera:~$ time ggrep LiteonTe *.text | wc -l
           11020
        
        real 0m0.130s
        user 0m0.087s
        sys 0m0.037s
        wgl@pondera:~$ time ggrep LiteonTe *.text | wc -l
           11020
        
        real 0m0.119s
        user 0m0.088s
        sys 0m0.035s
        wgl@pondera:~$ du -h -s *.text
        128M Kismetkismet-kali-pondera-20190325-08-46-27-1.pcapdump.text
    

First one was done to cache then the number discarded. Thus the 'grep' you see
above is the second run over the 128 mb pcap file expanded with tshark.

Dramatic.

I'll stay with the gnu grep and not update the regular ones for now.

~~~
GalacticDomin8r
It would help if you tested just grep when benchmarking grep. These datapoints
tell a much different story.

    
    
      # /usr/bin/grep -V
      grep (BSD grep) 2.6.0-FreeBSD
      
      root@m6600:~ # /usr/local/bin/grep -V
      grep (GNU grep) 3.3
      Copyright (C) 2018 Free Software Foundation, Inc.
      License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
      
      Written by Mike Haertel and others; see
      <https://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.
      
      root@m6600:~ # /usr/bin/time /usr/bin/grep X-User-Agent packetdump.pcap -c
      60
              0.54 real         0.45 user         0.07 sys
      root@m6600:~ # /usr/bin/time /usr/bin/grep X-User-Agent packetdump.pcap -c
      60
              0.54 real         0.44 user         0.08 sys
      root@m6600:~ # /usr/bin/time /usr/bin/grep X-User-Agent packetdump.pcap -c
      60
            0.54 real         0.41 user         0.11 sys
      root@m6600:~ # /usr/bin/time /usr/local/bin/grep X-User-Agent packetdump.pcap -c
      60
              0.58 real         0.49 user         0.08 sys
      root@m6600:~ # /usr/bin/time /usr/local/bin/grep X-User-Agent packetdump.pcap -c
      60
              0.60 real         0.48 user         0.11 sys
      root@m6600:~ # /usr/bin/time /usr/local/bin/grep X-User-Agent packetdump.pcap -c
      60
              0.59 real         0.50 user         0.08 sys
      root@m6600:~ # du -h -s packetdump.pcap
      225M packetdump.pcap

~~~
wglb
That is a very good point. Taking this better approach, here is what I get on
my (not updated grep) system:

    
    
        wgl:$ /usr/bin/grep --version
        /usr/bin/grep --version
        grep (BSD grep) 2.5.1-FreeBSD
        
        wgl:$ /usr/local/bin/ggrep --version
        /usr/local/bin/ggrep --version
        ggrep (GNU grep) 3.3
        Packaged by Homebrew
        Copyright (C) 2018 Free Software Foundation, Inc.
        License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
        This is free software: you are free to change and redistribute it.
        There is NO WARRANTY, to the extent permitted by law.
        
        Written by Mike Haertel and others; see
        <https://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.
        
        wgl:$ /usr/bin/time /usr/local/bin/ggrep LiteonTe really-big.text | wc -l
        /usr/bin/time /usr/local/bin/ggrep LiteonTe really-big.text | wc -l
                2.30 real         1.04 user         0.67 sys
            1228
        wgl:$ /usr/bin/time /usr/bin/grep LiteonTe really-big.text | wc -l
        /usr/bin/time /usr/bin/grep LiteonTe really-big.text | wc -l
                5.65 real         5.30 user         0.33 sys
            1228
        wgl:$ /usr/bin/time /usr/local/bin/ggrep LiteonTe really-big.text >/dev/null
        /usr/bin/time /usr/local/bin/ggrep LiteonTe really-big.text >/dev/null
                0.05 real         0.03 user         0.01 sys
        wgl:$ /usr/bin/time /usr/bin/grep LiteonTe really-big.text >/dev/null
        /usr/bin/time /usr/bin/grep LiteonTe really-big.text >/dev/null
                6.50 real         5.71 user         0.58 sys
        wgl:$ /usr/bin/time /usr/local/bin/ggrep LiteonTe really-big.text -c
        /usr/bin/time /usr/local/bin/ggrep LiteonTe really-big.text -c
        1228
                2.33 real         1.05 user         0.69 sys
        wgl:$ /usr/bin/time /usr/bin/grep LiteonTe really-big.text -c
        /usr/bin/time /usr/bin/grep LiteonTe really-big.text -c
        1228
                5.37 real         5.05 user         0.31 sys
    

The wc -l is clearly polluting the result. However, I suspect that the
>/dev/null is as well. But in the worst case, I see a halving of time over the
old grep (edited), which correlates with my most common use of grep in looking
through source files.

~~~
GalacticDomin8r
Compiler is also going to make an impact which for me is consistent across
both grep binaries.

FreeBSD clang version 6.0.1 as well as -O2

I suspect there are still edge cases where BSD grep is quite a bit slower or
not compatible with GNU grep. However with a closer apples to apples
comparison there isn't much difference anymore for my usage. Which is a lot of
grep use but that is pretty vanilla.

There may also be other OS differences in our comparison. My tests where run
against a fairly recent FreeBSD 12-STABLE.

------
peterwwillis
> The key to making programs fast is to make them do practically nothing. ;-)

Also key to making them small, stable, intuitive, and usable

~~~
saagarjha
FWIW GNU's tools tend to be much larger and more complicated that competing
versions of the same command.

~~~
peterwwillis
That's natural when you try to support a standard, are multi-platform, and
have extra features.

------
z3t4
Grep is a damn useful tool, I use it every day. But is it just me or does grep
sacrifice convenience for performance ? Eg. it rather skips some bytes, then
to make sure it finds everything. And case sensitive is default, where you
most of the time want to find all case variations.

~~~
burntsushi
No. Skipping bytes is only done when you can prove there isn't a match. Please
see my other comments in this thread.

If you want case insensitive matching by default, then use:

    
    
        alias grep="grep -i"

------
wscott
"Old age and treachery will beat youth and skill every time."
[http://ridiculousfish.com/blog/posts/old-age-and-
treachery.h...](http://ridiculousfish.com/blog/posts/old-age-and-
treachery.html)

------
KVFinn
I often have the experience of being like, "Surely I can't just do this entire
task with a bunch of greps in a timely manner, it's going to take forever,
right?" and then boom, the script takes like a minute to finish sorting
through bazillions of lines.

------
dreamcompiler
Yes, Boyer-Moore is ridiculously fast, but it's a string search algorithm, not
a regular expression search algorithm. You can't use it for regexes. Is TFA
still true when you're not searching for literal strings?

~~~
burntsushi
Yes, because it will extract literals from your regex and seaech for those
first. When a match of the literal is found, all you then need to do is find
the line boundaries of that match and run the regex engine on just that line.
It ends up working pretty well.

Of course, if your regex has no required literals at all, then this
optimization can't be performed. GNU grep can slow down quite a bit in these
cases, especially if you have a non-C locale enabled and use some Unicode
aware features such as \w.

------
emmanueloga_
I keep this on my bookmarks: [1] (that I fished from this other blog post
[2]).

Not sure if any actual search tools other than Scalyr use that specific
implementation, but it makes for an interesting read anyway if you are into
substring search algorithms (the algo described improves on Boyer-Moore).

1:
[http://volnitsky.com/project/str_search/](http://volnitsky.com/project/str_search/)

2: [https://www.scalyr.com/blog/searching-1tb-sec-systems-
engine...](https://www.scalyr.com/blog/searching-1tb-sec-systems-engineering-
before-algorithms)

------
ananonymoususer
This would depend upon which version of GNU grep one installs... About 10
years ago (when this article was written), I upgraded a RedHat system and
discovered that some of my scripts were running 10 times slower than they had
on the previous install. I isolated the issue to grep and benchmarked the grep
from the previous release against the current release. The new grep was 10
times slower. I didn't have much more time to waste on the issue so I simply
copied the grep binary from the previous release to the new install and
everything was fine afterward.

~~~
acdha
This sounds a lot like when they started making it handle Unicode correctly.
We ran into the same problem but fixed it by setting `LC_ALL=C` in the
affected log processing scripts.

------
EGreg
That’s so 2010. Here is what I like to read:
[https://blog.burntsushi.net/ripgrep/](https://blog.burntsushi.net/ripgrep/)

------
leonardmh
It’s not that fast (comparatively):
[https://blog.burntsushi.net/ripgrep/](https://blog.burntsushi.net/ripgrep/)

------
peter_d_sherman
Boyer-Moore + Loop Unrolling... fast.

But, perhaps it could be improved further via SSE instructions, or by sending
the data to a GPGPU and parallelizing the algorithm...

~~~
djmips
see ripgrep mentioned elsewhere in comments.

------
opportune
Just in case anybody wants to look at the code, I think these two files [0, 1]
contain the main logic when it comes to searching. Please correct me if these
files aren't the right ones:

[0]
[https://git.savannah.gnu.org/cgit/grep.git/tree/src/kwset.c](https://git.savannah.gnu.org/cgit/grep.git/tree/src/kwset.c)

[1]
[https://git.savannah.gnu.org/cgit/grep.git/tree/src/kwsearch...](https://git.savannah.gnu.org/cgit/grep.git/tree/src/kwsearch.c)

~~~
TheDong
If you're going to link to gnu grep's source code, you might as well link to
the actual git repo grep uses:
[https://git.savannah.gnu.org/cgit/grep.git/tree/src](https://git.savannah.gnu.org/cgit/grep.git/tree/src)

~~~
opportune
Thanks, didn't realize you could actually look at the source in browser for
the savannah repo

------
stirfrykitty
Also interesting, the Linux network stack is now faster than BSDs, which while
a recent development, is still a feat.

~~~
auvi
could you please point me to a link where i can read about it? i am curious.

~~~
stirfrykitty
[https://medium.com/@matteocroce/linux-and-freebsd-
networking...](https://medium.com/@matteocroce/linux-and-freebsd-networking-
cbadcdb15ddd)

~~~
auvi
thanks, I have got something to read for the weekend :)

------
oneplane
This is simply a very good technical mailing list discussion. Instead of the
whole BSD vs. GNU thing, you get to the point information sharing on
processing structures.

I do wonder about the later messages discussing the other string matching
algorithms (on of them using a Trie to easily track back), while they are
considered in the discussion, I didn't find any search/regex implementing
them.

~~~
burntsushi
Are you referring to Aho-Corasick? If so, there are definitely some regex
engines that use it, including I believe GNU grep:
[http://git.savannah.gnu.org/cgit/grep.git/tree/src/kwset.c](http://git.savannah.gnu.org/cgit/grep.git/tree/src/kwset.c)

~~~
oneplane
I must've been temporarily blind while scanning the src, thanks! The kinds of
implementations that bring together various data structures and algorithms are
usually rather interesting to read and learn from.

------
bibyte
From my experience GNU CLI tools make a particular effort to be really fast
(even on embedded hardware).

~~~
creatornator
I think this is originally to avoid any possible plagiarism claims by Unix--if
a Unix tool was originally optimized for one of size/efficiency/speed/memory
use, the GNU tools were optimized for one of the others. This is part of why
the `yes` utility is so freakin fast in GNU, despite being kind of a weird use
case for speed.

Mac builtin:

yes | pv > /dev/null

139MiB 0:00:05 [28.4MiB/s]

GNU:

gyes | pv > /dev/null

3.31GiB 0:00:06 [ 584MiB/s]

~~~
zorked
I don't know if it is or isn't the plagiarism thing but GNU tools were written
with a very different, way more modern style than Unix tools.

Unix was pretty much "do a quick loop using terse code" and "see what fits in
this static small buffer, and if it doesn't, quietly truncate". GNU would
reallocate buffers to fit, and take reasonable precautions on performing well
on tiny/huge inputs.

GNU was actually considered bloated by the Unix crowd ("why so many more
kilobytes of code?" ). It's an approach that scaled better for the future
though.

~~~
creatornator
See section 2.1 of [0] and the top response to this comment [1]. You'll note
in the GNU standards there, it encourages optimizing for the opposite of Unix
to avoid collisions in strategy.

[0] [https://www.gnu.org/prep/standards/standards.html#Reading-
No...](https://www.gnu.org/prep/standards/standards.html#Reading-Non_002dFree-
Code)

[1]
[https://news.ycombinator.com/item?id=14543536](https://news.ycombinator.com/item?id=14543536)

------
systematical
How is looking at the final letter of a string faster than starting with the
first?

~~~
paulmd
As discussed in TFA:
[https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-
sea...](https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-
search_algorithm)

~~~
systematical
Yeah, thats easy to digest.

------
deytempo
I miss the days when the internet was mostly html pages

~~~
nostrademons
The Internet still mostly _is_ HTML pages. It's just the portion of the
Internet that you look at on a daily basis that has moved to JS-heavy SPAs.

------
slowrabbit
use ripgrep it's faster
[https://github.com/BurntSushi/ripgrep](https://github.com/BurntSushi/ripgrep)
and appropriately named

~~~
burntsushi
Origins of the name:
[https://github.com/BurntSushi/ripgrep/blob/master/FAQ.md#int...](https://github.com/BurntSushi/ripgrep/blob/master/FAQ.md#intentcountsforsomething)

------
wiredfool
TLDR: It cheats.

~~~
danso
That’s an interesting term in this context. The technique is so clever and the
gains so big that it does feel a “cheat”. But as long as it replicates the
behavior and effects and constraints of the previous algorithm, is it really a
“cheat” even in the loosest sense of the word? It feels far more apt to view
prior work as “tryhard”

------
alexnewman
I mean is it?

