

Grep too slow? Use git-grep - mparramon
http://developingandstuff.blogspot.de/2013/07/grep-too-slow-tired-of-waiting-for-it.html

======
dfc
I would like to know what version of grep is being used? GNU or BSD grep? My
money is that it is a OSX box with BSD grep.[1]

Using a clone of linus' kernel on a six year old amd64 box running debian/sid:

    
    
      (master):dfc@fob-xray:~/lk$ time git grep "leap second" > /dev/null
    
      real    0m0.739s
      user    0m0.540s
      sys     0m0.650s
      (master):dfc@fob-xray:~/lk$ time grep -r --exclude-dir=.git "leap second" . > /dev/null
    
      real    0m0.833s
      user    0m0.400s
      sys     0m0.430s
      (master):dfc@fob-xray:~/lk$ du -sh .
      1.4G    .
    
    

[1] [http://lists.freebsd.org/pipermail/freebsd-
current/2010-Augu...](http://lists.freebsd.org/pipermail/freebsd-
current/2010-August/019310.html)

ADDENDUM:

After running a test on a Mac mini with 10.8.4 and a clone of the same
repository I am overly confident that the title should be "BSD grep too slow?
Use git-grep or GNU grep":

    
    
      jumbo:lk dfc$ purge ; time grep -r --exclude-dir=.git "leap second" . > /dev/null
    
      real	1m8.479s
      user	0m32.229s
      sys	0m2.348s
      jumbo:lk dfc$ purge ; time ggrep -r --exclude-dir=.git "leap second" . > /dev/null
    
      real	0m41.844s
      user	0m1.682s
      sys	0m6.035s
      jumbo:lk dfc$ purge ; time git grep "leap second" > /dev/null
    
      real	0m37.325s
      user	0m1.423s
      sys	0m4.213s
      jumbo:lk dfc$ gdu -sh .
      1.3G	.
    

You can read `ggrep` as either _GNU grep_ or _good grep._

~~~
calinet6
Well shoot. `brew install grep` is is then.

And there's also ack, and ag... no shortage of options that are better than
BSD grep, apparently.

It's not "WOW GNU grep is fast!" it's, "wow, BSD grep is _slow._ "

------
MatthewPhillips
The Silver Searcher is where it's at:

[https://github.com/ggreer/the_silver_searcher](https://github.com/ggreer/the_silver_searcher)

~~~
ggreer
I'm glad you (and many others) like it. I've neglected Ag lately, but I'll
definitely come back to it once I'm less busy (probably in the fall). There
are some features users still want, but it's tricky to implement them without
performance regressions.

~~~
riquito
It amazes me every time that I use it. You did a fantastic job. Thank you

------
kbuck
There's another interesting method for speeding up regex searches as well - a
trigram index[1]. This is what Google's code search used to do. A simple
command-line version written in Go[2] was released for local use. There's even
an ack replacement based upon it[3].

[1]
[http://swtch.com/~rsc/regexp/regexp4.html](http://swtch.com/~rsc/regexp/regexp4.html)
[2]
[http://code.google.com/p/codesearch/](http://code.google.com/p/codesearch/)
[3]
[https://github.com/rliebling/fastrAck](https://github.com/rliebling/fastrAck)

------
iambot
Why is this? Would anyone with a better understanding of the situation care to
explain why it would be faster?

~~~
bbrks
I'm not quite, sure. But here's why GNU grep is fast [1]

1\. [http://lists.freebsd.org/pipermail/freebsd-
current/2010-Augu...](http://lists.freebsd.org/pipermail/freebsd-
current/2010-August/019310.html)

~~~
dfc
See my comment[1], your intuition was correct.

[1]
[https://news.ycombinator.com/item?id=6016352](https://news.ycombinator.com/item?id=6016352)

------
gruseom
I haven't been able to figure out how to match a one-sided word boundary using
git-grep. For example, say a file includes the word "router". You can do this
to match the exact word:

    
    
      git grep -w router
    

But what if you want to match just the word boundary at the start ("rout") or
at the end ("outer")?

~~~
moonboots
The "\--perl-regexp" flag to git grep enables perl flavored regexes, which
contain explicit word boundary matching. To match only word starts, you could
use the following:

    
    
        git grep --perl-regexp "\brout"
    
    

and for just word endings

    
    
        git grep --perl-regexp "outer\b"

~~~
gruseom
This doesn't work for me, nor do the other two users' suggestions. I even just
rebuilt git in case my version was out of date.

~~~
dfc
Are you using homebrew?

    
    
      $ brew info git 
      git: stable 1.8.3.2, HEAD
      http://git-scm.com
      /usr/local/Cellar/git/HEAD (1324 files, 29M) *
        Built from source with: --with-blk-sha1 --with-pcre
      From: https://github.com/mxcl/homebrew/commits/master/Library/Formula/git.rb
      ==> Dependencies
      Optional: pcre, gettext
      ==> Options
      --with-blk-sha1
    	Compile with the block-optimized SHA1 implementation
      --with-gettext
    	Build with gettext support
      --with-pcre
    	Build with pcre support
      --without-completions
    	Disable bash/zsh completions from "contrib" directory
    
    

What does "built from source with" say?

~~~
gruseom
I don't use homebrew, but your comment made me realize what I needed to do to
successfully compile git with pcre. As a result, this now works:

    
    
      git grep -P "outer\b"
    

Yay! Thank you and moonboots.

This is an important feature for me, because I organize my code lexically
(primarily by being disciplined about using unique names for things) and rely
on grep to quickly find what I want. Occasionally one unique name overlaps
with another, and then I really want to match the word boundary.

------
abimaelmartell
"Grep too slow?" no...

~~~
jasonlotito
On large enough code bases, git grep is much faster than grep. Couple that
with repo for git and repo grep, and searching is painless. grep is fast, but
the nature of git grep only makes it faster.

------
derekp7
Another way of speeding up grep is to turn off Unicode. Check your _LANG_
environment variable -- if it is set to en_US.UTF-8 (or simething similar),
set it to _C_ then run grep. Also speeds up wc, and makes sort behave better.

Note, you can also export environment variables for a specific command, if you
don't want to change it for your whole shell, by specifying "variable=value"
before the command:

    
    
      LANG=C grep ...

~~~
dfc
Are you sure this is still true? I thought that bug was fixed a little while
ago?

ADDENDUM:

I remembered discussing this a while back and I just found my old comment.
Someone said "fun fact, gnu grep is slow with UTF8"[1] and I said "funner fact
gnu grep was slow with UTF8"[2]. I reran the same grep that I used elsewhere
in this discussion with C and with UTF8:

    
    
      root@fob-xray:lk# sync ; echo 3 > /proc/sys/vm/drop_caches  
      root@fob-xray:lk# declare -x LANG=C ; /usr/bin/time grep --exclude-dir=.git -r "leap second" . > /dev/null
      0.51user 3.04system 1:28.40elapsed 4%CPU (0avgtext+0avgdata 992maxresident)k
      0inputs+0outputs (3major+340minor)pagefaults 0swaps
    
      root@fob-xray:lk# sync ; echo 3 > /proc/sys/vm/drop_caches  
      root@fob-xray:lk# declare -x LANG=en_US.UTF-8 ; /usr/bin/time grep --exclude-dir=.git -r "leap second" . > /dev/null
      0.41user 3.44system 1:27.01elapsed 4%CPU (0avgtext+0avgdata 1100maxresident)k
      0inputs+0outputs (2major+367minor)pagefaults 0swaps
    

I think you must have a version of GNU grep before 2.7.3 or 2.7.1. The UTF
problem seems to have disappeared. There is a decent amount of information in
the debian bug report[3].

[1]
[https://news.ycombinator.com/item?id=2860932](https://news.ycombinator.com/item?id=2860932)

[2]
[https://news.ycombinator.com/item?id=2862543](https://news.ycombinator.com/item?id=2862543)

[3] [http://bugs.debian.org/cgi-
bin/bugreport.cgi?bug=604408](http://bugs.debian.org/cgi-
bin/bugreport.cgi?bug=604408)

~~~
derekp7
Your right -- I've been stuck with older RHEL 4 systems at work, and they run
grep 2.5.something (although I'm not able to reproduce the problem at home
now, maybe I'm not on an old enough RHEL 4 system -- will check when I get
back to work). Problem disappears with the grep in RHEL 6. However there are
still glitches in the sort command. For example:

    
    
       ls -lh |sort +4 -5 -h
    

doesn't work (at least on my RHEL 6.4 box), yet:

    
    
        ls -lh |LANG=C sort +4 -5 -h
    

does work.

~~~
dfc
Wow sort is misbehaving, but I cannot verbalize the problem. I ran your two ls
invocations on the kernel source directory and diffed the output.

    
    
      --- /tmp/withutf 2013-07-09 21:03:23.060064079 -0400
      +++ /tmp/withC 2013-07-09 21:03:30.069578205 -0400
      @@ -1,26 +1,26 @@
      total 544K
      -rw-r--r-- 1 dfc dfc 252 Jul 9 19:27 Kconfig
      -rw-r--r-- 1 dfc dfc 2.5K Jul 9 19:27 Kbuild
      -drwxr-xr-x 113 dfc dfc 4.0K Jul 9 19:27 drivers
      +drwxr-xr-x 2 dfc dfc 4.0K Jul 9 19:27 init
      +drwxr-xr-x 2 dfc dfc 4.0K Jul 9 19:27 ipc
      +drwxr-xr-x 2 dfc dfc 4.0K Jul 9 19:27 mm
      +drwxr-xr-x 2 dfc dfc 4.0K Jul 9 19:27 usr
      +drwxr-xr-x 3 dfc dfc 4.0K Jul 9 19:27 block
      +drwxr-xr-x 3 dfc dfc 4.0K Jul 9 19:27 virt
      +drwxr-xr-x 4 dfc dfc 4.0K Jul 9 19:27 crypto
      +drwxr-xr-x 9 dfc dfc 4.0K Jul 9 19:27 lib
      +drwxr-xr-x 9 dfc dfc 4.0K Jul 9 19:27 security
      drwxr-xr-x 11 dfc dfc 4.0K Jul 9 19:27 kernel
      drwxr-xr-x 12 dfc dfc 4.0K Jul 9 19:27 samples
      drwxr-xr-x 13 dfc dfc 4.0K Jul 9 19:27 scripts
      drwxr-xr-x 17 dfc dfc 4.0K Jul 9 19:27 tools
      drwxr-xr-x 22 dfc dfc 4.0K Jul 9 19:27 sound
      drwxr-xr-x 26 dfc dfc 4.0K Jul 9 19:27 include
      -drwxr-xr-x 2 dfc dfc 4.0K Jul 9 19:27 init
      -drwxr-xr-x 2 dfc dfc 4.0K Jul 9 19:27 ipc
      -drwxr-xr-x 2 dfc dfc 4.0K Jul 9 19:27 mm
      -drwxr-xr-x 2 dfc dfc 4.0K Jul 9 19:27 usr
      drwxr-xr-x 32 dfc dfc 4.0K Jul 9 19:27 arch
      drwxr-xr-x 36 dfc dfc 4.0K Jul 9 19:27 firmware
      -drwxr-xr-x 3 dfc dfc 4.0K Jul 9 19:27 block
      -drwxr-xr-x 3 dfc dfc 4.0K Jul 9 19:27 virt
      -drwxr-xr-x 4 dfc dfc 4.0K Jul 9 19:27 crypto
      drwxr-xr-x 55 dfc dfc 4.0K Jul 9 19:27 net
      drwxr-xr-x 73 dfc dfc 4.0K Jul 9 19:27 fs
      -drwxr-xr-x 9 dfc dfc 4.0K Jul 9 19:27 lib
      -drwxr-xr-x 9 dfc dfc 4.0K Jul 9 19:27 security
      +drwxr-xr-x 113 dfc dfc 4.0K Jul 9 19:27 drivers
      -rw-r--r-- 1 dfc dfc 7.4K Jul 9 19:27 REPORTING-BUGS
      drwxr-xr-x 101 dfc dfc 12K Jul 9 19:27 Documentation
      -rw-r--r-- 1 dfc dfc 19K Jul 9 19:27 COPYING
    

There were differences in the output but I am not sure if that is a bug or if
it has to do with locale rules for sorting. What do you think is broken with
sort?

There seems to be a bug regarding utf and sort in debian[1] but I am not sure
if it is the same problem. Do you know if there is a bug in redhat's bugzilla
for the issue? Up until this thread I did not realize sort behaved differently
depending on the locale.[2] Are you sure its not a difference in locale
expectations on how strings are sorted?

[1] [http://bugs.debian.org/cgi-
bin/bugreport.cgi?bug=695489](http://bugs.debian.org/cgi-
bin/bugreport.cgi?bug=695489)

[2] [http://stackoverflow.com/questions/5909404/sort-not-
sorting-...](http://stackoverflow.com/questions/5909404/sort-not-sorting-as-
expected-space-and-locale)

~~~
0x0
One is sorting "intelligently" numerically, the other is just comparing ascii
values, on the first numerical field.

~~~
dfc
Thank you for indulging our rather off topic curiosity. Just so I am clear are
you saying the reason for the difference in sorting has to do with locale
interpretations and is not a bug in sort?

------
eliben
Even better: use pss
([https://github.com/eliben/pss](https://github.com/eliben/pss))

It's not git-exclusive and has a ton of additional capabilities and features.
Besides, being a pure Python program it's very easy to tweak if there's
something you want done differently.

[Disclaimer: shameless plug]

~~~
JulianWasTaken
I use ag, but it's good to know this exists, I hadn't seen it before; good to
have if I ever needed grep-as-library-code.

------
otikik
I use the silver searcher:
[https://github.com/ggreer/the_silver_searcher](https://github.com/ggreer/the_silver_searcher)

------
7histle
[https://github.com/ggreer/the_silver_searcher](https://github.com/ggreer/the_silver_searcher)

