Hacker News new | comments | show | ask | jobs | submit login

"git grep" is a lot faster than grep in a large codebase. One obvious reason is that "git grep" ignores non-checked-in files in the project directory. But I also notice a speed difference even when the project directory is clean.



That's because git-grep doesn't need to bother to go through the filesystem for the on-disk files. It greps directly from the object storage instead.

And AFAIK it can run the grep in parallel. Practically any non-archaic machine has multiple cores now, so git-grep can easily be faster than regular grep from command line.


By default grep is going through all of the .git directly, which is the part `git grep` filters out. `ag` also filters them out (and is only slightly slower than git grep by default, with all the colors and stuff), or you can tell grep to only check relevant files with e.g. `grep $PATTERN $(git ls-tree --full-tree --name-only -r HEAD)`.

On my machine, using postgres's repository, I get the following:

    > time git grep foo > /dev/null                                                           
    0.22s user 0.25s system 151% cpu 0.312 total
    > time ag foo > /dev/null                                                                     
    0.85s user 0.19s system 174% cpu 0.596 total
    > time grep foo $(git ls-tree --full-tree --name-only -r HEAD) > /dev/null                    
    0.13s user 0.10s system 93% cpu 0.255 total
grep's faster than git grep. In fact, grep is already as fast as git grep just ignoring .git:

    > time grep foo -r . --exclude-dir=.git > /dev/null
    0.15s user 0.16s system 91% cpu 0.338 total


You need to take the effect of the page cache into account. Since you are not clearing the page cache after each test, the test after it benefits from the contents. So running 'git grep' first disadvantages it, compared to everything else.

I ran a test on a large repository and here are my results. The repository was Hadoop, and is available from git://github.com/apache/hadoop-common.git.

  cmccabe@keter:~/hadoop4> du -cksh .
  375M    .
  375M    total
  sudo -- sh -c 'sync ; echo 3 > /proc/sys/vm/drop_caches'
  cmccabe@keter:~/hadoop4> /usr/bin/time git grep 'class TestDefaultNameNodePort'                                                                              
  hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDefaultNameNodePort.java:public class TestDefaultNameNodePort {
  0.11user 0.34system 0:00.74elapsed 61%CPU (0avgtext+0avgdata 60512maxresident)k
  260256inputs+0outputs (19major+9718minor)pagefaults 0swaps

  sudo -- sh -c 'sync ; echo 3 > /proc/sys/vm/drop_caches'
  cmccabe@keter:~/hadoop4> /usr/bin/time grep -rI --exclude .git 'class TestDefaultNameNodePort' *
  hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDefaultNameNodePort.java:public class TestDefaultNameNodePort {
  0.13user 0.56system 0:02.40elapsed 29%CPU (0avgtext+0avgdata 5584maxresident)k
  252792inputs+0outputs (2major+414minor)pagefaults 0swaps
So you can see that it is faster, even when excluding the .git directory.

Running grep a second time without clearing the cache gives a bogus result:

  cmccabe@keter:~/hadoop4> /usr/bin/time grep -rI --exclude .git 'class TestDefaultNameNodePort' *
  hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDefaultNameNodePort.java:public class TestDefaultNameNodePort {
  0.03user 0.04system 0:00.08elapsed 97%CPU (0avgtext+0avgdata 5584maxresident)k
  0inputs+0outputs (0major+416minor)pagefaults 0swaps
This is because the data is all in the page cache at that point, so we're not actually accessing the disk.

I was curious about the true source of the speedup, and so I checked the output of the 'perf' tool. git had 1,922 CPU migrations, whereas grep had 52. Following up on this, you can see that git is spawning a bunch of threads, whereas grep only has one thread.

  cmccabe@keter:~/hadoop4> strace -f -e trace=clone git grep 'class TestDefaultNameNodePort' 2>&1 | grep -c '] +++ exited with '                                 
  8
 cmccabe@keter:~/hadoop4> strace -f -e trace=clone grep -rI --exclude=.git 'class TestDefaultNameNodePort' *  2>&1 | grep -c '] +++ exited with '                                               
  0
I also think git might be cheating and using a simpler regex engine than grep, but at this point I got bored. Case closed.




Applications are open for YC Winter 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: