
Ripgrep – A new command line search tool - anp
http://blog.burntsushi.net/ripgrep/
======
losvedir
Meh, yet another grep tool.... wait, by burntsushi! Whenever I hear of someone
wanting to improve grep I think of the classic ridiculous fish piece[0]. But
when I saw that this one was by the author of rust's regex tools, which I know
from a previous post on here, are quite sophisticated, I perked up.

Also, the tool aside, this blog post should be held up as the gold standard of
what gets posted to hacker news: detailed, technical, interesting.

Thanks for your hard work! Looking forward to taking this for a spin.

[0] [http://ridiculousfish.com/blog/posts/old-age-and-
treachery.h...](http://ridiculousfish.com/blog/posts/old-age-and-
treachery.html)

~~~
pohl
Another burntsushi project was recently posted, but didn't get much attention:

[https://news.ycombinator.com/item?id=12559515](https://news.ycombinator.com/item?id=12559515)

~~~
skanga
Looks interesting. But I did not find binaries for it and do not want to setup
a rust env to try it out.

~~~
luckman212
The releases are right there on GitHub:
[https://github.com/BurntSushi/ripgrep/releases](https://github.com/BurntSushi/ripgrep/releases)

~~~
burntsushi
I think the GP was asking about xsv, not ripgrep. There are binary releases
for xsv though:
[https://github.com/BurntSushi/xsv/releases](https://github.com/BurntSushi/xsv/releases)

~~~
skanga
Awesome. Thanks

~~~
skanga
PS: You might want to revise this verbiage in the README markdown file:

    
    
        Installing xsv is a bit hokey right now. Ideally, I could release binaries for Linux, Mac and Windows. Currently, I'm only able to release binaries for Linux because I don't know how to cross compile Rust programs.

~~~
burntsushi
Ah how embarrassing! I will fix that soon. Thanks :-)

~~~
skanga
No need for any embarrassment. I should have looked for releases instead of
only looking at the doc.

------
ggreer
I'm the author of ag. That was a really good comparison of the different code
searching tools. The author did a great job of showing how each tool
misbehaved or performed poorly in certain circumstances. He's also totally
right about defaults mattering.

It looks like ripgrep gets most of its speedup on ag by:

1\. Only supporting DFA-able Rust regexes. I'd love to use a lighter-weight
regex library in ag, but users are accustomed to full PCRE support. Switching
would cause me to receive a lot of angry emails. Maybe I'll do it anyway. PCRE
has some annoying limitations. (For example, it can only search up to 2GB at a
time.)

2\. Not counting line numbers by default. The blog post addresses this, but I
think results without line numbers are far less useful; so much so that I've
traded away performance in ag. (Note that even if you tell ag not to print
line numbers, it still wastes time counting them. The printing code is the
result of me merging a lot of PRs that I really shouldn't have.)

3\. Not using mmap(). This is a big one, and I'm not sure what the deal is
here. I just added a --nommap option to ag in master.[1] It's a naive
implementation, but it benchmarks comparably to the default mmap() behavior.
I'm really hoping there's a flag I can pass to mmap() or madvise() that says,
"Don't worry about all that synchronization stuff. I just want to read these
bytes sequentially. I'm OK with undefined behavior if something else changes
the file while I'm reading it."

The author also points out correctness issues with ag. Ag doesn't fully
support .gitiginore. It doesn't support unicode. Inverse matching (-v) can be
crazy slow. These shortcomings are mostly because I originally wrote ag for
myself. If I didn't use certain gitignore rules or non-ASCII encodings, I
didn't write the code to support them.

Some expectation management: If you try out ripgrep, don't get your hopes up.
Unless you're searching some really big codebases, you won't notice the speed
difference. What you will notice, however, are the feature differences. Take a
look at
[https://github.com/BurntSushi/ripgrep/issues](https://github.com/BurntSushi/ripgrep/issues)
to get a taste of what's missing or broken. It will be some time before all
those little details are ironed-out.

That said, may the best code searching tool win. :)

1\.
[https://github.com/ggreer/the_silver_searcher/commit/bd65e26...](https://github.com/ggreer/the_silver_searcher/commit/bd65e2691d6ce673e026842d7e7d02bc25adfd26)

~~~
burntsushi
Thanks for the response! Some notes:

1\. In my benchmarks, I do control for line numbers by either explicitly
making it a variable (i.e., when you see `(lines)`) or by making all tools
count lines to make the comparison fair. For the most part, this only tends to
matter in the single-file benchmarks.

2\. For memory maps, you might get very different results depending on your
environment. For example, I enabled memory maps on Windows where they seem to
do a bit better. (I think my blog post gives enough details that you could
reproduce the benchmark environment _precisely_ if you were so inclined. This
was important to me, so I spent a lot of time documenting it.)

3\. The set of features supported by rg should be very very close to what is
supported by ag. Reviewing `ag`'s man page again, probably the only things
missing from rg are --ackmate, --depth, some of the color configurability
flags (but rg does do coloring), --passthrough, --smart-case and --stats
maybe? I might be missing some others. And Mercurial support (but ag's is
incomplete). In exchange, rg gives you much better single file performance,
better large-repo performance and real Unicode support that doesn't slow way
down. I'd say those are pretty decent expectations. :-)

Thanks for ag by the way. It and ack have definitely triggered a new kind of
searching. I have some further information retrievalish ideas on evolving the
concept, but those will have to wait!

~~~
ggreer
In terms of core features, ripgrep is totally there. It searches _fast_. It
ignores files pretty accurately. It outputs results in a pleasant and useful
format. If a new user tries rg, they'll be very happy.

My warning about the feature differences was meant to temper ag users'
expectations. There are lots of little things that ag users are accustomed to
that are either different or missing in ripgrep. Off the top of my head: Ag
reads the user's global gitignore. (This is harder than most people think.) It
detects stdout redirects such as "ag blah > output.txt" and ignores
output.txt. It can search gz and xz files. It defaults to smart-case
searching. It can limit a search to one hardware device (--one-device),
avoiding slow reads on network mounts. And as a commenter already pointed out,
it supports the --pager option. Taken together, all those small differences
are likely to cause an average ag user some grief. I wanted to manage
expectations so that users wouldn't create annoying "issues" (really, feature
requests) on your GitHub repo. Sorry if that came off the wrong way.

On a completely unrelated note: I see ripgrep supports .rgignore files,
similar to how ag supports .agignore. It'd be nice if we could combine forces
and choose a single filename for this purpose. That way when the next search
tool comes along, it can use the same thing instead of .zzignore or whatever.
It would also make it easier for users to switch between our tools. I'd
suggest a generic name like ".ignore" or ".ignores", but I'm sure some tool
creates such files or directories already.

Edit: Actually, it looks like .ignore can work. The only examples I've found
of .ignore files are actual files containing ignore patterns.

~~~
burntsushi
You raise good points, thank you. I hope to support some of those features,
since they seem like nice conveniences.

In principle I'd be fine standardizing on a common ignore file. We'd need to
come up with a format (I think I'd vote for "do what gitignore does", since I
think that's what we're both doing now anyway).

Adding files to this list is kind of a bummer though. I could probably get
away with replacing `.rgignore` proper, but I suspect you'd need to add it
without replacing `.agignore`, or else those angry users you were talking
about might show themselves. :-)

I do kind of like `.grepignore` since `grep` has kind of elevated itself to
"search tool" as a term, but I can see how that would be confusing.
`.searchignore` feels too long. `.ignore` and `.ignorerc` feel a bit too
generic, but either seems like the frontrunner at the moment.

~~~
ggreer
I also vote for "do what gitignore does". My plan is to add support for the
new file name, deprecate .agignore, and update docs everywhere. But it'd be a
while before I removed .agignore completely.

I really like .ignore, and I like it _because_ it's generic. The information I
want it to convey is:

 _> Dear programs,

> If you are traversing this directory, please ignore these things._

Of course, some programs could still benefit from having application-specific
ignore files, but it'd cut down on a lot of cruft and repetition.

~~~
burntsushi
Let's do it.
[https://github.com/BurntSushi/ripgrep/issues/40](https://github.com/BurntSushi/ripgrep/issues/40)

~~~
ggreer
…and merged:
[https://github.com/ggreer/the_silver_searcher/pull/974](https://github.com/ggreer/the_silver_searcher/pull/974)

I'll tag a new release in a day or two. Also, it looks like the sift author is
getting on the .ignore train:
[https://github.com/svent/sift/issues/78#issuecomment-2493352...](https://github.com/svent/sift/issues/78#issuecomment-249335277)

This worked out pretty well. :)

~~~
burntsushi
Same!
[https://github.com/BurntSushi/ripgrep/pull/41](https://github.com/BurntSushi/ripgrep/pull/41)

Agreed :-)

~~~
tonglil
This is probably the best case of out-in-the-open open source developers of
similar-but-different tools collaborating on a new standard and implementing
them in record time that I have ever seen.

Keep it up all (rg/ag/sift)!

~~~
scott_karana
I completely agree. That was one of the most reasonable and level-headed
discussions between strangers I have _ever_ seen on the Internet!

------
minimax
_In contrast, GNU grep uses libc’s memchr, which is standard C code with no
explicit use of SIMD instructions. However, that C code will be autovectorized
to use xmm registers and SIMD instructions, which are half the size of ymm
registers._

I don't think this is correct. glibc has architecture specific hand rolled (or
unrolled if you will lol) assembly for x64 memchr. See here:
[https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86...](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/memrchr.S;h=840de30cd71ba96b3ae43540e6ac255c28906cc5;hb=HEAD)

~~~
burntsushi
Drats, you're totally right. It's easy to mess up that kind of thing.

Thankfully, it looks like my analysis remains mostly unchanged. I don't see
any AVX2 in there (and indeed, I didn't when I looked at the profile either,
in contrast to Go's implementation).

I updated the blog, thanks again for the clarification.

------
jonstewart
Nice! Lightgrep[1] uses libicu et al to look up code points for a user-
specified encoding and encode them as bytes, then just searches for the bytes.
Since ripgrep is presumably looking just for bytes, too, and compiling UTF-8
multibyte code points to a sequence of bytes, perhaps you can do likewise with
ICU and support other encodings. ICU is a bear to build against when cross-
compiling, but it knows hundreds of encodings, all of the proper code point
names, character classes, named properties, etc., and the surface area of its
API that's required for such usage is still pretty small.

[1]:
[http://strozfriedberg.github.io/liblightgrep](http://strozfriedberg.github.io/liblightgrep)

~~~
burntsushi
I hadn't heard of liblightgrep, nice. It's on my short list for looking more
closely.

I doubt I'd ever be comfortable with Rust's regex engine growing a dependency
on libicu, but it's still worth understanding your implementation. Some
questions, if you don't mind. The big one is: does your regex engine use
finite automata, and does it put the text decoding into the automaton itself?
For example, when you compile the `.` regex, you end up with an automaton that
inlines UTF-8 decoding itself. It looks like this:
[https://gist.github.com/anonymous/8fbe170bfcca5d7475b59299fa...](https://gist.github.com/anonymous/8fbe170bfcca5d7475b59299fabbd0a3)

Does you regex library do that for each type of encoding? Or is there a
transcoding step?

~~~
jonstewart
Reply fail, see above.

------
cwillu
I wish more people actually took steps to optimize disk io though; my current
source tree may be in cache, but my logs certainly aren't. Nor are my
/usr/share/docs/, /usr/includes/, or my old projects.

Chris Mason of btrfs fame did some proof of concept work for walking and
reading trees in on-disk order, showing some pretty spectacular potential
gains:
[https://oss.oracle.com/~mason/acp/](https://oss.oracle.com/~mason/acp/)

Tooling to do your own testing:
[https://oss.oracle.com/~mason/seekwatcher/](https://oss.oracle.com/~mason/seekwatcher/)

------
_audakel
"Anti-pitch

I’d like to try to convince you why you shouldn’t use ripgrep. Often, this is
far more revealing than reasons why I think you should use ripgrep."

Love that he added this

------
bodyfour
It would be interesting to benchmark how much mmap hurts when operating in a
non-parallel mode.

I think a lot of the residual love for mmap is because it actually did give
decent results back when single core machines were the norm. However, once
your program becomes multithreaded it imposes a lot of hidden synchronization
costs, especially on munmap().

The fastest option might well be to use mmap sometimes but have a collection
of single-thread processes instead of a single multi-threaded one so that
their VM maps aren't shared. However, this significantly complicates the work-
sharing and output-merging stages. If you want to keep all the benefits you'd
need a shared-memory area and do manual allocation inside it for all common
data which would be a lot of work.

It might also be that mmap is a loss these days even for single-threaded... I
don't know.

Side note: when I last looked at this problem (on Solaris, 20ish years ago)
one trick I used when mmap'ing was to skip the "madvise(MADV_SEQUENTIAL)" if
the file size was below some threshold. If the file was small enough to be
completely be prefetched from disk it had no effect and was just a wasted
syscall. On larger files it seemed to help, though.

~~~
burntsushi
One thing I did benchmark was the use of memory maps for single file search
(cf. `subtitles_literal`). In that case, it saved time (small, but
_measurable_ ) to memory map the file than to incrementally read it. Memory
maps were only slower in parallel search on large directories.

Thankfully, ripgrep makes it easy to switch between memory maps and
incremental reading. So I can just do this for you right now on the spot:

    
    
        $ time rg -j1 PM_SUSPEND | wc -l
        335
        
        real    0m0.406s
        user    0m0.350s
        sys     0m0.293s
    
        $ time rg -j1 PM_SUSPEND --mmap | wc -l
        335
        
        real    0m0.482s
        user    0m0.380s
        sys     0m0.317s
    

Note that this is on a Linux x64 box. I bet you'd get completely different
results on a different OS.

~~~
bodyfour
Interesting that user time went up as well.. not sure if that's significant.

I guess it's not too surprising that mmap isn't much of a win these days for
anything... SIMD can copy a memory page pretty fast these days.

I just installed rg from homebrew and it's quite impressive... about 2.5x
faster than ag on my macbook pro. Interestingly I get another 25% improvement
by falling back to -j3 even though I'm on a quad-core machine. Not sure what
is bottlenecking since it's all in cache.

~~~
burntsushi
Yeah, figuring out the optimal thread count has always seemed like a bit of a
black art to me. I can pretty reliably figure it out for my system (which has
8 physical cores, 16 logical), but it's hard to generalize that to others.

-j3 will spawn 3 workers for searching while the main thread does directory traversal. It sounds like I should do `num_cpus - 1` for the default `-j` instead of `num_cpus`.

------
cm3
To build a static Linux binary with SIMD support, run this:

    
    
        RUSTFLAGS="-C target-cpu=native" rustup run nightly cargo build --target x86_64-unknown-linux-musl --release --features simd-accel

~~~
burntsushi
That's an awesome demonstration of how easy it is to swap out the libc in
Rust. :-)

Note that I also distribute statically compiled executables with musl and SIMD
enabled (using target-feature=+ssse3 instead of target-cpu=native):
[https://github.com/BurntSushi/ripgrep/releases](https://github.com/BurntSushi/ripgrep/releases)

~~~
cm3
> That's an awesome demonstration of how easy it is to swap out the libc in
> Rust. :-)

Now I have to look up how I use cargo to build a static binary on FreeBSD,
where I don't have to swap out libc.

> Note that I also distribute statically compiled executables with musl and
> SIMD enabled (using target-feature=+ssse3 instead of target-cpu=native):
> [https://github.com/BurntSushi/ripgrep/releases](https://github.com/BurntSushi/ripgrep/releases)

I took the flag from your blog post, thanks for pointing out the explicit
feature flag. That will allow the binary to run on more cpus.

------
lobster_johnson
Very nice. Not only fast, but feels modern.

Tried it out on a 3.5GB JSON file:

    
    
      # rg
      rg erzg4 k.json > /dev/null  1.80s user 2.54s system 53% cpu 8.053 total
    
      # rg with 4 threads
      rg -j4 erzg4 k.json > /dev/null  1.76s user 1.29s system 99% cpu 3.059 total
    
      # OS X grep
      grep erzg4 k.json > /dev/null  60.62s user 0.96s system 99% cpu 1:01.75 total
    
      # GNU Grep
      ggrep erzg4 k.json > /dev/null  1.96s user 1.43s system 88% cpu 2.691 total
    

GNU Grep wins, but it's pretty crusty, especially with regards to its output
(even with colourization).

~~~
burntsushi
My guess is that since you ran `rg` first, the file wasn't in memory, and you
ended up benchmarking disk IO. (Notice the sys time decrease from your first
run to the second run.) Subsequent commands then run faster with the file
already in memory.

This is one of many reasons why assembling the benchmarks in my blog post was
so difficult. For example, on every command I benchmarked, I ran them 3 times
for "warmup" and didn't record any measurements. I then ran them another 10
times in which I recorded them. You can see the raw output here:
[https://github.com/BurntSushi/ripgrep/blob/master/benchsuite...](https://github.com/BurntSushi/ripgrep/blob/master/benchsuite/runs/2016-09-20-ubuntu1604-ec2/raw.csv)

In any case, on my underpowered Mac, here are some results on a 1.2 GB file
(notice how much the time fluctuates until its fully in cache):

    
    
        mac:~ andrew$ ggrep --version
        ggrep (GNU grep) 2.25                                                                                                                                                                                                
        Packaged by Homebrew
        mac:~ andrew$ time ggrep 'Bruce Springsteen' foo.jsonl > /dev/null   
        
        real    0m5.447s
        user    0m0.600s
        sys     0m0.350s
        mac:~ andrew$ time ggrep 'Bruce Springsteen' foo.jsonl > /dev/null
        
        real    0m1.247s
        user    0m0.549s
        sys     0m0.264s
        mac:~ andrew$ time ggrep 'Bruce Springsteen' foo.jsonl > /dev/null
        
        real    0m0.803s
        user    0m0.542s
        sys     0m0.259s
        mac:~ andrew$ time ggrep 'Bruce Springsteen' foo.jsonl > /dev/null
        
        real    0m0.805s
        user    0m0.544s
        sys     0m0.260s
    

And now for rg:

    
    
        mac:~ andrew$ time rg 'Bruce Springsteen' foo.jsonl > /dev/null
        
        real    0m1.062s
        user    0m0.339s
        sys     0m0.333s
        mac:~ andrew$ time rg 'Bruce Springsteen' foo.jsonl > /dev/null
        
        real    0m0.640s
        user    0m0.337s
        sys     0m0.302s
        mac:~ andrew$ time rg 'Bruce Springsteen' foo.jsonl > /dev/null
        
        real    0m0.637s
        user    0m0.336s
        sys     0m0.300s
    

Oh! And check this out, on a Mac, not using a memory map for single files is
faster. My goodness---memory map performance is all over the place.

    
    
        mac:~ andrew$ time rg 'Bruce Springsteen' foo.jsonl --no-mmap > /dev/null
        
        real    0m0.445s
        user    0m0.170s
        sys     0m0.274s
    

If I do this on my Linux machine on the same file, I get timings of 0.275s for
rg, 0.398s for rg with no memory maps (opposite direction for Mac) and 0.708s
for GNU grep (v 2.25).

Benchmarks are fun, eh?

~~~
atombender
I ran each test four times and picked the best result — not my first rodeo —
but for the first result I picked the wrong time from my output, which
obviously didn't make use of the cache. Here's it again, complete result,
added --no-mmap:

    
    
        # ggrep
        zerogravitas$ for x in {1..4}; do (time ggrep erzg4 k.json >/dev/null); done
        ggrep erzg4 k.json > /dev/null  1.96s user 0.67s system 99% cpu 2.641 total
        ggrep erzg4 k.json > /dev/null  1.95s user 0.68s system 99% cpu 2.660 total
        ggrep erzg4 k.json > /dev/null  2.00s user 0.66s system 99% cpu 2.672 total
        ggrep erzg4 k.json > /dev/null  1.96s user 0.67s system 98% cpu 2.662 total
    
        # rg
        zerogravitas$ for x in {1..4}; do (time rg erzg4 k.json >/dev/null); done
        rg erzg4 k.json > /dev/null  1.76s user 1.40s system 99% cpu 3.180 total
        rg erzg4 k.json > /dev/null  1.77s user 1.31s system 99% cpu 3.088 total
        rg erzg4 k.json > /dev/null  1.74s user 1.36s system 99% cpu 3.128 total
        rg erzg4 k.json > /dev/null  1.76s user 1.41s system 97% cpu 3.265 total
    
        # rg --no-mmap
        zerogravitas$ for x in {1..4}; do (time rg erzg4 k.json --no-mmap >/dev/null); done
        rg erzg4 k.json --no-mmap > /dev/null  0.98s user 0.75s system 99% cpu 1.743 total
        rg erzg4 k.json --no-mmap > /dev/null  0.99s user 0.75s system 99% cpu 1.740 total
        rg erzg4 k.json --no-mmap > /dev/null  1.01s user 0.76s system 99% cpu 1.772 total
        rg erzg4 k.json --no-mmap > /dev/null  0.99s user 0.75s system 99% cpu 1.754 total
    
        # rg -j4
        zerogravitas$ for x in {1..4}; do (time rg erzg4 k.json -j4 >/dev/null); done
        rg erzg4 k.json -j4 > /dev/null  1.75s user 1.35s system 98% cpu 3.134 total
        rg erzg4 k.json -j4 > /dev/null  1.75s user 1.44s system 98% cpu 3.224 total
        rg erzg4 k.json -j4 > /dev/null  1.80s user 1.38s system 99% cpu 3.204 total
        rg erzg4 k.json -j4 > /dev/null  1.80s user 1.35s system 99% cpu 3.164 total
    
        # rg -j4 --no-mmap
        zerogravitas$ for x in {1..4}; do (time rg erzg4 k.json -j4 --no-mmap >/dev/null); done
        rg erzg4 k.json -j4 --no-mmap > /dev/null  0.98s user 0.75s system 99% cpu 1.740 total
        rg erzg4 k.json -j4 --no-mmap > /dev/null  0.97s user 0.74s system 99% cpu 1.721 total
        rg erzg4 k.json -j4 --no-mmap > /dev/null  0.99s user 0.75s system 99% cpu 1.752 total
        rg erzg4 k.json -j4 --no-mmap > /dev/null  0.98s user 0.76s system 99% cpu 1.748 total
    

Sounds like "alias rg=rg --no-mmap" is a good idea on a Mac.

~~~
burntsushi
Wow. Those are awesome results, thank you.

> Sounds like "alias rg=rg --no-mmap" is a good idea on a Mac.

I will fix that in ripgrep proper by making --no-mmap the default on Mac. :-)
It should be an easy one to knock off:
[https://github.com/BurntSushi/ripgrep/issues/36](https://github.com/BurntSushi/ripgrep/issues/36)

> not my first rodeo

Right, sorry about that. :-) Just had to cover all my bases!

(Also, `-j` on a single file won't do anything, and ripgrep should try to use
multiple threads by default when searching multiple files.)

~~~
lobster_johnson
I suspected that -j wouldn't do anything on a single file. For three large
(6.5GB in total) files I'm getting good performance, about 1.6x of what GNU
Grep does, best case.

------
dikaiosune
Compiling it to try right now...

Some discussion over on /r/rust:
[https://www.reddit.com/r/rust/comments/544hnk/ripgrep_is_fas...](https://www.reddit.com/r/rust/comments/544hnk/ripgrep_is_faster_than_grep_ag_git_grep_ucg_pt/)

EDIT: The machine I'm on is much less beefy than the benchmark machines, which
means that the speed difference is quite noticeable for me.

------
rob_cameron
Nice work. But your story is incomplete if you don't include comparison with
icgrep (parabix.costar.sfu.ca). Although icgrep is still an active research
project, it is faster in many cases and has broader Unicode support (full
Unicode level 1 of UTS #18, plus many level 2 features). For example, try the
'\N{SMIL(E|ING)}' search that finds lines containing emoji characters with
SMILE or SMILING in their Unicode name. icgrep also correctly applies Unicode
character class intersection with expressions such as [\p{Greek}&&\p{Lu}],
while ripgrep fails to meet UTS 18 level 1 requirements by interpreting '&&'
as literal characters.

Nevertheless, we appreciate the challenge that ripgrep presents to our
performance story. We definitely see some cases in which rg achieves better
performance taking advantage of fixed strings somewhere in the pattern. We'll
have to work on that... But for patterns based on Unicode classes (e.g.,
\p{Greek}, icgrep can be much faster, especially on large files. (We are only
focused on big data applications -- icgrep has significant dynamic compilation
overhead). It also does very well in cases involving alternations and
ambiguity.

The icgrep performance story is based primarily on a new bitwise data parallel
regular expression algorithm working off the Parabix transform representation
of text. See our ICA3PP 2015 or PACT 2014 papers.

It might be fair to say that icgrep is not yet polished enough for inclusion
in your study. I just added our first implementation of -r/-R flags last night
and we certainly haven't yet handled .gitignore, etc. But if you want to
understand the truth about regular expression performance, I think that the
data point represented by icgrep (and its continuing development) needs to be
included.

~~~
glangdale
Speaking as the leader of the Hyperscan project
([https://github.com/01org/hyperscan](https://github.com/01org/hyperscan)),
I'd say you might not be the only project feeling a bit neglected in the
performance comparison here.

It's nice to be mentioned - and even called out by name as the inventor of
Teddy (!) - but we're always even more pleased when someone else measures
Hyperscan, as it minimizes the prospect of us embarrassing ourselves by
posting some self-serving and/or outlandish microbenchmark.

Also it saves everyone time reading the obligatory Intel legal disclaimers...

~~~
rob_cameron
We are very interested in tackling the multiple-pattern regular expression
problem and will definitely want to use Hyperscan as a comparator.

Any help in setting up a study with both patterns and data sets would be most
appreciated!

~~~
glangdale
This sounds interesting. I suspect there are many alternate approaches to
multiple regex.

Sadly, regex benchmarking is a sewer. There are two classes of multiple regex
benchmarks: public ones and good ones, and not much intersection between the
two. Synthetic pattern generation can be manipulated to say whatever you want
it to say, Snort patterns aren't intended to be run simultaneously (so putting
a big pile of them into a sack and running them is of arguable use), and most
vendors guard proprietary signature sets closely (we have thousands, but all
the good ones are customer confidential).

That being said, there are some paths forward here. Let's talk.

~~~
rob_cameron
Certainly. I dropped an e-mail at the hyperscan account.

By the way, if you try out icgrep and use the -DumpASM option, you may notice
a very unusual characteristic: the generated code is almost completely
dominated by AVX2 instructions!

------
echelon
Rust is really staring to be seen in the wild now.

~~~
CJefferson
I agree, and I'm already using both ripgrep and rust-parallel (
[https://github.com/mmstick/parallel](https://github.com/mmstick/parallel) , a
gnu-parallel replacement which should probably get another name ).

I am really happy to see Rust apps actually been written -- Rust programmers
seem to be actually trying to replace the existing code lying around, instead
of just insulting it and telling us how much better the world would be if we
used their language.

~~~
saurik
This would have been a lot more exciting if it were designed to actually be
even slightly compatible with grep (or maybe had some core in there that was,
while leaving the other UI parts he wants to change on top) and were then
approaching GNU and saying "hey, this is something I've been working on: would
you consider making grep the first standalone tool to _move_ to Rust, and what
would it take from me to make this happen?" as opposed to writing an article
about how "I am smarter than the GNU grep people for these reasons and have
built a tool named after how my tool is going to kill their tool" which almost
seems to go out of its way to set up an air of competition rather than
collaboration. Maybe you value pure technical chops, and the difference
between "insulting it and telling us how" is much worse than "insulting it and
actually writing code", but to me they both start with "insulting it" and
demonstrate an almost tragic inability to work with others. For people like
me, people who actually want to see a language like Rust get used en masse and
entirely replace languages like C, the attitude in this blog post is extremely
depressing. Even if I now we're to myself take the time to go to the GNU grep
authors and try to talk to them about this, the mere existence of this blog
post is going to make that slightly more taxing and slightly more of a battle
for everyone involved :/. (I mean: seriously... "ripgrep"?! This developer is
clearly going _out of their way_ to be combative. What ever happened to the
open source spirit of collaboration? What happened to actual communication
between teams? Why do projects seem to just assume "the design decisions, or
even specific implementations, or _even accidental mistake_ of existing tools
are set in stone, and so the right way to talk about changing them is to
discuss competition between entire projects or at best hard forks rather than
working with other people"? :/)

~~~
burntsushi
> (or maybe had some core in there that was, while leaving the other UI parts
> he wants to change on top)

But I did! Have you looked at the dependency list of ripgrep? It's utterly
filled with tons of tools that you can pick out and use in other projects for
any purpose you like. I didn't mention this in the blog post because there's
already too much there, but sure, here they are:

memchr - Fast single byte search:
[http://burntsushi.net/rustdoc/memchr/](http://burntsushi.net/rustdoc/memchr/)

walkdir - Recursive directory iterator:
[http://burntsushi.net/rustdoc/walkdir/](http://burntsushi.net/rustdoc/walkdir/)

utf8-ranges - Generate utf8 automata:
[http://burntsushi.net/rustdoc/utf8_ranges/](http://burntsushi.net/rustdoc/utf8_ranges/)

regex-syntax - A regex parser (including literal extraction helpers):
[https://doc.rust-lang.org/regex/regex_syntax/index.html](https://doc.rust-
lang.org/regex/regex_syntax/index.html)

regex - The regex engine itself: [https://doc.rust-
lang.org/regex/regex/index.html](https://doc.rust-
lang.org/regex/regex/index.html)

grep - Line-by-line search (as a library). This is where all of the inner
literal optimizations happen, for example.
[http://burntsushi.net/rustdoc/grep/](http://burntsushi.net/rustdoc/grep/)

And this is only the stuff that _I_ did. This doesn't count all the other
wonderful stuff I used that other folks built!

And sure, I could do better. There's more stuff I could move into the `grep`
crate, but this is only the beginning, not the end.

> "hey, this is something I've been working on: would you consider making grep
> the first standalone tool to move to Rust, and what would it take from me to
> make this happen?"

But I didn't want to do that. I don't understand why that's a problem. Doing
this requires being POSIX interface compatible, and that's not a hole I care
to dig. I don't mean any disrespect, it's just not what I want to spend my
time doing.

> as opposed to writing an article about how "I am smarter than the GNU grep
> people for these reasons and have built a tool named after how my tool is
> going to kill their tool" which almost seems to go out of its way to set up
> an air of competition rather than collaboration

I'm really sorry if I came across that way. It was of course not my intention.
My intention was to write about what I had learned, which seems like a pretty
fundamental component of collaboration. I'm sure five years from now, when
I've moved on to other problems, someone will come along and beat ripgrep, and
I can only hope that they write about it. :-)

I mean, if it weren't for the innumerable people who wrote about their
experience with this kind of stuff, I never would have been able to get here.
It only makes sense for me to write if I think I have something valuable to
share.

> (I mean: seriously... "ripgrep"?! This developer is clearly going out of
> their way to be combative.

I'm not. "rip" was supposed to mean "rip through text." I'm sorry it came off
as combative. I wasn't even aware of the "rest in peace" interpretation until
someone else pointed it out.

Honestly... I was trying hard to find a way to justify the binary name `rg`
because I liked that `r` could stand for Rust. But `rustgrep` seemed a bit too
in-your-face, so I started searching for small relevant words starting with
`r`. That's it.

~~~
saurik
> I'm sure five years from now, when I've moved on to other problems, someone
> will come along and beat ripgrep, and I can only hope that they write about
> it. :-)

I contend that the world would be a much better place if instead of someone
building a new tool which beats yours, they worked to improve your tool
(which, at the point where you moved on, would hopefully have been granted to
a body of separate maintainers, whether people you find or a group such as
Apache which specializes in maintaining valuable open source projects). Sure:
a world where people write about what they do is better than a world where
they don't, but that's a really depressing thing to be "hoping for".

I am apparently becoming extremely unpopular in these circles for expressing
this opinion, but a really important aspect of open source was about people
collaborating towards a common effort to build high quality software: to avoid
working together, to even expect that people will or should continually build
new projects from scratch that "compete" with each other, defeats many (if not
even most) of the benefits of open source software, as it relegates us to the
same process by which closed source software improves.

> But I didn't want to do that. I don't understand why that's a problem. Doing
> this requires being POSIX interface compatible, and that's not a hole I care
> to dig. I don't mean any disrespect, it's just not what I want to spend my
> time doing.

Providing some of the parts, or an offer to help, goes a long way: you don't
have to do all of the work (and others wouldn't expect that); also, it is
worth noting that GNU grep has on occasion added alternative backends (whether
"extended" expressions or later using PCRE), and often have made improvements
or added features. The assumption that grep is a tool which works the way it
does and which will always do what it does, and that improvements should come
in the form of competition, is demotivating to contribution.

~~~
burntsushi
> I am apparently becoming extremely unpopular in these circles for expressing
> this opinion

To be really clear: I don't think your opinion is necessarily the problem.
When I first read your comment, I typed up a response that I wasn't proud of.
It wasn't nice because your comment wasn't nice. I had to step away from the
computer and take a moment to put things back into focus to give you the
response I did. It wasn't easy.

And really, my reaction to your comment had nothing to do with your opinion
that we should try to collaborate more. That's a completely reasonable thing
to hope for. But some of the things you said, or implied (about me
personally), were really way way off, and I personally found them pretty
insulting.

I get that you took my blog post as combative, so maybe you think the same
about me. But you didn't ask for clarification, you just kind of dove right
into the insults and assumptions and bad faith, and personally, I think that
is just a really awful way to interact with other humans.

It's clear that we have different valuations on how to spend our time, and I
really don't appreciate your implicit condescension. I also don't appreciate
you telling me how I should spend my time. My free time is precious, and I
want to spend it doing the things I find interesting. I don't want to work on
a legacy code base, in C and spend _enormous social resources_ pushing on one
of the most established C projects _in existence_ to switch to a new
programming language. That does not sound like fun to me, and I want to work
on something _fun_ in my free time. `ripgrep` happened to be it. (N.B. Fun is
not the only criterion, but it's a big one.)

> but a really important aspect of open source was about people collaborating
> towards a common effort to build high quality software

I've spent a huge portion of my free time in the past 2.5 years contributing
to the Rust ecosystem. If that's not collaborating towards a common effort,
then I don't know what is. `ripgrep` itself is barely a blip in that effort.
All the stuff that went into building `ripgrep` that is freely available as
other libraries? Yeah, that took a while.

------
Tim61
I love the layout of this article. Especially the pitch and anti-pitch. I wish
more more tools/libraries/things would make note of their downsides.

I'm convinced to give it a try.

------
h1d
"if you like speed, saner defaults, fewer bugs and Unicode"

Warning - Conditional always returns true.

------
krylon
When I use grep (which is fairly regularly), the bottleneck is nearly always
the disk or the network (in case of NFS/SMB volumes).

Just out of curiosity, what kind of use case makes grep and prospective
replacements scream? The most "hardcore" I got with grep was digging through a
few gigabytes of ShamePoint logs looking for those correlation IDs, and that
apparently was completely I/O-bound, the CPUs on that machine stayed nearly
idle.

~~~
burntsushi
> Just out of curiosity, what kind of use case makes grep and prospective
> replacements scream?

Unicode? Check out the subtitle benchmarks in the blog post. In the best case,
grep is a little slower. In the worst case, grep is orders of magnitude
slower.

ripgrep achieves speed by building UTF-8 decoding straight into its DFA regex
engine (well, strictly speaking, this is Rust's regex engine, not ripgrep).

The other case where grep users might scream is when you're searching large
code repositories. A `grep -r` might catch a large binary file, or search your
`.git` or whatever. Both `ag` and `rg` will look at your `.gitignore` so that
the results you see have higher relevance. (Of course, this is just a default,
you can always "search everything" with ripgrep too!)

~~~
bpchaps
Maybe I'm unique to this, but that sort of default would drive me batshit
insane. Not all of us are programmers by trade who use git. It's just another
gotcha to keep track of. I'd seriously recommend removing that as a default.

That said, this is an overall awesome project.

~~~
burntsushi
The default isn't going to change. The Silver Searcher has, IMO, proven that
it's a good default. I've spoken with so many people that love it, myself
included.

If you want to not respect .gitignores, it's easy: `alias rg="rg -u"`.

(To be clear, I agree that the default is a trade off. I'm not so gung-ho on
this that I think it should never be anything else. But for my project and its
goals, I think it's the best fit.)

------
chalana
I'm never sure whether or not I should adopt these fancy new command line
tools that come out. I get them on my muscle memory and then all of a sudden I
ssh into a machine that doesn't have any of these and I'm screwed...

~~~
petdance
I have to switch back and forth between ack and grep all the time. Sometimes I
use ack, sometimes I use grep. I wrote ack, but I've never stopped using grep,
and ack has never been intended as a replacement for grep.

For most of the common ack flags, they are the same as grep. -i, -l, -v, -w,
-A, -B, -C, etc. That was intentional, to minimize that whiplash.

One other suggestion is that you drop ack into your ~/bin directory that you
sync between machines. It's a single Perl file, and that portability is a
feature. As long as your machine has a Perl on it, you can use ack.

------
pixelbeat
Thanks for the detailed comparisons and writeup.

I find this simple wrapper around grep(1) very fast and useful:

[http://www.pixelbeat.org/scripts/findrepo](http://www.pixelbeat.org/scripts/findrepo)

~~~
burntsushi
Thanks for the kind words!

Note that the key thing that `findrepo` doesn't support is respecting your
.gitignore files. For example, in the Rust ecosystem, we often have a `target`
directory in our projects that contains a lot of stuff we probably don't want
to search. In fact, running `cargo new` will add that directory to your
`.gitignore` automatically!

Tools like The Silver Searcher and ripgrep will ignore that directory (and
_all_ others like it) automatically.

There are other advantages to ripgrep. For example, every other tool that
supports Unicode as well as ripgrep (that's `git grep` and GNU grep)
experience a substantial slow down when trying to use more advanced Unicode
features (like \w, -i, etc.). This is one of the things my benchmarks show.

~~~
pixelbeat
Yes ignoring .gitignore contents is a very useful feature. That could be added
quite easily to findrepo though would probably have to be an opt as it is
often useful to search intermediate build files etc. that aren't checked in.
Whereas `git grep` handles the other use case of only searching checked in
files.

Interesting info wrt efficient unicode processing for \w and -i.

cheers

------
glangdale
I'm glad to see this work get written up. If anyone wants a good project,
there are some optimizations we (the Hyperscan team) have done in our "Teddy"
SIMD string implementation that aren't captured (yet) in the Rust
implementation to my knowledge. We're very happy to see techniques from our
library get used in other projects as one of the points of open sourcing it
([https://github.com/01org/hyperscan](https://github.com/01org/hyperscan)) was
to share how we do things with the community.

~~~
burntsushi
You guys have _so many_ amazing optimizations that I haven't captured yet. :-)

I personally love bragging about your Teddy algorithm everywhere I go! Thank
you so much for opening up the Hyperscan project. It has been a huge boon!

------
justinmayer
Anyone have any suggestions regarding how to best use Ripgrep within Vim?
Specifically, how best to use it to recursively search the current directory
(or specified directory) and have the results appear in a quickfix window that
allows for easily opening the file(s) that contain the searched term.

~~~
burntsushi
I'd like to get this working too, since I know a lot of folks are happy with
it for ag and ack. rg does have a --vimgrep option that should make it as easy
as ag to use, but I don't think there is a proper integration just yet.

~~~
nkantar
I like ripgrep quite a bit from trying it today, and am hoping to find time to
work on a Vim plugin this weekend. No promises, but I'll share as soon as I
have something usable.

~~~
burntsushi
That's fantastic! Please don't hesitate to file an issue if you run into
problems.

------
chx
I am not sure how excited I am ... I readily accept this to be faster than ag
-- but ag already scans 5M lines in a second for a string literal on my
machine. Not having to switch tools when I need a recursive regexp is win
enough to tolerate a potential .4s vs .32s second everyday search.

------
fsiefken
nice, but does it compile and run on armhf? I don't see any binaries

~~~
mbrubeck
I was able to build ripgrep from source for ARM, cross-compiling from my
laptop running Debian, following the directions here:

[https://github.com/japaric/rust-cross#tldr-ubuntu-
example](https://github.com/japaric/rust-cross#tldr-ubuntu-example)

(I haven't actually run it because I don't have an ARM linux device handy.)

~~~
pyroholic
Cross-compiled and ran a couple basic searches on an armv7l device. So at
least the basic functionality works just fine.

------
xuhu
Why not make --with-filename default even for e.g. "rg somestring" ? That
seems like it could hinder adoption since grep does it and it's useful when
piping to other commands.

Is it enabled when you specify a directory (rg somestring .) ?

~~~
burntsushi
It should be the default whenever you search more than one file.

------
qwertyuiop924
That is really cool. Although I think this is a case where Good Enough will
beat amazing, at least for me (especially given how much I use backrefs).

------
petre
Does it use PCRE (not the lib, the regex style). If not, ack is just fine. My
main concern with grep are Posix regular expressions.

~~~
Occivink
It uses the re2 style, as provided by the rust regex library.

------
wamatt
On a somewhat related note.

There does not appear be a popular indexed full-text search tool in existence.

Think cross-platform version of Spotlight's _mdfind_. Could there be something
fundamental that makes this approach unsuitable for code search?

Alternatively, something like _locate_ , but realtime and fulltext, instead of
filename only.

~~~
diimdeep
AFAIK mdfind heavily depends on file system events in macOS, it would be
painful or impossible to implement such system in cross platform way with
support for file systems like FAT etc.

~~~
wamatt
sure, FAT. OTOH some people think it's not a totally intractable problem

 _A cross-platform file change monitor with multiple backends: Apple OS X File
System Events,_ _BSD kqueue, Solaris /Illumos File Events Notification, Linux
inotify, Microsoft Windows and a stat()-based backend._

[http://emcrisostomo.github.io/fswatch/](http://emcrisostomo.github.io/fswatch/)

[https://github.com/emcrisostomo/fswatch](https://github.com/emcrisostomo/fswatch)

~~~
diimdeep
nice.

------
AlisdairO
Superb work, and a superb writeup. It's really great to see such an honest and
thorough evaluation.

------
TheGrassyKnoll

      Mega-Thanks to the authors of grep (longtime user) 
                                     ack (nice innovation)
                                      ag  (outstanding work)
                                      rg  (outstanding work)

------
timonv
Impressive! It works really well. Has anyone set it up with ctrlp in vim? I
have `rg %s --files --color=never`, which works, but shows an empty white line
at the prompt, so I need to use a cursor to jump to the desired file.

------
visarga
Great tool. Does there exist a faster implementation of sort as well? I once
implemented quicksort in C and it was faster than Unix sort by a lot, I mean,
seconds instead of minutes for 1 million lines of text.

~~~
burntsushi
I've never waited more that a few milliseconds to sort a million lines.

I expect GNU sort does quite a bit, including handling data sets that don't
fit into memory.

Anyway, this post is about regexes, which really has nothing at all to do with
sorting. They are two very different problems. :)

------
spicyj
rg is harder to type with one hand because it uses the same finger twice. :)

~~~
comex
It does?

I guess if you use a rigid fingering system. I never bothered to learn that
back in elementary school, so my fingering is ad hoc based on what I'm typing.
To type "rg" (using QWERTY), I'd just bring my middle finger up to R while
using my index finger for G. This would probably be a little slower than "ag"
because my hand is more likely to already be in place to type the latter
without movement ("home row"), but not as slow as reusing a finger would be.
It's not something I'd have to think about; this is already what I do whenever
I have to type "rg" as part of a word.

I'm curious whether an ad-hoc approach is more or less efficient overall.
Fingering customized per word clearly has the potential to optimize finger
movement, but my error rate is relatively high - mostly timing-related - which
might be exacerbated by an ad-hoc system because there are more (and more
complex) unfamiliar transitions between words.

Anyway, I upvoted you for mentioning typing. It might seem trivial - well, if
you use a lot of custom aliases, it is trivial - but if a command runs fast
enough (and ag is already very fast on small source trees), the time spent
typing its name can become a significant bottleneck. The author of Pijul, for
instance, a version control system meant to compete with Git, seems not to
recognize this... the command is 'pijul', which is essentially impossible to
type on QWERTY without reusing a finger at least once.

~~~
spicyj
I'm not super rigid with all keys but I do seem to have "R with index finger"
ingrained. R with the middle finger does work nicely in this case if I think
about it and force myself to do it.

------
pmontra
It looks very good and I'd like to try it. However I'm lazy and I don't want
to install all the Rust dev environment to compile it. Did anybody build a
.deb for Ubuntu 16?

~~~
steveklabnik
If you don't want to compile it yourself, the blog post has links to binaries.
Since they're entirely self-contained, you don't need a full .deb to compile
them; just delete the binary when you want to get rid of it.

~~~
pmontra
I didn't notice that. Thanks.

It works. The binary is 11 MB, or 1.5 MB stripped. ag is 69176 bytes. Shared
libraries were a good invention :-) Sooner or later rg will be packaged
properly too.

~~~
cgag
I'd much rather have a self contained binary than 11 more MB of free space.

------
hxn
Looks like every tool has its upsides and downsides. This one lacks full PCRE
syntax support. Does one have to install Rust to use it?

~~~
kod
No. binaries:
[http://blog.burntsushi.net/ripgrep/#installation](http://blog.burntsushi.net/ripgrep/#installation)

------
reubano
Nice writeup! Any chance you'll support macports for those of us who never
jumped ship to homebrew?

~~~
burntsushi
I'm not a mac user, so I'm terribly unfamiliar with the ecosystem. In
principle, I have no problem with supporting macports, but I haven't looked
into it.

One thing that would be a huge help is if someone briefly wrote up what would
be necessary to get ripgrep into macports:
[https://github.com/BurntSushi/ripgrep/issues/10](https://github.com/BurntSushi/ripgrep/issues/10)

------
serge2k
> We will attempt to do the impossible

Oh well. Waste of time then.

~~~
bo1024
Maybe the author hasn't had breakfast yet and has only done 5 impossible
things so far?

------
libman
Tragically the news that LLVM is switching to a non-Copyfree license (see
copyfree.org/standard/rejected) has ruined everything... Nothing written in
Rust can be called Free Software. :(

~~~
alphapapa
According to that site, Free Software (as defined by the FSF) is not Copyfree.

As far as I can tell, Copyfree is like the BSD license, except you can only
restrict changes to the license text itself; the license itself may not be
changed or added to, but everything else in the project may be used and
changed in any way. So, effectively, it seems like a license where the license
itself is copyrighted with all rights reserved, but the rest of the project is
public domain.

So, just like BSD, anyone can take a Copyfree project proprietary, turning it
non-free.

The GPL is as important today as ever.

------
kozikow
1\. Ag have nice editor integration. I would miss emacs helm-projectile-ag

2\. Pcre is good regexp flavor to master. It is have good balance of speed,
power and popularity. In addition to Ag, there are accessible libraries in
many languages, including python.

I think it would be good if everyone settled on Pcre, rather than each
language thinking they will do regexps better.

~~~
burntsushi
PCRE suffers from worst case exponential behavior, so it's not suitable for
all tasks.

For the most part, the syntax supported by ripgrep is a strict subset of the
syntax supported by PCRE.

But yes, I can agree that supporting PCRE can be considered an advantage if
you use advanced features heavily (backreferences and lookaround come to
mind).

~~~
db48x
Meh. Back-references and lookaround both take too much brainpower to use at an
interactive shell. I've used them few times in programs that I was writing,
but just to find some text in some files on my disk? Never.

------
zatkin
>It is not, strictly speaking, an interface compatible “drop-in” replacement
for both, but the feature sets are far more similar than different.

------
wruza

        ...
        $ rg -uu foobar  # similar to `grep -r`
        $ rg -uuu foobar  # similar to `grep -a -r`
    

I knew it. The name is absolutely ironic. I cannot just drop-it-in and make
all my scripts and whatever scripts I download work immediately faster (nor is
it compatible with my shell typing reflexes). New, shiny, fast tool, doomed
from birth.

~~~
burntsushi
It fundamentally can't be interface compatible, sorry. I think I was pretty
clear about this in the blog. :-)

