Hacker News new | comments | show | ask | jobs | submit login

I'm the author of ag. That was a really good comparison of the different code searching tools. The author did a great job of showing how each tool misbehaved or performed poorly in certain circumstances. He's also totally right about defaults mattering.

It looks like ripgrep gets most of its speedup on ag by:

1. Only supporting DFA-able Rust regexes. I'd love to use a lighter-weight regex library in ag, but users are accustomed to full PCRE support. Switching would cause me to receive a lot of angry emails. Maybe I'll do it anyway. PCRE has some annoying limitations. (For example, it can only search up to 2GB at a time.)

2. Not counting line numbers by default. The blog post addresses this, but I think results without line numbers are far less useful; so much so that I've traded away performance in ag. (Note that even if you tell ag not to print line numbers, it still wastes time counting them. The printing code is the result of me merging a lot of PRs that I really shouldn't have.)

3. Not using mmap(). This is a big one, and I'm not sure what the deal is here. I just added a --nommap option to ag in master.[1] It's a naive implementation, but it benchmarks comparably to the default mmap() behavior. I'm really hoping there's a flag I can pass to mmap() or madvise() that says, "Don't worry about all that synchronization stuff. I just want to read these bytes sequentially. I'm OK with undefined behavior if something else changes the file while I'm reading it."

The author also points out correctness issues with ag. Ag doesn't fully support .gitiginore. It doesn't support unicode. Inverse matching (-v) can be crazy slow. These shortcomings are mostly because I originally wrote ag for myself. If I didn't use certain gitignore rules or non-ASCII encodings, I didn't write the code to support them.

Some expectation management: If you try out ripgrep, don't get your hopes up. Unless you're searching some really big codebases, you won't notice the speed difference. What you will notice, however, are the feature differences. Take a look at https://github.com/BurntSushi/ripgrep/issues to get a taste of what's missing or broken. It will be some time before all those little details are ironed-out.

That said, may the best code searching tool win. :)

1. https://github.com/ggreer/the_silver_searcher/commit/bd65e26...

Thanks for the response! Some notes:

1. In my benchmarks, I do control for line numbers by either explicitly making it a variable (i.e., when you see `(lines)`) or by making all tools count lines to make the comparison fair. For the most part, this only tends to matter in the single-file benchmarks.

2. For memory maps, you might get very different results depending on your environment. For example, I enabled memory maps on Windows where they seem to do a bit better. (I think my blog post gives enough details that you could reproduce the benchmark environment precisely if you were so inclined. This was important to me, so I spent a lot of time documenting it.)

3. The set of features supported by rg should be very very close to what is supported by ag. Reviewing `ag`'s man page again, probably the only things missing from rg are --ackmate, --depth, some of the color configurability flags (but rg does do coloring), --passthrough, --smart-case and --stats maybe? I might be missing some others. And Mercurial support (but ag's is incomplete). In exchange, rg gives you much better single file performance, better large-repo performance and real Unicode support that doesn't slow way down. I'd say those are pretty decent expectations. :-)

Thanks for ag by the way. It and ack have definitely triggered a new kind of searching. I have some further information retrievalish ideas on evolving the concept, but those will have to wait!

In terms of core features, ripgrep is totally there. It searches fast. It ignores files pretty accurately. It outputs results in a pleasant and useful format. If a new user tries rg, they'll be very happy.

My warning about the feature differences was meant to temper ag users' expectations. There are lots of little things that ag users are accustomed to that are either different or missing in ripgrep. Off the top of my head: Ag reads the user's global gitignore. (This is harder than most people think.) It detects stdout redirects such as "ag blah > output.txt" and ignores output.txt. It can search gz and xz files. It defaults to smart-case searching. It can limit a search to one hardware device (--one-device), avoiding slow reads on network mounts. And as a commenter already pointed out, it supports the --pager option. Taken together, all those small differences are likely to cause an average ag user some grief. I wanted to manage expectations so that users wouldn't create annoying "issues" (really, feature requests) on your GitHub repo. Sorry if that came off the wrong way.

On a completely unrelated note: I see ripgrep supports .rgignore files, similar to how ag supports .agignore. It'd be nice if we could combine forces and choose a single filename for this purpose. That way when the next search tool comes along, it can use the same thing instead of .zzignore or whatever. It would also make it easier for users to switch between our tools. I'd suggest a generic name like ".ignore" or ".ignores", but I'm sure some tool creates such files or directories already.

Edit: Actually, it looks like .ignore can work. The only examples I've found of .ignore files are actual files containing ignore patterns.

You raise good points, thank you. I hope to support some of those features, since they seem like nice conveniences.

In principle I'd be fine standardizing on a common ignore file. We'd need to come up with a format (I think I'd vote for "do what gitignore does", since I think that's what we're both doing now anyway).

Adding files to this list is kind of a bummer though. I could probably get away with replacing `.rgignore` proper, but I suspect you'd need to add it without replacing `.agignore`, or else those angry users you were talking about might show themselves. :-)

I do kind of like `.grepignore` since `grep` has kind of elevated itself to "search tool" as a term, but I can see how that would be confusing. `.searchignore` feels too long. `.ignore` and `.ignorerc` feel a bit too generic, but either seems like the frontrunner at the moment.

I also vote for "do what gitignore does". My plan is to add support for the new file name, deprecate .agignore, and update docs everywhere. But it'd be a while before I removed .agignore completely.

I really like .ignore, and I like it because it's generic. The information I want it to convey is:

> Dear programs,

> If you are traversing this directory, please ignore these things.

Of course, some programs could still benefit from having application-specific ignore files, but it'd cut down on a lot of cruft and repetition.

…and merged: https://github.com/ggreer/the_silver_searcher/pull/974

I'll tag a new release in a day or two. Also, it looks like the sift author is getting on the .ignore train: https://github.com/svent/sift/issues/78#issuecomment-2493352...

This worked out pretty well. :)

This is probably the best case of out-in-the-open open source developers of similar-but-different tools collaborating on a new standard and implementing them in record time that I have ever seen.

Keep it up all (rg/ag/sift)!

I completely agree. That was one of the most reasonable and level-headed discussions between strangers I have _ever_ seen on the Internet!

I really like .grepignore as it is generic enough to encompass all tools which have grep-like functionality while never stepping on the feet of other programs that may also require ignore files that may be different than the .grepignore.

The problem is that grep will never obey .grepignore. That's so confusing as to be a deal-breaker.

Also, what about programs that have search functionality as part of their design, but not as their core function? For example, I don't want my text editor to search .min.js files. I'd even prefer it if such files didn't show up in my editor's sidebar. Do I have to add *.min.js to .searchignore and .atomignore? (Or if the editor people ever work out a standard, maybe it will be .editorignore.)

If I had to draw a Venn diagram of ignore patterns in my text editors, my search tools, and my rsync scripts, they'd mostly overlap. I don't deny the need for application-specific ignores, but there is a large class of applications that could benefit from a more generic ignore file.

I do think it would be better to have the name at least reflect that class of applications, maybe "searchignore" like someone else suggested. There may be overlap but it's hard to predict all the types of applications people are using that need ignore functionality and something as simple as backing things up with rsync would seem like an example where someone could well want considerably different ignores.

I'd say that .ignore is too generic a name. What is to be ignored by what?

But I like the idea of standardizing this. Perhaps the cache directory tagging standard gives some inspiration.


.gignore? .ignore is way too generic. .gignore as in grep ignore, rg ignore, ag ignore, they all have a g in their names somewhere. Well, ack doesn't but what kind of name is that anyway; it sounds like someone (the author?) got annoyed with grep's lack of pcre. .gignore seems generic enough, yet specific for these tools. Expecting a single .ignore file to rule them all text search tools is rather too optimistic.

May be .searchignore?


> detects stdout redirects

Is there a portable way for you to find out that stdout is connected to output.txt? isatty() only tells you that there may be a redirection and I suppose on linux you could use /proc/self/fd/1 but I don't know how to do it portably.

If output is redirected, ag calls fstat on stdout and records the inode. It then ignores files based on inode rather than path:


Could I suggest .ignorerc?

I would immediately intuit that such a file is a (r)untime (c)onfiguration for ignoring something.

.ignore isn't... bad, it just looks like something I can safely delete, like a file styled `such-and-such~`

If you're taking the time to create a .ignore file and set up personally-relevant regexps, I doubt you'll forget. (You'll probably also have it versioned.)

What I like about '.ignore' is that it's not tied to grep (which will never use it) but expresses that the concept is agnostic. You can imagine lots of programs supporting it.

I notice that subsequent runs in the same (non changing) directory get different results. These runs are all within 20 seconds, what gives?

  $ rg each | md5sum
  670b544e15f9430d9934334a11a87b7e  -
  $ rg each | md5sum
  4d13be6b4531ad52b1b476314fe98fb7  -
  rg each | md5sum
  88e15dbb943665ea54482cb499741938  -
  rg each | md5sum
  eec6d6d5c9a592cec25aa8b0c19aae15  -
  rg each | md5sum
  ad74b78ef8f0d21450f8f87415555af0  -

  $ date
  Sat Sep 24 01:42:27 EEST 2016
  $ rg each > foo1
  $ rg each > foo2
  $ rg each > foo3
  $ rg each > foo4
  $ ls -la foo*
  -rw-r--r--  1 coldtea  staff  1429646 Sep 24 01:42 foo1
  -rw-r--r--  1 coldtea  staff  2250868 Sep 24 01:42 foo2
  -rw-r--r--  1 coldtea  staff  4536031 Sep 24 01:42 foo3
  -rw-r--r--  1 coldtea  staff  9140652 Sep 24 01:42 foo4
  $ date
  Sat Sep 24 01:42:44 EEST 2016
OS X 10.12, installed with brew.

This can happen because rg searches files in parallel, so the order in which it finishes the files can be nondeterministic. If you run with -j1 (single-threaded) then it is deterministic.

To get deterministic output in multi-threaded mode, rg could wait and buffer the output until it can print it in sorted order. This might increase memory usage, and possibly time, though I think the increase would be minor.

In the first case, it's searching in parallel, so I bet the order of results is different each time.

In the second case, rg each > foo2 found results in foo1 and put them in foo2. Then rg each > foo3 found results in foo1 and foo2, and put them in foo3. Etc. That's why the file size increases so quickly.

>In the first case, it's searching in parallel, so I bet the order of results is different.

Aha. Thought that needed the -j flag (it says: default threads: 0 in the cli help).

Could it do anything to put them out in order of "depth" (and directory/file sorting order)?

>In the second case, rg each > foo2 found results in foo1 and put them in foo2. Then rg each > foo3 found results in foo1 and foo2, and put them in foo3. Etc. That's why the file size increases so quickly.

LOL, facepalm -- yes.

Forcing it to use one worker (-j1, I think) should give it deterministic output.

`ag --pager` is the most important one (I use alias ag='ag --pager "less -FRX"')

I just made a cursory pass at `man rg`, but it seems to me that the `-g` option from ag is also missing. I use it with vim-ctrlp to search file names.

Thanks for rg and the very informative blog post!

It's there. You need to pass --files.

> Only supporting DFA-able Rust regexes. I'd love to use a lighter-weight regex library in ag, but users are accustomed to full PCRE support

Would it be possible to detect when an expression requires PCRE-specific features and use a different engine when possible?

It's possible, but it's certainly not easy. Here are some complications:

1. The DFA-regex engine's syntax must be a subset of PCRE's syntax. If it's not, then users will be very confused when regex features work fine in isolation, but cause errors when combined in the same query.

2. The DFA-regex's behavior must be the same as PCRE. If whitespace matching or unicode support is even slightly different, it will frustrate users.

3. Adding another dependency means yet another way in which compilation can fail or incompatibilities can arise.

Considering the marginal usefulness of backtracking and captures, I'd prefer to keep ag as simple as possible and ditch them.

> 1. Only supporting DFA-able Rust regexes. I'd love to use a lighter-weight regex library in ag, but users are accustomed to full PCRE support. Switching would cause me to receive a lot of angry emails. Maybe I'll do it anyway. PCRE has some annoying limitations. (For example, it can only search up to 2GB at a time.)

The standard trick here is to use the faster method for searches that it supports, and use the slower but more capable method only for searches that require it. Parse the regex, see if a DFA will work, and only use PCRE for expressions with backreferences and similar.

Or use fancy-regex. Not ready for prime-time, but potentially the best of both worlds.

Do you know any engines that actually do this? As in, is it really standard? I thought maybe Spencer's Tcl regex engine did it? Although I confess, I've never read the source.

I guess RE2/Rust/Go all kind of do it as well. For example, RE2/Rust/Go will actually do backtracking in some cases! (It's bounded of course, maintaining linear time.) But this doesn't actually meet the criteria of being able to support more advanced features.

This is probably a digression.. While it's not a regex engine, and I suppose this is standard in compiler development, the Neo4j query planner team uses this approach extensively to incrementally introduce new techniques or to add pointed optimizations.

For instance, Neo4j chooses between a "rule" planner, which can (at least while I still worked at Neo) solve any query and a "cost" planner, which can solve a large subset. For those queries the cost planner can solve, it usually makes significantly better plans, kind of like the example with regex engines here.

For the curious, that happens here: https://github.com/neo4j/neo4j/blob/3.1/community/cypher/cyp...

Likewise, once a plan is made, there are two runtimes that are able to execute them - an interpreting runtime that can execute any plan, and a compiling runtime that converts logical plans to JVM bytecode, much faster but only supports a subset.

That choice is made here: https://github.com/neo4j/neo4j/blob/3.1/community/cypher/cyp...

This goes on in finer and finer detail. Lots of similar examples in how the planners devise graph traversal algorithms on the fly, by looking for patterns they now and falling back to more generic approaches if need be.

FWIW, the overhead of this has, I would argue, massively paid for itself. It has made extremely ambitious projects, like the compile-to-bytecode runtime and the cost-based query planner safely deliverable in incremental steps.

> Do you know any engines that actually do this? As in, is it really standard?

I don't know about choosing between multiple regex engines, but GNU grep and other grep implementations check for non-regex fixed strings or trivial patterns, and have special cases for those.

Well... yeah... I was more thinking about really big switches like between PCRE and a DFA.

It looks like the sibling comment found one that I think qualifies based on the description (assuming "NFA" is used correctly :P).

For anyone else interested in the memory map issue, here's some more data: https://news.ycombinator.com/item?id=12567326

> It looks like ripgrep gets most of its speedup on ag by:

A non-trivial amount of time spent is simply reading the files off the disk. If speed is the all-encompassing metric, there's a big gain to be made by pre-processing the files into an index and loading that into memory instead, and that's what livegrep[1] does.

If you find yourself waiting for grep often enough, it's a pretty handy internal tool to configure across all repos.

[1] https://livegrep.com/about

You just said that to the author of The Silver Searcher (ag).

[0] https://github.com/ggreer/the_silver_searcher

Thanks for bringing ag to the world. It's my favorite code searching tool because the letter 'a' and the letter 'g', from its name, feel easier to type than the ones of any other code searching tool. I am definitively willing to sacrifice a few milliseconds, while searching for a word in my megabyte of code, for this feature.

I've just installed ag. It defaults to case insensitive searches when the pattern is in lowercase. Is there a way to change this default? Perhaps a config file of some sort?

There's no config file for default options. You probably want to add an alias to your bash/zsh/fishrc:

    alias ag='ag -s'
I tend to favor aliases over config files. It reduces complexity and improves startup time. Startup time may not seem like a big deal, but it really matters if you're running something like:

    find . -ctime 2 -exec ag blah
(Find all files modified in the past two days and search them for instances of "blah".) If 10,000 files were changed in the past two days, and your search program takes 10 milliseconds to parse a config file, that's an extra 100 seconds wasted.

find can batch arguments using +, much like xargs:

    find . -ctime 2 -exec ag blah {} +
This makes startup latency less important.

Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact