
Qgrep Internals - gbrown_
https://zeux.io/2019/04/20/qgrep-internals/
======
pcwalton
Along these lines, it would be interesting to see a code search that can
optionally use the native OS full text search to accelerate queries. Having to
keep an index up to date for a search tool specifically is a pain. But Windows
comes with the Windows Search API, macOS has NSMetadataQuery, and Linux has a
variety of tools like Recoll that might be available depending on the system.
Since in many cases the OS maintains a full text search database anyway, a
grep tool might as well make use of it where available.

~~~
burntsushi
If I were going to build a full text search tool that was primarily used for
code, I doubt the OS native search indices could be used. For searching code,
you likely want a very different tokenizer and code-aware searches than what a
typical document indexer would use. I've often thought about leveraging the
enormous set of syntax highlighting definitions for languages out there as a
base to build a tokenizer off of that has some smarts about the things it
indexes. But it's a half-baked thought.

But I agree. A prerequisite to any tool like this really need to be very good
at keeping the index up to date with minimal user fuss and needs to be
reliable. That's a hard engineering task.

~~~
pcwalton
What I was thinking was to use the OS native search as a way to quickly find a
candidate set of files, and then to manually winnow down from there. The
indexer would only help for a subset of queries: there are many queries it
wouldn't help with at all. But it could help for many common ones. I suppose
it depends on how the OS full text search is implemented: if it has hardwired
NLP stuff in it, for example, it might not be useful at all.

------
kyberias
I very rarely need to search my code base for arbitrary strings. I use more
advanced tools to find code. For example, Resharper indexes all the names and
provides me with a great user interface to find them. Let's say I need to find
class named MyWonderfulFactoryManagerAdapterSingleton I type Ctrl+T and merely
type famaadsi (first letters of the camel case words) to find it.

~~~
OskarS
I used to use "smart" tools like that, but I've more and more gone back to the
"dumber" grep tools. It turns out, they're usually faster, they're almost
exactly as reliable, and they're way more flexible. I'll stick with ripgrep.

------
sudeepj
qgrep maintains some metadata ("compressed representation of the entire
searchable codebase") and ripgrep does not (as far as I know). It does't
appear to be benchmarking the same thing, unless I am missing something.

Besides this, I did learn a trick or two from the blog.

------
mruts
I've (unfortunately?) never need to make code this fast, but it sounds like a
lot of fun. Seems a little unnecessary (Intellij with Scala can find
usages/implementations pretty quickly for even large codebases), but I
appreciate the engineering and dedication that went into this. Moreover, I
haven't worked on very _very_ large bases (> 5m lines) and never with VS, so I
believe it was a real problem that needed to be solved.

Do other search tools like ag or ripgrep also maintain a cache? A lot of the
optimizations he described could be done on none-cached data and I am curious
to see how important the I/O and cache size bottleneck are.

~~~
jnwatson
Ag and ripgrep do not maintain their own cache. Usually the OS file cache does
a decent job of keeping things fast.

------
asdf098asdf098
Not sure its a fair comparison. I mean in his comparison qgrep finds a match
at line 44478 where ripgrep says its line 53924.

Then there's the obvious left-out of time it takes for qgrep to populate its
cache.

Seems like an apples and oranges comparison to me.

~~~
burntsushi
On my system, the line number reported by qgrep is correct and matches
ripgrep. I don't know why the article's example is off, but I'd probably
assume it was a copy & paste error.

> Seems like an apples and oranges comparison to me.

This kind of misses the point. The OP is perfectly up front about this. It's
not an apples to apples comparison in the sense that the tools are doing
roughly similar algorithmic work internally, but it is a decent apples to
apples comparison in terms of the _user experience_. That is, searches are
much faster when you have a pre-computed index.

With that said, getting the user experience around keeping the index up to
date is definitely a key component of this, and the article doesn't really
explore that aspect. But I don't think index creation/update time should be
included search benchmarks, since the whole point of a tool like this is that
creating or updating the index should be a relatively cheap and infrequent
thing compared to the searches executed against the index.

------
fefe23
Holy moly qgrep is awesome! How have I never heard of it before?!

Thanks for sharing this link!

------
nurettin
Besides google's RE2 and writing properly parallelized and cached code, this
article highlights the most important thing which we are in the process of
losing at the moment.

Testing your code on low cpu clock and disks with spinning parts. I'm pretty
sure if we had this hardware in the 60s, computer science would have halted
because people would have thought "oh why bother reducing this algorithm to
nlogn where n^2 is computable in a perfectly reasonable time frame for the use
cases we have?"

------
Lowkeyloki
Reminds me of my first job out of college. It was a very large insurance
application written in VB.NET. It was a terrible application. And Visual
Studio's search functionality left much to be desired at the time. I ended up
purchasing Copernic Desktop Search, which is an indexed file search tool. It's
obviously not built for this kind of use case, but it sped up my workflow
enormously.

Nowadays, I can get by just fine with the search functionality in VSCode. (And
I don't write VB.NET anymore.)

