It's interesting that we're still seeing such interesting progress in such a fun...

burntsushi · on April 23, 2021

For queries that are already fast, I expect them to remain similarly fast. Or hope that they do. :-)

But there are many queries that are not so fast today. For example:

    $ time rg-master -Nc 'strengths' OpenSubtitles2018.raw.sample.en
    242

    real    0.255
    user    0.204
    sys     0.050
    maxmem  903 MB
    faults  0

    $ time rg -Nc 'strengths' OpenSubtitles2018.raw.sample.en
    242

    real    0.092
    user    0.068
    sys     0.024
    maxmem  904 MB
    faults  0

Where the second command has the new substring algorithm. Nothing earth shattering to be honest. ripgrep already did a good job of dealing with common cases.

This is ripgrep's current work-horse: https://github.com/rust-lang/regex/blob/3db8722d0b204a85380f...

The revised algorithm is based on the same principles (heuristics derived from an a priori frequency distribution), but shouldn't be as easy to get into cases where it runs very slowly. Here's another more poignant example (where the needle is two ASCII space characters):

    $ time rg-master -Nc '  ' OpenSubtitles2018.raw.sample.en

    real    0.887
    user    0.859
    sys     0.027
    maxmem  903 MB
    faults  0

    $ time rg -Nc '  ' OpenSubtitles2018.raw.sample.en

    real    0.084
    user    0.043
    sys     0.040
    maxmem  903 MB
    faults  0

js2 · on April 23, 2021

> heuristics derived from an a priori frequency distribution

How did you come up with the frequency distribution? English text? Code samples? Languages other than English?

burntsushi · on April 23, 2021

Yes, yes and yes. :-)

https://github.com/rust-lang/regex/blob/3db8722d0b204a85380f...

pletnes · on April 23, 2021

Wow, cool!

Since rg (which is awesome) understands which programming languages map to which file suffixes, could it make sense to do per-language frequency analysis? {}[] are common in C-style languages, but much less so in lisp, python, etc.

burntsushi · on April 23, 2021

It could. But there's a lot of code complexity there because ripgrep is made from de-coupled components. Implementing your idea would require making that frequency distribution available as an overridable element all the way down into where substring search is actually implemented (a transitive dependency of ripgrep).

On top of that, it's unlikely to make much of a difference. Not only would the frequency distributions between languages look pretty similar, but it's still just a guess at the end of the day. There is such a thing as overfitting too. :-)