More a comment for the newer members of our community: If you don't know grep, you should definitely learn it. It's both the weirdest tool (when you are new to *nix) and the most useful. I procrastinated a couple of years learning grep and now 10 years later I use it dozens of times a day. Grep is one of the most important productivity tools in my development environment and a critical component for understanding and writing shell scripts.
Bet that breaks a few scripts, though they should be checking the exit value, not stdout or stderr.
| grep "some.expression" \
| grep -Ev 'Binary.file.*matches'
When I still was forced to use Windows in the first decade of 2000 and we had a huge codebase in the company I used an indexing tool called Wilbur http://www.redtree.com/wilbur/index.htm That was incredibly fast. But of course it did not support regexps.
it uses a very old and outdated substring search algorithm. even ripgrep does not use the state of the art EPSM substring search.
extended pattern matching is not jitted. in 2020 there's no excuse for that.
it has no switch to skip the most common ignored files, as in ack, ag or rg.
it cannot do unorm, thus fails to find some unicode patterns. debian shipped for a while with the unorm patches, but this caused huge performance regressions on UTF-8 locales. a lot of work is still ahead.
Eg. it fails to find "Café" in "Café", with different é's.
GNU grep does have some Unicode support when using the UTF-8 locale. Things like \b and \w are Unicode-aware, for example.
> it uses a very old and outdated substring search algorithm. even ripgrep does not use the state of the art EPSM substring search.
Boyer-Moore is still quite serviceable. So is Two-Way. And neither of them require platform specific vector instructions, so there will be a place for them for a while yet. And as for EPSM, I don't know of any place where that's used in practice. I don't even think Hyperscan implements it. It's hard to beat a well place memchr or a simpler prefilter SIMD approach.
> extended pattern matching is not jitted. in 2020 there's no excuse for that.
It's not the case that a JIT is always faster than a DFA (or a lazy DFA in GNU grep's case). So, umm, yeah, there are plenty of excuses for it.
> it cannot do unorm, thus fails to find some unicode patterns. debian shipped for a while with the unorm patches, but this caused huge performance regressions on UTF-8 locales. a lot of work is still ahead.
I don't know of any search tool that does Unicode normalization, probably exactly because of the performance overhead. Unicode normalization negates most if not all of the "clever" optimizations used by GNU grep. It would certainly render the choice of substring algorithm or JIT mostly useless, for example.
 - http://0x80.pl/articles/simd-strfind.html#algorithm-1-generi...
Weirdly enough I have not found such tool recently.
Does anyone know why -o and --color were slow in older versions of Grep? Doesn't --color just affect the output, not the matching?