Hacker News new | past | comments | ask | show | jobs | submit login
Grep-3.5 Released (gnu.org)
59 points by ink_13 on Sept 28, 2020 | hide | past | favorite | 21 comments



Lots of performance fixes, which is great because grep is already very performant! If you're ever bored, try writing your own grep or grep similar tool in your favorite language (I tried it in Go[1] to both learn Go and make a tool that I wanted that didn't exist (at the time)). The naive solution won't take long, but from there you can parallelize and optimize. It's a fun and highly practical exercise and you'll gain an appreciation for the clever algorithms that grep uses.

More a comment for the newer members of our community: If you don't know grep, you should definitely learn it. It's both the weirdest tool (when you are new to *nix) and the most useful. I procrastinated a couple of years learning grep and now 10 years later I use it dozens of times a day. Grep is one of the most important productivity tools in my development environment and a critical component for understanding and writing shell scripts.

[1]: https://github.com/FreedomBen/findref




Awesome, thanks for this link. I remember finding this years ago when I was first rolling my own but had forgotten. This is a great resource.


One of the coolest grep features is exposed in the -f argument (aka fgrep) - it implements searching for any token present in the file argument. This basically implements the Aho-Corasick algorithm and is pretty amazing.


The message that a binary file matches is now sent to standard error and the message has been reworded from "Binary file FOO matches" to "grep: FOO: binary file matches"

Bet that breaks a few scripts, though they should be checking the exit value, not stdout or stderr.


Ironically, I see a lot of greps piped to grep to remove "Binary file FOO matches" from grep.

    some_command_generates_output \
      | grep "some.expression" \
      | grep -Ev 'Binary.file.*matches'


Single most useful software tool I've ever used, bar none. I installed WSL on my workstation primarily to get grep on my Windows 10 work laptop, as Notepad++'s equivalent "Search in Files" function is incredibly slow.


True, there is no day without calling grep. Often just simple strings, but a least a couple of times a month also with true regexps.

When I still was forced to use Windows in the first decade of 2000 and we had a huge codebase in the company I used an indexing tool called Wilbur http://www.redtree.com/wilbur/index.htm That was incredibly fast. But of course it did not support regexps.


Yep. I’m sure this being Hacker News someone will suggest ripgrep as being faster and written in Rust, but grep has been there forever and will always be there forever when you need it. So while I might use something else occasionally, being able to use grep is extremely useful.


At least for my uses, rg is drop-in compatible with most other grep versions (if I pass the right flag or two), so I just use whichever is available. On most personal machines that's rg, on work stuff GNU grep, on small systems busybox grep, on BSDs the local grep, etc. But yes, GNU's version is common and good.


Good might be for the eye of the beholder. But in reality it can be made much much faster, and unicode support would also be a good idea. strings are not just ASCII anymore.

it uses a very old and outdated substring search algorithm. even ripgrep does not use the state of the art EPSM substring search.

extended pattern matching is not jitted. in 2020 there's no excuse for that.

it has no switch to skip the most common ignored files, as in ack, ag or rg.

it cannot do unorm, thus fails to find some unicode patterns. debian shipped for a while with the unorm patches, but this caused huge performance regressions on UTF-8 locales. a lot of work is still ahead. Eg. it fails to find "Café" in "Café", with different é's.


> and unicode support would also be a good idea. strings are not just ASCII anymore.

GNU grep does have some Unicode support when using the UTF-8 locale. Things like \b and \w are Unicode-aware, for example.

> it uses a very old and outdated substring search algorithm. even ripgrep does not use the state of the art EPSM substring search.

Boyer-Moore is still quite serviceable. So is Two-Way. And neither of them require platform specific vector instructions, so there will be a place for them for a while yet. And as for EPSM, I don't know of any place where that's used in practice. I don't even think Hyperscan implements it. It's hard to beat a well place memchr or a simpler prefilter SIMD approach[1].

> extended pattern matching is not jitted. in 2020 there's no excuse for that.

It's not the case that a JIT is always faster than a DFA (or a lazy DFA in GNU grep's case). So, umm, yeah, there are plenty of excuses for it.

> it cannot do unorm, thus fails to find some unicode patterns. debian shipped for a while with the unorm patches, but this caused huge performance regressions on UTF-8 locales. a lot of work is still ahead.

I don't know of any search tool that does Unicode normalization, probably exactly because of the performance overhead. Unicode normalization negates most if not all of the "clever" optimizations used by GNU grep. It would certainly render the choice of substring algorithm or JIT mostly useless, for example.

[1] - http://0x80.pl/articles/simd-strfind.html#algorithm-1-generi...


Well, if the GP used ripgrep, they wouldn't have needed to install WSL or even cygwin first. ripgrep is a native Windows program.


Git Bash that comes with Git is another low overhead way to get grep since you may already need git anyway.


In the 1990s I used something called agrep, approximate grep, that did fuzzy matches. That can be useful when you don't know/remember the exact spelling of a string you are looking for. In theory you could write a regexp for such fuzzy search, but that's extremely tedious if done manually (and maybe the performance would be bad?).

Weirdly enough I have not found such tool recently.


I use fzf in my terminal to search my history.

https://github.com/junegunn/fzf


That is a bit different than agrep, but looks very useful. I need to try it.


agrep is a component of libtre - https://github.com/laurikari/tre/


True, I found that a while ago, too. Don't remember what was the reason that I did not start using it again.


> An N^2 RSS performance regression with many patterns has been fixed in common cases (no backref, and no use of -o or --color).

Does anyone know why -o and --color were slow in older versions of Grep? Doesn't --color just affect the output, not the matching?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: