
Ugrep: Faster Fuzzy Interactive Grep - bkudria
https://github.com/Genivia/ugrep
======
dnpp123
For having spent some time playing with ripgrep and hyperscan
([https://sr.ht/~pierrenn/ripgrep/](https://sr.ht/~pierrenn/ripgrep/)) the
benchmarking part looks really odd to me.

The T5/T6/T7/T8/T9 are no way extensive enough to show a difference between
hyperscan and ripgrep. Plus, the benchmark already includes the pattern
compilation time in the hyperscan benchmarks so this makes even less sense.

So the only upside left to this tool is usability. And from a quick test I
don't really see the point. I'd prefer to use something written in a safe
language (e.g. Rust) than this so I guess I'll just stick with rg.

------
bkor
I use this an ripgrep. By default I tend to use ripgrep. It skips various
files, e.g. tarballs, ".git", ".svn" and so on. It's quite quick as a result.
I also like the output a lot, though it's kind of annoying that the output
format changes when output to the screen vs when redirected (e.g. to a pipe).
It would be nicer if ripgrep always redirected things through less with
colours (the way that git does it).

Ugrep is great because it can grep through tarballs. But the default output
format of ripgrep is nicer, plus ripgrep seems quicker when searching
recursively with loads of files that could be skipped (so tarballs, etc). I'm
guessing ugrep skips less files.

~~~
burntsushi
I'm sure there are some cases where the performance difference comes down to
skipping more or fewer files, but there's a lot more to it than that. In many
cases, skipping files is actually slower! I did some analysis here:
[https://www.reddit.com/r/rust/comments/i6pfb2/ugrep_new_ultr...](https://www.reddit.com/r/rust/comments/i6pfb2/ugrep_new_ultrafast_c_grep_claims_to_be_faster/g0xybge)

> It would be nicer if ripgrep always redirected things through less with
> colours (the way that git does it).

The -p flag should help there.

Disclaimer: I am the author of ripgrep.

~~~
bkor
> I'm sure there are some cases where the performance difference comes down to
> skipping more or fewer files

My use case if probably weird, various times I grep for things across about
13k of full rpm package checkouts. So 13k directories, each with their .svn
directory, one or more tarballs, possibly some patches/scripts, plus a spec
file. I used to specifically tell GNU grep what to search for (grep something
_/ SPECS/_.spec), partly to avoid getting too many unwanted matches. Ripgrep
is smart and quick enough to not need anything special. I like ugrep because
sometimes it's needed to check the sources as well. Total size is 50-70GB, the
spec files make up a tiny bit of that.

The distribution recently added a daily updated tarball with just the specs (7
or 20MB compressed depending if the .svn is in there). I'm planning to switch
over to that. I'll spare you the details for why it's more difficult than it
seems.

Thanks for the -p advice!

~~~
burntsushi
Gotya, thanks for elaborating! In case you didn't know about it, there is also
ripgrep-all, which uses ripgrep internally, but can also search oodles of non-
plain-text files, including tarballs: [https://github.com/phiresky/ripgrep-
all](https://github.com/phiresky/ripgrep-all)

------
engelen
Disclosure: I am the author of ugrep.

Here are my 2 cents.

So what is going on with the new ugrep tool?
[https://github.com/Genivia/ugrep](https://github.com/Genivia/ugrep)

As a small organization specializing in open source software we needed a
search tool like grep but updated to handle many compression formats including
tarballs, with filters to search PDF, DOCX, and other formats. And narrowing
down the file type of the archive contents to source code when necessary. Why?
For example to look for differences that explain bugs, for finding potential
vulnerabilities in older software that is archived, and to check for open
source licenses/violations.

OK. But what about performance?

At the same time, I worked on designing a new fast pattern matching method
that in simple wording uses logic/hashing to detect possible matches fast,
before performing a regex match that is more CPU expensive. This method is
extensively tested with many configurations of parameters to find the optimal
parameterization as tested on several machines. This was then compared to the
best-known algorithms I could find and implemented in C (tested in memory, not
on files and not reported in the ugrep project). I will gladly share this
method publicly eventually in a technical paper. For a while I contemplated
filing a utility patent, but did not move forward on that because I want this
technology to be freely available to everyone and not proprietary. Of course,
it would be nice to receive some recognition and not get ripped off. Most of
the grep tools just use what is already available publicly and aren't doing
something new that is clever, with the possible exception of hyperscan.

Secondly, I am glad to see that ugrep is useful to many others. In my
conversations with ugrep users, performance is not their top concern but
having these new features that ugrep offers that other grep lack. As long as
ugrep is very fast, they are more than happy. They also suggested that ugrep
should be compatible with GNU grep's options and not try to be "too clever" to
skip files and directories, at least not out-of-the-box. None of the ugrep
perf tests skip files or directories and includes all hidden
files/directories, binary files, and compressed files.

Combining all these requirements and suggestions by users into ugrep wasn't
trivial. But I believe we accomplished that goal reasonably well. Having said
that, ugrep is relatively new and still evolving.

There are a lot of opinionated folks when it comes to performance. Many in my
domain of expertise realized already over a decade ago that it is a folly to
pursue "the best performance" when the variety of architectures is vast and
hardware and software are still evolving, even when slowly. There is no set of
perfect benchmarks. There are always assumptions and requirements that affect
the results wildly (I am a professor in CS and spent my entire career as a
researcher, including in the area of high-performance computing.)

I enjoy working now and then on deep and challenging coding projects, such as
ugrep. It's a crazy fun project to work on when I have time.

Cheers!

