
Show HN: Rga: ripgrep, but also search in PDFs, Office documents, zip, tar.gz - phiresky
https://phiresky.github.io/blog/2019/rga--ripgrep-for-zip-targz-docx-odt-epub-jpg/
======
woodruffw
Ah, we've come full circle and reimplemented lesspipe[1][2].

[1]:
[https://manpages.debian.org/jessie/less/lesspipe.1.en.html](https://manpages.debian.org/jessie/less/lesspipe.1.en.html)

[2]: [https://www.openwall.com/lists/oss-
security/2014/11/23/2](https://www.openwall.com/lists/oss-
security/2014/11/23/2)

~~~
escapecharacter
I can't wait until it sends email in addition to searching for it.

~~~
gitgud
Is that a reference to [1] "Zawinski's Law of Software Envelopment"?

 _" Every program attempts to expand until it can read mail. Those programs
which cannot so expand are replaced by ones which can."_

[1] [http://www.catb.org/~esr/jargon/html/Z/Zawinskis-
Law.html](http://www.catb.org/~esr/jargon/html/Z/Zawinskis-Law.html)

~~~
bhaak
Are there any corollaries to Zawinski's Law?

Because lately when I develop something, it always gains either IRC or RSS
capabilities at some point.

------
noodlesUK
This is awesome!

One of the features I really liked about MacOS is spotlight. Is there some
(fast) equivalent that I’ve been missing on Linux? I’m aware of `locate`, but
that only matches the names of files, not their content. Is there a search
engine that indexes local content as well?

~~~
Arkanosis
I believe Baloo (a KDE project) does exactly that (indexing local content).
There are different tools you can then use to search for indexed data: I
assume KRunner will be the closest to Spotlight, but you can search for files
right from Dolphin, if you use it.

To be honest, I've found Baloo to be a resource hog in the past, so I've now
the habit of disabling it right after installing a distro which comes with it
(eg. Kubuntu) or not installing at all otherwise (eg. on Debian or Arch). I
should probably give it a second chance, though: on paper, that's the right
approach to search.

~~~
nisa
Baloo ist basically Spotlight and probably more but dare you are not that
typical happy smile buisness user - have a few kernel source tree's laying
around in your /home and you'll likely are getting friends with Baloo sooner
(spinning rust) or later (ssd) because that beast manages to saturate even SSD
IOPs and while doing so hands out segfaults on every other file.

You can switch to disable content-indexing and just search for filenames (80%
rule) but it's hidden in systemsettings or you even need to go to balooctl -
also index corruption is a thing...

That beeing said, if you are a brave KDE user and don't mind the hassle it
might even work 80% of the time - in theory it's great idea but it never
worked reliable for me - if you are a bored dev there is probably lot's of low
hanging fruit there - seccomb, some simple heuristic to not overload on IOPs,
there are probably more efficient db structures than LevelDB and so on...

It's step forward from nepomuk? that did the full rdf semenatic web stuff and
fed a relational database and killed your hdd-based desktop reliable in the
early 2000ies but it's still a nasty surprise when using KDE.

It's still a cool idea but it needs some love and contributors to work
reliable on all kinds of nasty setups.

~~~
winter_blue
I’ve found searching for files on KDE (with Baloo) to be, _for some reason_ ,
really inaccurate and hit-or-miss. Sometimes, even if I remember the exact
name of the file, it won’t show up in the search results. Partial names have a
lower probability of showing up. Slightly inaccurate names (with a Levenshtein
distance of 1 or 2 to the actual name) have a _very low_ probability of
showing up. I end up searching for it using find or ag (the silver searcher).

Baloo is also indeed a resource hog. It can use 100% of a CPU core for hours
and hours, while it’s indexing. But I’d be fine/happy with that if it just
worked properly.

~~~
nisa
Ah, I can confirm all of the issues. I've thought that was my fault - so these
are probably bugs.

------
ronjouch
Or, if your usage is sufficiently infrequent to not need caching and if you'd
rather not depend on one more tool, simply configure ripgrep to use a `--pre`
flag (which is what rga is doing :).

See config and gist at
[https://github.com/BurntSushi/ripgrep/issues/1252](https://github.com/BurntSushi/ripgrep/issues/1252)

~~~
phiresky
Pretty hard to make rg search in files within archives though without actually
extracting them (which I've seen more than one request for in the ripgrep
issues [1]), which is why rga includes streaming, recursive decompression of
archives including running other preprocessors (like pdf) within them (second
example in the above post and readme).

[1]:
[https://github.com/BurntSushi/ripgrep/issues/918](https://github.com/BurntSushi/ripgrep/issues/918)

~~~
ronjouch
That's a pretty cool feature indeed, I had missed it (and now I understand the
need for a separate binary with archive handling & streaming logic). Thanks
for correcting.

------
nmstoker
Looks great.

Off topic, but did anyone else see Phiresky's other work: AI that mimics human
backchanneling so it pretends it's listening (saying "yeah", uh-huh" etc)
Remarkable work! [https://streamable.com/dycu1](https://streamable.com/dycu1)

GitHub repo here: [https://github.com/phiresky/backchannel-
prediction](https://github.com/phiresky/backchannel-prediction)

~~~
phiresky
Thank you! Yeah, I like to think I have some interesting projects. Maybe I
should make an overview page because you can only pin 6 repositories on Github
:)

------
mehrdadn
No Windows support? :\

Also, suffers from the exact same problem I see pretty much every text search
tool suffer: doesn't support other UTF encodings like UTF-16, meaning you'll
miss files.

Not sure if it can search in single-line mode either... would be nice if
anyone knows options to do that. With grep etc. it always sucks not to be able
to search for line feeds for no good reason.

~~~
phiresky
Windows support should be fairly trivial, the main problem is packaging it up
and that travis doesn't like Windows.

> doesn't support other UTF encodings like UTF-16

UTF-16 should in fact work, since ripgrep supports it too. Looks like my
binary file detection is at fault [1]..

> Not sure if it can search in single-line mode either

That works fine, just use `rga --multiline '\n' fname`

[1]: [https://github.com/phiresky/ripgrep-
all/issues/5](https://github.com/phiresky/ripgrep-all/issues/5)

~~~
mehrdadn
Ah I see, thanks. You shouldn't require a BOM though. There are often files
without a BOM, and not all files with UTF-16 in them are text either (EXEs
etc.). I would just search for all the possible UTF byte sequences (UTF-7,
UTF-8, UTF-16LE/BE, UTF-32, possibly with a switch to allow specifying subsets
or additional encodings if you can support that?) regardless of BOM.

~~~
phiresky
In ripgrep itself you can apparently only look in files of encodings other
than UTF16LE with BOM by manually specifying `--encoding UTF16BE` etc.

I could maybe add encoding detection myself, but I'm kind of discouraged since
not even the unix `file` tool can detect those files as text, and a normal
editor opens at least a UTF16BE file completely wrong. So I'm not sure if I
want to spend my time on trying to write heuristic detection on those,
especially since UTF16 itself is broken and shouldn't really exist at all...

I'll look into what encoding_rs has to offer.

~~~
mehrdadn
Thanks! Yeah I wouldn't try to detect encodings or use heuristics either. If
you could just reduce a single pattern into the OR of a bunch of byte
sequences in each encoding, I think that should work? I'm not sure how easy
that is with the interface you're given. (I wouldn't call UTF-16 'broken', but
either way... it's a reality; a huge fraction of the time when you're
searching binary files on Windows it's to find text inside executables, which
on Windows are generally UTF-16.)

------
hamilyon2
I see threads recommending software doing full-text indexing of pdfs
descending into archives. I remember that a few years ago there were some
security vulnerabilities in exactly this kind of software on some fairly
modern Linux distro.

So, if you enable it or use it, make sure computer is isolated of anything of
value to you, not to mention it is your main work or personal machine.

~~~
nullbyte
Are you alleging there is a security issue in this software? It's open source,
feel free to show us where.

~~~
tomsmeding
GP said "exactly this _kind_ of software" \-- i.e. software with a similar
type of functionality. He did not claim that the posted code specifically has
a vulnerability.

------
mmastrac
This is something I've needed for a long time! Has anyone docker-ized it yet?

Edit: unsure why I'm getting downvoted here - docker is an extremely
convenient way to run things on a server with Unraid that doesn't have a full
distro or easy way to add packages locally.

~~~
mises
I'm assuming this is sarcasm? Why on God's green earth would you want to
dockerize a utility like this?

That aside, it does seem like a very useful utility.

To respond to your edit, I don't think dockerization is necessary for a simple
binary. You're not running any thing that could benefit from it. It makes
sense if you want to run some kind of network service (torrent client, web
server, ftp, smb, blah blah blah) and forward a port to your instance. That
way, you can pre-package it on something like unraid. Here, it doesn't, and
(as far as I know) even unraid can install a simple binary.

~~~
mmastrac
Why wouldn't you want to dockerize it? It has a huge number of potential
dependencies (pandoc, pdftotext, etc). Getting all those working on my Unraid
storage server for searching is going to be painful.

~~~
mises
I feel like this case might be better served by something like flatpak or
appimage. While I'm usually not a fan, this is basically exactly what they're
built for, as compared to docker, which is not designed with this in mind. I
don't want to have to prefix every command with docker stuff.

------
triangleman
So, if you're trying to grep through 100+ mb zip files on a shared folder
(windows box) is the most efficient way to do that to use ansible or the like
to remote in to the server and run commands there? Of course it could use cpu
so that could cause problems if it's a production server.

I would appreciate hearing from anyone with experience in this area.

------
orthoxerox
How well does it handle zip bombs and 42.zip?

~~~
phiresky
There is an option to limit the maximum archive recursion `--rga-max-archive-
recursion=` which defaults to 4. That is also needed to handle droste.zip [1]
which is a zip file that contains itself. So for huge archives it will simply
take a fairly long time, unless you limit recursion more.

[1]: [https://alf.nu/ZipQuine](https://alf.nu/ZipQuine)

------
Adam89
phiresky does this use libripgrep under the hood?

Or do you shell out to the ripgrep binary?

~~~
masklinn
libripgrep isn't currently a thing (not merged to mainline). On reddit, the
author noted that ripgrep (and other utilities) need to be on the PATH, which
is one of the issues related to windows packaging.

~~~
burntsushi
libripgrep has been on master since 0.10.0. There just isn't any high level
documentation for it.

It's not clear whether libripgrep would be a good fit for this project or not.
They would need to reroll all the arg parsing logic themselves. libripgrep is
really about building your own search tools (or more specialized tools) using
the same internal infrastructure as ripgrep. But yeah, this is why I need high
level docs to explain this stuff. I've been putting it off until I get bstr
straightened out.

~~~
Adam89
I'd like to try libripgrep out in one my projects, maybe I could also take on
the challenge of attempting to document it.

~~~
burntsushi
So, a lot of it is actually written already. :-)
[https://github.com/BurntSushi/blog/blob/ag/libripgrep/conten...](https://github.com/BurntSushi/blog/blob/ag/libripgrep/content/post/libripgrep.md)
(The formatting is FUBAR, so you'll want to look at the raw text.)

There's just a lot of polish that needs to be done, and converting portions of
it appropriate API documentation. Unfortunately, I don't really have the
bandwidth to mentor this at the moment. :-( However, with that said, one super
useful thing you could do is try out libripgrep and then give feedback[1] on
how it worked for you, and in particular, which things were hard to figure
out.

[1] -
[https://github.com/BurntSushi/ripgrep/issues/1009](https://github.com/BurntSushi/ripgrep/issues/1009)

------
kekebo
>rga simply runs ripgrep (rg) with some options set, especially --pre=rga-
preproc and --pre-glob.

If all it does is running ripgrep with certain options, creating a
corresponding alias might be a simpler solution than adding another binary to
the system.

~~~
FreeFull
At the very least, you still need the rga-preproc binary

~~~
kekebo
That makes more sense, the way the sentence is written made it appear to me
like it's just a very simplistic wrapper - but looking through the code shows
it's more elaborate than that.

------
pcunite
FileSearchEX works well if you need a Windows searchy tool.

------
batbomb
it seems like this might be useful for journalists

