Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Rga: ripgrep, but also search in PDFs, Office documents, zip, tar.gz (phiresky.github.io)
284 points by phiresky 35 days ago | hide | past | web | favorite | 65 comments




I can't wait until it sends email in addition to searching for it.


Is that a reference to [1] "Zawinski's Law of Software Envelopment"?

"Every program attempts to expand until it can read mail. Those programs which cannot so expand are replaced by ones which can."

[1] http://www.catb.org/~esr/jargon/html/Z/Zawinskis-Law.html


Are there any corollaries to Zawinski's Law?

Because lately when I develop something, it always gains either IRC or RSS capabilities at some point.


TBF unlike lesspipe this is an explicit invocation.

In fact, this might be why the author opted to create a separate utility rather than recommend setting --pre/--pre-glob straight into the ripgrep configuration file.


Thanks! This simplifies opening of compressed log files: I've been using zless but lesspipe makes it redundant.


You're welcome, but also check out that second link: I'd be careful about running lesspipe on untrusted inputs. It looks like this tool might have the same problems, given that it appears to spawn tools like poppler[1].

[1]: https://github.com/phiresky/ripgrep-all/blob/master/src/adap...


kind of.. though this tool does a lot more (caching, recursing into archives and extracting all text) and is a lot faster (for the file types it can parse, lesspipe knows more), and of course lesspipe is only indirectly usable for recursive searching.

Also, most of it is completely safe Rust, so no out-of-bounds writes there :). The most dangerous part currently is probably the PDF parser.


This is awesome!

One of the features I really liked about MacOS is spotlight. Is there some (fast) equivalent that I’ve been missing on Linux? I’m aware of `locate`, but that only matches the names of files, not their content. Is there a search engine that indexes local content as well?


I believe Baloo (a KDE project) does exactly that (indexing local content). There are different tools you can then use to search for indexed data: I assume KRunner will be the closest to Spotlight, but you can search for files right from Dolphin, if you use it.

To be honest, I've found Baloo to be a resource hog in the past, so I've now the habit of disabling it right after installing a distro which comes with it (eg. Kubuntu) or not installing at all otherwise (eg. on Debian or Arch). I should probably give it a second chance, though: on paper, that's the right approach to search.


Baloo ist basically Spotlight and probably more but dare you are not that typical happy smile buisness user - have a few kernel source tree's laying around in your /home and you'll likely are getting friends with Baloo sooner (spinning rust) or later (ssd) because that beast manages to saturate even SSD IOPs and while doing so hands out segfaults on every other file.

You can switch to disable content-indexing and just search for filenames (80% rule) but it's hidden in systemsettings or you even need to go to balooctl - also index corruption is a thing...

That beeing said, if you are a brave KDE user and don't mind the hassle it might even work 80% of the time - in theory it's great idea but it never worked reliable for me - if you are a bored dev there is probably lot's of low hanging fruit there - seccomb, some simple heuristic to not overload on IOPs, there are probably more efficient db structures than LevelDB and so on...

It's step forward from nepomuk? that did the full rdf semenatic web stuff and fed a relational database and killed your hdd-based desktop reliable in the early 2000ies but it's still a nasty surprise when using KDE.

It's still a cool idea but it needs some love and contributors to work reliable on all kinds of nasty setups.


I’ve found searching for files on KDE (with Baloo) to be, for some reason, really inaccurate and hit-or-miss. Sometimes, even if I remember the exact name of the file, it won’t show up in the search results. Partial names have a lower probability of showing up. Slightly inaccurate names (with a Levenshtein distance of 1 or 2 to the actual name) have a very low probability of showing up. I end up searching for it using find or ag (the silver searcher).

Baloo is also indeed a resource hog. It can use 100% of a CPU core for hours and hours, while it’s indexing. But I’d be fine/happy with that if it just worked properly.


Ah, I can confirm all of the issues. I've thought that was my fault - so these are probably bugs.


Baloo is alphabetically first dependency of Dolphin in Debian (both Stretch and Buster), so either you have to use a different file manager or disable it after install.


https://wiki.gnome.org/Projects/Tracker

Most gnome based distros have that available, if not enabled by default. Not sure what kind of support it has for different file types.


There's Recoll: https://www.lesbonscomptes.com/recoll/

I haven't used it, so I'm not sure how fast it is, but I've seen it recommended several times.


Recoll is excellent. It was a game changer for me once I finally set it up on a NAS at home with a web interface accesible by port forwarding over ssh. Though the web interface requires some tiny tweaks to be mobile responsive.

+ It searches within compressed file types recursively.

+ Searches damn near everything

+ The huge number of way to interact with it - gui, python, command-line, web interface - combined with extensive if kinda weird query language make it clear its been refined for a long time.

+ Windows GUI

- Pain in the ass to make work right on windows, and the indexing on windows seems to be way longer for some reason.


Recoll is great. I use it like gmail search for my computer-global file system and the files that get put there...who needs extensive folder organization when you can find anything, anywhere?


Only for pdfs (and I use/develop it specifically for academic pdf): https://github.com/bellecp/fast-p



Or, if your usage is sufficiently infrequent to not need caching and if you'd rather not depend on one more tool, simply configure ripgrep to use a `--pre` flag (which is what rga is doing :).

See config and gist at https://github.com/BurntSushi/ripgrep/issues/1252


Pretty hard to make rg search in files within archives though without actually extracting them (which I've seen more than one request for in the ripgrep issues [1]), which is why rga includes streaming, recursive decompression of archives including running other preprocessors (like pdf) within them (second example in the above post and readme).

[1]: https://github.com/BurntSushi/ripgrep/issues/918


That's a pretty cool feature indeed, I had missed it (and now I understand the need for a separate binary with archive handling & streaming logic). Thanks for correcting.


You might get some inspiration from Strigi, a C++ library and program that also does recursive decompression for searching. It support indexing pdfs, archives, emails without writing temporary files.

https://www.vandenoever.info/software/strigi/akademy2006.pdf

I've considered rewriting Strigi in rust quite often but too many other projects to pursue at the moment.


Looks great.

Off topic, but did anyone else see Phiresky's other work: AI that mimics human backchanneling so it pretends it's listening (saying "yeah", uh-huh" etc) Remarkable work! https://streamable.com/dycu1

GitHub repo here: https://github.com/phiresky/backchannel-prediction


Thank you! Yeah, I like to think I have some interesting projects. Maybe I should make an overview page because you can only pin 6 repositories on Github :)


No Windows support? :\

Also, suffers from the exact same problem I see pretty much every text search tool suffer: doesn't support other UTF encodings like UTF-16, meaning you'll miss files.

Not sure if it can search in single-line mode either... would be nice if anyone knows options to do that. With grep etc. it always sucks not to be able to search for line feeds for no good reason.


Windows support should be fairly trivial, the main problem is packaging it up and that travis doesn't like Windows.

> doesn't support other UTF encodings like UTF-16

UTF-16 should in fact work, since ripgrep supports it too. Looks like my binary file detection is at fault [1]..

> Not sure if it can search in single-line mode either

That works fine, just use `rga --multiline '\n' fname`

[1]: https://github.com/phiresky/ripgrep-all/issues/5


Ah I see, thanks. You shouldn't require a BOM though. There are often files without a BOM, and not all files with UTF-16 in them are text either (EXEs etc.). I would just search for all the possible UTF byte sequences (UTF-7, UTF-8, UTF-16LE/BE, UTF-32, possibly with a switch to allow specifying subsets or additional encodings if you can support that?) regardless of BOM.


In ripgrep itself you can apparently only look in files of encodings other than UTF16LE with BOM by manually specifying `--encoding UTF16BE` etc.

I could maybe add encoding detection myself, but I'm kind of discouraged since not even the unix `file` tool can detect those files as text, and a normal editor opens at least a UTF16BE file completely wrong. So I'm not sure if I want to spend my time on trying to write heuristic detection on those, especially since UTF16 itself is broken and shouldn't really exist at all...

I'll look into what encoding_rs has to offer.


Thanks! Yeah I wouldn't try to detect encodings or use heuristics either. If you could just reduce a single pattern into the OR of a bunch of byte sequences in each encoding, I think that should work? I'm not sure how easy that is with the interface you're given. (I wouldn't call UTF-16 'broken', but either way... it's a reality; a huge fraction of the time when you're searching binary files on Windows it's to find text inside executables, which on Windows are generally UTF-16.)


Travis works fine for me building windows binaries.

It's still beta, but wondering what problems you are having here? Have you tried windows in .travis.yml?


Nope, haven't tried it. I just saw that ripgrep is using appveyor for windows instead, so I assumed it doesn't work on travis. I was actually just trying to add appveyor to this [1], but I'm getting a weird error.

[1]: https://ci.appveyor.com/project/phiresky/ripgrep-all/builds/...


Give it a crack, not sure what dependencies ripgrep is pulling in but I've had good experiences so far. They seem to be doing ok with rolling it out.

Just a few more lines in your travis deploy setup.


> Also, suffers from the exact same problem I see pretty much every text search tool suffer: doesn't support other UTF encodings like UTF-16, meaning you'll miss files.

Did you try it? ripgrep supports UTF-16 just fine. It even supports it automatically and transparently, via BOM detection. If there's no BOM, then you must specify the encoding explicitly.


Yes, I tried it. Without a BOM, because you can't rely on BOMs being there.


At that point, you don't know the encoding, so the only thing available to you is heuristics (including needing to guess the byte order). Either way, I don't think it's accurate to claim that ripgrep doesn't support UTF-16.


> At that point, you don't know the encoding, so the only thing available to you is heuristics (including needing to guess the byte order).

That's emphatically not the case though. I explained how you could handle it here without requiring BOM or byte order knowledge or heuristics: https://news.ycombinator.com/item?id=20198208

> Either way, I don't think it's accurate to claim that ripgrep doesn't support UTF-16.

Having UTF-16 text in a file doesn't imply the file has have a BOM, and when I tried it rga didn't work on UTF-16 that didn't have a BOM. If that's still "ripgrep supports UTF-16" in your view then I'm not sure how else to word it, but the wording is hardly my concern. At the end of the day I was just trying to convey a particular fact, not argue over its wording.


> I explained how you could handle it here without requiring BOM or byte order knowledge or heuristics:

Yes, that's an absurd amount of development effort and would result in a serious performance regression. (To the point that it's likely nobody would use ripgrep at all, so your approach would need to be put behind a flag, which seriously hinders the feature since it's no longer automatic.) Moreover, that only covers match detection, but does not actually cover output. Once you find the match, you have to determine how to print it, and the device you're printing to very likely does not support things like UTF-32 or even UTF-16 in many cases. Moreover, there are many operations that ripgrep does in a post-processing step (like limiting the output to a certain number of characters per line) that require knowing the presumed encoding (which is always UTF-8 by that point, since the data will have been transcoded to UTF-8 if UTF-16 were detected).

> UTF-16 doesn't require BOMs

You cannot decode UTF-16 without knowing its byte order. The BOM tells you that. If there is no BOM, then you need to get the byte order from some other source (or guess it). ripgrep requires the user to tell it what it is. This seems entirely reasonable to me, especially since most or all UTF-16 files I've seen include a BOM. Notably, ripgrep's support for UTF-16 is good enough for VS Code, which has a pretty sizable Windows user base.

> your view then I'm not sure how else to word it, but the wording is hardly my concern. At the end of the day I was just trying to convey a fact, not argue over its wording or semantics.

At the end of the day, my concern is to correct misleading claims about what ripgrep can and can't do. ripgrep clearly has support for UTF-16, and this is actually one of its marquee features that sets it apart from other search tools. For example, grep doesn't (and literally can't) support UTF-16 at all. The only way to search UTF-16 encoded files with grep is to transcode the file to UTF-8 first or to set the locale to C, and search for the binary encoding directly. ripgrep does a lot better than that, so to lump it in with "pretty much every text search tool" is pretty misleading from my perspective.


[flagged]


I'm not saying you were intentionally misleading anyone. What I'm saying is that I'm trying to correct something that I saw as misleading. Criticism is totally fair, but criticism of criticism should be fair game too. I totally appreciate that we shouldn't take these things too personally, but that cuts both ways. I wasn't saying you were trying to be misleading; I was trying to point out an inaccuracy. Given that ripgrep is my project, and myths spread easily, I try to stay on top of that.

> If you don't care or it's too much work

I mean, I do care. Windows users and the prevalence of UTF-16 is why I added the automatic transcoding in the first place. But it's not just that it's too much work; as I said, the performance regression would be so serious that people would literally stop using ripgrep unless it was disabled by default. (In addition to the fact that printing the results puts you in a precarious situation.)


Windows 10 built-in search works pretty well, and easily finds things in PDFs, even non-English and non-Roman alphabets.


I see threads recommending software doing full-text indexing of pdfs descending into archives. I remember that a few years ago there were some security vulnerabilities in exactly this kind of software on some fairly modern Linux distro.

So, if you enable it or use it, make sure computer is isolated of anything of value to you, not to mention it is your main work or personal machine.


Well, it uses the exact same pdf parsing library that e.g. Evince uses.. so if you ever open untrusted PDFs in a normal viewer, this will expose you to the same danger (maybe less since it only extracts text). But yeah, if there was a nice safe PDF library written in pure Rust I would of course link against that.


Are you alleging there is a security issue in this software? It's open source, feel free to show us where.


GP said "exactly this _kind_ of software" -- i.e. software with a similar type of functionality. He did not claim that the posted code specifically has a vulnerability.


This is something I've needed for a long time! Has anyone docker-ized it yet?

Edit: unsure why I'm getting downvoted here - docker is an extremely convenient way to run things on a server with Unraid that doesn't have a full distro or easy way to add packages locally.


I'm assuming this is sarcasm? Why on God's green earth would you want to dockerize a utility like this?

That aside, it does seem like a very useful utility.

To respond to your edit, I don't think dockerization is necessary for a simple binary. You're not running any thing that could benefit from it. It makes sense if you want to run some kind of network service (torrent client, web server, ftp, smb, blah blah blah) and forward a port to your instance. That way, you can pre-package it on something like unraid. Here, it doesn't, and (as far as I know) even unraid can install a simple binary.


Why wouldn't you want to dockerize it? It has a huge number of potential dependencies (pandoc, pdftotext, etc). Getting all those working on my Unraid storage server for searching is going to be painful.


I feel like this case might be better served by something like flatpak or appimage. While I'm usually not a fan, this is basically exactly what they're built for, as compared to docker, which is not designed with this in mind. I don't want to have to prefix every command with docker stuff.


So, if you're trying to grep through 100+ mb zip files on a shared folder (windows box) is the most efficient way to do that to use ansible or the like to remote in to the server and run commands there? Of course it could use cpu so that could cause problems if it's a production server.

I would appreciate hearing from anyone with experience in this area.


How well does it handle zip bombs and 42.zip?


There is an option to limit the maximum archive recursion `--rga-max-archive-recursion=` which defaults to 4. That is also needed to handle droste.zip [1] which is a zip file that contains itself. So for huge archives it will simply take a fairly long time, unless you limit recursion more.

[1]: https://alf.nu/ZipQuine


phiresky does this use libripgrep under the hood?

Or do you shell out to the ripgrep binary?


libripgrep isn't currently a thing (not merged to mainline). On reddit, the author noted that ripgrep (and other utilities) need to be on the PATH, which is one of the issues related to windows packaging.


libripgrep has been on master since 0.10.0. There just isn't any high level documentation for it.

It's not clear whether libripgrep would be a good fit for this project or not. They would need to reroll all the arg parsing logic themselves. libripgrep is really about building your own search tools (or more specialized tools) using the same internal infrastructure as ripgrep. But yeah, this is why I need high level docs to explain this stuff. I've been putting it off until I get bstr straightened out.


> It's not clear whether libripgrep would be a good fit for this project or not

I actually looked into using libripgrep for this, but then I decided not to because of (a) not wanting to handle arg parsing myself (ripgrep has sooo many arguments), (b) missing or hard to find documentation.

The main reason it might be a good idea is because currently ripgrep does not know at all about a single file returning multiple "files", and all line prefixes are "hardcoded" (e.g. Page X: hello in pdfs is just prefixed per line). Also I can't rely on ripgrep's binary detection currently, because it would have to happen for "parts of files" from the perspective of ripgrep.

It would be great if ripgrep had a slightly more advanced preprocessing API - allow returning multiple "files" per filename input, maybe even with a "sourcemap" of line<->Page etc.


> libripgrep has been on master since 0.10.0. There just isn't any high level documentation for it.

Oh damn, sorry (when I checked yesterday the branch was still there and the "doc" PR was still open so I assumed it wasn't merged yet)


I'd like to try libripgrep out in one my projects, maybe I could also take on the challenge of attempting to document it.


So, a lot of it is actually written already. :-) https://github.com/BurntSushi/blog/blob/ag/libripgrep/conten... (The formatting is FUBAR, so you'll want to look at the raw text.)

There's just a lot of polish that needs to be done, and converting portions of it appropriate API documentation. Unfortunately, I don't really have the bandwidth to mentor this at the moment. :-( However, with that said, one super useful thing you could do is try out libripgrep and then give feedback[1] on how it worked for you, and in particular, which things were hard to figure out.

[1] - https://github.com/BurntSushi/ripgrep/issues/1009


>rga simply runs ripgrep (rg) with some options set, especially --pre=rga-preproc and --pre-glob.

If all it does is running ripgrep with certain options, creating a corresponding alias might be a simpler solution than adding another binary to the system.


It's true that you could mostly replace the "rga" binary with `rg --pre=rga-preproc` and just publish the rga-preproc binary, which does most of the work.. But rga itself adds convenience regarding filter selection and other config options like caching.


At the very least, you still need the rga-preproc binary


That makes more sense, the way the sentence is written made it appear to me like it's just a very simplistic wrapper - but looking through the code shows it's more elaborate than that.


FileSearchEX works well if you need a Windows searchy tool.


it seems like this might be useful for journalists




Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: