"Every program attempts to expand until it can read mail. Those programs which cannot so expand are replaced by ones which can."
Because lately when I develop something, it always gains either IRC or RSS capabilities at some point.
In fact, this might be why the author opted to create a separate utility rather than recommend setting --pre/--pre-glob straight into the ripgrep configuration file.
Also, most of it is completely safe Rust, so no out-of-bounds writes there :). The most dangerous part currently is probably the PDF parser.
One of the features I really liked about MacOS is spotlight. Is there some (fast) equivalent that I’ve been missing on Linux? I’m aware of `locate`, but that only matches the names of files, not their content. Is there a search engine that indexes local content as well?
To be honest, I've found Baloo to be a resource hog in the past, so I've now the habit of disabling it right after installing a distro which comes with it (eg. Kubuntu) or not installing at all otherwise (eg. on Debian or Arch). I should probably give it a second chance, though: on paper, that's the right approach to search.
You can switch to disable content-indexing and just search for filenames (80% rule) but it's hidden in systemsettings or you even need to go to balooctl - also index corruption is a thing...
That beeing said, if you are a brave KDE user and don't mind the hassle it might even work 80% of the time - in theory it's great idea but it never worked reliable for me - if you are a bored dev there is probably lot's of low hanging fruit there - seccomb, some simple heuristic to not overload on IOPs, there are probably more efficient db structures than LevelDB and so on...
It's step forward from nepomuk? that did the full rdf semenatic web stuff and fed a relational database and killed your hdd-based desktop reliable in the early 2000ies but it's still a nasty surprise when using KDE.
It's still a cool idea but it needs some love and contributors to work reliable on all kinds of nasty setups.
Baloo is also indeed a resource hog. It can use 100% of a CPU core for hours and hours, while it’s indexing. But I’d be fine/happy with that if it just worked properly.
Most gnome based distros have that available, if not enabled by default. Not sure what kind of support it has for different file types.
I haven't used it, so I'm not sure how fast it is, but I've seen it recommended several times.
+ It searches within compressed file types recursively.
+ Searches damn near everything
+ The huge number of way to interact with it - gui, python, command-line, web interface - combined with extensive if kinda weird query language make it clear its been refined for a long time.
+ Windows GUI
- Pain in the ass to make work right on windows, and the indexing on windows seems to be way longer for some reason.
See config and gist at https://github.com/BurntSushi/ripgrep/issues/1252
I've considered rewriting Strigi in rust quite often but too many other projects to pursue at the moment.
Off topic, but did anyone else see Phiresky's other work: AI that mimics human backchanneling so it pretends it's listening (saying "yeah", uh-huh" etc) Remarkable work!
GitHub repo here: https://github.com/phiresky/backchannel-prediction
Also, suffers from the exact same problem I see pretty much every text search tool suffer: doesn't support other UTF encodings like UTF-16, meaning you'll miss files.
Not sure if it can search in single-line mode either... would be nice if anyone knows options to do that. With grep etc. it always sucks not to be able to search for line feeds for no good reason.
> doesn't support other UTF encodings like UTF-16
UTF-16 should in fact work, since ripgrep supports it too. Looks like my binary file detection is at fault ..
> Not sure if it can search in single-line mode either
That works fine, just use `rga --multiline '\n' fname`
I could maybe add encoding detection myself, but I'm kind of discouraged since not even the unix `file` tool can detect those files as text, and a normal editor opens at least a UTF16BE file completely wrong. So I'm not sure if I want to spend my time on trying to write heuristic detection on those, especially since UTF16 itself is broken and shouldn't really exist at all...
I'll look into what encoding_rs has to offer.
It's still beta, but wondering what problems you are having here? Have you tried windows in .travis.yml?
Just a few more lines in your travis deploy setup.
Did you try it? ripgrep supports UTF-16 just fine. It even supports it automatically and transparently, via BOM detection. If there's no BOM, then you must specify the encoding explicitly.
That's emphatically not the case though. I explained how you could handle it here without requiring BOM or byte order knowledge or heuristics: https://news.ycombinator.com/item?id=20198208
> Either way, I don't think it's accurate to claim that ripgrep doesn't support UTF-16.
Having UTF-16 text in a file doesn't imply the file has have a BOM, and when I tried it rga didn't work on UTF-16 that didn't have a BOM. If that's still "ripgrep supports UTF-16" in your view then I'm not sure how else to word it, but the wording is hardly my concern. At the end of the day I was just trying to convey a particular fact, not argue over its wording.
Yes, that's an absurd amount of development effort and would result in a serious performance regression. (To the point that it's likely nobody would use ripgrep at all, so your approach would need to be put behind a flag, which seriously hinders the feature since it's no longer automatic.) Moreover, that only covers match detection, but does not actually cover output. Once you find the match, you have to determine how to print it, and the device you're printing to very likely does not support things like UTF-32 or even UTF-16 in many cases. Moreover, there are many operations that ripgrep does in a post-processing step (like limiting the output to a certain number of characters per line) that require knowing the presumed encoding (which is always UTF-8 by that point, since the data will have been transcoded to UTF-8 if UTF-16 were detected).
> UTF-16 doesn't require BOMs
You cannot decode UTF-16 without knowing its byte order. The BOM tells you that. If there is no BOM, then you need to get the byte order from some other source (or guess it). ripgrep requires the user to tell it what it is. This seems entirely reasonable to me, especially since most or all UTF-16 files I've seen include a BOM. Notably, ripgrep's support for UTF-16 is good enough for VS Code, which has a pretty sizable Windows user base.
> your view then I'm not sure how else to word it, but the wording is hardly my concern. At the end of the day I was just trying to convey a fact, not argue over its wording or semantics.
At the end of the day, my concern is to correct misleading claims about what ripgrep can and can't do. ripgrep clearly has support for UTF-16, and this is actually one of its marquee features that sets it apart from other search tools. For example, grep doesn't (and literally can't) support UTF-16 at all. The only way to search UTF-16 encoded files with grep is to transcode the file to UTF-8 first or to set the locale to C, and search for the binary encoding directly. ripgrep does a lot better than that, so to lump it in with "pretty much every text search tool" is pretty misleading from my perspective.
> If you don't care or it's too much work
I mean, I do care. Windows users and the prevalence of UTF-16 is why I added the automatic transcoding in the first place. But it's not just that it's too much work; as I said, the performance regression would be so serious that people would literally stop using ripgrep unless it was disabled by default. (In addition to the fact that printing the results puts you in a precarious situation.)
So, if you enable it or use it, make sure computer is isolated of anything of value to you, not to mention it is your main work or personal machine.
Edit: unsure why I'm getting downvoted here - docker is an extremely convenient way to run things on a server with Unraid that doesn't have a full distro or easy way to add packages locally.
That aside, it does seem like a very useful utility.
To respond to your edit, I don't think dockerization is necessary for a simple binary. You're not running any thing that could benefit from it. It makes sense if you want to run some kind of network service (torrent client, web server, ftp, smb, blah blah blah) and forward a port to your instance. That way, you can pre-package it on something like unraid. Here, it doesn't, and (as far as I know) even unraid can install a simple binary.
I would appreciate hearing from anyone with experience in this area.
Or do you shell out to the ripgrep binary?
It's not clear whether libripgrep would be a good fit for this project or not. They would need to reroll all the arg parsing logic themselves. libripgrep is really about building your own search tools (or more specialized tools) using the same internal infrastructure as ripgrep. But yeah, this is why I need high level docs to explain this stuff. I've been putting it off until I get bstr straightened out.
I actually looked into using libripgrep for this, but then I decided not to because of (a) not wanting to handle arg parsing myself (ripgrep has sooo many arguments), (b) missing or hard to find documentation.
The main reason it might be a good idea is because currently ripgrep does not know at all about a single file returning multiple "files", and all line prefixes are "hardcoded" (e.g. Page X: hello in pdfs is just prefixed per line). Also I can't rely on ripgrep's binary detection currently, because it would have to happen for "parts of files" from the perspective of ripgrep.
It would be great if ripgrep had a slightly more advanced preprocessing API - allow returning multiple "files" per filename input, maybe even with a "sourcemap" of line<->Page etc.
Oh damn, sorry (when I checked yesterday the branch was still there and the "doc" PR was still open so I assumed it wasn't merged yet)
There's just a lot of polish that needs to be done, and converting portions of it appropriate API documentation. Unfortunately, I don't really have the bandwidth to mentor this at the moment. :-( However, with that said, one super useful thing you could do is try out libripgrep and then give feedback on how it worked for you, and in particular, which things were hard to figure out.
 - https://github.com/BurntSushi/ripgrep/issues/1009
If all it does is running ripgrep with certain options, creating a corresponding alias might be a simpler solution than adding another binary to the system.