Hacker News new | past | comments | ask | show | jobs | submit login
Lmgrep: Lucene-based grep-like utility (jocas.lt)
106 points by merename 6 months ago | hide | past | favorite | 25 comments

Very nice, examples are in the repository https://github.com/dainiusjocas/lucene-grep#text-analysis

Thanks! The text analysis machinery baked into `lmgrep` is the thing that I'm very proud of.

I agree it is its strength, keep it as prominent as possible!

I'm surprised/impressed that it can build an index and search it in 29 milliseconds. Pretty clever solution!

Thanks! This speed is possible because two things: GraalVM native-image (fast startup) and Lucene (for doing the work).

Wow. This is nice. I can see putting it to use with csv and json with gron or something. I don’t have that much markdown to search, but having a tool be just there in the shell might change that.

I see they have enabled Snowball stemmers. I wonder if using other Lucene analyzers such as Voikko for Finnish is feasible. Snowball wasn’t particularly good when text get complex. I used to deal with Lucene and Solr way back when. Based on the OP I see Graal requires change.

Author of lmgrep here. Could you create an issue for Finnish text analysis? https://github.com/dainiusjocas/lucene-grep/issues

I'd take look.

Here goes: https://github.com/dainiusjocas/lucene-grep/issues/84

I realize some relatively obscure Finnish stemmer and Lucene with GraalVM aren't exactly a common use case. I did some testing and provided my use case. I certainly have much English language content to search with using lucene-grep. So, thank you for making it!

One usecase where Lucene’s tokenizing approach tends to work less well than something like grep is when for some reason we want to query for a substring of a token, e.g. if the text is “I walked through the town” and I want to search for “oug”. Does lmgrep offer a performant solution for this kind of case, or would it be a situation where it’s better sticking with regular grep?

I’ve been looking for something exactly like this. I keep the complete text of articles I’ve enjoyed but, to-date, effective searching has meant spinning up an ES instance, which is painful. This is a specific use case that isn’t necessarily well-served by something like grep or ripgrep. I’ll definitely try this, thanks - looks very elegant.

Can you say more, I'm curious.

Is it automated in some way during web browsing, remembering to copy to a folder when you enjoyed it enough, or do you use a reading app/e-reader to read them so they're already downloaded

Sure, I just hacked something together as I wanted it to fit around my existing workflow. I’ve been using Instapaper since forever, and I wanted something built around that instead of “every URL I visit”, as most shit you read has a low signal:noise ratio.

I wrote some Python to drive Selenium to get the URLs (not the full text) from Instapaper, then pass those URLs to newspaper3k, where a lot of the downloading and parsing work is done. I then save the output to SQLite. From there I was previously having ES build indexes but recently just switched to hosted Algolia, which seems to be basically free for my use case and has some nice libraries for building real-time search front ends too. I’ll be trying lmgrep as a substitution though.

The key thing about searching the text of articles you’ve read is that you want an intelligent ranking of all articles that bear on a subject, in order of relevance. That’s not something you can get with grep/ripgrep. ES is pretty good at it out of the box. But it’s also a pain to set up and run - you’ll probably end up needing something like Docker.

There are a thousand different ways you could do something like this - this is just the way I do it.

Not OP so I can't speak for them. There's a bunch of ways to do this, ranging from more turnkey solutions to collections of scripts and extensions you can use. On the turnkey side, there's programs like ArchiveBox[1] which take links and store them as WARC files. You can import your browsing history into ArchiveBox and set up a script to do it automatically. If you'd like to set something up yourself, you can extract your browsing history (eg, firefox stores its history in a sqlite database) and manually wget those urls. For a reference to the more "bootstrapped" version, I'll link to Gwern's post on their archiving setup [2]. It's fairly long, so I advise skipping to the parts you're interested in first.

1: https://github.com/ArchiveBox/ArchiveBox

2: https://www.gwern.net/Archiving-URLs

I'm the author of `lmgrep`. Happy to hear that you liked it. I have a similar user use-case: searching for blog posts that are in markdown source files.

Have you considered using Docfetcher or Recoll?

Or Zotero?

I think one of the best code search tools I've seen is the one here: https://source.chromium.org/chromium

I guess that's what Google uses internally? Is there an open source alternative?

I think there's a company attempting to implement their own version of something similar. A very important part of search is also the understanding of language semantics. Something that is really cool for this is Kythe [0].

[0] - https://kythe.io/

I am pretty sure that's what sourcegraph is trying to do.

There is DXR from Mozilla but I'm not sure how generalised it is.


There is also Sourcegraph.

Thank you, DXR looks amazing.

Neat. This is similar to a tool I have been working on (but need to finish off) as I saw the same issue.

Except rather than build an index I brute forced the search each time. For most repositories it’s fast enough even with ranking.

https://github.com/boyter/cs For those interested it’s still very WIP with noticeable issues in TUI mode.

Is there an actual link to the source code somewhere in that blog post?

Yes, it is there as an in-text link:

"Then the most complicated part was to prepare executable binaries for different operating systems. Plenty of CPU, RAM, VirtualBox with Windows and macOS virtual machines, and here we go."


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact