Xapian: Open source search engine library

rbanffy · 2024-08-17T14:15:23.000000Z

I remember having used, a very long time ago, a self-hosted search engine on my library of PDFs, and it was unbelievably useful.

I dream about a similar thing that can do OCR on scanned docs and extract text from my also sprawling library of epub and mobi files. If someone builds something like this, with maybe a LOCAL LLM to extract text descriptions from photos and movies as well as indexing metadata for everything, subtitles from movies and lyrics for songs, and add that to a NAS appliance, it’d be a killer.

theolivenbaum · 2024-08-17T16:36:08.000000Z

That's what our app does: curiosity.ai, local index, support for many files types and apps out of the box, and integrated local OCR, STT and even local LLM

rmholt · 2024-08-17T17:37:18.000000Z

I couldn't find any mention on your website about LOCAL LLMs and according to your FAQ, it requires an account with your website.

Is there a way how to run curiosity.ai fully offline, without an account on your servers?

Bluestein · 2024-08-17T16:56:04.000000Z

That is the future :) Much success!

glompers · 2024-08-17T15:30:46.000000Z

DEVONthink 3 [0] (Apple only) will do most of that although I don't keep up at all with its interoperability with LLM extensions.

[0] https://www.devontechnologies.com/apps/devonthink

andyfilms1 · 2024-08-17T14:24:52.000000Z

Evernote will do this, you can feed it a bunch of PDFs and other documents, it will OCR them and make them all searchable. it's not perfect, but you can also add manual tags for things you know are important.

rbanffy · 2024-08-17T15:25:49.000000Z

At some point I'd love to further train an LLM on all my PDFs and be able to ask it questions.

Jzush · 2024-08-20T16:35:23.000000Z

For txt based pdfs, doc files etc, this is super easy to do and requires very little technical expertise or configuration. Download the gpt4all client. Pick an LLM the new meta llama3 model works really well then configure the local documents plugin.

I wrote a scraper to download all of the California EdCode from the governments site, convert them all to txt docs and I can ask questions about California EdCode in plain English.

I work in a shared governance capacity that requires us to refer to the Ed code for contractual negotiations and it’s been extremely helpful.

pdw · 2024-08-17T16:56:12.000000Z

> similar thing that can do OCR on scanned docs

It's only part of what you want, but ocrmypdf will add a OCRed text layer to PDF files, making the text selectable and indexable

kordlessagain · 2024-08-17T15:14:16.000000Z

I have most of this code for doing this - just needs to get rewritten for local storage (I was running it on Google Cloud). Need to pick something that doesn't run Solr as a service for local use. With Ollama, we have function calls running, so should be doable. I was also thinking about using the Open WebUI for use.

Bluestein · 2024-08-17T16:01:36.000000Z

> extract text descriptions from photos and movies as well as indexing metadata for everything, subtitles from movies and lyrics for songs

The big AI players are probably already scraping the bottom of this "barrel" in their search for training data, I am sure ...

infocollector · 2024-08-17T11:59:35.000000Z

This project has been around and maintained for more than a decade! Small footprint, good speed. One downside might be GPL v2 for commercial use.

frenchman99 · 2024-08-17T12:54:09.000000Z

You can always build a small search webservice that you open source and that your proprietary software calls out too, removing the need to open source everything.

Linux is GPL too, didn't hinder companies making trillions on top of it.

the_mitsuhiko · 2024-08-17T13:05:38.000000Z

Linux is not a good example because of the syscall exemption. The licensing situation is not at all comparable as xapian’s original point of existence was embedding.

synergy20 · 2024-08-17T13:00:19.000000Z

you mean, don't compile it and link it within my application, instead wrap it as a separate program then call it via rpc remotely or locally?

frenchman99 · 2024-08-17T13:24:08.000000Z

Yes, exactly that.

donio · 2024-08-17T19:09:21.000000Z

A lot more than a decade. I've been using it for 15 years at it was a very mature project even then. Repo history goes back to 1999 and according to the history page the project's roots go back to the 80s. A bit like Postgres in this respect.

https://xapian.org/history https://sigir.org/files/forum/S2000/MUSCAT_note.pdf

rbanffy · 2024-08-17T14:16:29.000000Z

For what kind of use do you think GPLv2 would be a blocker?

rurban · 2024-08-18T06:32:01.000000Z

I used it commercially, at a very big international company. All users had access to its unmodified source code and templates. There was no trouble at all.

Bluestein · 2024-08-17T12:38:50.000000Z

In fact, AIUI it's roots go back about 3 decades. The "about" page has a nice historical overview.-

the_mitsuhiko · 2024-08-17T12:54:07.000000Z

The project at one point started tracking files for potential rewrite to rid itself of GPL history. I used it many years ago and I quite enjoyed it (pre elastic search times) but unfortunately the license situation didn’t help the project to become popular.

synergy20 · 2024-08-17T12:57:36.000000Z

that's true, wonder if there is alternative that is not gpl

bearjaws · 2024-08-17T13:11:43.000000Z

Sonic search https://github.com/valeriansaliou/sonic

Maybe not exactly the same, its a server that you can store documents and then retrieve their ID using a search string.

Bluestein · 2024-08-17T13:16:49.000000Z

Elastic Search and its Amazon fork Opensearch perhaps?

JackSlateur · 2024-08-17T13:32:44.000000Z

Xapian is a library, while elastic has a client-server model

Xapian is more like sqlite while elastic would be mariadb

inertiatic · 2024-08-17T14:00:55.000000Z

Lucene which is what ES builds upon is a library with bindings in languages other than Java, and it's Apache licensed.

Bluestein · 2024-08-17T13:46:14.000000Z

Thanks for the spot-on, very illuminated comparison.-

openrisk · 2024-08-17T12:33:46.000000Z

used also by recoll, the desktop search app: https://www.recoll.org/

nanna · 2024-08-17T13:07:58.000000Z

I use recoll to index and search thousands of pdfs. Because I always have the author name in the filename I can filter queries like this:

Cybernetics OR steering filename:Heidegger ext:pdf

It's an absolute power tool.

nickpsecurity · 2024-08-17T14:19:22.000000Z

I do it like this:

Title Year Author Name.pdf

Same benefits as you mentioned. You can also filter by time that way.

nanna · 2024-08-17T20:00:19.000000Z

My way is to do

year__author1_author2_author-n~~title-of-book~subtitle##tag1#tag-2#tagn.pdf

This means that files are automatically organised by year of publication,that I can search by tag name, and that I dont have to escape chars in the terminal. One day I hope to get round to building an Emacs mode to filter by the different elements.

donio · 2024-08-17T20:42:04.000000Z

If your PDFs have Author properties then you might be able to do "author:Heidegger" too. The Recoll PDF filter extracts some of these fields and if I remember right it can be configured to extract additional custom properties too.

nanna · 2024-08-18T06:21:46.000000Z

That would be great, however most of my pdfs don't have the author property set.

Beijinger · 2024-08-17T14:44:13.000000Z

Recoll is magic...

dvdkon · 2024-08-17T11:58:42.000000Z

Xapian is nice. I've used it before to add interactive autocomplete to a Python web app, since my previous favourite, Whoosh, is unmaintained and somehow slower than grep on a folder (I remember it being pretty fast years back, I'd love to know what happened).

I'd say my favourite thing about Xapian is that it's just a simple library you can embed in any app, no need for a separate database and JVM tuning. For simple usecases and small-to-medium datasets, it just works.