Hacker News new | past | comments | ask | show | jobs | submit login
Xapian: Open source search engine library (xapian.org)
172 points by Bluestein 29 days ago | hide | past | favorite | 43 comments



I remember having used, a very long time ago, a self-hosted search engine on my library of PDFs, and it was unbelievably useful.

I dream about a similar thing that can do OCR on scanned docs and extract text from my also sprawling library of epub and mobi files. If someone builds something like this, with maybe a LOCAL LLM to extract text descriptions from photos and movies as well as indexing metadata for everything, subtitles from movies and lyrics for songs, and add that to a NAS appliance, it’d be a killer.


That's what our app does: curiosity.ai, local index, support for many files types and apps out of the box, and integrated local OCR, STT and even local LLM


I couldn't find any mention on your website about LOCAL LLMs and according to your FAQ, it requires an account with your website.

Is there a way how to run curiosity.ai fully offline, without an account on your servers?


That is the future :) Much success!


DEVONthink 3 [0] (Apple only) will do most of that although I don't keep up at all with its interoperability with LLM extensions.

[0] https://www.devontechnologies.com/apps/devonthink


Evernote will do this, you can feed it a bunch of PDFs and other documents, it will OCR them and make them all searchable. it's not perfect, but you can also add manual tags for things you know are important.


At some point I'd love to further train an LLM on all my PDFs and be able to ask it questions.


For txt based pdfs, doc files etc, this is super easy to do and requires very little technical expertise or configuration. Download the gpt4all client. Pick an LLM the new meta llama3 model works really well then configure the local documents plugin.

I wrote a scraper to download all of the California EdCode from the governments site, convert them all to txt docs and I can ask questions about California EdCode in plain English.

I work in a shared governance capacity that requires us to refer to the Ed code for contractual negotiations and it’s been extremely helpful.


> similar thing that can do OCR on scanned docs

It's only part of what you want, but ocrmypdf will add a OCRed text layer to PDF files, making the text selectable and indexable


I have most of this code for doing this - just needs to get rewritten for local storage (I was running it on Google Cloud). Need to pick something that doesn't run Solr as a service for local use. With Ollama, we have function calls running, so should be doable. I was also thinking about using the Open WebUI for use.


> extract text descriptions from photos and movies as well as indexing metadata for everything, subtitles from movies and lyrics for songs

The big AI players are probably already scraping the bottom of this "barrel" in their search for training data, I am sure ...


This project has been around and maintained for more than a decade! Small footprint, good speed. One downside might be GPL v2 for commercial use.


You can always build a small search webservice that you open source and that your proprietary software calls out too, removing the need to open source everything.

Linux is GPL too, didn't hinder companies making trillions on top of it.


Linux is not a good example because of the syscall exemption. The licensing situation is not at all comparable as xapian’s original point of existence was embedding.


you mean, don't compile it and link it within my application, instead wrap it as a separate program then call it via rpc remotely or locally?


Yes, exactly that.


A lot more than a decade. I've been using it for 15 years at it was a very mature project even then. Repo history goes back to 1999 and according to the history page the project's roots go back to the 80s. A bit like Postgres in this respect.

https://xapian.org/history https://sigir.org/files/forum/S2000/MUSCAT_note.pdf


For what kind of use do you think GPLv2 would be a blocker?


I used it commercially, at a very big international company. All users had access to its unmodified source code and templates. There was no trouble at all.


In fact, AIUI it's roots go back about 3 decades. The "about" page has a nice historical overview.-


The project at one point started tracking files for potential rewrite to rid itself of GPL history. I used it many years ago and I quite enjoyed it (pre elastic search times) but unfortunately the license situation didn’t help the project to become popular.


that's true, wonder if there is alternative that is not gpl


Sonic search https://github.com/valeriansaliou/sonic

Maybe not exactly the same, its a server that you can store documents and then retrieve their ID using a search string.


Elastic Search and its Amazon fork Opensearch perhaps?


Xapian is a library, while elastic has a client-server model

Xapian is more like sqlite while elastic would be mariadb


Lucene which is what ES builds upon is a library with bindings in languages other than Java, and it's Apache licensed.


Thanks for the spot-on, very illuminated comparison.-


used also by recoll, the desktop search app: https://www.recoll.org/


I use recoll to index and search thousands of pdfs. Because I always have the author name in the filename I can filter queries like this:

Cybernetics OR steering filename:Heidegger ext:pdf

It's an absolute power tool.


I do it like this:

Title Year Author Name.pdf

Same benefits as you mentioned. You can also filter by time that way.


My way is to do

year__author1_author2_author-n~~title-of-book~subtitle##tag1#tag-2#tagn.pdf

This means that files are automatically organised by year of publication,that I can search by tag name, and that I dont have to escape chars in the terminal. One day I hope to get round to building an Emacs mode to filter by the different elements.


If your PDFs have Author properties then you might be able to do "author:Heidegger" too. The Recoll PDF filter extracts some of these fields and if I remember right it can be configured to extract additional custom properties too.


That would be great, however most of my pdfs don't have the author property set.


Recoll is magic...


Xapian is nice. I've used it before to add interactive autocomplete to a Python web app, since my previous favourite, Whoosh, is unmaintained and somehow slower than grep on a folder (I remember it being pretty fast years back, I'd love to know what happened).

I'd say my favourite thing about Xapian is that it's just a simple library you can embed in any app, no need for a separate database and JVM tuning. For simple usecases and small-to-medium datasets, it just works.


Love Xapian, been using it for many years via notmuch (mail) and recoll (document indexing, mainly PDFs in my case).

It's been trouble free and very performant, a real workhorse.

https://notmuchmail.org/

https://www.recoll.org/


Xapian is used in https://www.djcbsoftware.nl/code/mu/ for indexing emails.


And by Fastmail



In itsel, Notmuch, a very interesting tool.-


s/itsel/itself/



I once wanted to compile a program that used Xapian on Windows. It was basically impossible for mortals.

Imo people should use cross-platform alternatives.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: