
Xapiand: A fast, simple, modern search and storage engine - Kronuz
https://kronuz.io/Xapiand/
======
michelpp
Xapian has a long history starting in the early 80s:

[https://xapian.org/history](https://xapian.org/history)

I've used Xapian extensively, but not this new Xapiand tool, so I can only
speak to the actual library. Xapian is a C++ library that accesses index data
files directly on disk. There are bindings for various languages, say Python,
let's you do 'import xapian' and get FFI bindings to the library, then you
basically open your on disk index files and issue queries.

Xapian supports many concurrent readers, but only one writer. It's not a
server, there are no protocols. Maybe that's what this Xapiand tool adds. In
general the overhead is very, very light, just enough ram to hold the library
code, the OS takes care of all the filesystem level caching.

Many of the very same concepts that are in Lucene, Documents, Terms, weights,
flavors of BM25 relevance ranking, query parsing trees, relevancy operators,
etc, all apply to Xapian as well.

~~~
_wmd
I love Xapian, the quality of its recall is excellent and indexing performance
very hard to find fault with. There's just a tiny problem - it's stuck with
the GPL, despite a long effort to relicence the code going back years.

~~~
vkazanov
Remind me what's wrong with GPL, here? You can't repackage and resell it?

~~~
ddorian43
You can't include it in your code and sell your software without distributing
the source.

~~~
pedrocr
Is there actually precedent for a GPL library, clearly setup to be consumed
with an API, to require GPL of the whole binary? On the one hand that's
clearly not the case in the Linux API, on the other hand the LGPL exists. But
was curious if this is actually settled or just too murky to live with.

~~~
detaro
The Linux kernel has explicit rules for the cases where it allows it:

An explicit license exception for the syscall interface, stating that calling
it from userspace is freely allowed:
[https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/LICENSES/exceptions/Linux-
syscall-note?h=v5.0-rc2)

And modules, e.g. drivers, have various statements from Linus on how they're
not necessarily considered derived works of the kernel/linked to the kernel,
e.g. [http://linuxmafia.com/faq/Kernel/proprietary-kernel-
modules....](http://linuxmafia.com/faq/Kernel/proprietary-kernel-modules.html)
or
[http://lkml.iu.edu/hypermail/linux/kernel/0312.0/0670.html](http://lkml.iu.edu/hypermail/linux/kernel/0312.0/0670.html)

~~~
pedrocr
I know the situation of the Linux kernel. The discussion hinges on what is a
"derived work". Kernel modules are very interlinked with the kernel,
particularly since there are no stable APIs inside the kernel. It seems
strange that just linking to a GPL library that defined an API makes you a
derived work. But that is the FSF position:

[https://en.wikipedia.org/wiki/GNU_Readline#Choice_of_the_GPL...](https://en.wikipedia.org/wiki/GNU_Readline#Choice_of_the_GPL_as_GNU_Readline's_license)

This is why I always do libs as LGPL but it seems strange to me that it's even
needed. If I've defined a proper opaque API, to be consumed by external code I
know nothing about, it's strange to then argue that library callers are
derived works and LGPL is explicitly needed.

------
nikolay
I am always surprised when I find out that developers with bold claims to fame
have not heard of Sphinx [0] and of Xapian [1]!

[0]: [http://sphinxsearch.com/](http://sphinxsearch.com/)

[1]: [https://xapian.org/](https://xapian.org/)

~~~
symlock
Sphinx is a workhorse. Very light memory and CPU usage compared to the popular
Elasticsearch.

------
lodestone
I checked out Xapiand several months ago after I stumbled across it during a
fit of Github browsing. It certainly seems fast and very easy to add documents
but there is so little documentation that I was unable to test it out in any
significant way. I'm very interested to see where the project goes, especially
if Xapian itself switches away from the GPL.

------
coleifer
Somewhat unrelated but I've written a restful search server that's powered by
Sqlite's full-text search. Extremely lightweight python (flask) app. Nice for
blogs or small projects, if I say so myself!

[https://github.com/coleifer/scout](https://github.com/coleifer/scout)

------
merlincorey
Still beta, last release[0] was 0.8 on 2018 November 18.

It is also not clear to me what, if anything, already integrates with this,
and therefore how much code I need to write to try it out and compare against
ElasticSearch.

[0] [https://kronuz.io/Xapiand/news/](https://kronuz.io/Xapiand/news/)

~~~
ploxiln
fwiw also 11 patch releases (meaning 0.8.z) since then
[https://github.com/Kronuz/Xapiand/releases](https://github.com/Kronuz/Xapiand/releases)

------
catmanjan
I'm interested, can anyone give a quick overview of why you'd use this over
Elasticsearch?

~~~
francislavoie
Well for one, Java is ridiculously memory hungry. The resource costs of
Elasticsearch is the #1 reason I'm not using it. I've seen a few projects
which had the aim of reimplementing the Elasticsearch backend in Rust, but
were incomplete. That would be my ideal solution, personally.

~~~
bheesham
I don't know how true that is anymore, that "Java is ridiculously memory
hungry".

It does power billions of devices, after all. :-P

~~~
gshack
So does the diesel engine, but's not eco friendly.

~~~
nkozyra
Compared to?

~~~
gshack
Electrical Engines ? I am giving a contrasting analogy to the previous
statement that the fact is it widely used does not make it necessarily good

------
ngrilly
Xapiand depends on xapian-core which is licensed under GPL (not LGPL). Makes
me think that Xapiand should be licensed under GPL, instead of MIT?

~~~
gdamjan1
the author can license his own source code however he likes. only, when
distributing the compiled binaries, you're required to provide the whole
sources under libre (gpl compatible) terms.

------
drenvuk
It's nice to see new search servers, especially low level ones. I'm going to
give this a few tests.

~~~
hardwaresofton
Not sure if you know about tantivy but it's cool too:
[https://github.com/tantivy-search/tantivy](https://github.com/tantivy-
search/tantivy)

~~~
cetra3
Also worth mentioning is Toshi: [https://github.com/toshi-
search/Toshi](https://github.com/toshi-search/Toshi)

Toshi is to ElasticSearch as Tantivy is to Lucene if that makes sense.

Obviously as they are new they are not at feature parity, but Tantivy does win
at some benchmarks: [https://tantivy-search.github.io/bench/](https://tantivy-
search.github.io/bench/)

~~~
bmichel
There is also Blast (golang), built on top of Bleve.

\- [https://github.com/mosuka/blast](https://github.com/mosuka/blast) \-
[http://blevesearch.com/](http://blevesearch.com/)

~~~
ddorian43
Yeah but it's golang, so it's kinda like java, so I see no pros in it TBH.

~~~
hardwaresofton
There are a lot of differences between Golang and Java. As much as I dislike
writing Java when I have a choice, the JVM (with Java or whatever else on top)
is a very capble tool... Could you explain what you mean by there being "no
pros"?

Are you maybe trying to get at the difficulty of tuning the JVM?

~~~
ddorian43
rust/c++/c has no gc and better performance/efficiency compared to
java/golang. so you get excited for a library/db in those languages

golang is kinda a java alternative. a db/search-engine in java/golang kinda
sucks (it will under pressure)

~~~
hardwaresofton
While I definitely agree with you on the broad strokes of the differences
between rust/c++/c and java/golang (representing languages without runtimes
and those with them respectively), I'd say that golang is a bit more than a
java alternative if we consider more than whether a runtime is included or
not.

Of course, if the only consideration is whether a runtime is there or not,
golang is identical to java but also identical to common lisp or maybe even
interpreted languages like python.

I do want to point out that it's possible to write horribly buggy code in
c++/c (less so in rust :), which can tank performance/efficiency when compared
to a java/golang program. All things considered though, the ceiling on
performance and efficiency is of course higher in manual memory management
land.

Thanks for clarifying what you meant!

------
amelius
From the features list:

> Ranked search (so the most relevant documents are more likely to come near
> the top of the results list) with built-in support for multiple models from
> the Probabilistic, Divergence from Randomness, and Language Modelling
> families of weighting models. Custom user-supplied weighting models are also
> supported.

Could someone explain in a little more detail what these terms mean?

~~~
sciurus
tl;dr is that those are different approaches to weighting documents in order
to return the most relevant ones for a query.

For an intro to the problem space, see
[https://opensourceconnections.com/blog/2014/06/10/what-is-
se...](https://opensourceconnections.com/blog/2014/06/10/what-is-search-
relevancy/)

If you want a lot more detail, check out the book Relevant Search.

[https://www.manning.com/books/relevant-
search](https://www.manning.com/books/relevant-search)

------
patelh
Haven't heard of Vespa? [https://vespa.ai](https://vespa.ai)

~~~
emmelaich
Can you tell us why you like it?

~~~
patelh
Has been in production far longer than any other open source solution. Runs at
scale across Yahoo, powering even Ad systems, with live configurations pushes.
Everything you need for highly available product. It also has capabilities to
be used for more complex uses cases around AI.

------
karterk
If you are looking for an easy to run/manage, typo tolerant search engine,
I've been working on this:
[https://github.com/typesense/typesense](https://github.com/typesense/typesense)

------
jayalpha
apt-get install recoll

Consider your problems solved...

------
the_other_guy
I really hope this is the long awaited thing. From the number of commits, this
project looks huge, Elasticsearch is really a big liability for resource
limited deployments, I have seen some smaller projects made in Rust and Go but
can't compete with Elasticseatch at any level, but this looks different and I
hope it does.

------
paradoxparalax
I was genuinely wondering recently why the Hackernews site search tool didn't
show a very recent article with very obvious keywords, that google, for
example, had in the first place in the results, when adding "Hn" to those 2
keywords in the search field. Is it a matter of the indexing, it means the
article was too recent and it wasn't yet in the Algolia's(I believe) based
HN's search tool memory; and in this case google copied it to memory faster;
Or is this purely a matter of the Algorithms themselves? The algorithms for
sure sound to be the matter, when the case is that of a search for an old
article, that should be in memory already. It seems unnatural. Algolia has a
free tier for open-source projects, what is very nice of them and thanks. I
genuinely wonder if those algorithms are indeed so complex to justify those
comparatively weaker behaviors seen at HN's internal search.

~~~
manigandham
[https://hn.algolia.com/](https://hn.algolia.com/) only searches the text
content of the submitted stories (as in the title and url) and the comments.

It doesn't index the actual article content nor take into account links across
sites and content like Google does. Algolia (as self-described) is designed to
search for things (like products in a ecommerce store) rather than text with
concepts, relations, and entities in a knowledge graph like Google.

~~~
paradoxparalax
Maybe too late now, but i would like to point that I went to read
Elasticsearch, Algolia, and Xapialand product descriptions before I made my
comment, so I know well what Algolia is for, and in the example I gave, I was
searching for a Headline, not for internal content inside the comments, and It
was a headline that was in the first page of results at HN on that moment, so,
I think I have phrased my comment in a polite way towards Algolia,
understanding that search has more moving parts then the pattern matching
algorithms of the logical core. :ps I am sincerely grateful for the
information you gave in your comment, about Entities, Concepts and Relations
on a Knowledge graph. This is exactly the kind of info I was looking for when
I made the comment, so It was enlighting to know that, and thank you again.

