Hacker News new | more | comments | ask | show | jobs | submit login
Xapiand: A fast, simple, modern search and storage engine (kronuz.io)
171 points by Kronuz 39 days ago | hide | past | web | favorite | 63 comments



Xapian has a long history starting in the early 80s:

https://xapian.org/history

I've used Xapian extensively, but not this new Xapiand tool, so I can only speak to the actual library. Xapian is a C++ library that accesses index data files directly on disk. There are bindings for various languages, say Python, let's you do 'import xapian' and get FFI bindings to the library, then you basically open your on disk index files and issue queries.

Xapian supports many concurrent readers, but only one writer. It's not a server, there are no protocols. Maybe that's what this Xapiand tool adds. In general the overhead is very, very light, just enough ram to hold the library code, the OS takes care of all the filesystem level caching.

Many of the very same concepts that are in Lucene, Documents, Terms, weights, flavors of BM25 relevance ranking, query parsing trees, relevancy operators, etc, all apply to Xapian as well.


I love Xapian, the quality of its recall is excellent and indexing performance very hard to find fault with. There's just a tiny problem - it's stuck with the GPL, despite a long effort to relicence the code going back years.


Remind me what's wrong with GPL, here? You can't repackage and resell it?


You can't include it in your code and sell your software without distributing the source.


Other libraries might require you to pay them money. GPL software requires you to pay your debt in source code, instead.


I was explaining, not complaining.


Is there actually precedent for a GPL library, clearly setup to be consumed with an API, to require GPL of the whole binary? On the one hand that's clearly not the case in the Linux API, on the other hand the LGPL exists. But was curious if this is actually settled or just too murky to live with.


The Linux kernel has explicit rules for the cases where it allows it:

An explicit license exception for the syscall interface, stating that calling it from userspace is freely allowed: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

And modules, e.g. drivers, have various statements from Linus on how they're not necessarily considered derived works of the kernel/linked to the kernel, e.g. http://linuxmafia.com/faq/Kernel/proprietary-kernel-modules.... or http://lkml.iu.edu/hypermail/linux/kernel/0312.0/0670.html


I know the situation of the Linux kernel. The discussion hinges on what is a "derived work". Kernel modules are very interlinked with the kernel, particularly since there are no stable APIs inside the kernel. It seems strange that just linking to a GPL library that defined an API makes you a derived work. But that is the FSF position:

https://en.wikipedia.org/wiki/GNU_Readline#Choice_of_the_GPL...

This is why I always do libs as LGPL but it seems strange to me that it's even needed. If I've defined a proper opaque API, to be consumed by external code I know nothing about, it's strange to then argue that library callers are derived works and LGPL is explicitly needed.


There are many gpl/lgpl dbs with bsd drivers (ex: scylladb). Dont know about library though (you should be able to use the library externally I think just like you do with the db)


Yeah, a pity


Maybe xapian library just needs a little push from a larger community to make relicensing faster, Xapiand could help towards that end by helping brining in more people which can help. Xapiand source code is itself licensed as MIT (before compiling), and xapian community is already taking big steps towards relicensing.


I am always surprised when I find out that developers with bold claims to fame have not heard of Sphinx [0] and of Xapian [1]!

[0]: http://sphinxsearch.com/

[1]: https://xapian.org/


Sphinx is a workhorse. Very light memory and CPU usage compared to the popular Elasticsearch.


Those developers roll their own search libraries.


I checked out Xapiand several months ago after I stumbled across it during a fit of Github browsing. It certainly seems fast and very easy to add documents but there is so little documentation that I was unable to test it out in any significant way. I'm very interested to see where the project goes, especially if Xapian itself switches away from the GPL.


Somewhat unrelated but I've written a restful search server that's powered by Sqlite's full-text search. Extremely lightweight python (flask) app. Nice for blogs or small projects, if I say so myself!

https://github.com/coleifer/scout


Still beta, last release[0] was 0.8 on 2018 November 18.

It is also not clear to me what, if anything, already integrates with this, and therefore how much code I need to write to try it out and compare against ElasticSearch.

[0] https://kronuz.io/Xapiand/news/


fwiw also 11 patch releases (meaning 0.8.z) since then https://github.com/Kronuz/Xapiand/releases


I'm interested, can anyone give a quick overview of why you'd use this over Elasticsearch?


Well for one, Java is ridiculously memory hungry. The resource costs of Elasticsearch is the #1 reason I'm not using it. I've seen a few projects which had the aim of reimplementing the Elasticsearch backend in Rust, but were incomplete. That would be my ideal solution, personally.


I don't know how true that is anymore, that "Java is ridiculously memory hungry".

It does power billions of devices, after all. :-P


So does the diesel engine, but's not eco friendly.


Compared to?


Electrical Engines ? I am giving a contrasting analogy to the previous statement that the fact is it widely used does not make it necessarily good


why do you think they called their product "elastic"? java heap size? 32gb? above that and you are in for problems


which problem? There is no problem, especially since the latest gc. Only tradeof


this is a limitation. For instance if you have billions of docs you need 200 tiny servers and have to deal with the communication/administration/monitoring between all of those. anything around 32gb and you will have perf drops everytime the GC works too hard..


Not true since g1gc


Yes, exactly the same for me, I've tried all the configuration possible with the jvm (that I know of) but nothing really worked to make it use a more reasonable amount of memory.


OpenJ9 is a low memory footprint jvm


Xapiand depends on xapian-core which is licensed under GPL (not LGPL). Makes me think that Xapiand should be licensed under GPL, instead of MIT?


the author can license his own source code however he likes. only, when distributing the compiled binaries, you're required to provide the whole sources under libre (gpl compatible) terms.


It's nice to see new search servers, especially low level ones. I'm going to give this a few tests.


Not sure if you know about tantivy but it's cool too: https://github.com/tantivy-search/tantivy


Also worth mentioning is Toshi: https://github.com/toshi-search/Toshi

Toshi is to ElasticSearch as Tantivy is to Lucene if that makes sense.

Obviously as they are new they are not at feature parity, but Tantivy does win at some benchmarks: https://tantivy-search.github.io/bench/


There is also Blast (golang), built on top of Bleve.

- https://github.com/mosuka/blast - http://blevesearch.com/


Wow I actually forgot about Bleve!

I watched a talk on the new indexing engine a while back:

https://www.youtube.com/watch?v=zjG2Y01i3Kk

Can we attribute some of this renewed zeal in the search space to the creation of more approachable systems languages (i.e. Golang and Rust)? Maybe I just haven't been watching the search space but I feel it wasn't always this full of new projects putting up good numbers.


Yeah but it's golang, so it's kinda like java, so I see no pros in it TBH.


There are a lot of differences between Golang and Java. As much as I dislike writing Java when I have a choice, the JVM (with Java or whatever else on top) is a very capble tool... Could you explain what you mean by there being "no pros"?

Are you maybe trying to get at the difficulty of tuning the JVM?


rust/c++/c has no gc and better performance/efficiency compared to java/golang. so you get excited for a library/db in those languages

golang is kinda a java alternative. a db/search-engine in java/golang kinda sucks (it will under pressure)


While I definitely agree with you on the broad strokes of the differences between rust/c++/c and java/golang (representing languages without runtimes and those with them respectively), I'd say that golang is a bit more than a java alternative if we consider more than whether a runtime is included or not.

Of course, if the only consideration is whether a runtime is there or not, golang is identical to java but also identical to common lisp or maybe even interpreted languages like python.

I do want to point out that it's possible to write horribly buggy code in c++/c (less so in rust :), which can tank performance/efficiency when compared to a java/golang program. All things considered though, the ceiling on performance and efficiency is of course higher in manual memory management land.

Thanks for clarifying what you meant!


golang isn't even close to using the same amount of memory as java, so at least there's that.


Non-native English speaker. But isn't easier to understand like so,

"Tantivity to Toshi, is as Lucene to Elasticsearch"


As a native english speaker, the earlier phrase ("tantivy is to toshi as lucene is to elastic search") is easier for me to understand. I find your phrase a bit harder to understand, but it looks like just the kind of reorganization other languages do structure wise -- I don't know how to express it in proper grammatical terms, but the way the prepositions are swapped around makes it seem like native english words but with a non-english structure.

It might have to do with the use of Analogy questions in the SAT (a standardized test all but required for high school students wanting to attend good colleges in America), though it looks like they've been removed?[0].

"_____ is to ___ as ____ is to ______" was the verbatim format of those test questions.

[0]: https://blog.prepscholar.com/sat-analogies-and-comparisons-w...


From the features list:

> Ranked search (so the most relevant documents are more likely to come near the top of the results list) with built-in support for multiple models from the Probabilistic, Divergence from Randomness, and Language Modelling families of weighting models. Custom user-supplied weighting models are also supported.

Could someone explain in a little more detail what these terms mean?


tl;dr is that those are different approaches to weighting documents in order to return the most relevant ones for a query.

For an intro to the problem space, see https://opensourceconnections.com/blog/2014/06/10/what-is-se...

If you want a lot more detail, check out the book Relevant Search.

https://www.manning.com/books/relevant-search


I really hope this is the long awaited thing. From the number of commits, this project looks huge, Elasticsearch is really a big liability for resource limited deployments, I have seen some smaller projects made in Rust and Go but can't compete with Elasticseatch at any level, but this looks different and I hope it does.


If you are looking for an easy to run/manage, typo tolerant search engine, I've been working on this: https://github.com/typesense/typesense


Haven't heard of Vespa? https://vespa.ai


Nobody has. There's no visibility or community around it which is a constant problem with Yahoo's open source projects. The only thing that really took off was Hadoop but there was very little back then.

Vespa is also far more heavy and complex than any other search systems mentioned here.


Can you tell us why you like it?


Has been in production far longer than any other open source solution. Runs at scale across Yahoo, powering even Ad systems, with live configurations pushes. Everything you need for highly available product. It also has capabilities to be used for more complex uses cases around AI.


apt-get install recoll

Consider your problems solved...


I was genuinely wondering recently why the Hackernews site search tool didn't show a very recent article with very obvious keywords, that google, for example, had in the first place in the results, when adding "Hn" to those 2 keywords in the search field. Is it a matter of the indexing, it means the article was too recent and it wasn't yet in the Algolia's(I believe) based HN's search tool memory; and in this case google copied it to memory faster; Or is this purely a matter of the Algorithms themselves? The algorithms for sure sound to be the matter, when the case is that of a search for an old article, that should be in memory already. It seems unnatural. Algolia has a free tier for open-source projects, what is very nice of them and thanks. I genuinely wonder if those algorithms are indeed so complex to justify those comparatively weaker behaviors seen at HN's internal search.


https://hn.algolia.com/ only searches the text content of the submitted stories (as in the title and url) and the comments.

It doesn't index the actual article content nor take into account links across sites and content like Google does. Algolia (as self-described) is designed to search for things (like products in a ecommerce store) rather than text with concepts, relations, and entities in a knowledge graph like Google.


Maybe too late now, but i would like to point that I went to read Elasticsearch, Algolia, and Xapialand product descriptions before I made my comment, so I know well what Algolia is for, and in the example I gave, I was searching for a Headline, not for internal content inside the comments, and It was a headline that was in the first page of results at HN on that moment, so, I think I have phrased my comment in a polite way towards Algolia, understanding that search has more moving parts then the pattern matching algorithms of the logical core. :ps I am sincerely grateful for the information you gave in your comment, about Entities, Concepts and Relations on a Knowledge graph. This is exactly the kind of info I was looking for when I made the comment, so It was enlighting to know that, and thank you again.


Google search with site:news.ycombinator.com (and optionally a time limit, which I wish wasn't limited to past hour/day/week/month/year) seems consistently superior to what Algolia provides.

Algolia is YC company, so I assume that's the main reason it's being used. But that it does such an awful job with such a simply structured site isn't compelling.


Hey latch, I've been working on the Algolia-based HN search and would love to improve it to provide you with a better search experience.

Do you think about any specific improvements? Would you mind sharing with us some non-working queries? We can follow-up here and you can also open issues on https://github.com/algolia/hn-search


I have just retested and the problem I mentioned before doesn't exist anymore, It has happened around Christmas time, back then, the search returned not very relevant results, Now it shows at the 9th position what google shows at 1th (exactly the article I was searching back then, still as first match on google), but this is a minor difference of order of the first page so I have to say the search is working pretty as expected now. Sorry for not retesting it before making that comment and thanks for keeping the search good. :all right I didn't want to say it but there: russia hypersonic


Well, if I do "activemq vs rabbitmq" I then switch to "comments", I get 2 results. The 2nd hit is reasonable: "ActiveMQ: Not ready for prime time"

Google gives many more results, and a few on the first page seem quite relevant. Most notably: https://news.ycombinator.com/item?id=5531192

but also: https://news.ycombinator.com/item?id=1657574


I just googled "elixir property testing site:news.ycombinator.com" and the first hit was what I wanted.

But it's the 3rd result in Algolia, behind stories that are both older and with fewer votes.


Thanks for sharing, this is a good example where the first 2 hits on Algolia have a "better" textual relevancy (proximity between words is better, because of the "-based" word in the middle) but where the 3rd hit is most probably the one we want to see first because it has more than 100 points while the 2 others have 3.

Let me share that to the team and see whether we can try something.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: