I've used Xapian extensively, but not this new Xapiand tool, so I can only speak to the actual library. Xapian is a C++ library that accesses index data files directly on disk.
There are bindings for various languages, say Python, let's you do 'import xapian' and get FFI bindings to the library, then you basically open your on disk index files and issue queries.
Xapian supports many concurrent readers, but only one writer. It's not a server, there are no protocols. Maybe that's what this Xapiand tool adds. In general the overhead is very, very light, just enough ram to hold the library code, the OS takes care of all the filesystem level caching.
Many of the very same concepts that are in Lucene, Documents, Terms, weights, flavors of BM25 relevance ranking, query parsing trees, relevancy operators, etc, all apply to Xapian as well.
An explicit license exception for the syscall interface, stating that calling it from userspace is freely allowed: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...
And modules, e.g. drivers, have various statements from Linus on how they're not necessarily considered derived works of the kernel/linked to the kernel, e.g. http://linuxmafia.com/faq/Kernel/proprietary-kernel-modules.... or http://lkml.iu.edu/hypermail/linux/kernel/0312.0/0670.html
This is why I always do libs as LGPL but it seems strange to me that it's even needed. If I've defined a proper opaque API, to be consumed by external code I know nothing about, it's strange to then argue that library callers are derived works and LGPL is explicitly needed.
It is also not clear to me what, if anything, already integrates with this, and therefore how much code I need to write to try it out and compare against ElasticSearch.
It does power billions of devices, after all. :-P
Toshi is to ElasticSearch as Tantivy is to Lucene if that makes sense.
Obviously as they are new they are not at feature parity, but Tantivy does win at some benchmarks: https://tantivy-search.github.io/bench/
I watched a talk on the new indexing engine a while back:
Can we attribute some of this renewed zeal in the search space to the creation of more approachable systems languages (i.e. Golang and Rust)? Maybe I just haven't been watching the search space but I feel it wasn't always this full of new projects putting up good numbers.
Are you maybe trying to get at the difficulty of tuning the JVM?
golang is kinda a java alternative. a db/search-engine in java/golang kinda sucks (it will under pressure)
Of course, if the only consideration is whether a runtime is there or not, golang is identical to java but also identical to common lisp or maybe even interpreted languages like python.
I do want to point out that it's possible to write horribly buggy code in c++/c (less so in rust :), which can tank performance/efficiency when compared to a java/golang program. All things considered though, the ceiling on performance and efficiency is of course higher in manual memory management land.
Thanks for clarifying what you meant!
"Tantivity to Toshi, is as Lucene to Elasticsearch"
It might have to do with the use of Analogy questions in the SAT (a standardized test all but required for high school students wanting to attend good colleges in America), though it looks like they've been removed?.
"_____ is to ___ as ____ is to ______" was the verbatim format of those test questions.
> Ranked search (so the most relevant documents are more likely to come near the top of the results list) with built-in support for multiple models from the Probabilistic, Divergence from Randomness, and Language Modelling families of weighting models. Custom user-supplied weighting models are also supported.
Could someone explain in a little more detail what these terms mean?
For an intro to the problem space, see https://opensourceconnections.com/blog/2014/06/10/what-is-se...
If you want a lot more detail, check out the book Relevant Search.
Vespa is also far more heavy and complex than any other search systems mentioned here.
Consider your problems solved...
It doesn't index the actual article content nor take into account links across sites and content like Google does. Algolia (as self-described) is designed to search for things (like products in a ecommerce store) rather than text with concepts, relations, and entities in a knowledge graph like Google.
Algolia is YC company, so I assume that's the main reason it's being used. But that it does such an awful job with such a simply structured site isn't compelling.
Do you think about any specific improvements? Would you mind sharing with us some non-working queries? We can follow-up here and you can also open issues on https://github.com/algolia/hn-search
Google gives many more results, and a few on the first page seem quite relevant. Most notably: https://news.ycombinator.com/item?id=5531192
but also: https://news.ycombinator.com/item?id=1657574
But it's the 3rd result in Algolia, behind stories that are both older and with fewer votes.
Let me share that to the team and see whether we can try something.