What's a common approach for keeping the index up to date? A live ETL from the DB to the search engine doesn't sound simple. Another method I can think of, after existent data has been loaded, is to send the data directly at the same time to both the database and the search engine every time a user makes a CRUD operation but lots of works too if you don't already have a HTTP api and are doing mostly server-side-rendered HTML.
Apart from being written in Rust, MeiliSearch (https://github.com/meilisearch/meilisearch) differs mostly on the use of a bucket sort to rank the documents retrieved within the index.
Both MeiliSearch and Typesense use a reverse index with a Levenshtein automaton to handle typos, but when it comes to sorting document:
- Typesense use a default_sorting_field on each document, it means that before indexing your documents you need to compute a relevancy score for typesense to be able to sort them based on your needs (https://typesense.org/docs/0.11.1/guide/#ranking-relevance)
- On the other hand MeiliSearch, uses a bucket sort which means that there is a default relevancy algorithm based on the proximity of words in the documents, the fields in which the words are found and the number of typos (https://docs.meilisearch.com/guides/advanced_guides/ranking....). And you can still add you own custom rules if you want to alter the default search behavior.
> Typesense use a default_sorting_field on each document, it means that before indexing your documents you need to compute a relevancy score for typesense to be able to sort them based on your needs
This is not entirely true. You can just use a field with a constant value (say 100) as the default sorting field and Typesense will just use the text based relevancy. Please update your comment.
The reason why Typesense does insist on one, though is that it's always a good idea to have a field that indicates popularity of a record (or a proxy to it). It makes search so much better.
Lucene was written for public search engine like Google, or DuckDuckGo (which is actually based on Lucene and Solr).
Lucene and Lucene-like projects (Tantivy or Bleve in Golang) are general-purpose search libraries. They can handle enormous datasets, and you can make very complex queries on them (compute the average age of people named Karl in a certain type of document for example).
These libraries are based on tf-idf (term frequency inverse document frequency) algorithm and manage quite poorly typos for example (unless you make the setup to index your documents differently to parse them correctly).
Toshi is like Elastic for Lucene, it provides sharding and JSON over HTTP api.
You can basically used Lucene and its derivatives for basically any search related project, but you may have to dive into how it works and understand concepts like tokenization or ngrams to tune it according to your needs.
On the other hand, MeiliSearch (and I guess Typesense, but I can not talk for them) focus a subset of what you could build with Lucene or Elastic.
It is a fully functionnal Restful API, made for instant search or search-as-you-type. The algorithms behind MeiliSearch are simply different: a inverse index, with a levensthein automaton to handle typos, then a bucket sort you can tune for the ranking of the returned documents.
The aim is to provide a easyer go-to solution to implement for customer-facing search.
You won't be able to make super complex queries on terabytes of data. We just make super fast and ultra relevant search for end-user.
TypeSense and MeiliSearch focus on the same usage, we choose Rust for performance, security and the modern ecosystem that will allow easier maintenance :D
@fulmicoton and @tpayet, I'm thinking then if I want both full text search, and also faceted search, then Tantivy can do that, but at this time, MeiliSearch (and Typesense) don't do that?
@tpayet, for a database like MeiliSearch, is faceted search typically always off-topic? Or you're thinking about adding faceted search, later on?
(My use case is 1) full text search, 2) typo friendly, and with 3) e.g. "begin" matching also "began", "begun", and "run" also matching "running", and 4) in all lanugages, and 5) faceted-search restricted to tags and categories and user groups.)
MeiliSearch does not offer faceted search yet. It is one of the key feature that we are still missing but we plan to work on it in the coming weeks.
For your use case today, I suggest you use Typesense if it fits your needs ( they handle faceted search already ) or Tantivy or Toshi.
To manage different languages, you should make one index per language. MeiliSearch and Tantivy handles kanjis!
We will add faceted search in the coming weeks if you are not in a hurry :D
As of now, my project uses ElasticSearch — it works fine, but it wants lots of RAM which I find slightly annoying, ... and the new Java v 9, 10, 11 etc feels a bit worrying. — So, as of now I'm just staying up-to-date with new Rust based search engines.
The fat tail queries go against PostgreSQL and the long tail queries go against Solr. For shorter queries PostgreSQL takes precedence. Long tail fills in Instance Answers where nothing else catches.
Yes :) Valerian Saliou, the maintainer of Sonic is a friend of us. He built Sonic mainly for his company (crisp.chat) and compared to MeiliSearch there is no relevancy ranking.
Sonic is "just" a inverted index with a levenshtein automaton, it will returns only the documents ids in which there is your requests words and then you will have to retrieve the full documents in your main database and only then can you apply some relevancy ranking by yourself.
I have played around with lucene a lot and it seems typesense is a very close match to the feature set. - Apart from the REST interface on top.
Was the decision to not use the mature lucene platform technical?
The memory and hardware requirements of lucene are quite small, even if Elastic or Solr leave a very different impression.
Glad to see a solution positioning itself a bit leaner than Solr/Elastic though, they really are a bit heavy for many occasions.
Yes, for typo correction + instant search, Lucene definitely is not fast enough on large datasets. There are also some limitations with fuzzy searching when you also want to sort/rank documents at the same time. Lucene is also a very generic mature library for a wider set of usecases.
What do you mean not fast enough? Lucene/Elasticsearch/Solr are basically routinely used on datasets that are way beyond the scope of what this or Algolia is used on (i.e. petabyte scale). I've worked with teams indexing billions of documents. At that scale measuring ranking quality is a much bigger concern than raw performance, which is just a matter of throwing more hardware at it (i.e. it's more of a cost concern than a performance concern). When it comes to ranking quality, a one size fits all, non sharded solution like this is definitely not going to stand up to much scrutiny. Either it fits, or more likely it doesn't and you need the knobs to tune it (which Lucene provides you plenty of).
Most of the tricks that Algolia and (presumably) this product do to be fast have more to do with how they manage memory than with their implementation language. Basically, Algolia is a non sharded search index that is loaded into memory in its entirety. They don't have to worry about disk seeks, file cache misses, etc. Not a problem when your entire index fits in memory. Of course that puts an upper limit on what they can index and limits the uses to smallish datasets that in no way pose any challenge whatsoever to Lucene (especially when your index easily fits in memory).
I'm not saying it's a bad approach. I think it's a great approach for smallish data sets with very loose ranking requirements where a one size fits all solution does the job and is good enough. For many companies search is not really core to their experience and it just needs to be idiot proof and low hassle to set up for their non expert tech teams and product managers.
I agree with everything you've said. I was a bit sloppy in not mentioning the trade-offs. My "fast enough" remark was certainly not for handling large datasets. The parent comment had asked why Lucene was not used for Typesense and my response was specifically to that point. Typesense targets a different set of use cases involving small-medium datasets where an interactive search experience is importnat.
Lucene is nice because it has a pretty well-defined plugin architecture where you can write a plugin which gets a parse tree, mutate the tree (to add fuzziness to the matching, for example), and then pass the tree to the next stage. It's a nicely composable way to extend functionality.
> when 1 million Hacker News titles are indexed along with their points, Typesense consumes 165 MB of memory. The same size of that data on disk in JSON format is 88 MB.
I like the compact filter_by, sort_by with qualifiers
I'm new to search libraries (frameworks?) but have been looking for something to use for a huge data dump I'm working with.
Storing everything in memory seems fast, but seems like it'd be quite the resource hog on a server -- is that a normal approach to take?
It's reassuring that the examples and documentation all revolve around books (as my data set is actually ~55 million books also), but since theirs seems to be quite the subset of that I worry about how well this scales and I don't know enough about search libs to even evaluate that.
Is there a good place to start learning about what kinds of situations Typesense works best in (besides needing a Levenshtein-based search), versus what kinds of situations it wouldn't work well in (and perhaps what other libraries would work better)?
Typesense's primary focus is speed and developer convenience. It makes an assumption (which is true for perhaps 99% of the time) that memory is cheap enough for indexing most datasets. Especially given the effort of development time and the benefits from a solid search user experience.
Other libraries like Elastic offer more customization but also has a steeper learning curve.
Talking about fastest time to market this is the biggest one rather than setting up elastic, which annoying as it is is still faster than creating the UI.
There is a bug on the demo search box in your home page, if no search results found (either due to empty string or no result found for the search term), it will display "undefined result. Page 1 of NaN"
Looks great. One of Algolia's strongest features is InstantSearch for vanilla JS, React, Vue, Angular, iOS and Android. Hopefully there can be this level of support for Typesense
- if using an orm have a hook in your orm model to update the search database whenever a database entry updated/created/deleted.
- if not using an orm, update your rest api/view/any code that does CRUD to update the search index after successful data update
- create a command line tool that sync all existing data to the search index. Probably only used a couple times when initializing the search index with existing data, but it's pretty handy for testing purpose.