Hacker News new | past | comments | ask | show | jobs | submit login
Typesense: Open-Source Alternative to Algolia (github.com/typesense)
447 points by karterk on Jan 29, 2020 | hide | past | favorite | 54 comments

What's a common approach for keeping the index up to date? A live ETL from the DB to the search engine doesn't sound simple. Another method I can think of, after existent data has been loaded, is to send the data directly at the same time to both the database and the search engine every time a user makes a CRUD operation but lots of works too if you don't already have a HTTP api and are doing mostly server-side-rendered HTML.

Pretty cool, how does it compare to the Rust pendant https://crates.meilisearch.com/

Apart from being written in Rust, MeiliSearch (https://github.com/meilisearch/meilisearch) differs mostly on the use of a bucket sort to rank the documents retrieved within the index.

Both MeiliSearch and Typesense use a reverse index with a Levenshtein automaton to handle typos, but when it comes to sorting document:

- Typesense use a default_sorting_field on each document, it means that before indexing your documents you need to compute a relevancy score for typesense to be able to sort them based on your needs (https://typesense.org/docs/0.11.1/guide/#ranking-relevance)

- On the other hand MeiliSearch, uses a bucket sort which means that there is a default relevancy algorithm based on the proximity of words in the documents, the fields in which the words are found and the number of typos (https://docs.meilisearch.com/guides/advanced_guides/ranking....). And you can still add you own custom rules if you want to alter the default search behavior.

> Typesense use a default_sorting_field on each document, it means that before indexing your documents you need to compute a relevancy score for typesense to be able to sort them based on your needs

This is not entirely true. You can just use a field with a constant value (say 100) as the default sorting field and Typesense will just use the text based relevancy. Please update your comment.

The reason why Typesense does insist on one, though is that it's always a good idea to have a field that indicates popularity of a record (or a proxy to it). It makes search so much better.

How would you say Typesense and MeiliSearch compare with Tantivy + Toshi?

(those two are a bit like Lucene + ElasticSearch — but written in Rust)

Lucene was written for public search engine like Google, or DuckDuckGo (which is actually based on Lucene and Solr).

Lucene and Lucene-like projects (Tantivy or Bleve in Golang) are general-purpose search libraries. They can handle enormous datasets, and you can make very complex queries on them (compute the average age of people named Karl in a certain type of document for example).

These libraries are based on tf-idf (term frequency inverse document frequency) algorithm and manage quite poorly typos for example (unless you make the setup to index your documents differently to parse them correctly).

Toshi is like Elastic for Lucene, it provides sharding and JSON over HTTP api.

You can basically used Lucene and its derivatives for basically any search related project, but you may have to dive into how it works and understand concepts like tokenization or ngrams to tune it according to your needs.

On the other hand, MeiliSearch (and I guess Typesense, but I can not talk for them) focus a subset of what you could build with Lucene or Elastic.

It is a fully functionnal Restful API, made for instant search or search-as-you-type. The algorithms behind MeiliSearch are simply different: a inverse index, with a levensthein automaton to handle typos, then a bucket sort you can tune for the ranking of the returned documents. The aim is to provide a easyer go-to solution to implement for customer-facing search.

You won't be able to make super complex queries on terabytes of data. We just make super fast and ultra relevant search for end-user.

TypeSense and MeiliSearch focus on the same usage, we choose Rust for performance, security and the modern ecosystem that will allow easier maintenance :D

tantivy main dev here. Just chiming in to confirm this is an accurate answer.

Thanks @tpayet and @fulmicoton for the info :- )

@fulmicoton and @tpayet, I'm thinking then if I want both full text search, and also faceted search, then Tantivy can do that, but at this time, MeiliSearch (and Typesense) don't do that?

( When I look here: https://github.com/meilisearch/MeiliSearch in the features list, I see no mentioning of faceted search. Whilst Tantivy does list faceted search as a feature: https://github.com/tantivy-search/tantivy )

@tpayet, for a database like MeiliSearch, is faceted search typically always off-topic? Or you're thinking about adding faceted search, later on?

(My use case is 1) full text search, 2) typo friendly, and with 3) e.g. "begin" matching also "began", "begun", and "run" also matching "running", and 4) in all lanugages, and 5) faceted-search restricted to tags and categories and user groups.)

Typesense does support faceted search. Look for the `facet_by` example in this section: https://typesense.org/docs/0.11.1/api/#search-collection

Thanks! Based on what I read about Typesense, I'm thinking this faceted search happens in-memory (so one would want ok much RAM)

You are welcome.

MeiliSearch does not offer faceted search yet. It is one of the key feature that we are still missing but we plan to work on it in the coming weeks.

For your use case today, I suggest you use Typesense if it fits your needs ( they handle faceted search already ) or Tantivy or Toshi.

To manage different languages, you should make one index per language. MeiliSearch and Tantivy handles kanjis! We will add faceted search in the coming weeks if you are not in a hurry :D

Thanks for the info :- )

As of now, my project uses ElasticSearch — it works fine, but it wants lots of RAM which I find slightly annoying, ... and the new Java v 9, 10, 11 etc feels a bit worrying. — So, as of now I'm just staying up-to-date with new Rust based search engines.

DuckDuckGo isn't based on lucene/solr but bing-api.

my bad, my informations could be outdated.

based on http://highscalability.com/blog/2013/1/28/duckduckgo-archite...:

The fat tail queries go against PostgreSQL and the long tail queries go against Solr. For shorter queries PostgreSQL takes precedence. Long tail fills in Instance Answers where nothing else catches.

It seems that Bing is now a part of their sources indeed: https://help.duckduckgo.com/results/sources

Both are used. Source: Torsten Raudssus who works for DDG as developer liaison.

You saying they do websearch in lucene/solr ? How many TB do they have in solr ?

Have you contrasted MeliSearch with Sonic?

Yes :) Valerian Saliou, the maintainer of Sonic is a friend of us. He built Sonic mainly for his company (crisp.chat) and compared to MeiliSearch there is no relevancy ranking.

Sonic is "just" a inverted index with a levenshtein automaton, it will returns only the documents ids in which there is your requests words and then you will have to retrieve the full documents in your main database and only then can you apply some relevancy ranking by yourself.

Thanks for sharing this info! You may want to add these useful comparisons to your Readme. I will keep an eye on MS and wish you and team the best.

Thank you so much, we will definitly add these comparisons!

Looks really cool.

I have played around with lucene a lot and it seems typesense is a very close match to the feature set. - Apart from the REST interface on top.

Was the decision to not use the mature lucene platform technical? The memory and hardware requirements of lucene are quite small, even if Elastic or Solr leave a very different impression.

Glad to see a solution positioning itself a bit leaner than Solr/Elastic though, they really are a bit heavy for many occasions.

Yes, for typo correction + instant search, Lucene definitely is not fast enough on large datasets. There are also some limitations with fuzzy searching when you also want to sort/rank documents at the same time. Lucene is also a very generic mature library for a wider set of usecases.

What do you mean not fast enough? Lucene/Elasticsearch/Solr are basically routinely used on datasets that are way beyond the scope of what this or Algolia is used on (i.e. petabyte scale). I've worked with teams indexing billions of documents. At that scale measuring ranking quality is a much bigger concern than raw performance, which is just a matter of throwing more hardware at it (i.e. it's more of a cost concern than a performance concern). When it comes to ranking quality, a one size fits all, non sharded solution like this is definitely not going to stand up to much scrutiny. Either it fits, or more likely it doesn't and you need the knobs to tune it (which Lucene provides you plenty of).

Most of the tricks that Algolia and (presumably) this product do to be fast have more to do with how they manage memory than with their implementation language. Basically, Algolia is a non sharded search index that is loaded into memory in its entirety. They don't have to worry about disk seeks, file cache misses, etc. Not a problem when your entire index fits in memory. Of course that puts an upper limit on what they can index and limits the uses to smallish datasets that in no way pose any challenge whatsoever to Lucene (especially when your index easily fits in memory).

I'm not saying it's a bad approach. I think it's a great approach for smallish data sets with very loose ranking requirements where a one size fits all solution does the job and is good enough. For many companies search is not really core to their experience and it just needs to be idiot proof and low hassle to set up for their non expert tech teams and product managers.

I agree with everything you've said. I was a bit sloppy in not mentioning the trade-offs. My "fast enough" remark was certainly not for handling large datasets. The parent comment had asked why Lucene was not used for Typesense and my response was specifically to that point. Typesense targets a different set of use cases involving small-medium datasets where an interactive search experience is importnat.

Lucene is nice because it has a pretty well-defined plugin architecture where you can write a plugin which gets a parse tree, mutate the tree (to add fuzziness to the matching, for example), and then pass the tree to the next stage. It's a nicely composable way to extend functionality.

good to know the memory efficiency !!!

> when 1 million Hacker News titles are indexed along with their points, Typesense consumes 165 MB of memory. The same size of that data on disk in JSON format is 88 MB.

I like the compact filter_by, sort_by with qualifiers

let searchParameters = { 'q' : 'harry', 'query_by' : 'title', 'filter_by' : 'publication_year:<1998', 'sort_by' : 'publication_year:desc' }

I'm new to search libraries (frameworks?) but have been looking for something to use for a huge data dump I'm working with.

Storing everything in memory seems fast, but seems like it'd be quite the resource hog on a server -- is that a normal approach to take?

It's reassuring that the examples and documentation all revolve around books (as my data set is actually ~55 million books also), but since theirs seems to be quite the subset of that I worry about how well this scales and I don't know enough about search libs to even evaluate that.

Is there a good place to start learning about what kinds of situations Typesense works best in (besides needing a Levenshtein-based search), versus what kinds of situations it wouldn't work well in (and perhaps what other libraries would work better)?

Typesense's primary focus is speed and developer convenience. It makes an assumption (which is true for perhaps 99% of the time) that memory is cheap enough for indexing most datasets. Especially given the effort of development time and the benefits from a solid search user experience.

Other libraries like Elastic offer more customization but also has a steeper learning curve.

Is it compatible with InstantSearch.js? or reactive search? https://github.com/appbaseio/reactivesearch

Talking about fastest time to market this is the biggest one rather than setting up elastic, which annoying as it is is still faster than creating the UI.

Not at the moment, but we have an equivalent integration planned shortly. Totally agree with you that building a search UI is still a pain.

An integration with something existing or a new competitor to the projects I mentioned?

Likely an integration with an existing popular UI search library.

A dream come true. Something I've been looking for, for a long time now. Thank you for sharing this

Bit of a bad look that I can't search the docs using typesense.

There's also https://vespa.ai/ from (former) Yahoo, which I think knows a thing or two about search.

Any support for languages other than English?

Does it do normalization as part of the typo search (in case of missed/incorrent accent marks, etc)?

Does it do stemming at all? For English or other languages? (ie, I search for "run" and you show me documents for "running" or the other way around).

Any support for Chinese text (which typically doesn't have whitespace between words)?

We support English and other European languages (supports fuzzy search normalizing accented chars).

While it does not support stemming, with fuzzy prefix matching, it will largely work and practically more useful.

No typo or fuzzy correction for Chinese text yet.

It's written in C++, and the code is simple enough to skim. I would expect this to be some hefty Java thing.

There is a bug on the demo search box in your home page, if no search results found (either due to empty string or no result found for the search term), it will display "undefined result. Page 1 of NaN"

Thanks, this has been fixed.

Looks great. One of Algolia's strongest features is InstantSearch for vanilla JS, React, Vue, Angular, iOS and Android. Hopefully there can be this level of support for Typesense

Instant search isn't that hard to build a small frontend for when you have the API though.

Definitely going to work on that.

Would love to see index restricted API keys comparable to https://www.algolia.com/doc/api-reference/api-methods/genera...

Looks very nice. The readme and website demonstrate and explain it well, good job!

Is it completely in memory? Seems like that from the readme

Why is the hacker news title is mentioning "...alternative to Algolia" instead of other open source, self hosted solutions?

When I think in-app search, I usually think Algolia first but then ElasticSearch and Solr immediately after.

I think mentioning any of them would be okay.

How well does this work for searching CJK (Chinese, Japanese, Korean) string fields?

Looks great! Anyone know if I can try this on Heroku somehow?

Not at the moment, but I've added this to our todo list. Looks like we need to write a custom buildpack.

small world. Developed by my colleague's brother :)

I’ve never managed a separate search database. How do you keep records in sync with your application database?

This is what I usually do with elasticsearch:

- if using an orm have a hook in your orm model to update the search database whenever a database entry updated/created/deleted.

- if not using an orm, update your rest api/view/any code that does CRUD to update the search index after successful data update

- create a command line tool that sync all existing data to the search index. Probably only used a couple times when initializing the search index with existing data, but it's pretty handy for testing purpose.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact