
Typesense: Open-Source Alternative to Algolia - karterk
https://github.com/typesense/typesense
======
ysleepy
Looks really cool.

I have played around with lucene a lot and it seems typesense is a very close
match to the feature set. - Apart from the REST interface on top.

Was the decision to not use the mature lucene platform technical? The memory
and hardware requirements of lucene are quite small, even if Elastic or Solr
leave a very different impression.

Glad to see a solution positioning itself a bit leaner than Solr/Elastic
though, they really are a bit heavy for many occasions.

~~~
karterk
Yes, for typo correction + instant search, Lucene definitely is not fast
enough on large datasets. There are also some limitations with fuzzy searching
when you also want to sort/rank documents at the same time. Lucene is also a
very generic mature library for a wider set of usecases.

~~~
jillesvangurp
What do you mean not fast enough? Lucene/Elasticsearch/Solr are basically
routinely used on datasets that are way beyond the scope of what this or
Algolia is used on (i.e. petabyte scale). I've worked with teams indexing
billions of documents. At that scale measuring ranking quality is a much
bigger concern than raw performance, which is just a matter of throwing more
hardware at it (i.e. it's more of a cost concern than a performance concern).
When it comes to ranking quality, a one size fits all, non sharded solution
like this is definitely not going to stand up to much scrutiny. Either it
fits, or more likely it doesn't and you need the knobs to tune it (which
Lucene provides you plenty of).

Most of the tricks that Algolia and (presumably) this product do to be fast
have more to do with how they manage memory than with their implementation
language. Basically, Algolia is a non sharded search index that is loaded into
memory in its entirety. They don't have to worry about disk seeks, file cache
misses, etc. Not a problem when your entire index fits in memory. Of course
that puts an upper limit on what they can index and limits the uses to
smallish datasets that in no way pose any challenge whatsoever to Lucene
(especially when your index easily fits in memory).

I'm not saying it's a bad approach. I think it's a great approach for smallish
data sets with very loose ranking requirements where a one size fits all
solution does the job and is good enough. For many companies search is not
really core to their experience and it just needs to be idiot proof and low
hassle to set up for their non expert tech teams and product managers.

~~~
karterk
I agree with everything you've said. I was a bit sloppy in not mentioning the
trade-offs. My "fast enough" remark was certainly not for handling large
datasets. The parent comment had asked why Lucene was not used __for
__Typesense and my response was specifically to that point. Typesense targets
a different set of use cases involving small-medium datasets where an
interactive search experience is importnat.

------
ronlobo
Pretty cool, how does it compare to the Rust pendant
[https://crates.meilisearch.com/](https://crates.meilisearch.com/)

~~~
tpayet
Apart from being written in Rust, MeiliSearch
([https://github.com/meilisearch/meilisearch](https://github.com/meilisearch/meilisearch))
differs mostly on the use of a bucket sort to rank the documents retrieved
within the index.

Both MeiliSearch and Typesense use a reverse index with a Levenshtein
automaton to handle typos, but when it comes to sorting document:

\- Typesense use a default_sorting_field on each document, it means that
before indexing your documents you need to compute a relevancy score for
typesense to be able to sort them based on your needs
([https://typesense.org/docs/0.11.1/guide/#ranking-
relevance](https://typesense.org/docs/0.11.1/guide/#ranking-relevance))

\- On the other hand MeiliSearch, uses a bucket sort which means that there is
a default relevancy algorithm based on the proximity of words in the
documents, the fields in which the words are found and the number of typos
([https://docs.meilisearch.com/guides/advanced_guides/ranking....](https://docs.meilisearch.com/guides/advanced_guides/ranking.html#ranking-
rules)). And you can still add you own custom rules if you want to alter the
default search behavior.

~~~
KajMagnus
How would you say Typesense and MeiliSearch compare with Tantivy + Toshi?

(those two are a bit like Lucene + ElasticSearch — but written in Rust)

~~~
tpayet
Lucene was written for public search engine like Google, or DuckDuckGo (which
is actually based on Lucene and Solr).

Lucene and Lucene-like projects (Tantivy or Bleve in Golang) are general-
purpose search libraries. They can handle enormous datasets, and you can make
very complex queries on them (compute the average age of people named Karl in
a certain type of document for example).

These libraries are based on tf-idf (term frequency inverse document
frequency) algorithm and manage quite poorly typos for example (unless you
make the setup to index your documents differently to parse them correctly).

Toshi is like Elastic for Lucene, it provides sharding and JSON over HTTP api.

You can basically used Lucene and its derivatives for basically any search
related project, but you may have to dive into how it works and understand
concepts like tokenization or ngrams to tune it according to your needs.

On the other hand, MeiliSearch (and I guess Typesense, but I can not talk for
them) focus a subset of what you could build with Lucene or Elastic.

It is a fully functionnal Restful API, made for instant search or search-as-
you-type. The algorithms behind MeiliSearch are simply different: a inverse
index, with a levensthein automaton to handle typos, then a bucket sort you
can tune for the ranking of the returned documents. The aim is to provide a
easyer go-to solution to implement for customer-facing search.

You won't be able to make super complex queries on terabytes of data. We just
make super fast and ultra relevant search for end-user.

TypeSense and MeiliSearch focus on the same usage, we choose Rust for
performance, security and the modern ecosystem that will allow easier
maintenance :D

~~~
ddorian43
DuckDuckGo isn't based on lucene/solr but bing-api.

~~~
tpayet
my bad, my informations could be outdated.

based on [http://highscalability.com/blog/2013/1/28/duckduckgo-
archite...](http://highscalability.com/blog/2013/1/28/duckduckgo-
architecture-1-million-deep-searches-a-day-and-gr.html):

The fat tail queries go against PostgreSQL and the long tail queries go
against Solr. For shorter queries PostgreSQL takes precedence. Long tail fills
in Instance Answers where nothing else catches.

It seems that Bing is now a part of their sources indeed:
[https://help.duckduckgo.com/results/sources](https://help.duckduckgo.com/results/sources)

------
Scarbutt
What's a common approach for keeping the index up to date? A live ETL from the
DB to the search engine doesn't sound simple. Another method I can think of,
after existent data has been loaded, is to send the data directly at the same
time to both the database and the search engine every time a user makes a CRUD
operation but lots of works too if you don't already have a HTTP api and are
doing mostly server-side-rendered HTML.

------
KaoruAoiShiho
Is it compatible with InstantSearch.js? or reactive search?
[https://github.com/appbaseio/reactivesearch](https://github.com/appbaseio/reactivesearch)

Talking about fastest time to market this is the biggest one rather than
setting up elastic, which annoying as it is is still faster than creating the
UI.

~~~
jabo
Not at the moment, but we have an equivalent integration planned shortly.
Totally agree with you that building a search UI is still a pain.

~~~
KaoruAoiShiho
An integration with something existing or a new competitor to the projects I
mentioned?

~~~
karterk
Likely an integration with an existing popular UI search library.

------
LrnByTeach
good to know the memory efficiency !!!

> when 1 million Hacker News titles are indexed along with their points,
> Typesense consumes 165 MB of memory. The same size of that data on disk in
> JSON format is 88 MB.

I like the compact filter_by, sort_by with qualifiers

let searchParameters = { 'q' : 'harry', 'query_by' : 'title', 'filter_by' :
'publication_year:<1998', 'sort_by' : 'publication_year:desc' }

------
drusepth
I'm new to search libraries (frameworks?) but have been looking for something
to use for a huge data dump I'm working with.

Storing everything in memory seems fast, but seems like it'd be quite the
resource hog on a server -- is that a normal approach to take?

It's reassuring that the examples and documentation all revolve around books
(as my data set is actually ~55 million books also), but since theirs seems to
be quite the subset of that I worry about how well this scales and I don't
know enough about search libs to even evaluate that.

Is there a good place to start learning about what kinds of situations
Typesense works best in (besides needing a Levenshtein-based search), versus
what kinds of situations it wouldn't work well in (and perhaps what other
libraries would work better)?

~~~
karterk
Typesense's primary focus is speed and developer convenience. It makes an
assumption (which is true for perhaps 99% of the time) that memory is cheap
enough for indexing most datasets. Especially given the effort of development
time and the benefits from a solid search user experience.

Other libraries like Elastic offer more customization but also has a steeper
learning curve.

------
prayze
A dream come true. Something I've been looking for, for a long time now. Thank
you for sharing this

------
wiradikusuma
There's also [https://vespa.ai/](https://vespa.ai/) from (former) Yahoo, which
I think knows a thing or two about search.

------
ng7j5d9
Any support for languages other than English?

Does it do normalization as part of the typo search (in case of
missed/incorrent accent marks, etc)?

Does it do stemming at all? For English or other languages? (ie, I search for
"run" and you show me documents for "running" or the other way around).

Any support for Chinese text (which typically doesn't have whitespace between
words)?

~~~
karterk
We support English and other European languages (supports fuzzy search
normalizing accented chars).

While it does not support stemming, with fuzzy prefix matching, it will
largely work and practically more useful.

No typo or fuzzy correction for Chinese text yet.

------
tkfu
Bit of a bad look that I can't search the docs using typesense.

------
lxe
It's written in C++, and the code is simple enough to skim. I would expect
this to be some hefty Java thing.

------
neurostimulant
There is a bug on the demo search box in your home page, if no search results
found (either due to empty string or no result found for the search term), it
will display "undefined result. Page 1 of NaN"

~~~
karterk
Thanks, this has been fixed.

------
ericcholis
Looks great. One of Algolia's strongest features is InstantSearch for vanilla
JS, React, Vue, Angular, iOS and Android. Hopefully there can be this level of
support for Typesense

~~~
SahAssar
Instant search isn't that hard to build a small frontend for when you have the
API though.

------
agentile
Would love to see index restricted API keys comparable to
[https://www.algolia.com/doc/api-reference/api-
methods/genera...](https://www.algolia.com/doc/api-reference/api-
methods/generate-secured-api-key/)

------
azhenley
Looks very nice. The readme and website demonstrate and explain it well, good
job!

------
beagle3
Is it completely in memory? Seems like that from the readme

------
ChrisCinelli
Why is the hacker news title is mentioning "...alternative to Algolia" instead
of other open source, self hosted solutions?

~~~
veeralpatel979
When I think in-app search, I usually think Algolia first but then
ElasticSearch and Solr immediately after.

I think mentioning any of them would be okay.

------
chocolatkey
How well does this work for searching CJK (Chinese, Japanese, Korean) string
fields?

------
jacquesc
Looks great! Anyone know if I can try this on Heroku somehow?

~~~
jabo
Not at the moment, but I've added this to our todo list. Looks like we need to
write a custom buildpack.

------
vira28
small world. Developed by my colleague's brother :)

------
GrayTextIsTruth
I’ve never managed a separate search database. How do you keep records in sync
with your application database?

~~~
neurostimulant
This is what I usually do with elasticsearch:

\- if using an orm have a hook in your orm model to update the search database
whenever a database entry updated/created/deleted.

\- if not using an orm, update your rest api/view/any code that does CRUD to
update the search index after successful data update

\- create a command line tool that sync all existing data to the search index.
Probably only used a couple times when initializing the search index with
existing data, but it's pretty handy for testing purpose.

