Oh hey, excited to see Typesense on the front page! Thank you for sharing OP.
Some quick context: we are a small bootstrapped team that's been working on Typesense since 2015. It started out as a nights-and-weekends project, out of personal frustration with ElasticSearch's complexity for doing seemingly simple things. So we set out (maybe naively at the time), to see what it would take to build our own search engine, just to scratch our intellectual curiosity. Over the years, we've realized that it takes a LOT of nuanced effort to build a search engine that works well out of the box.
Our goal with Typesense is to democratize search technology on two fronts:
1. Simplify and reduce the amount of developer effort it takes to build a good search experience that works well out of the box. To this end, we pore over API design to make it intuitive and set sane defaults for all parameters.
2. Make good instant-search technology accessible to individuals and teams of all sizes. To this end, we decided to open source our work and make it completely free to self-host. We also optimize for reducing the operational overhead it takes to deploy Typesense to production (eg: single binary with no runtime dependencies, one-step clustering, etc).
In 2020, I left my full-time job and my co-founder left his full-time job a month ago, and we're now both working full-time on Typesense.
Do you have a document that explains the architecture of the product? I searched a bit on your github and website but didn't find anything. Apologies in advance if I've missed something very obvious :-).
We don't have an architecture document at the moment, but here's a high-level summary from @karterk's comment from another thread:
At the heart of Typesense is a `token => documents` inverted index backed by an Adapative Radix Tree (https://db.in.tum.de/~leis/papers/ART.pdf), which is a memory-efficient implementation of the Trie data structure. ART allows us to do fast fuzzy searches on a query.
All indices are stored in-memory, while the documents are stored on disk on RocksDB. All underlying data structures were carefully designed, benchmarked and optimized to exploit cache locality and utilize all cores efficiently.
I did benchmark extensively 4-5 years ago, but I don't have those numbers with me. Tries are quite expensive memory-wise by design, but I found that ART gave the best balance between speed (by exploiting cache locality) and memory. State of art might have improved by now.
As far as Typesense goes though, I found that the actual posting lists, document listings, and other faceting/sorting related indexing data structures is where the bigger overhead is, especially for larger datasets.
Thanks for the feedback,
my issue is that I allocate only a few MB to my indexing thread so I'm looking for a more efficient implementation to avoid having to produce then merge too many segments from disk.
I'm currently considering using compressed pointers on some part of the tree to reduce the memory footprint as much as I can. Let's see how it goes...
Thanks. Typesense takes a different approach to indexing than Algolia.
a) While Algolia (from what is available publicly) has indices which are pre-sorted on a set of ranking factors, Typesense allows dynamic, on-the-fly sorting.
b) Also I believe Algolia memory maps their indices, but Typesense stores the raw JSON documents on-disk and constructs the index from scratch on start-up.
Typesense seems like a good fully-featured alternative to Elasticsearch. I.e. it's basically a database with fuzzy-search features (schemas, fields, facets, ordering, scoring profiles, etc), and its speed is enabled by holding everything in RAM.
If you just want the fuzzy-search part (query string -> list of matching document ids) and don't want to pay for GBs of RAM, sonic [1] seems to be an interesting project. It's very fast (μs) and uses very little RAM but doesn't offer DB-like features such as sorting, schemas/fields, scoring etc. It's more of a low-level primitive for building your own search feature than an integrated search db that's ready to use out of the box.
Thank you for pointing this out. To add to what @karterk said, we had upgraded just one of the 3 nodes in the Typesense cluster powering the home page demo to an internal RC build which has some new changes. So some queries were hitting that new node, and the others were hitting the old node, which is why the results were inconsistent.
I've now updated the demo to send queries to a node that's closest to the user, so it shouldn't jump around any more.
Man, I didn't even realize there was a demo until I read this. Looked all over for it. That "try it" message needs to be a lot more prominent, I think.
How come the search is somewhat stochastic? Deleting the "e" from "cake" and retyping it causes ~ 1 in 4 searches to display pure "cake" entries while the rest show things like "Baked praline ice cream cake".
Sorry about that. We've a RC build that we are testing a new feature with at the moment on that cluster, and unfortunately this demo is hosted on that :-/
Pretty poor demo if it doesn’t demonstrate a stable build that the customer would be searching. It would be better to isolate and leave experimentation as a separate service.
I'm curious if there are any projects out there that would search "schemaless" structured data, but still retain some of the structure.
Essentially something like the result of indexing a json row of field/value pairs for a bunch of csvs (with different fields) that would lead to being able to do faceted search on an individual field across rows/datasets, or being able to find the bits related to a needle deep in the dataset haystack.
I'm not sure I see how that would provide faceting, or even how multiple rows would be coerced into something that could match either the dataset or the row.
How it would work if there were incompatible fields in the schemas across different datasets? (unless it was basically a stringify operation across the whole thing. )
When you set the data type to "auto", the data type is inferred from the first document that you index. When subsequent documents are indexed, the fields are first attempted to be coerced to the previously determined data types. For eg:
If record1 has {field1: 32, field2: true, field3: '22'}
and record2 has {field1: "32", field2: 'true', field3: 22}
When record1 is indexed, the data type for field1 is set to int, field2 is set to bool and field3 is set to string.
Then when record2 shows up, field1 is coerced to an int, field2 is coerced to a bool and field3 is coerced to a string.
If a coercion is not possible, you can configure it to be ignored or error out.
re: facets, you can use a regex field name and set for example all fields that end with `.*_facet` to be a facet.
I would generate a JSON feed for the site and upload that (I am curious if this can detect duplicate records. Edit: it can using the UPSURT call), that way you could even have full text search.
A crawler is probably not the best tool for this if you use some form of site generation, as you can also use it to generate the information that goes into Typesense.
Exact word queries are on our near-term todo list. We've already added support for exclusion via "-" operator in the last release.
Dynamic synonyms will require more thought given the machine learning aspects involved. May require domain specific models. And I also wonder how "off-the-shelf" it will really be in practice.
I don't have comparative benchmarks, but here are some Typesense benchmarks:
A dataset containing 2.2 Million recipes (recipe names and ingredients):
- Took 3.6mins to index all 2.2M records
- On a server with 4vCPUs, Typesense was able to handle a concurrency of 104 concurrent search queries per second, with an average search processing time of 11ms.
A dataset containing 28 Million books (book titles, authors and categories):
- Took 78mins to index all 28M records
- On a server with 4vCPUs, Typesense was able to handle a concurrency of 46 concurrent search queries per second, with an average search processing time of 28ms.
With a dataset containing 3 Million products (Amazon product data), Typesense was able to handle a throughput of 250 concurrent search queries per second on an 8-vCPU 3-node Highly Available Typesense cluster.
Some quick context: we are a small bootstrapped team that's been working on Typesense since 2015. It started out as a nights-and-weekends project, out of personal frustration with ElasticSearch's complexity for doing seemingly simple things. So we set out (maybe naively at the time), to see what it would take to build our own search engine, just to scratch our intellectual curiosity. Over the years, we've realized that it takes a LOT of nuanced effort to build a search engine that works well out of the box.
Our goal with Typesense is to democratize search technology on two fronts:
1. Simplify and reduce the amount of developer effort it takes to build a good search experience that works well out of the box. To this end, we pore over API design to make it intuitive and set sane defaults for all parameters.
2. Make good instant-search technology accessible to individuals and teams of all sizes. To this end, we decided to open source our work and make it completely free to self-host. We also optimize for reducing the operational overhead it takes to deploy Typesense to production (eg: single binary with no runtime dependencies, one-step clustering, etc).
In 2020, I left my full-time job and my co-founder left his full-time job a month ago, and we're now both working full-time on Typesense.
Happy to answer any questions!