Hacker News new | past | comments | ask | show | jobs | submit | jabo's comments login

We’ve interacted before on Twitter and GitHub, and I want to address your point about Raft in Typesense since you mention it explicitly:

I can confidently say that Raft in Typesense is NOT broken.

We run thousands of clusters on Typesense Cloud serving close to 2 Billion searches per month, reliably.

We have airlines using us, a few national retailers with 100s of physical stores in their POS systems, logistic companies for scheduling, food delivery apps, large entertainment sites, etc - collectively these are use cases where a downtime of even an hour could cause millions of dollars in loss. And we power these reliably on Typesense Cloud, using Raft.

For an n-node cluster, the Raft protocol only guarantees auto-recovery for a failure of up to (n-1)/2 nodes. Beyond that, manual intervention is needed. This is by design to prevent a split brain situation. This not a Typesense thing, but a Raft protocol thing.


I'm biased, but I'd recommend exploring Typesense for search.

It's an open source alternative to Algolia + Pinecone, optimized for speed (since it's in-memory) and an out-of-the-box dev experience. E-commerce is also a very common use-case I see among our users.

Here's a live demo with 32M songs: https://songs-search.typesense.org/

Disclaimer: I work on Typesense.


I can also highly recommend TypeSense and have no affiliation. You'll save a lot of money and get much faster results.


I work on Typesense [1] - historically considered an open source alternative to Algolia.

We then launched vector search in Jan 2023, and just last week we launched the ability to generate embeddings from within Typesense.

You'd just need to send JSON data, and Typesense can generate embeddings for your data using OpenAI, PaLM API, or built-in models like S-BERT, E-5, etc (running on a GPU if you prefer) [2]

You can then do a hybrid (keyword + semantic) search by just sending the search keywords to Typesense, and Typesense will automatically generate embeddings for you internally and return a ranked list of keyword results weaved with semantic results (using Rank Fusion).

You can also combine filtering, faceting, typo tolerance, etc - the things Typesense already had - with semantic search.

For context, we serve over 1.3B searches per month on Typesense Cloud [3]

[1] https://github.com/typesense/typesense

[2] https://typesense.org/docs/0.25.0/api/vector-search.html

[3] https://cloud.typesense.org


We store a couple million documents in typesense and the vector store is performing great so far (average search time is a fraction of overall RAG time). Didn’t realise you’ve updated to support creating the embeddings automatically; great news!


This is very difficult for me to understand. Can you explain like I'm an undergrad? What exactly does this mean? What is an embedding? What is the difference between keyword and semantic search?


Here's an example of semantic search:

Let's say your dataset has the words "Oceans are blue" in it.

With keyword search, if someone searches for "Ocean", they'll see that record, since it's a close match. But if they search for "sea" then that record won't be returned.

This is where semantic search comes in. It can automatically deduce semantic / conceptual relationships between words and return a record with "Ocean" even if the search term is "sea", because the two words are conceptually related.

The way semantic search works under the hood is using these things called embeddings, which are just a big array of floating point numbers for each record. It's an alternate way to represent words, in an N-dimensional space created by a machine learning model. Here's more information about embeddings: https://typesense.org/docs/0.25.0/api/vector-search.html#wha...

With the latest release, you essentially don't have to worry about embeddings (except may be picking one of the model names to use and experiment) and Typesense will do the semantic search for you by generating embeddings automatically.


We use Typesense for vector search as well for Struct.ai in production, it works amazingly.

I'm surprised the original post doesn't benchmark Typesense.


Typesense has a vector store / search built-in: https://typesense.org/docs/0.24.1/api/vector-search.html

In the upcoming version, we've also added the ability to automatically generate embeddings from within Typesense either using OpenAI, PaLM API or a built-in model like s-bert or E5. So you only have to send json and pick a model, Typesense will then do a hybrid vector+keyword search for queries.


I see you run hnswlib but do you (plan to) support external vector databases, so users can upgrade?


We don't plan to support external vector databases, since we want to build Typesense as a vector + keyword search datastore by itself.


I see. Do you plan to replace hnswlib with your own technology?


We've been using Struct's Slack bot in Typesense's Slack community here (if you want to see a demo of how it looks):

https://threads.typesense.org/kb

I love that the discussions we're having (in public channels) are now automatically indexed and made searchable publicly to any users who are looking for information on Google, etc, even if they're not a part of our Slack community.

I previously used to be worried about all this time and effort we're putting in to a walled garden of information that Slack was becoming, not to mention their untenable pricing for communities.

I now find myself spending more time writing more detailed answers in Slack, because I know it's going to be available publicly for future searchers.


You can also use this library with Typesense, which is an open source alternative to Algolia: https://github.com/typesense/typesense-autocomplete-demo

Disclaimer: I work on Typesense.


It's quite easy to plug into any results provider, in fact. I built a proof of concept using Lunr (dataset too small for anything heavier-weight).



I get this question frequently - why not use FAISS or ANNOY directly, instead of a vector database, so glad to see this aspect covered in this article.

Plug: If you're ever looking for an open source alternative to Pinecone, we recently added vector search to Typesense: https://typesense.org/docs/0.24.1/api/vector-search.html

The key thing is that it's in-memory and allows you to combine attribute-based filtering, together with HNSW-based nearest-neighbor search.

We're also working on a way to automatically generate embeddings from within Typesense using any ML models of your choice.

So Algolia + Pinecone + Open Source + Self-Hostable with a cloud hosted option = Typesense


Plug: If you're ever looking for an open source alternative to Pinecone, we recently added vector search to Typesense: https://typesense.org/docs/0.24.1/api/vector-search.html

The key thing is that it's in-memory and allows you to combine attribute-based filtering, together with nearest-neighbor search.

We're also working on a way to automatically generate embeddings from within Typesense using any ML models of your choice.

So Algolia + Pinecone + Open Source + Self-Hostable with a cloud hosted option = Typesense


This is what we’re using. We already sync database content to a typesense DB for regular search so it wasn’t much more work to add in embeddings and now we can do semantic search.


Could you do boosting with Typesense, like favoring more recent results?


Yup. If you store timestamps as Unix timestamps you can gave Typesense sort on text match score and the timestamp field: https://typesense.org/docs/guide/ranking-and-relevance.html#...


If I'm reading this correctly, it looks like search requests are now charged separate from the number of documents.

So if I have 5M records and 5M searches per month, in the old pricing scheme I would have paid $5K per month.

In the new pricing plan, I would pay $0.40 per 1K records and then $0.50 per 1K searches. So that's $2K for record storage and $2.5K for search requests for a total of $4.5K.

This blog post [1] claims they've slashed prices by 50%, which may be true at the unit level, but then it sounds like they've separated out charges for searches and records, so you still end up paying close to what you used to pay previously...

Am I misreading something?

[Disclaimer: I work on Typesense, an open source alternative to Algolia with a SaaS version, so I'm looking to understand this better myself.]

[1] https://www.algolia.com/about/news/algolia-introduces-new-de...


I think you are correct.

My biggest issues with Algolia were always:

- search requests should be separated from number of documents

-- think of a geoname service where there are 10 mio. documents vs. 500k search requests -- seems to be solved now

- it is crazy to require a new index for each sort direction

-- this is still the case

imho they should introduce cpu cycles + storage. until then self hosted Typesense, Meilisearch, Elasticsearch or hosted Typesense, Elasticsearch are still superior. I am leaving out Meilisearch here as their entry level is also nuts at 1.2k/month for hosted.


Hello, I'm the Meilisearch CEO. I think you're also correct, Jabo.

I just want to clarify. Meilisearch's pricing doesn't start at 1.2K/month, but at 0/month. We have a usage-based pricing that is basically 0.25/1000 documents and searches. And, funny thing, we are thinking about splitting the searches and documents, too, but we wanted to have more data to be sure to select the right unit price for each. :)


Sorry, this fine print so fine printed that I could not even see it.

You are correct. For < 1k documents it should be included on the hosted list.

Also, please really split documents and search request. In a real life scenario there will be 100k documents stored and only the top 1% will be searched for.

The one thing that always held us back from further investigating your product was searching in facets which was on the roadmap back then and imho crucial to any serious search ui. No idea if you implemented it now.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: