Hacker News new | past | comments | ask | show | jobs | submit login

Vector search is incredibly powerful on matching on context or similarity. For example, automobile and car are semantically similar and, and one will rank well for the other in a search.

Vector search, though, isn't as good on handling typos and not good at all when it comes to as you type searching. Vehic won't match on auto, for example.

We believe that there is use for each of these approaches and a use in a single search, rather than choosing ahead of time or through heuristics after the fact which to choose.

(I'm a Principal PM for Semantic Search and Search Ranking at Algolia.)




Aren't typos just a question of how you generate your vectors/embeddings? I'd be surprised if a transformer with a character level tokenizer trained on a representative source of data (ie: with typos) wouldn't be able to make sense of typos.


Can confirm. We use sentence-level transformer embeddings for (vector) search, clustering, and classification tasks. As an old school ML guy I've been amazed at how robust they are to typos, slang, punctuation, etc.

However, I'm sure there are still applications where you don't have access to a robust embedding for your domain but can apply other techniques to deal with that domain's noise.


Here is decent intro to sentence level transformers & embeddings:

https://www.pinecone.io/learn/sentence-embeddings/


Yes, good point. I still believe that net-net you're going to get better results on typos with a keyword-based search, but I didn't mean to imply that vector searching won't handle typos at all.


> Vector search, though, isn't as good on handling typos and not good at all when it comes to as you type searching. Vehic won't match on auto, for example.

This is incorrect in general case and it entirely depends on the model that is used to produce word vectors and the text corpus the model is trained with.

For instance, fastText model is trained on words, but also their parts (n-grams), so it should produce word vectors that would be close (in cosine-distance) to vectors of their corresponding typos and partials, even if the text corpus that was used to train the model doesn't contain same typos and partially typed words verbatim.


I'd like to add that vector search works not just for natural language, but also for a variety of other types of unstructured data as well. Images, video, user profiles, and pretty much anything else that can be vectorized. Here's an example of image search: https://milvus.io/docs/image_similarity_search.md


Do yall have a technical blog? I would love to both understand the problem and methods, domains yall cross (eg biometrics and fuzzy matching?), and how yall integrate in different industries.

A good search partner is hard to find. PageRank is fun and all, but I believe better methods exist these days.


There's two related problems here: finding relevant results and ranking those results. The first is historically done with massive inverted indexes. Page rank is for the second one of ranking those relevant results.

For the first part you can look into "embeddings" and "approximate nearest neighbor lookup" for the modern approaches. That said inverted indexes are still very popular.

The second one is generally called "learning to rank" so you can find a lot of things written on that topic. The biggest issue here imho is what training data you use which gives you examples of good rankings. The best algorithm trained on garbage will give you garbage.


Here's a link to our engineering blog posts: https://www.algolia.com/blog/engineering/

And our CTO, Julien, wrote an "Inside the Engine" series on how our search engine works. It doesn't have the new "hybrid search" but it shows you the base of how we do search: https://www.algolia.com/blog/engineering/inside-the-algolia-...


These are relatively easy to build and can be used for a variety of tasks like Entity Resolution, https://news.ycombinator.com/item?id=32825679




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: