Vector search is incredibly powerful on matching on context or similarity. For e...

marcinzm · on Sept 15, 2022

Aren't typos just a question of how you generate your vectors/embeddings? I'd be surprised if a transformer with a character level tokenizer trained on a representative source of data (ie: with typos) wouldn't be able to make sense of typos.

evrydayhustling · on Sept 15, 2022

Can confirm. We use sentence-level transformer embeddings for (vector) search, clustering, and classification tasks. As an old school ML guy I've been amazed at how robust they are to typos, slang, punctuation, etc.

However, I'm sure there are still applications where you don't have access to a robust embedding for your domain but can apply other techniques to deal with that domain's noise.

O__________O · on Sept 15, 2022

Here is decent intro to sentence level transformers & embeddings:

https://www.pinecone.io/learn/sentence-embeddings/

dustincoates · on Sept 15, 2022

Yes, good point. I still believe that net-net you're going to get better results on typos with a keyword-based search, but I didn't mean to imply that vector searching won't handle typos at all.

dnc · on Sept 15, 2022

> Vector search, though, isn't as good on handling typos and not good at all when it comes to as you type searching. Vehic won't match on auto, for example.

This is incorrect in general case and it entirely depends on the model that is used to produce word vectors and the text corpus the model is trained with.

For instance, fastText model is trained on words, but also their parts (n-grams), so it should produce word vectors that would be close (in cosine-distance) to vectors of their corresponding typos and partials, even if the text corpus that was used to train the model doesn't contain same typos and partially typed words verbatim.

fzliu · on Sept 15, 2022

I'd like to add that vector search works not just for natural language, but also for a variety of other types of unstructured data as well. Images, video, user profiles, and pretty much anything else that can be vectorized. Here's an example of image search: https://milvus.io/docs/image_similarity_search.md

tomrod · on Sept 15, 2022

Do yall have a technical blog? I would love to both understand the problem and methods, domains yall cross (eg biometrics and fuzzy matching?), and how yall integrate in different industries.

A good search partner is hard to find. PageRank is fun and all, but I believe better methods exist these days.

marcinzm · on Sept 15, 2022

There's two related problems here: finding relevant results and ranking those results. The first is historically done with massive inverted indexes. Page rank is for the second one of ranking those relevant results.

For the first part you can look into "embeddings" and "approximate nearest neighbor lookup" for the modern approaches. That said inverted indexes are still very popular.

The second one is generally called "learning to rank" so you can find a lot of things written on that topic. The biggest issue here imho is what training data you use which gives you examples of good rankings. The best algorithm trained on garbage will give you garbage.

dustincoates · on Sept 15, 2022

Here's a link to our engineering blog posts: https://www.algolia.com/blog/engineering/

And our CTO, Julien, wrote an "Inside the Engine" series on how our search engine works. It doesn't have the new "hybrid search" but it shows you the base of how we do search: https://www.algolia.com/blog/engineering/inside-the-algolia-...

kartoolOz · on Sept 15, 2022

These are relatively easy to build and can be used for a variety of tasks like Entity Resolution, https://news.ycombinator.com/item?id=32825679