Could anybody explain me a simple example how to do text analysis via this vecto...

alexgarcia-xyz · 2024-08-02T16:59:48 1722617988

You can generate "text embeddings" (ie vectors) from your text with an embedding model. Once you have your text represented as vectors, then "closest vector" means "more semantically similar", as defined by the embedding model you use. This can allow you to build semantic search engines, recommendations, and classifiers on top of your text and embeddings. It's kindof like a fancier and fuzzier keyword search.

I wouldn't completely replace keyword search with vector search, but it's a great addition that essentially lets you perform calculations on text

yAak · 2024-08-02T17:13:37 1722618817

Thank you! This was an extremely helpful explanation to me, as someone not familiar on the topic. =)

bodantogat · 2024-08-02T17:43:15 1722620595

Nice explanation. One use case where keywords haven't worked well for me , and (at least at first glance) vectors are doing better are longer passages -- finding sentences that are similar rather than just words.

simonw · 2024-08-02T18:26:03 1722623163

I put together an extensive guide to understanding embeddings and vector search last year: https://simonwillison.net/2023/Oct/23/embeddings/

bambax · 2024-08-03T05:49:13 1722664153

Yes this is an excellent article, thank you.

Do embeddings work across languages? It must depend on the model, but are there models where similar concepts in different languages occupy the same location in vector space?

If yes it would allow to search a corpus in one language using another, without translating anything up front (neither the corpus nor the query).

brylie · 2024-08-02T16:48:53 1722617333

One use case is for retrieving similar documents, such as when recommending related content. Another use case is retrieving document segments that are similar to a user query and passing them along with the user query to a large language model for improvement in the generated response. Vector search is also better in some ways than keyword search since it can find documents that are semantically similar even when the user may not have used the exact keyword, or even partial keywords like “Postgres” instead of “PostgreSQL.”