I suspect the biggest difference is the input data. Embeddings are great over da...

gdiamos · 2024-04-13T17:25:59 1713029159

rare words are out of vocab errors in vectors

Especially if they aren’t in the token vocab

mind-blight · 2024-04-13T20:50:16 1713041416

Even worse, named entities vary from organization to organization.

We have a client who uses a product called "Time". It's software time management. For that customer's documentation, time should be close to "product" and a bunch of other things that have nothing to do with the normal concept of time.

I actually suspect that people would get a lot more bang for their buck fine tuning the embedding models on B2B datasets for their use case, rather than fine tuning an llm

Yacovlewis · 2024-04-15T13:35:00 1713188100

Great example of how an entity like that could throw effective RAG out the window