This is not the best way to understand where modern embeddings are coming from.

codetrotter · 2024-04-18T00:46:34 1713401194

True, but what is the best way?

HarHarVeryFunny · 2024-04-18T01:41:33 1713404493

Are you talking about sentence/text chunk embeddings, or just embeddings in general?

If you need high quality text embeddings (e. g to use with a vector DB for text chunk retrieval), they they are going to come from the output of a language model, either a local one or using an embeddings API.

Other embeddings are normally going to be learnt in end-to-end fashion.

thisiszilff · 2024-04-18T17:26:49 1713461209

I disagree. In most subjects, recapitulating the historical development of a thing helps motivate modern developments. Eg

1. Start with bag of words. Represent words as all zero except one index that is not zero. Then a document is the sum (or average) of all the words in the document. We now have a method of embedding a variable length piece of text into a fixed size vector and we start to see how "similar" is approximately "close", though clearly there are some issues. We're somewhere at the start of nlp now.

2. One big issue is that there are a lot of common noisy words (like "good", "think", "said", etc.) that can make the embedding more similar than we feel they should be. So now we develop strategies for reducing the impact of those words one our vector. Remember how we just summed up the individual word vectors in 1? Now we'll scale each word vector by its frequency so that the more frequent the word in our corpus, the smaller we'll make the corresponding word vector. That brings us to tf-idf embeddings.

3. Another big issue is that our representation of words don't capture word similarity at all. The sentences "I hate when it rains" and "I dislike when it rains" should be more similar than "I hate when it rains" and "I like when it rains", but in our embeddings from (2) the similarity of the two pairs is going to be similar. So now we revisit our method of constructing word vectors and start to explore ways to "smear" words out. This is where things like word2vec and glove pop up as methods of creating distributed representation of words. Now we can represent documents by summing/averaging/tf-idfing our word vectors the same as we did in 2.

4. Now we notice there is an issue where words can have multiple meanings depending on their surrounding context. Think of things like irony, metaphor, humor, etc. Consider "She rolled her eyes and said, 'Don't you love it here?"'" and "She rolled the dough and said, 'Don't you love it here?'". Odds our, the similarity per (3) is going to be pretty similar, despite the fact that its clear these are wildly different meanings. The issue is that our model in (3) just uses a static operation for combining our words, and because of that we aren't capturing the fact that "Don't you love it here" shouldn't mean the same thing in the first and second sentences. So now we start to consider ways in which we can combine our word vectors differently and let the context affect the way in which we combine them.

5. And that brings us to now where we have a lot more compute than we did before and access to way bigger corpora so we can do some really interesting things, but its all still the basic steps of breaking down text into its constituent parts, representing those numerically, and then defining a method to combine various parts to produce a final representation for a document. The above steps greatly help by showing the motivation for each change and understanding why we do the things we do today.

kadushka · 2024-04-18T18:36:22 1713465382

This explanation is better, because it puts things into perspective, but you don't seem to realize that your 1 and 2 are almost trivial compared to 3 and 4. At the heart of it are "methods of creating distributed representation of words", that's where the magic happens. So I'd focus on helping people understand those methods. Should probably also mention subword embedding methods like BPE, since that's what everyone uses today.

I noticed that many educators make this mistake: spend a lot of time on explaining very basic trivial things, then rush over difficult to grasp concepts or details.

_akhe · 2024-04-18T22:19:01 1713478741

> We now have a method of embedding a variable length piece of text into a fixed size vector

Question: Is it a rule that the embedding vector must be higher dimensional than the source text? Ideally 1 token -> a 1000+ length vector? The reason I ask is because it seems like it would lose value as a mechanism if I sent in a 1000 character long string and only got say a 4-length vector embedding for it. Because only 4 metrics/features can't possibly describe such a complex statement, I thought it was necessary that the dimensionality of the embedding be higher than the source?

p1esk · 2024-04-20T01:31:59 1713576719

No. Number of characters in a word has nothing to do with dimensionality of that word’s embedding.

GPT4 should be able to explain why.