I imagine it's more useful for finding related articles and clustering things than for semantic search, which will work much better against smaller chunks - especially if you're implementing Retrieval Augmented Generation.
I think the point is: if you compress 25 pages of text into 1024 floats, you will lose a ton of information, regardless of what the use case is, so you're probably still better of with chunking.
> if you compress 25 pages of text into 1024 floats, you will lose a ton of information
Sure, but then if you do it one page at a time, or one paragraph at a time, you lose ton of meaning - after all, individual paragraphs aren't independent of each other. And meaning is kind of the whole point of the exercise.
Or put another way, squashing a ton of text loses you some high-frequency information, while chunking cuts off the low-frequency parts. Ideally you'd want to retain both.
I think that the assumption that you lose a ton of meaning (of low frequency) in doing separate chunks is probably less likely to be true over doing the whole document at once (losing high frequency meaning). As you say, doing both is probably a good strategy, and I think that's why we see a lot of "summarize this text" approaches.
I use a multi-pronged approach to this based on a special type of summarization. I chunk on sentences using punctuation until they are just over 512 characters, then I embed them. After embedding, I ask a foundation model to summarize (or ask a question about the chunk) and then generate keyterms for it. Those keyterms are stored along with the vector in the database. During search, I use the user's input to do a vector search for matching chunks, then pull their keyterms in. Using those keyterms, I do set operations to find related chunks. I then run a vector search against these to the top matches from the vector search to assemble new prompt text.
This strategy is based on the idea of a "back of the book index". It is entirely plausible to look for "outliers" in the keyterms and consider throwing those chunks with those keyterms in there to see if it nets us understanding of some "hidden" meaning in the document.
There is also a means to continue doing the "keyterm" extraction trick as the system is used. Keyterms from answer as well as user prompts may be added to the existing index over time, thus helping improve the ability to return low frequency information that may be initially hidden.
My previous implementation used TF-IDF - I basically took all the words in the post and turned them into a giant "word OR word OR word OR word" search query and piped that through SQLite full-text search. https://til.simonwillison.net/sqlite/related-content
I jumped straight from that to OpenAI embeddings. The results were good enough that I didn't spend time investigating other approaches.
Does that mean you'd return other docs if they share just one word?
The idea of tfidf is that it gives you a vector (maybe combined with pca or a random dimensionality reduction) that you can use just like an Ada embedding. But you still need vector search.
My goal for related articles was to first filter to every document that shared at least one word with the target - which is probably EVERY document in the set - but then rank them based on which ones share the MOST words, scoring words that are rare in the corpus more highly. BM25 does that for free.
Then I take the top ten by score and call those the "related articles".
That's not quite tfidf though. I agree you can get better results than that with Ada embeddings, but I would argue you can get even better results with embeddings from smaller chunks.
Good point, I wonder how different it is to use a large context here vs having some other model summarize an 8k article into a small paragraph and using embedding from the paragraph instead where such a large context wouldn't be necessary.
You mean the marketing blurb? Those tend to carry low information value, sometimes even negative - as in, if you didn't know anything else about the book, reading the blurb will make you even more wrong about it than you were. This is a common feature of marketing copy.