Hacker News new | past | comments | ask | show | jobs | submit login

What is the use case for an 8k token embedding? My (somewhat limited) experience with long context models is they aren't great for RAG. I get the impression they are optimized for something else, like writing 8k+ tokens rather than synthesizing responses.

Isn't the normal way of using embedding to find relevant text snippets for a RAG prompt? Where is it better to have coarser retrieval?




> What is the use case for an 8k token embedding?

Calculating embeddings on larger documents than smaller-window embedding models.

> My (somewhat limited) experience with long context models is they aren't great for RAG.

The only reason they wouldn't be great for RAG is that they aren't great at using information in their context window, which is possible (ISTR that some models have a strong recency bias within the window, for instance) but I don't think is a general problem of long context models.

> Isn't the normal way of using embedding to find relevant text snippets for a RAG prompt?

I would say the usual use is for search and semantic similarity comparisons generally. RAG is itself an application of search, but its not the only one.


I wonder how the perfomance fair when context size is increased. Intuitively this should be higher, but some quantized models I've tested showed noticeably worst performance.


Your KV cache size is linear with context size which might put you tight on memory. There is also increased cost of recalculating KV cache of context window when the window has to move but this is close to being solved with streaming LLMs.


BERT style encoder-only models, like the embedding model being discussed here, don't need a KV cache for inference. A KV cache is only needed for efficient inference with encoder-decoder and decoder-only (aka GPT) models.


You could get a facsimile to a summary for a full article or short story. Reducing an 8k token article to a summary using a completions model would cost far more. So if you need to search through collections of contracts, scientific papers, movie scripts, etc. for recommendations/clustering then bigger input sizes can do that in one shot.

Think of it like skipping the square root step in Euclidean distance. Perfectly valid as long as you don’t want a distance so much as a way to compare distances. And doing so skips the most computationally expensive operation.


I think I'm missing something: like, yeah, it's vector search for bigger text chunks. But arguably vector search with bigger text chunks is _definitively_ worse -- this isn't doing summarization, just turning about 25 pages of text to 1024 floats, which you then can use cosine similarity to measure the semantic similarity to other text

I'd much rather know what paragraph to look in than what 25 pages to look in


I imagine it's more useful for finding related articles and clustering things than for semantic search, which will work much better against smaller chunks - especially if you're implementing Retrieval Augmented Generation.


I think the point is: if you compress 25 pages of text into 1024 floats, you will lose a ton of information, regardless of what the use case is, so you're probably still better of with chunking.


> if you compress 25 pages of text into 1024 floats, you will lose a ton of information

Sure, but then if you do it one page at a time, or one paragraph at a time, you lose ton of meaning - after all, individual paragraphs aren't independent of each other. And meaning is kind of the whole point of the exercise.

Or put another way, squashing a ton of text loses you some high-frequency information, while chunking cuts off the low-frequency parts. Ideally you'd want to retain both.


I think that the assumption that you lose a ton of meaning (of low frequency) in doing separate chunks is probably less likely to be true over doing the whole document at once (losing high frequency meaning). As you say, doing both is probably a good strategy, and I think that's why we see a lot of "summarize this text" approaches.

I use a multi-pronged approach to this based on a special type of summarization. I chunk on sentences using punctuation until they are just over 512 characters, then I embed them. After embedding, I ask a foundation model to summarize (or ask a question about the chunk) and then generate keyterms for it. Those keyterms are stored along with the vector in the database. During search, I use the user's input to do a vector search for matching chunks, then pull their keyterms in. Using those keyterms, I do set operations to find related chunks. I then run a vector search against these to the top matches from the vector search to assemble new prompt text.

This strategy is based on the idea of a "back of the book index". It is entirely plausible to look for "outliers" in the keyterms and consider throwing those chunks with those keyterms in there to see if it nets us understanding of some "hidden" meaning in the document.

There is also a means to continue doing the "keyterm" extraction trick as the system is used. Keyterms from answer as well as user prompts may be added to the existing index over time, thus helping improve the ability to return low frequency information that may be initially hidden.


I've been getting great results for related documents by embedding entire blog posts, e.g. here: https://til.simonwillison.net/gis/pmtiles#related

I'm not sure how I would do that after chunking.


Did you compare with simple baselines like bag-of-words and word vectors?


My previous implementation used TF-IDF - I basically took all the words in the post and turned them into a giant "word OR word OR word OR word" search query and piped that through SQLite full-text search. https://til.simonwillison.net/sqlite/related-content

I jumped straight from that to OpenAI embeddings. The results were good enough that I didn't spend time investigating other approaches.


> Into a giant "word OR word OR word OR word"

Does that mean you'd return other docs if they share just one word?

The idea of tfidf is that it gives you a vector (maybe combined with pca or a random dimensionality reduction) that you can use just like an Ada embedding. But you still need vector search.


My goal for related articles was to first filter to every document that shared at least one word with the target - which is probably EVERY document in the set - but then rank them based on which ones share the MOST words, scoring words that are rare in the corpus more highly. BM25 does that for free.

Then I take the top ten by score and call those the "related articles".


That's not quite tfidf though. I agree you can get better results than that with Ada embeddings, but I would argue you can get even better results with embeddings from smaller chunks.


I guess technically it's bm25, since it's using the rank mechanism in SQLite FTS5: https://www.sqlite.org/fts5.html#sorting_by_auxiliary_functi...


Good point, I wonder how different it is to use a large context here vs having some other model summarize an 8k article into a small paragraph and using embedding from the paragraph instead where such a large context wouldn't be necessary.


Ever read the back of a book?


You mean the marketing blurb? Those tend to carry low information value, sometimes even negative - as in, if you didn't know anything else about the book, reading the blurb will make you even more wrong about it than you were. This is a common feature of marketing copy.


Isn't it up to 8k? So you can index your documents by paragraphs if you prefer?


you could do both


Is this what you mean by RAG? https://www.promptingguide.ai/techniques/rag?


I have an explanation of RAG in the context of embeddings here: https://simonwillison.net/2023/Oct/23/embeddings/#answering-...


You could just sum it up for us all rather than do a divert to your blog?

It's Retrieval Augmented Generation btw.

To quote:

> The key idea is this: a user asks a question. You search your private documents for content that appears relevant to the question, then paste excerpts of that content into the LLM (respecting its size limit, usually between 3,000 and 6,000 words) along with the original question.

> The LLM can then answer the question based on the additional content you provided.


> You could just sum it up for us all rather than do a divert to your blog?

Why? Have links gone out of fashion?

I even linked directly to the relevant section rather than linking to the top of the page.

The paper that coined the term used the hyphen, though I think I prefer it without: https://arxiv.org/abs/2005.11401


> Have links gone out of fashion?

Yes.

You wrote far more words than needed to answer the comment, I did it for you instead.


One of the reasons I write so much stuff is so I can provide links to things I've written to answer relevant questions.


And those of us with the sense to value your insight, and the attention-span to read more than tweet-sized content, thank you for it.


Thanks so much for your writings and for posting the link (and also for Datasette!). I've learned in the past few months from your blog.


Thank you, nice blog.


Appreciate it. Your posts in general have been great - accessible to a large audience, quality links to follow up research and catchy analogies even when they don't fully hold true (llm as a calculator for words - which I admit I use with citation!). Keep going.


Just to add that, we appreciate that very much.


I liked your link a lot.


"Links have gone out of fashion" is an odd thing to write on a Link Aggregator website.


You know you're responding to a programmer famous enough to have a Wikipedia page, right?

https://en.m.wikipedia.org/wiki/Simon_Willison


I don't pay the slightest fucking attention to who I'm responding to and take people on their merit/comment.

Should we all do the ad hominem thing? You are actually suggesting that?


Yes




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: