Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I fail to imagine a 8k-token-length piece of text that has just one single semantic coordinate and is appropriate for embedding and vector search.

In my experience, any text is better embedded using a sliding window of a few dozen words - this is the approximate size of semantic units in a written document in english; although this will wildly differ for different texts and topics.




What are you using those embeddings for?

I can see a sliding window working for semantic search and RAG, but not so much for clustering or finding related documents.


Ah yes, clustering is indeed something that would benefit from large context, I agree.

However even so I would think about the documents themselves and figure out if it is even needed. Lets say we are talking about clustering court proceedings. I'd rather extract the abstract from these document, embed and cluster those instead of the whole text.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: