Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

One issue I always run into when implementing these approaches is the embedding model's context window being too small to represent what I need.

For example, on this project, looking at the generation of training data [1], it seems like what's actually being generated are embeddings on a string concatenated from each review, title, description, etc. [2]. With the max_seq_length set to 200, wouldn't lengthy book reviews result in the book description text never being encoded? Wouldn't this result in queries not matching against potentially similar descriptions if the reviews are topically dissimilar (e.g., discussing author's style, book's flow, etc. instead of plot).

[1] https://github.com/veekaybee/viberary/blob/main/src/model/ge... [2] https://github.com/veekaybee/viberary/blob/main/src/model/ge...



I have the same problem with a project I'm working on. In my case I'm chunking the documents and encoding the chunks. Then I do semantic search over the embeddings of the chunked documents. It has some drawbacks but it's the best approach I could think




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: