One issue I always run into when implementing these approaches is the embedding ...

One issue I always run into when implementing these approaches is the embedding model's context window being too small to represent what I need.

For example, on this project, looking at the generation of training data [1], it seems like what's actually being generated are embeddings on a string concatenated from each review, title, description, etc. [2]. With the max_seq_length set to 200, wouldn't lengthy book reviews result in the book description text never being encoded? Wouldn't this result in queries not matching against potentially similar descriptions if the reviews are topically dissimilar (e.g., discussing author's style, book's flow, etc. instead of plot).

[1] https://github.com/veekaybee/viberary/blob/main/src/model/ge... [2] https://github.com/veekaybee/viberary/blob/main/src/model/ge...