Not sure if I can answer those questions. I'm quite new to the field myself. Man...

sgc · on Dec 18, 2021

Thank you for the write up, since as far as my research has gone that is the best description of how to go about planning for vectorization I have seen. Undoubtedly experimentation on our corpus is required, but it helps to have an overview so we don't wildly run down the wrong paths early on.

leobg · on Dec 19, 2021

Exactly. Training models, generating embeddings and building databases all can take days to run, and hundreds of Dollars in server costs. It’s painful to have done all that only to realize that one has gone down the wrong path. It pays to test your pipeline with a smaller batch first. Most likely you will discover some issues in your process. This iterative cycle is much faster if you work with a smaller test set first and do not switch over to your main corpus prematurely.