Text-similarity embeddings aren't very interesting and will correlate with gzip, especially when the test is text similarity, especially when they're distinct vocabularies being tested.
The really useful ones are based on SBERT, and measure the likelihood that the answer is contained in the text that was embedded.
ex. from my unit tests:
"what is my safe passcode?" has a strong match with "my lockbox pin is 1234", but a very weak match to 'my jewelry is stored safely in the safe'
Found this fascinating, thanks. I’ve been circling SBERT over the last few weeks (along with a range of other techniques to improve the quality of retrieval). Reading your comment and the linked post and comments has really cemented for me that we’re on the right track.
The really useful ones are based on SBERT, and measure the likelihood that the answer is contained in the text that was embedded.
ex. from my unit tests: "what is my safe passcode?" has a strong match with "my lockbox pin is 1234", but a very weak match to 'my jewelry is stored safely in the safe'
I learned this from https://news.ycombinator.com/item?id=35377935: thank you to whoever posted this, blew my mind and gave me a powerful differentiator