Text-similarity embeddings aren't very interesting and will correlate with gzip,... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

refulgentis on July 29, 2023 | parent | context | favorite | on: Gzip beats BERT? Part 2: dataset issues, improved ...

Text-similarity embeddings aren't very interesting and will correlate with gzip, especially when the test is text similarity, especially when they're distinct vocabularies being tested.

The really useful ones are based on SBERT, and measure the likelihood that the answer is contained in the text that was embedded.

ex. from my unit tests: "what is my safe passcode?" has a strong match with "my lockbox pin is 1234", but a very weak match to 'my jewelry is stored safely in the safe'

I learned this from https://news.ycombinator.com/item?id=35377935: thank you to whoever posted this, blew my mind and gave me a powerful differentiator

darkteflon on July 30, 2023 [–]

Found this fascinating, thanks. I’ve been circling SBERT over the last few weeks (along with a range of other techniques to improve the quality of retrieval). Reading your comment and the linked post and comments has really cemented for me that we’re on the right track.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact