Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Here's an example of Instructor Embeddings w/FeatureBase: https://gist.github.com/kordless/aae99946e7e2a5afccc83f3c4ee...

Instructor Embedding rank high on various leaderboards for embeddings and can be run locally, irregardless of how they are stored. It takes about half a second to embed 20 strings. 2.2 seconds to embed 80 strings. I haven't tested this with different batch sizes or GPU acceleration (don't know if that's possible). It is possible to quantize the vectors to 8 bit floats.

I'm using FeatureBase to store these because a) I work there and b) it will store and search the vectors by euclidian_distance and cosine_distance. Right now this is a cloud-only feature, but we'll work on getting it into the community release at some point.

Combined with our current support of sets, set intersection operations like tanimoto() and filtering and aggregation (all done using roaring bitmaps for in memory operations) this presents an interesting offering for storing training data and reporting. Being able to filter the vectors compared in distance makes nearest neighbor search algos almost unnecessary, except for extreme use cases. In that case, it might be better to consider switching to using knowledge graphs (to filter the vector space) instead of storing 10s of millions of dense vectors and doing approximate search on them.



Cool. I was wondering what Tanimoto meant, since I've tried to make myself familiar with all the useful similarity measures, and apparently it's just another name for Jaccard Index.

I do think there is a lot of potential in exploring more sensitive measures of similarity or statistical dependence. It seems like the ML community has basically decided that all the heavy lifting should be done at the model embedding level, and then you can just use cosine similarity for speed and the answers just "fall out". Which is definitely nice because then you can search across millions of records per second.

But there are some lesser-known measures of similarity/dependence that can pick up on more subtle relationships-- the big drawback is that they are slow. I included a couple of these exotic ones in my project, Hoeffding's D and HSIC, mostly out of curiosity.


> apparently it's just another name for Jaccard Index

Yes, it produces the same values for set compare: https://www.featurebase.com/blog/tanimoto-similarity-in-feat...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: