Since hamming distance calculations take relatively few gates, its might be possible to compute those in-memory(ram/flash) to achieve massive amount of paralellism(for ex, a specific hamming distance calculations block for each stored vector) and at much lower power.
This could be useful for web-scale embeddings search.
Yeah, that's exactly right. You take the embedding of your query and then compare it to the embeddings of your documents (or chunks of documents).
You can use vector databases that implement the Approximate Nearest Neighbors (ANN) algorithm, or if you don't have too many documents you can just brute force the similarity comparison.
I'm not sure at what point you _really_ need or want a vector database, but anecdotally, calculating the Hamming distance for a couple thousand vectors seems pretty negligible.
Very cool thanks. I am using pgvector right now. I will run into scaling issues at some point I suspect/hope.
I’ve been thinking about some setup on object storage. The data I have could be grouped per account. If you are able to compress the total index size to 1% of traditional vectors you can probably get a long way with fetching the binaries from s3 only when necessary to some sort of warm cache location.
Or just get a hetzner machine with 20TB of storage and not worry about it I guess
It got me thinking, what might it look like to natively train a binary quantized embedding model? You can’t do calculus per se on {0,1}, but maybe you could do something like randomly flip bits with a probability weighted by the severity of the error during backprop… anyway, I’m sure there’s plenty of literature about this.
Probably just an extreme version of quantization-aware training? During training you round the prediction to the range you want, but keep it as a float.
Since rounding isn’t differentiable there’s fancy techniques to approximate that as well.
> QAT backward pass typically uses straight-through estimators (STE), a mechanism to estimate the gradients flowing through non-smooth functions
Yes, binary embeddings are cool indeed. And because they can be so small they even double down as powerful cluster identifiers. Built this a while ago: https://huggingface.co/spaces/iscc/iscc-sct
I wonder what would happen if we quantized each dimension to 0.5 (or even fewer) bits instead of 1, i.e., taking 2 (or more) scalar components at a time and mapping them to 0 or 1 based on some carefully designed rules.