
Hashing for large-scale similarity - suphyr
https://mesuvash.github.io/blog/2019/Hashing-for-similarity/
======
ovi256
If you're interested in state-of-the-art approaches, there is work around
learned similarity functions.

First method is using autoencoders, which are networks that try to reconstruct
their input, but going through a constricted, smaller space. This smaller
space is hypothethised to be a good representation of your inputs. The first
half of the autoencoder is the function that goes input -> restricted space
which is similar to a hash function and can be used as such for similarity
measures. Examples of papers from this method: "Using very deep autoencoders
for content-based image retrieval"

Another, conceptually simpler approach is to just train a learned similarity
function directly using triplet loss. Have samples of each input, labelled
with a similar input and a dissimilar input. This will force the learned
function to output similar hashes for the similar inputs and dissimilar hashes
for the dissimilar outputs. This works surprisingly well. See for example:
"Loc2Vec: Learning location embeddings with triplet-loss networks" For image
inputs, it's best for the learned function to have some convolutional layers.

~~~
mesuvash
Thanks for the pointers.

From my personal experience, Auto-encoders are amazing for dense input
(images, audio etc), more specifically, when the input feature space is not
large. However, in many real-world problems such as recommendation, ranking
etc. the feature space is generally very sparse for eg clicks, purchase of
items (say 100M items). In such cases, scaling can be challenging with neural
models esp Autoencoder.

------
carlmr
The Jaccard and Cosine similarity metrics are easy to grasp. I think the
hashes could take some work.

Also you're counting up to k hash functions, but the element before the three
dots is already element k. I think it should be 1,2,...,k, instead of
1,k,...,k.

~~~
mesuvash
>>I think the hashes could take some work. Any suggestions or thing that are
not clear?

Thanks for your feedback. I shall update the post accordingly.

------
orasis
If you’re looking for other connections, simhash is very closely related to
random projections and can be used for dimensionality reduction in machine
learning.

