Hacker News new | past | comments | ask | show | jobs | submit login
Hashing for large-scale similarity (mesuvash.github.io)
57 points by suphyr 5 months ago | hide | past | web | favorite | 5 comments

If you're interested in state-of-the-art approaches, there is work around learned similarity functions.

First method is using autoencoders, which are networks that try to reconstruct their input, but going through a constricted, smaller space. This smaller space is hypothethised to be a good representation of your inputs. The first half of the autoencoder is the function that goes input -> restricted space which is similar to a hash function and can be used as such for similarity measures. Examples of papers from this method: "Using very deep autoencoders for content-based image retrieval"

Another, conceptually simpler approach is to just train a learned similarity function directly using triplet loss. Have samples of each input, labelled with a similar input and a dissimilar input. This will force the learned function to output similar hashes for the similar inputs and dissimilar hashes for the dissimilar outputs. This works surprisingly well. See for example: "Loc2Vec: Learning location embeddings with triplet-loss networks" For image inputs, it's best for the learned function to have some convolutional layers.

Thanks for the pointers.

From my personal experience, Auto-encoders are amazing for dense input (images, audio etc), more specifically, when the input feature space is not large. However, in many real-world problems such as recommendation, ranking etc. the feature space is generally very sparse for eg clicks, purchase of items (say 100M items). In such cases, scaling can be challenging with neural models esp Autoencoder.

The Jaccard and Cosine similarity metrics are easy to grasp. I think the hashes could take some work.

Also you're counting up to k hash functions, but the element before the three dots is already element k. I think it should be 1,2,...,k, instead of 1,k,...,k.

>>I think the hashes could take some work. Any suggestions or thing that are not clear?

Thanks for your feedback. I shall update the post accordingly.

If you’re looking for other connections, simhash is very closely related to random projections and can be used for dimensionality reduction in machine learning.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact