First method is using autoencoders, which are networks that try to reconstruct their input, but going through a constricted, smaller space. This smaller space is hypothethised to be a good representation of your inputs. The first half of the autoencoder is the function that goes input -> restricted space which is similar to a hash function and can be used as such for similarity measures.
Examples of papers from this method: "Using very deep autoencoders for content-based image retrieval"
Another, conceptually simpler approach is to just train a learned similarity function directly using triplet loss.
Have samples of each input, labelled with a similar input and a dissimilar input. This will force the learned function to output similar hashes for the similar inputs and dissimilar hashes for the dissimilar outputs. This works surprisingly well. See for example: "Loc2Vec: Learning location embeddings with triplet-loss networks" For image inputs, it's best for the learned function to have some convolutional layers.
From my personal experience, Auto-encoders are amazing for dense input (images, audio etc), more specifically, when the input feature space is not large. However, in many real-world problems such as recommendation, ranking etc. the feature space is generally very sparse for eg clicks, purchase of items (say 100M items). In such cases, scaling can be challenging with neural models esp Autoencoder.
Also you're counting up to k hash functions, but the element before the three dots is already element k. I think it should be 1,2,...,k, instead of 1,k,...,k.
Thanks for your feedback. I shall update the post accordingly.