But, they are used/designed for different use cases. DR is mostly concerned with preserving distances such that
c * D( f(x), f(y) ) < d(x,y) < C*D( f(x), f(y) )
Here, f(x) is the mapping of x to a lower dimensional space (or any other object, really). d(-,-) and D(-,-) are the original distance measure and new distance measure respectively. The smaller the ratio C/c the better. Clearly, the computational overhead of applying f and D are also important.
LSH, on the other hand, is mostly concerned with probabilistically filtering and searching for nearby points efficiently. Here are some (old) class notes from a course I gave 2013https://edoliberty.github.io//classnotes/datamining2013/pdfs...
While LSH is mathematically beautiful, it is no longer considered the gold standard for ANN. There are many open source algorithms that outperform LSH on most datasets.
(I founded a vector search company: https://www.pinecone.io)
By the way. I enjoyed listening the interview with you on Practical AI podcast!
However, I'd like to add that LSH has recently been applied to other types of problems with very good results.
For example, LSH-based methods are at the core of the best sampling methods for kernel density estimation (KDE) and other problems [1,2]. This has recently led to fast approximations for the top eigenvalues of kernel matrices , SGD sampling , and probably more. LSH can also be used to construct sketches for kernel sums , which have applications to bioinformatics and can lead to fast neural network inference . So, LSH is alive and well - just not necessarily for ANN.
(Full disclaimer: I am an author of  and )
ScaNN is currently SOTA:
Also, most search use cases have requirements besides just the "get_k_nearest" functionality too (scoring-level filtering, business logic built into scoring unrelated to similarity, disparate query patterns that don't fit nicely into a single similarity space), so unless you only need to retrieve the K most-similar documents it might not be a good solution.
it is now supported in ElasticSearch KNN index (they use HNSWLIB but you can call it a descendant of original LSH in a way)
check out ANN benchmarks  for comparison of LSH performance to other state of the art methods like proximity graphs/HNSWLIB  and quantization/SCANN 
As an introduction LSH (with MinHash) is also described in detail in the book "Mining Of Massive Datasets", ch.3, "Finding Similar items", highly recommended 
if you want to play with LSH, python "annoy" library is the best place to start 
http://infolab.stanford.edu/~ullman/mmds/book.pdf is the book; http://infolab.stanford.edu/~ullman/mmds/bookL.pdf is the hyperlinked book; and http://www.mmds.org/ seems to be the official homepage.
That having been said, this was a wonderful read, and beautifully presented.
Conversely, any dimensionality-reducing technique with a fixed number of finite range output dimensions can be used for location-sensitive hashing as long as a couple of extra constraints hold: (1) nearby inputs (given an appropriate distance metric) will produce nearby outputs (given a second appropriate distance metric), and (2) far-apart inputs are likely to produce far-apart outputs.
I would say that it's both a dimensionality reduction and an approximate nearest neighbor search technique.
One example of a scenario where I think that the ANN perspective is more useful than the DR one is near duplicate detection. There, the buckets tend not to be treated as dimensions so much as filtered lists of items to consider for more detailed comparison. Assuming you're trying to do a high recall search for everything within one distance, you may just look at the union of all the matching buckets.
On the other hand, if you're doing a k-nearest-neighbors search (rather than a dragnet search for all items within some threshold), you might treat them more like dimensions, since the items that share more buckets with the search vector are likely to also be the most similar. So perhaps you start by ranking by Hamming distance in the LSH space, and then refine using comparisons in the higher-dimensional space.
They're all applicable, but perhaps not useful in your particular problem domain. It all depends on how useful the distance metric used by the LSH is in your problem domain.
Edit: Googling finds a paper from 2015 (https://eprint.iacr.org/2014/744.pdf). Can't tell how relevant this is or how far off the mark I am, but either way this ain't novel. Back to humbly minding my own business =)
They discuss distance metrics, which may be sensitive to inducive bias, which may perhaps be substantiated by external knowledge about the specific instance of the problem. Then things are not amusing anymore.
If the proposal is to hash all of the possible lattice points using irreversable computation on Earth in a reasonable amount of time, for a cryptographically useful lattice, the waste heat would literally boil the oceans. That's limitation doesn't hold for quantum computers, but results need to be converted into classic bits for storage. Presumably rainbow-table-like operations could reduce the storage space and waste heat necessary in storing these bits. Maybe quantum rainbow tables are a fruitful area of research.
If we're not pre-hashing all possible lattice points, then either we has the public value and we have some heuristic that allows us to hash only some small subset of lattice points for this particular value (... but then why use a hash? ... just compare points in your heuristic sequence) or we have some inverse hash operation where we hash the public value, then explore the "un-hash space" of lattice points corresponding to the LSH for the point in question.
This last general line of inquiry (trying to find search-space reducing heuristics) is probably where most of the research effort is going, though perhaps not specifically using LSH. In order to be useful for the lattice problem, the distance metrics for the input and output spaces of the LSH need to have specific relations to the lattice distance metric.
So... yea... maybe it helps to re-cast the problem in terms of LSH, but not in any obvious way. Also, any search-space reducing heuristic could probably be re-interpreted as a LSH if you squint at it hard enough.
Also I wonder if hexagonal lattices were a better choice for the 2D use case. Having multiple randomly sized, shifted and rotated hexagonal lattices would come very close to: https://en.wikipedia.org/wiki/Grid_cell
The best I can find is https://medium.com/@glaslos/locality-sensitive-fuzzy-hashing... in which SSDEEP apparently uses a rolling hash to implement locality sensitive fuzzy hashing
ssdeep would be a fuzzy hashing program https://ssdeep-project.github.io/ssdeep/index.html
Whether you use locality sensitive hashing or something else is a separate question. Personal experience is graph based ANN indices (hnsw is one nice one) have tended to a bit better, but LSH is competitive so I wouldn't strongly be against a design decision choosing it. One downside I've seen with some ann index libraries is a lot of them don't support incremental updates/deletions and force you to build the index as a batch job. That's fine in some use cases, but breaks others. LSH based approach spotify uses doesn't support incremental updates, but that's not because LSH can't support it just Annoy wasn't designed for it.
Treating your binary vector as a set allows you to use min-hashing as your LSH schema (min-hashing is just a random permutation of the given set). This simple trick makes LSH with min-hashing quite a powerful tool for binary vectors that are extensively used in recommenders systems and other domains.
I've used LSH + Min-Hash for image search (and subsequently for audio fingerprinting). If interested, I've blogged about it here .
 - https://emysound.com/blog/open-source/2020/06/12/how-audio-f...
mash is probably the best-known implementation (https://doi.org/10.1186/s13059-016-0997-x). It allows you to do highly efficient pairwise comparisons of genome sequences, where it provides a good approximates their actual pairwise sequence divergence.
Is MinHash used industrially outside of genomics? In theory, it'd be useful on any kind of text corpus. The design of the hash functions that the sketch is built from is non-trivial The hashes typically come from k-mers of a fixed length and various fast hash functions over them.
The problem was that the base layer model (tree ensembles) did very well in terms of training error but was much too slow to evaluate in production. With LSH, you can start with a bunch of points from the base model, then train the compressed model against the reference samples.
It's more appropriately used to answer Ball queries (i.e. find all points within distance r of the query) than near neighbor queries.
I've used it in multiple context :
-Low dimension : Computing all pair-wise interaction for nearby particles in physics simulations : You hash particles positions to cells, and you check interactions between nearby cells.
-Image Similarity search for candidates to loop closure in SLAM algorithms (Depending on the number of images you have brute-forcing in hamming space is often sufficient)
-Visual vocabulary, each key-point has a descriptor value, when you use opencv with SIFT/SURF for matching you can use Flann with LSH.
-K-mean variant clustering with large K, when K is large you need an index to find the closer point efficiently (used to compute the visual vocabulary above)
-Neural networks : for example it can be very tempting to put some kind of LSH index on the GPU as a standard neural-network layer, but if it's too small you are better with standard brute-force, and if it's too big you run out of memory. So you are looking for the sweet-spot. But remembering that if you have an index with 10M elements and you update sparsely only 10 elements by iteration, you'll need at least 1M iterations before each element has moved. So your algorithm will take a long time to converge.
-P2P indices : Tables of hashes that can be used to recover similar data when querying them.
The version of LSH described in the article is not often used because in real data you can often take advantage of some hierarchical representation inside the data.
It's among one of the many space-partitioning technique with multiple overlapping partitions.
Depending on the constraints on your data, it's hard to make it shine. It shines best when there are large quantity of data, but it consumes a lot of memory so it may be useful to have the index on disk, and ideally you'd like to be able to update the index incrementally. The various existing libraries work well for the specific situations they are designed to handle, but quite often you are still better rolling your own space-partitioning technique for your specific constraints.
Also quite often instead of an approximate search and have to fiddle with some tuning parameters, it's better to use some exact search : If you can project into hamming spaces, there are fast and exact algorithms (although memory intensive) based on the pigeon-hole principle.
My recollection is a bit fuzzy, but I remember specifically using Min-Hash together with random projection.
Typically, you'll have loads of script kiddies hosting slightly modified copies of the Apple ID landing page on a compromised web server... Alternatively you'd have loads of these pages build out of "exploit frameworks" / "kits" so we'd want to categorize and group to identify prevalence of a given framework / author.
Had the nice side-effect of prioritizing, speeding up, and automating takedowns for our SOC folks.
Edit: I see now that user inciampati already addressed this.
If that is not a valuable feature for you, the I believe you could use any clustering hash function you want, including a deterministic one.
So, the randomization is not the only way to prevent bias: you can also just not have it to begin with by cutting to the chase and using circular falloff to begin with. But what I'm asking is how would one implement such a thing.
Lets say you have millions of phrases and you want to find the ones that are the most similar to a given one that you've chosen. One way of doing this would be to create embedding for each phrase and looking at some metric like cosine similarity of the embeddings to determine closeness of the phrases. The problem is, you don't want to have to compare the embedding of your phrase to every other phrase in your collection. In this case, LSH can help you narrow it down.
If you hash the data using something like SHA256 and even one byte is different, it’ll produce a radically different hash (which is by design). With LSH you can have a measurement to say “corpus A has a 90% overlap with corpus B”, or in a single bit flip case, a very high correlation.
To some extent both define a kind of fuzzy neighbourhood for each point, and both kind of induce a topology on the points. Both also seem to end up with some kind of weighted graph.
That said the UMAP approach has some very deep mathematical foundations, so it could take you a while to work that one out for a LSH approach.
That said you could also just use a few random projections as an initial step and use UMAP from there. The random projections should also give you a lower bound for the distance, which can be useful in finding the k nearest neighbours.