Hacker News new | past | comments | ask | show | jobs | submit login
Introduction to Locality-Sensitive Hashing (tylerneylon.com)
255 points by polm23 7 months ago | hide | past | favorite | 63 comments

Hi All, it's true that LSH (locality sensitive hashing) and DR (dimension reduction) are very related. They both derive their correctness from the same strong concentration phenomenon.

But, they are used/designed for different use cases. DR is mostly concerned with preserving distances such that

c * D( f(x), f(y) ) < d(x,y) < C*D( f(x), f(y) )

Here, f(x) is the mapping of x to a lower dimensional space (or any other object, really). d(-,-) and D(-,-) are the original distance measure and new distance measure respectively. The smaller the ratio C/c the better. Clearly, the computational overhead of applying f and D are also important.

LSH, on the other hand, is mostly concerned with probabilistically filtering and searching for nearby points efficiently. Here are some (old) class notes from a course I gave 2013https://edoliberty.github.io//classnotes/datamining2013/pdfs...

While LSH is mathematically beautiful, it is no longer considered the gold standard for ANN. There are many open source algorithms that outperform LSH on most datasets.

(I founded a vector search company: https://www.pinecone.io)

There are many great algorithms that outperform LSH on ANN vector search, but vectorization in itself a high-compute task, which ads latency and require GPUs, but this is not reported in benchmarks. For text LSH hash can be efficiently created directly from tokens. Of course, this does not work for semantic similarity, but can be used for lexical similarity search.

By the way. I enjoyed listening the interview with you on Practical AI podcast!

This thread contains many excellent points, and it's true that LSH is no longer SOTA for ANN problems. Generally, LSH indices are much faster to construct but slower to query than other ANN methods (graph-based, cluster-based, etc). In many applications, construction time matters less than query time, so LSH is unattractive there.

However, I'd like to add that LSH has recently been applied to other types of problems with very good results.

For example, LSH-based methods are at the core of the best sampling methods for kernel density estimation (KDE) and other problems [1,2]. This has recently led to fast approximations for the top eigenvalues of kernel matrices [3], SGD sampling [4], and probably more. LSH can also be used to construct sketches for kernel sums [5], which have applications to bioinformatics and can lead to fast neural network inference [6]. So, LSH is alive and well - just not necessarily for ANN.

(Full disclaimer: I am an author of [5] and [6])

[1]: https://arxiv.org/abs/1808.10530 [2]: http://proceedings.mlr.press/v97/siminelakis19a [3]: https://arxiv.org/abs/2102.08341 [4]: https://arxiv.org/abs/1910.14162 [5]: https://arxiv.org/abs/1912.02283 [6]: https://arxiv.org/abs/2106.11426

What are the gold standard algorithms now?


My 2c, the choice of algorithm is quite complex... It is data dependent and also depends on your read/write ratio, latency and accuracy needs, memory/cpu, and many other factors. If I were you I'd look for a versatile solution more than "the best" algorithm cos that choice will likely change... just sayin'...

+1 to sibling comment about not choosing the SOTA just cuz SOTA (if you're in the market for an ANN algo). Almost all the ones there are pretty darn fast these days and many (e.g. hnsw) are implemented out of the box by libraries like nmslib.

Also, most search use cases have requirements besides just the "get_k_nearest" functionality too (scoring-level filtering, business logic built into scoring unrelated to similarity, disparate query patterns that don't fit nicely into a single similarity space), so unless you only need to retrieve the K most-similar documents it might not be a good solution.

great post and my favorite CS topic, LSH is particularly relevant to machine learning because of its use in indexing of embeddings, which are now omnipresent in ML (from word2vec to transformers to image and graph embeddings etc, etc.)

it is now supported in ElasticSearch KNN index (they use HNSWLIB but you can call it a descendant of original LSH in a way)

check out ANN benchmarks [0] for comparison of LSH performance to other state of the art methods like proximity graphs/HNSWLIB [1] and quantization/SCANN [2]

As an introduction LSH (with MinHash) is also described in detail in the book "Mining Of Massive Datasets", ch.3, "Finding Similar items", highly recommended [3]

if you want to play with LSH, python "annoy" library is the best place to start [4]

[0] https://github.com/erikbern/ann-benchmarks

[1] https://github.com/google-research/google-research/tree/mast...

[2] https://github.com/nmslib/hnswlib

[3] http://infolab.stanford.edu/~ullman/mmds

[4] https://github.com/spotify/annoy

http://infolab.stanford.edu/~ullman/mmds is 401 for me - presumably because directory indexing is turned off.

http://infolab.stanford.edu/~ullman/mmds/book.pdf is the book; http://infolab.stanford.edu/~ullman/mmds/bookL.pdf is the hyperlinked book; and http://www.mmds.org/ seems to be the official homepage.

It seems like this approach could be adapted as a dimensionality reduction technique: select a set of k random hyperplanes denoted p_i, then map each point to a new vector r such that r_i is the distance to the hyperplane p_i. My gut tells me it may be more efficient to reuse existing nearest neighbor approaches in that embedding space than to bucket them and compute the partial intersection of 20 sets.

That having been said, this was a wonderful read, and beautifully presented.

Yes, given the dimensionality of the input is at least the dimensionality of the hash value, location-sensitive hashing is always a dimensionality reduction operation.

Conversely, any dimensionality-reducing technique with a fixed number of finite range output dimensions can be used for location-sensitive hashing as long as a couple of extra constraints hold: (1) nearby inputs (given an appropriate distance metric) will produce nearby outputs (given a second appropriate distance metric), and (2) far-apart inputs are likely to produce far-apart outputs.

That seems to imply that LSHs are a subset of DRs. Are there LSH techniques that are not applicable to DR?

Like so many of these algorithms, there are a few perspectives, and the most useful way to contextualize it depends on the problem you're trying to solve.

I would say that it's both a dimensionality reduction and an approximate nearest neighbor search technique.

One example of a scenario where I think that the ANN perspective is more useful than the DR one is near duplicate detection. There, the buckets tend not to be treated as dimensions so much as filtered lists of items to consider for more detailed comparison. Assuming you're trying to do a high recall search for everything within one distance, you may just look at the union of all the matching buckets.

On the other hand, if you're doing a k-nearest-neighbors search (rather than a dragnet search for all items within some threshold), you might treat them more like dimensions, since the items that share more buckets with the search vector are likely to also be the most similar. So perhaps you start by ranking by Hamming distance in the LSH space, and then refine using comparisons in the higher-dimensional space.

> Are there LSH techniques that are not applicable to DR?

They're all applicable, but perhaps not useful in your particular problem domain. It all depends on how useful the distance metric used by the LSH is in your problem domain.

Aaron Levin did a very interesting talk about this at !!Con 2017: https://www.youtube.com/watch?v=kKRvEJrvvso

A little off topic but after enjoying the piece (nice tufte style layout, kudos) I was digging around the other posts. Disappointed that I couldn't add an RSS feed for the future!

Few years ago I made a PoC with LSH to find similar functions within keyword-driven test framework to reduce keywords duplicates. It did find duplicates with fairy high accuracy although I did not get any comments from real live usage.

Crackpot intuition: Knowing very little about either LSH or Lattices, could something like this be used to speed up the Shortest Vector Problem and attack lattice-based post-quantum cryptography?

Edit: Googling finds a paper from 2015 (https://eprint.iacr.org/2014/744.pdf). Can't tell how relevant this is or how far off the mark I am, but either way this ain't novel. Back to humbly minding my own business =)

Replying in kind: if it optimizes memory layout then it speeds up computation. However if you reduce the hardness from 5 bn years to 1bn years it's important but amusing.

They discuss distance metrics, which may be sensitive to inducive bias, which may perhaps be substantiated by external knowledge about the specific instance of the problem. Then things are not amusing anymore.

A few observations: we don't yet know how to perform long-term qbit storage, so all multi-day storage at present is classic bits.

If the proposal is to hash all of the possible lattice points using irreversable computation on Earth in a reasonable amount of time, for a cryptographically useful lattice, the waste heat would literally boil the oceans. That's limitation doesn't hold for quantum computers, but results need to be converted into classic bits for storage. Presumably rainbow-table-like operations could reduce the storage space and waste heat necessary in storing these bits. Maybe quantum rainbow tables are a fruitful area of research.

If we're not pre-hashing all possible lattice points, then either we has the public value and we have some heuristic that allows us to hash only some small subset of lattice points for this particular value (... but then why use a hash? ... just compare points in your heuristic sequence) or we have some inverse hash operation where we hash the public value, then explore the "un-hash space" of lattice points corresponding to the LSH for the point in question.

This last general line of inquiry (trying to find search-space reducing heuristics) is probably where most of the research effort is going, though perhaps not specifically using LSH. In order to be useful for the lattice problem, the distance metrics for the input and output spaces of the LSH need to have specific relations to the lattice distance metric.

So... yea... maybe it helps to re-cast the problem in terms of LSH, but not in any obvious way. Also, any search-space reducing heuristic could probably be re-interpreted as a LSH if you squint at it hard enough.

Any relation to: https://en.wikipedia.org/wiki/Rolling_hash ?

Also I wonder if hexagonal lattices were a better choice for the 2D use case. Having multiple randomly sized, shifted and rotated hexagonal lattices would come very close to: https://en.wikipedia.org/wiki/Grid_cell

I also wish to know! Specially because https://en.wikipedia.org/wiki/Locality-sensitive_hashing says "See also: Rolling hash" (and a bunch of other things like bloom filters and wavelet compression)

The best I can find is https://medium.com/@glaslos/locality-sensitive-fuzzy-hashing... in which SSDEEP apparently uses a rolling hash to implement locality sensitive fuzzy hashing

ssdeep would be a fuzzy hashing program https://ssdeep-project.github.io/ssdeep/index.html

I remember learning about locality sensitive hashing in school a decade ago, but since then I’ve never seen it used in an ML system in industry. Does anyone actually use this technique at scale? If you have, I’d love to hear about your application and experience with it.

The recommender system mention is very valuable use case that I've seen at some of the largest ml using companies in the world. If you have a recommendation system on millions or much higher pool of content you will need an ANN index. I would expect google image search is using an ANN index. Facebook definitely uses ANN indices. Spotify is where Annoy comes from. They're pretty key at tiktok.

Whether you use locality sensitive hashing or something else is a separate question. Personal experience is graph based ANN indices (hnsw is one nice one) have tended to a bit better, but LSH is competitive so I wouldn't strongly be against a design decision choosing it. One downside I've seen with some ann index libraries is a lot of them don't support incremental updates/deletions and force you to build the index as a batch job. That's fine in some use cases, but breaks others. LSH based approach spotify uses doesn't support incremental updates, but that's not because LSH can't support it just Annoy wasn't designed for it.

It's widely used in recommenders based on embeddings. See:



It was used extensively in classic computer vision descriptor matching including most image lookup e.g. google image search today. There are many reasons it not much used in recent computer vision. The first is that it does not work when the descriptors themselves are explicitly designed to make visually similar features descriptively distinct, and the second is that it generally works poorly for binary vectors which dominate. With regards to ml, its been tried, but the datasets are either too small or not available to researchers, and the result would be far too slow compared to the current approaches. Its actually one of the areas where deep learning hasn't reached parity to my current knowledge. It would be a great project, the dataset issue aside.

For binary vectors you can choose a different distance metric (not geometric one, i.e. Jaccard) that can be used to effectively hash similar data points into similar buckets.

Treating your binary vector as a set allows you to use min-hashing as your LSH schema (min-hashing is just a random permutation of the given set). This simple trick makes LSH with min-hashing quite a powerful tool for binary vectors that are extensively used in recommenders systems and other domains.

I've used LSH + Min-Hash for image search (and subsequently for audio fingerprinting). If interested, I've blogged about it here [1].

[1] - https://emysound.com/blog/open-source/2020/06/12/how-audio-f...

Agree. Also cosine distance.

Locality sensitive hashing is extensively used in bioinformatics and genomics.

mash is probably the best-known implementation (https://doi.org/10.1186/s13059-016-0997-x). It allows you to do highly efficient pairwise comparisons of genome sequences, where it provides a good approximates their actual pairwise sequence divergence.

Is MinHash used industrially outside of genomics? In theory, it'd be useful on any kind of text corpus. The design of the hash functions that the sketch is built from is non-trivial The hashes typically come from k-mers of a fixed length and various fast hash functions over them.

I’ve used it in an HFT system. Basically as a form of latency-minimizing model compression for evaluation in production.

The problem was that the base layer model (tree ensembles) did very well in terms of training error but was much too slow to evaluate in production. With LSH, you can start with a bunch of points from the base model, then train the compressed model against the reference samples.

It was used in https://ai.googleblog.com/2020/01/reformer-efficient-transfo... for faster and more memory-efficient attention.

Academic example but I'm sure it's made it's way to this industry as well: in bioinformatics, this has proven tremendously useful for approximate string matching to do things like identify microbes in sequenced microbiome samples - cf https://genomebiology.biomedcentral.com/articles/10.1186/s13...

The general idea of quantizing points so that similar points fall into the same bucket or a near bucket is a powerful one.

It's more appropriately used to answer Ball queries (i.e. find all points within distance r of the query) than near neighbor queries.

I've used it in multiple context :

-Low dimension : Computing all pair-wise interaction for nearby particles in physics simulations : You hash particles positions to cells, and you check interactions between nearby cells.

-Image Similarity search for candidates to loop closure in SLAM algorithms (Depending on the number of images you have brute-forcing in hamming space is often sufficient)

-Visual vocabulary, each key-point has a descriptor value, when you use opencv with SIFT/SURF for matching you can use Flann with LSH.

-K-mean variant clustering with large K, when K is large you need an index to find the closer point efficiently (used to compute the visual vocabulary above)

-Neural networks : for example it can be very tempting to put some kind of LSH index on the GPU as a standard neural-network layer, but if it's too small you are better with standard brute-force, and if it's too big you run out of memory. So you are looking for the sweet-spot. But remembering that if you have an index with 10M elements and you update sparsely only 10 elements by iteration, you'll need at least 1M iterations before each element has moved. So your algorithm will take a long time to converge.

-P2P indices : Tables of hashes that can be used to recover similar data when querying them.

The version of LSH described in the article is not often used because in real data you can often take advantage of some hierarchical representation inside the data.

It's among one of the many space-partitioning technique with multiple overlapping partitions.

Depending on the constraints on your data, it's hard to make it shine. It shines best when there are large quantity of data, but it consumes a lot of memory so it may be useful to have the index on disk, and ideally you'd like to be able to update the index incrementally. The various existing libraries work well for the specific situations they are designed to handle, but quite often you are still better rolling your own space-partitioning technique for your specific constraints.

Also quite often instead of an approximate search and have to fiddle with some tuning parameters, it's better to use some exact search : If you can project into hamming spaces, there are fast and exact algorithms (although memory intensive) based on the pigeon-hole principle.

It's used in the cybersecurity world for fuzzy file comparison. See https://github.com/trendmicro/tlsh

A lifetime ago I used it on a document clustering system. It is not very good as a general similarity function, but excellent for quickly finding near duplicates in linear time. About half of the documents of set I was working on (news articles) were duplicates, so an early removal pass would speed up the actual clustering algorithm pass.

My recollection is a bit fuzzy, but I remember specifically using Min-Hash together with random projection.

We used it at work (anti-cybercrime / Phishing focused) company in conjunction with our crawler for clustering of phishing landing pages.

Typically, you'll have loads of script kiddies hosting slightly modified copies of the Apple ID landing page on a compromised web server... Alternatively you'd have loads of these pages build out of "exploit frameworks" / "kits" so we'd want to categorize and group to identify prevalence of a given framework / author.

Had the nice side-effect of prioritizing, speeding up, and automating takedowns for our SOC folks.

It is being used in the reformer architecture to relate the dot product attention layer. This enables much longer input sequences and reduces computational complexity from O(n^2) to O(nlog(n)), where n in the length of the sequence.


I've used it for a wide variety of search and recommendation problems, albeit not in isolation. In practice, I've found that it is a great way to reduce the size of a search space before handing the problem off to a more precise algorithm.

I used it in ecommerce to find duplicate product titles.

When I was in ByteDance, it was used extensively to deduplicate news articles.

Well written article. Does this get used at all bioinformatics, for maybe de novo sequence construction from short reads or something? Seems like it would have many applications in that field.

Edit: I see now that user inciampati already addressed this.

This is actually an amazing idea. I wonder if this would help with creating/authenticating real online identities from bots? It obviously can be spoofed but definitely a useful tool in the toolkit.

I can't see how LSH would be any faster for 2D/3D data than just sorting into regular X, Y buckets (like the first example). Surely this is only useful for high dimensional data?

yes, this is true. In 2D/3D, sorting into regular space partitions of exponentially decreasing size (e.g. KD-trees) is far more effective - roughly log(N) vs N^rho (rho < 1). In high dimensions, the simple partitioning strategy runs up against the "curse of dimensionality" which is where LSH starts to work better.

This is pretty awesome! But I'm not a fan of the RNG involved. Is there an approach where you use a circle or sphere with a fallof function instead of layering random mosaics/hash functions?

I was barely hanging on to a lot of the math notation, but the author argues the advantage of the randomization is that on average the performance will be constant across all datasets. That is, once you develop a sense for how the system performs, you shouldn’t be surprised by wild performance swings.

If that is not a valuable feature for you, the I believe you could use any clustering hash function you want, including a deterministic one.

The randomization is to prevent bias in the sampling, as they illustrate with the grid. They then overlay random "grid shapes" to roughly approximate a sphere. I can't prove it right now, but I'm betting that if you extended this approach to infinite hashes you would eventually get a perfect circular fallof.

So, the randomization is not the only way to prevent bias: you can also just not have it to begin with by cutting to the chase and using circular falloff to begin with. But what I'm asking is how would one implement such a thing.

What is the real world use case where you could use LSH?

It actually has a lot of overlap with deep learning. Here's an NLP example.

Lets say you have millions of phrases and you want to find the ones that are the most similar to a given one that you've chosen. One way of doing this would be to create embedding for each phrase and looking at some metric like cosine similarity of the embeddings to determine closeness of the phrases. The problem is, you don't want to have to compare the embedding of your phrase to every other phrase in your collection. In this case, LSH can help you narrow it down.

Malware detection uses it as a heuristic to peg specific “variants” to each other. Basically LSH is great for comparing corpuses of data that are slightly different.

If you hash the data using something like SHA256 and even one byte is different, it’ll produce a radically different hash (which is by design). With LSH you can have a measurement to say “corpus A has a 90% overlap with corpus B”, or in a single bit flip case, a very high correlation.

Image similarity.

This reminds me of the fuzzy simplical sets used by UMAP. Is there any relationship between the two? Could it be used to accelerate the algorithm?

Well, that was one hell of a rabbit hole.

To some extent both define a kind of fuzzy neighbourhood for each point, and both kind of induce a topology on the points. Both also seem to end up with some kind of weighted graph.

That said the UMAP approach has some very deep mathematical foundations, so it could take you a while to work that one out for a LSH approach.

That said you could also just use a few random projections as an initial step and use UMAP from there. The random projections should also give you a lower bound for the distance, which can be useful in finding the k nearest neighbours.

Does TeX equations are intended to be presented in a raw form? It's very hard to read without proper rendering.

It renders fine on my end. Try checking if your browser is blocking cdnjs.cloudflare.com for some reason.

Yep, chrome works fine!

Wow I just discovered LSH couple of hours ago, before opening HN... Just to see it being covered here

Getting a 404 'Site not Found' page.

Try to disable HTTPS Everywhere or Firefox' HTTPS Only Mode. Looks like this site is badly setup and can't handle HTTP-only requests properly.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact