Hacker News new | past | comments | ask | show | jobs | submit login
Similarity search and deduplication at scale (dsalaj.com)
76 points by dsalaj on Nov 13, 2022 | hide | past | favorite | 16 comments



I have been working on an entity matching solution for two years now, and I have decided to write down some of the learning I picked up along the way. Turns out there are too many relevant details to cover in a single post, so I will cover the topic in multiple parts.

This first part is the high-level introduction, useful for project planning and architecture decisions that need to be made early in the development process. Any feedback is welcome, along with wishes for the follow-up parts if you have something specific that you would like to be covered.


Thank you for a very helpful writeup!

Do you have any materials on word embedding strategies past Word2Vec? BERT and beyond?

I am currently working on a recommendation engine for a large library - original idea being to find "similar" documents - the funding comes from a plagiarism checking project.

I was slightly surprised how deceptively simple the widely cited winnowing paper is https://dl.acm.org/doi/10.1145/872757.872770 . The key idea being simple mod reduction of hashed fingerprints.

My project's goal is to find phrase level similarities to assist researchers.

It seems k-grams, n-grams, tf-idf and even Word2Vec is not going to cut it. A "smarter" context aware embedding is in order. My foray in training BERT from scratch was not very successful. - My corpora are not in English...

PS. As usual I spend most of the time on improving OCR quality and preprocessing corpora...


For achieving a high accuracy for matching, it really comes down to details of your specific domain and dataset. Regarding the attempt of using large pre-trained language models to be able to find semantically similar documents, which is what you are attempting now, maybe try Whisper or other multilingual models, and then fine tune them on your dataset.

But a better bet might be actually turning looking into simpler embedding and methods and attempting to directly improve them by including some domain knowledge in the method or the process. Again, it is hard to judge what might work better just looking at the surface.

In case you really need to work with labeled datasets, set up a strong baseline, look into the active-learning methods and set up the loop, do a few iterations and try to predict if it will scale sufficiently fast to your target accuracy.


Thank you for this writeup. Having done some work on deduplication/matching systems, my experience (as with many things in data science) is that there are a lot of things consider and there is no single best solution. Hopefully you are able to keep up with this series, because I think it will be very helpful to many people.


I'm surprised to see that ML-based semantic search is barely touched on in this article. There's a strong focus on entity matching, but an arguably more powerful way to conduct similarity search is to leverage embedding vectors from trained models.

A great upside to this approach is that it works for a variety of different types of unstructured data (images, video, molecular structures, geospatial data, etc), not just text. The rise of multimodal models such as CLIP (https://openai.com/blog/clip) makes this even more relevant today. Combine it with a vector database such as Milvus (https://milvus.io) and you'll be able to do this at scale with very minimal effort.


Shameless plug - for folks who don't want to take on the work of model selection, on-demand scaling of model serving, scaling the vector database for search set size and query throughput, we built a service that hides all this behind a simple API [1]. The example in [1] is for images, but here is quick-start for text [2].

[1] https://www.nyckel.com/semantic-image-search [2] https://www.nyckel.com/docs/text-search-quickstart


Maybe from ... "at scale" bit? ML approaches are relatively computationally expensive.


And opaque and hard to modify.


The latent spaces created by neural networks inherently de-dupe data.


I would like to know if any of these techniques could be used for identifying articles that are either copies of each other, or near-copies, or different articles on the same story.


The easiest and likely most effective method may be to compute vector embeddings using a sentence transformer model, and find nearest neighbors among these vectors for all articles in the set. The distance between the nearest vectors will give you a degree of similarity between the articles. You'll need to figure out some thresholds on these distances to figure out what are near copies vs different articles on the same story. There are efficient methods to find approximate nearest neighbors among a large set of these vectors - available as both OSS and SaaS. Faiss [1], ScaNN [2], and Pinecone [3] are some examples.

This is one of the methods mentioned in the article. I don't have implementation experience with the other string distance measures in the article (under "normalized string" in the table), except for Q-grams. Compared to the above method Q-grams don't scale as well and are not as robust because it doesn't encapsulate an understanding of the semantics of the text.

[1] github.com/facebookresearch/faiss

[2] github.com/google-research/google-research/tree/master/scann

[3] www.pinecone.io


If looking for exact or near-exact duplicates, a transformer seems like it's probably overkill. Maybe it's not bad if you already have one that you can use for inference in the database, but I suspect that something as simple as Fasttext would do the job. A transformer would probably be more useful if you want to catch things like replacing words with synonyms out of a thesaurus.


Wouldn't that find articles that are semantically similar, rather than structurally similar (which I interpreted GP as wanting)?

In the structural-comparison case, I imagine you might have better luck with just doing cosine similarity across the term frequency vector or some such, possibly doing random projection first to reduce dimensionality.

Or really, an LSH would do the trick.


Yes, see these examples, which frame this as plagiarism detection but effectively do the same thing:

- https://dzone.com/articles/build-a-plagiarism-checker-using-...

- https://www.pinecone.io/learn/plagiarism-detection/


You should be able to use embeddings for this (sort by the cosine similarity). eg OpenAI has an off the shelf offering: https://beta.openai.com/docs/guides/embeddings/what-are-embe...

We used something similar to build a “similar articles” feature & it gave us de-duplication essentially for free.


Exact and near duplicate articles should have similar or identical word frequency distributions. Maybe that can be used as a blocking criterion somehow. Although it might not be any faster to compare word frequency distributions than to compare dense low-dimensional embeddings.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: