Ask HN: Does Symantic Search Work?

PaulHoule · on Jan 22, 2023

I helped develop a "semantic search" engine for patent and non-patent literature about 10 years ago which was highly successful, enough that our demo made a big sale on the very first day.

This engine used a neural network to train an autoencoder that crunches down the word counts for thousands of words to a moderate dimensional vector, say n=50. This captures correlations between words such that similar documents are more consistently close in the embedding space than they are in the very high dimensional word vector space.

This kind of system does not improve short queries (<10 words) but is great for "more like this" queries centered on a document and taking paragraph you wrote describing an invention and finding prior art.

We used the TREC evaluation methodology, public data, our proprietary data, and the opinions of users to conclude our product was much better than a simple baseline search engine and our competitors.

yonz · on Jan 22, 2023

I can imagine how painful it would be to rely on keywords to search through parents so your success makes sense to me.

But I should clarify, did we get enough of a comprehension boost from transformer based Pretrained LLMs to successfully interpret query intent and find related items?

PaulHoule · on Jan 23, 2023

I'd think you could evaluate a system like that with the TREC methodology.

I've seen plenty of blog posts where somebody did something poorly motivated with an embedding and had a search engine that worked but didn't do any real evaluation so you don't know if it is better or worse than a simple search engine that uses, say, Okapi BM25.

more_corn · on Jan 25, 2023

We’ve had good results with semantic search. We use it because keyword search doesn’t handle minor changes in words gracefully and semantic search does.

yonz · on Jan 25, 2023

Would you say it works more like fuzzy text search than searching by meaning?

For example i expect that searching "a snarky blog post" or "uplifting people"... wouldn't work that great