I remember when "semantic search" was the Next Big Thing (back when all we had were simple keyword searches).
I don't know enough about the internals of Google's search engine to know if it could be called a "semantic search engine", but not, it gets close enough to fool me.
But I feel like I'm still stuck on keyword searches for a lot of other things, like email (outlook and mutt), grepping IRC logs, searching for products in small online stores, and sometimes even things like searching for text in a long webpage.
I'm sure people have thought about these things: what technical challenges exist in improving search in these areas? is it just a matter of integrating engines like the one that's linked here? Or maybe keyword searches are often Good Enough, so no one is really clamoring for something better
Semantic similarity more concretely means to use neural nets to embed the text, then use cosine similarity or dot product to compute the score between two entities.
embed1 = neural_net(txt1)
embed2 = neural_net(txt2)
sim_score = np.dot(embed1, embed2)
If you're making a search engine you precompute the embeds for all the items in your database. When a user performs a search you just need to embed the query and do the dot products, which are pretty fast for small indexes.
Assuming you want to index millions or billions of entities doing dot products is inefficient because it scales linearly in the size of the index. There is a trick (similar to binary search) that will find the top-k most similar results in O(log(N)) time, called approximate nearest neighbour (ANN). There are a few good libraries for that.
Are there any semantic search implementations focused on.. small, local deploys?
Eg i'm interested in local serverless setups (on desktop, mobile, etc) that yield quality search results in the ~instant~ time frame, but that are also complete and accurate in results. Ie i threw out investigating ANN because i wanted complete results due to smaller datasets.
hnswlib is in cpp and has python bindings (you should be able to make your own for other languages). Faiss, Annoy (by Spotify) should also provide similar functionality.
For anybody interested in why this comment says "cosine similarity _or_ dot product", its because the vectors in word embedding models are typically scaled to unit length.
If cos(theta) := A.B / (|A|^2 * |B|^2)
And A and B are normalised, then the denominator is 1, and the RHS is equal to the dot product.
I think you mean when semantic web was the next Big Thing. Better search would have been a nice side effect of a semantic web.
I was a big enthusiast of all the potential a semantic web could have brought. In my opinion due to the rise of social networks, content got heavily centralized, and it wasn't in the centralized platform's interest to annotate it, or allow others to consume it. For a short period of my time on the internet, when most in my close circle of friends, had WP blogs and blogrolls in the sidebar, you'd get at least some FOAF annotations on those links. I could complain for a whole afternoon about the clunkiness of supporting software and technologies (RDF, triple stores, ontologies, graph databases, etc), as it wasn't easy as a coder to hack on these technologies developed by consortiums.
As far as semantic search goes. I think that due to the heavy SaaS-ification of software, now there isn't even an incentive to create better search tooling. I know the landscape of search systems is huge, and while there is no way for me to assess all existing software, I've just stuck with the the tried and tested Apache Solr (or ElasticSearch) on projects I worked on. And those are not easily tweaked into semantic search engines.
My experience with Google Search for the past few years is that results based on their knowledge graph have been gamed heavily. There's a lot of junk in those results that is very much adjacent to the keywords I'm searching. You'll see the common suggestion on HN as well, when you search for a product review to also include news.ycombinator.com/reddit.com in your query based on the type of product you're looking for.
> I'm sure people have thought about these things: what technical challenges exist in improving search in these areas? is it just a matter of integrating engines like the one that's linked here? Or maybe keyword searches are often Good Enough, so no one is really clamoring for something better
Several technical challenges.
First, keyword searches have a lot of history that has led to a huge amount of tuning that users have gotten used to (mostly for the better in terms of results but mostly for the worse in terms of difficulty of configuration). For example, keyword systems have evolved over decades to have synonyms (unidirectional and bidirectional), a huge number of stemming algorithms for various languages (and some that cross languages), dictionaries for decompounding, various ngram/shingling methods, phrase matching and term overlap analysis, and the ability to combine all of these together with tunable weights, etc. These have generally resulted in a lot of keyword systems continuing to be "as good as it gets" for a long time. People generally like fiddling with these knobs/dials because it gives them a sense of control...until they realize the combinatory mess they get themselves into where they're essentially human hyperparameter tuning systems. Recently, some additional steps have come to take the "human" out of that with automated systems, but even then, most systems aren't set up to "learn" what synonyms to potentially introduce, whether/when/how to take word order into place, and in particular when/how these can/should combine together and when they shouldn't.
Semantic large language models "solve" some of these problems (automatic synonyms, built in linguistic understanding of root words, etc) if you build them right, but they have a lot of hidden technical depth. Most people try to throw something like BERT into their search and find the hardware costs and complexity go through the roof in ways they weren't ready to handle. And there's history weighing on the expectations for the operators ("where's my synonym configuration," etc) and the answers are very different ("go through a fine tuning step for your model") or sometimes nonexistent on most commercial platforms (how do you ensure only relevant results are returned)? And because the semantic/large language models don't know everything in the world, OOTB models still do underperform relative to keyword on certain query types (those heavy on obscure people names, etc) -- until they're retrained.
There's good research and companies/products coming out though that are changing a lot of this. See https://docs.google.com/spreadsheets/d/1L8aACyPaXrL8iEelJLGq... for example where the BM25 rows are traditional keyword and rows 8+ are zero-shot language models, and you can start to see that in some of the recent developments, semantic/neural/large language models are starting to outperform keyword on the things keyword used to be better at. My sense (though I'm biased) is these solutions are going to rapidly evolve to eliminate many of these technical challenges.
Disclosure/source: I led product management for Elasticsearch for several years and am currently leading product management for Vectara (a neural search SaaS platform)
This is an area that's dear to me, I'm the cofounder of Vectara and have been working with embedding-based semantic search, aka neural search or neural IR, since 2017.
To whether Google uses semantic search, the answer is yes, very heavily [1][2]. Not only that, but they have led, and continue to lead, much of the pioneering research in NLP and neural IR for the past decade [3][4][5].
Technical challenges lie along a few primary dimensions. The first has been search quality, because, while early neural systems like Google Talk to Books [5][6] demonstrated the potential of these techniques, benchmarks like BEIR [7], released a few years later, in 2020, showed that the best keyword retrieval algorithms still outperformed neural techniques in general settings.
The landscape since then has shifted very rapidly: In 2022, for the first time, neural search methods outperformed BM25 on BEIR. This includes late interaction [8], sparse encoding [9], and, most challengingly, dense encoding [10] systems.
The second technical challenge is scalability. After decades of infrastructure optimization, keyword systems scale well to very large corpora, while semantic systems struggle to achieve the same scale. The k-d tree approach presented in the article, for example, while good for experimentation, would be difficult to productionize, as-is, in a large-scale system.
However, research into scaling dense vector retrieval has received a lot of focus recently [11], so I'm confident this will change.
I'll close by saying your observation about being stuck with keyword search in a lot of apps is accurate, but I expect that to change soon. It's becoming easier to embed neural models everywhere, and I think that distilled models in the 5-50mb size range can feasibly power semantic search everywhere you press Ctrl-F today.
That doesn't explain why Atlassian can't build a working search function for Confluence.
I get the feeling the biggest problem with site-local search engines is a tacit requirement that the search index always must be up to date. That severely hamstrings any search engine, since there's a wealth of supplemental information to be gathered by considering the corpus as a whole that is simply not available if you support real-time updates.
Generally not a fan of Atlassian, but I can't say Confluence search has been a problem, I have more problems searching within Google Workspace (if its still called that this week).
It's made worse by the fact that it's usually the only way of navigating confluence, as any non-trivial confluence eventually turns into a nightmare maze of abandoned stale pages, dead links and half-baked attempts at restructuring it where the person enthusiastically pushing for the restructuring effort sort of gave up a third through because it turned out to be a lot more work than it seemed.
I've seen this time and time again in both big and small organizations that use confluence. Makes me feel there is a fundamental design problem with the product.
I don't know enough about the internals of Google's search engine to know if it could be called a "semantic search engine", but not, it gets close enough to fool me.
But I feel like I'm still stuck on keyword searches for a lot of other things, like email (outlook and mutt), grepping IRC logs, searching for products in small online stores, and sometimes even things like searching for text in a long webpage.
I'm sure people have thought about these things: what technical challenges exist in improving search in these areas? is it just a matter of integrating engines like the one that's linked here? Or maybe keyword searches are often Good Enough, so no one is really clamoring for something better