Will keyword (BM25, TD-IDF) be replaced for search by Neural Search?

awadallah · on Oct 13, 2022

We, humans, are preconditioned to be linear in our extrapolation (as opposed to exponential) thanks to our hunter ancestors (and FPS games!). It is very clear that the rate of advancement of Large Language Models is super-linear, if not exponential.

Hence, I indeed predict that keyword search will be completely supplanted in the next 5 years as a mechanism for search.

Of course we will still need to do lookups for ISBNs and generic ids, but that isn't keyword search, that is index lookup functionality.

Case in point: take a look at Meta Research's Contriever model (https://github.com/facebookresearch/contriever), which already matches keyword techniques in efficacy without any supervision.

This is only the beginning, come build the future with us, we see it very clearly :)

jamesblonde · on Oct 12, 2022

Amr Awadallah, cofounder of Cloudera, is arguing that Keyword Search will be replaced with Neural Search within the next five years. He’s not alone; a number of companies have emerged in this space recently (e.g. Deepset Haystack, Hebbia, etc.). Amr launched his new startup Vectara today. They claim their Neural Search as a service is as easy to use as Algolia, as scalable as Elastic, but "neural-first" leading to much higher semantic relevance (they are free for 15k queries per month). My question: do you really see neural search replacing keyword search, or do you simply see it as an extension?

kacperlukawski · on Oct 14, 2022

Both kinds of systems may coexist - the question that remains open to me is how to combine them both. Vector search, however, provides tools for non-textual search capabilities. Things like querying images, videos, etc, are non-solvable with BM25 and similar methods.

bedouin-ranger · on Oct 12, 2022

There is a place for both:

If you are searching for a specific ID/ISBN some random token, keyword search will be always useful and easy to implement.

If the goal of the search is more semantically ambiguous and can not be expressed by a unique phrase, then neural search will be the way to go.

Most of the interesting applications of search will be semantically driven and therefore neural search has a big role to play.

eskibars · on Oct 12, 2022

I agree there's a place for both right now. Curious if you think that's going to be the case forever? I've been seeing huge, nearly step-function increases in precision gains for neural systems recently, and I'm wondering how much longer that's going to be true. Keyword has taken the title for this so far (but doesn't need anything as complex as BM25/TF-IDF IMO in a lot of cases)

bedouin-ranger · on Oct 12, 2022

Here is a scenario that demonstrates my concern: If you are going to search for a phone number, semantically all phone numbers almost have the same meaning.

The embedding space will project them all to the same part of the space. I do not see any modifier in the query will adjust that. For example, searching "631 887 9812" will not be able to be quite different from any number that starts with zip code 631. The results will be quite washed in my expecation.

paloaltobound · on Oct 13, 2022

Interesting! I wonder if it can help improve search for images/products/code and other artifacts.

awadallah · on Oct 13, 2022

yes, and for voice, video, 3d objects, the digital knowledge at large.

charliejuggler · on Oct 14, 2022

People have been predicting the death of keyword search for at least as long as I've been working in the field (23 years and counting!). We've seen outrageous marketing claims, buzzwords like "concept search", "insight engines", many new companies promising a step change in search quality, some of which are still in business but many who shone briefly then vanished. The concept of an easy to use, fire-and-forget, scalable search engine that gives you great relevance out of the box isn't new. Yet the bag-of-words model remains the standard across the sector, it's well understood with many powerful and scalable open & closed source implementation options.

To make search work in practice however is hard. It's as much about process and people as it is about technology: many companies aren't even measuring search quality, recording search issues correctly or have an active search team (bigger than one poor overworked search person). No matter how clever the tech, these problems aren't going away: they're compounded by bad source data quality, misunderstandings of user search intent and bad search UX. Martin White, author of many books on search, describes search as a 'wicked problem'. Getting all these parts working in harmony so you can truly own your search is what we do here at OSC and it takes time, investment and commitment.

I think Vectara is very interesting and the people involved have impressive track records (there's also some other great engines like Vespa, Pinecone, Qdrant, Weaviate...). However I think the future of search is hybrid - we'll see keyword search still there for many use cases but enhanced by vector/neural approaches (the most widely used search engine Lucene recently gained vector features and work is happening on how to combine these with keyword ranking). No one approach will solve everyone's search problems, cope with special cases like part number search, or the specialised language used in some sectors, or always understanding the searcher's intent, magically without considering the human factors above or without extra tuning/training.

That said, with all these exciting new approaches, tools and companies, it's a very interesting time in the search world!

Further reading/viewing: at the Haystack EU search conference a couple of weeks ago www.haystackconf.com Dmitry Kan, host of the Vector Podcast (he featured the Vectara team a while ago) gave a great keynote describing the current state of vector search - I wasn't going to release the video until Monday but you can get an early look here https://youtu.be/2o8-dX__EgU . You can also read the joint article we wrote for The Search Network on vector search here https://opensourceconnections.com/wp-content/uploads/2022/05... (aimed at executives and others needing to understand the field).

svcrunch · on Oct 15, 2022

I agree with many aspects of your assessment, but when you say, "The concept of an easy to use, fire-and-forget, scalable search engine that gives you great relevance out of the box isn't new. Yet the bag-of-words model remains the standard across the sector ...", I think you have to weigh the emergence of transformer-based neural networks around 2017 more strongly.

In 2018, BERT, the first demonstration of a pretrained large language model (LLM), exceeded human performance on Stanford's Question Answering Dataset [1]. Nobody in 2010 predicted such rapid progress.

Between 2017 and 2020, I worked with several teams managing very complex search systems. In one case a single LLM obviated dozens of hand-tuned relevance signals developed over the better part of a decade.

One of the main effects of neural search adoption will be raising the baseline quality of search; a second will be reduction in the overall cost and complexity of search impementations.

For example, it's not easy to configure a keyword system to find "works fine, We have two Roku's [sic] in other televisions which are working fine" in response to "does it work with different tvs?". But neural search finds this result directly, without any tuning or configuration [2].

Thank you for sharing the video and the article!

[1]: https://www.nytimes.com/2018/11/18/technology/artificial-int...

[2]: https://www.youtube.com/watch?v=Tn7AqmY9yaY&t=112s