
Challenges in Implementing a Full-Text-Search Engine - bhavaniravi
https://bhavaniravi.com/blog/challenges-in-full-text-search
======
misterman0
FTS is dead. It didn't use to be dead but it is surely dead today. For years
and years and years we were told to use stop words, integrate a stemming
library, make sure term weights are normalised with TF-IDF, maintain a
synonyms dictionary and use an index that keeps track of term positions to be
able to cater for term proximity and if you did all that the relevance of your
search results would be top notch. Top notch! Today none of that is relevant.
Today you need to provide semantic search ("SS") or your search will be
considered broken because the results will truly suck. To provide SS you need
ML.

FTS is still commonly included in e-commerce where I guess it's not quite dead
but only because truly relevant search results are irrelevant to retailers'
bottom line.

SS is at least one magnitude more complex to set up, especially if you want to
be able to refresh your "index" and even more so if your data is big. What a
typical back-ender would wip up in a couple of days using ES/Solr you now need
a proper Math Dev to do and for the MD's model to become useful you need them
to hand it over to a distributed systems expert.

SS through ML is commonly a nasty, duct-taped work-flow that at best results
in a system that looks more like a POC or DEMO than a proper production
system, unless you work at Bing/Google that is (probably, but I have no hands-
on experience of those systems).

I've been trying for years, I'd say at least ten, to try to "commoditize" this
work flow, make it simpler, more usable for generalist devs (not just ML
people) but no matter what I do I keep getting crushed under the weight of the
data. To me it seems search is dead and we haven't appointed a new king.

~~~
ObserverEffect
Disclaimer: I work at Algolia
([https://www.algolia.com/](https://www.algolia.com/)), a hosted search as a
service API.

While I agree that building a great and relevant search experience with a
Lucene-based engine requires lots of extra time and effort to get right, there
are other non-TFIDF based solutions that provide a much faster path to great
relevance with far less effort ([https://blog.algolia.com/inside-the-algolia-
engine-part-1-in...](https://blog.algolia.com/inside-the-algolia-engine-
part-1-indexing-vs-search/)), and it's possible to have semantic ranking
without too much machine learning ([https://blog.algolia.com/promote-search-
results-query-rules/](https://blog.algolia.com/promote-search-results-query-
rules/)). Not to discount the value of machine learning - we're finding that
for specific usecases ML can be a very valuable way to help surface more
pertinent content for individuals based on their profile/preferences etc.
([https://blog.algolia.com/personalization-
announcement/](https://blog.algolia.com/personalization-announcement/)).

This may be along the lines of what you mentioned around "commoditizing"
complex traditional search workflows. I'd be curious to hear more about what
kind of use-cases you think are trickiest without SS.

~~~
misterman0
Although I'm a big fan of Algolia search (because it's freakin' fast) I happen
to know little to nothing of your search model other than what I have learned
from Algolians chipping in right here on HN.

I used to be quite impressed with Lucene, even at version 1.0 (when a fuzzy
search meant a full table scan), then watched in joy when they conquered the
search market, before realizing how it struggled (and still does) with, well,
I hate to say it (because I'm usually ridiculed when I bring this up), y'know,
big (-ish) data. The proposed and popular solution: sharding the data onto a
cluster of machines.

Algolia seems to be a focused, streamlined and more efficient ElasticSearch,
at least in the FTS use case.

I've worked almost exclusively in e-com for ~20 years. Algolia
FTS+personalization seems to fit the e-commerce use case pretty darn well.

I wonder, regarding "Algolia Query Rules" (which also seems like a real
killer-feature for e-commerce):

>> automatically transforming a query word into its equivalent filter (“cheap”
would become a filter price< 400...

How do you translate "cheap" into "price<400"? By maintaining a dictionary?
Also, what if some people think 400 is quite expensive?

I want to build or implement a search engine that is inherently self-
maintained in the same way you and me are self-maintained. As humans, however,
we do have a serious flaw. In order for us to maintain an index of our
knowledge we need to sleep. To start with I'd like to try to mimic that
construct, then move past it.

------
theandrewbailey
I've used the full text search feature in Postgres (even before then, I was
vaguely familiar with the topics covered here). It worked unless you
misspelled something, or split/merged compound words. Trigrams solved that.
Whenever I get around to upgrading, I'd love to use the websearch_to_tsquery
function.

[https://www.postgresql.org/docs/current/textsearch.html](https://www.postgresql.org/docs/current/textsearch.html)

[https://www.postgresql.org/docs/current/pgtrgm.html](https://www.postgresql.org/docs/current/pgtrgm.html)

~~~
SigmundA
PG really needs better ranking at least TF-IDF but also BM25.

There has been some work but not sure when it will be stable, needs a new kind
of index:

[https://github.com/postgrespro/rum](https://github.com/postgrespro/rum)

------
ddorian43
This is extremely high level. A nice view of how lucene internals work by core
committer Adrien Grand:

[https://www.slideshare.net/lucenerevolution/what-is-
inalucen...](https://www.slideshare.net/lucenerevolution/what-is-
inaluceneagrandfinal)

[https://www.youtube.com/watch?v=T5RmMNDR5XI](https://www.youtube.com/watch?v=T5RmMNDR5XI)

------
andrewmatte
Nice article. I am interested to see this stuff blended with the GPU/ML
powered databases rather than the TF-IDF of decades ago - as well as it works

~~~
m_ke
Someone needs to make a DB with first class support for dense feature vectors
(embeddings) and approximate nearest neighbor search.

These two features would allow you to do visual search, semantic text search,
recommendations, learning to rank and etc.

~~~
mumblemumble
I'd love to have something like that. As far as I'm aware, one big limiting
factor is that there aren't currently any great ways to do an index for
approximate nearest neighbor search that doesn't require you to keep the whole
index in memory. A disk-friendly indexing method would make it just a
PostgreSQL plugin away.

~~~
gravypod
There are no good exact indexing structures but there are a lot of very high
performance approximate NN structures. Facebook has an open source
implementation of some of these in a project called faiss [0] which does a
relatively good job of this.

[0] -
[https://github.com/facebookresearch/faiss](https://github.com/facebookresearch/faiss)

~~~
blr246
At Frame.ai, we are using both PostgreSQL and faiss (and other tools) in our
stack to do several different kinds of inference tasks on semantic
representations of text to help companies understand and act on customer
chats, emails, and phone call transcripts.

We've frequently had the same dream of adding more native support for nearest-
neighbor type queries, since that is the workhorse of so many useful
techniques in the modern NLP stack.

Right now, we have lots of dense vectors stored in massive toast tables in PG.
It's faster to fetch them rather than recompute them, especially since there
are a number of preprocessing steps that limit what we pay attention to.

The discussion here about full text search versus semantic search is
interesting. In our experience, both are highly relevant. Sometimes it's most
useful for our customers to segment their conversation data by exact text
matches, and other times semantic clustering is most effective. I think
there's plenty of reason to offer both kinds of capabilities.

------
avremel
I wrote an intro to the Lucene scoring model with a python example:

[https://github.com/avremel/lucene](https://github.com/avremel/lucene)

Elastic/Solr is a very decent option. Last time I checked, Algolia and other
SaaS were too expensive for small businesses.

