
Let's build a full-text search engine - elorant
https://artem.krylysov.com/blog/2020/07/28/lets-build-a-full-text-search-engine/
======
injb
>To query the index, we are going to apply the same tokenizer and filters we
used for indexing

From developing a Lucene-based search engine years ago, I learned a very
useful trick for things like stemming and synonyms: _don 't_ use the same
tokenizer/filters that you used for indexing. It's much more flexible if you
instead expand the query terms, replacing a given term with a list of OR
clauses with all the synonyms.

So during indexing, if the document has "fishing", then you just add that to
the index.

Later, during querying, the user can search for "fish" with optional stemming
enabled, and their query will get re-written into

"fish" OR "fishing" OR "fishy"

This way, the user has control at query-time, and you can manage your
synonym/stemming database without having to rebuild your index.

Of course, I had to write a custom parser for this but it was well worth it
imo.

~~~
blueline
how do you populate/maintain the synonym/stemming database? is this something
that “standard” datasets exist for? asking because this problem will fall on
my plate at work sometimes in the next few months :-)

~~~
injb
There was a standard set of synonyms that everyone was using at the time
(around 12 years ago) that came in the form of a Lucene index. I'm not sure if
it's still around or where you'd find it.

IIRC the stemming was done by an algorithm and not from a database...but I
don't remember the exact details.

~~~
vaidhy
when I was building this in circa 2006, I used porter stemmer. I do not know
if that is still state of the art, but it worked really well for my purpose.

~~~
sanxiyn
These days we use Snowball stemmer. OP also used it.

------
adsjhdashkj
Recently i've been writing a sort of archival focused SQL database. Partly out
of want, partly out of learning desire. FTS would be of great benefit if i
added it.

So i ask, where do FTS and SQL constraints or implementations differ greatly?
Why does Lucene exist rather than having the same features in MySQL or
Postgres?

I'm asking in the context of road bumps, things to watch out for - should i
try to merge the ideas and implementation of a SQL DB and FTS.

 _(Note: I am aware that some SQL has FTS, but i imagine it still differs or
has weaknesses when compared to a strictly focused FTS engine like Lucene)_

~~~
mamcx
One major "weakness" of rdbms is that all internals are black blocks.

You can do "SELECT * FROM table"

but, weirdly, can't "SELECT * FROM A_Index".

NOTHING in the relational model requiere this limitation. Is purely
accidental.

BTW, I'm building a relational lang and also, very long time ago, try to port
lucene to Delphi.

So, what you need to support FTS and other kind of structures is... support
them.

For interfacing, you can consider ANY key-value as a relation of exactly 2
columns. Any scalar, a relation of 1 column, 1 row. You can put relations in
relations, so you can do recursive data.

Maybe the hardest is to have not only storage of btrees but also other stuff.
I think you can fit FTS data in btrees anyway, but probably tweak it.

When building the relational lang I stumble upon how support a n-column data
inside a btree. You can split:

    
    
        pk=key
        values=col1 col2 col3
    

or as I later "get", just store the full row and change the functions that do
the comparing. In rust start with:

    
    
        pub struct Tree {
            index: usize, <--signal which column as key
            data: BTreeSet<Vec<...>>, <-- Store all row
        }

~~~
branko_d
> but, weirdly, can't "SELECT * FROM A_Index".

In all honesty, most DBMSes have some sort of "hinting" mechanism that
accomplishes roughly the same.

It's a cludge, though, usually in the form of specially formatted comments or
some sort of syntactic "piggy-back" on top of the regular SQL, or even outside
SQL (e.g. SQL Server plan guides).

I'd love to be able to specify the physical access path with more "first-
class" syntax, and mix-and-match with logical access more easily.

------
mwsherman
As you get into this, you’ll want to improve the tokenization to something
less naive. (The naive approach is fine for the educational purpose of the
article.)

I’ve made the mistake of simply splitting on spaces or punctuation, and
eventually finding that will do the wrong thing.

Eventually I learned about Unicode text segmentation[0], which solves this
pretty well for many languages. I believe the default Lucene tokenizer uses
it. I implemented it for Go[1].

[0] [https://unicode.org/reports/tr29/](https://unicode.org/reports/tr29/) [1]
[https://github.com/clipperhouse/uax29](https://github.com/clipperhouse/uax29)

~~~
rawoke083600
Yea agreed, as to my tokenization example. Back when I was doing search for a
price comparison service. The model numbers of the products were sometimes
entered as LGA775 or LGA-775 or LGA 77-5 or LGA—775 (unicode dash) or "LGA
775"

And that is just one example :) I ended up by developing a suite of "search
tests" so whenever we were doing tuning (daily basis) we had a set of
automatic search tests of about 80 weird/prev-wrong search cases we could
check in a blink of an eye. One could specify the top X result needed to
include at least Y number of products in XYZ category or have "brand" as ABC
etc.. One might say it was "test-driven-search-tuning"

------
andyjpb
This is a great intro to full-text search.

These techniques were the basis of some of the early web search engines (pre
Google).

More of this stuff is explored in Witten Moffat and Bell's book "Managing
Gigabytes" which dates from the era of those early search engines and when
"Gigabytes" was something unimaginably large that certainly wouldn't fit in
RAM and perhaps not even on a single disk.

~~~
ramraj07
Whatever terabytes or petabytes of data we have in indices today, is it all
served from memory? I remember Google saying at some point they went fully in
memory searches, don't know if they have kept at it or if it's the standard.

~~~
MR4D
A petabyte of ram is less than $5MM - using just ram costs and consumer ram as
well [0]. So given how much money they make and how many servers they have, I
think it’s highly probable. The caching of images and web pages might be more
difficult.

Plus, Google could easily just cache the the results for the top million
queries without much ram at all by comparison.

[0] - [https://www.newegg.com/team-32gb-288-pin-
ddr4-sdram/p/N82E16...](https://www.newegg.com/team-32gb-288-pin-
ddr4-sdram/p/N82E16820331396?cm_sp=homepage_ss-_-p4_20-331-396-_-08042020)

~~~
elorant
What kind of motherboard could take a PB of RAM?

~~~
MR4D
8,000 of them networked in a data center.

Back in 2011 it was thought they had a million servers, so 8,000 is easy for
them.

[0] -
[https://www.datacenterknowledge.com/archives/2011/08/01/repo...](https://www.datacenterknowledge.com/archives/2011/08/01/report-
google-uses-about-900000-servers)

------
karterk
If you used a Trie for the inverted index, you can do fuzzy searches on
tokens. This can help handle typo errors. This is the approach I've taken with
Typesense:
[https://github.com/typesense/typesense](https://github.com/typesense/typesense)

------
hbcondo714
> Most well-known FTS engine is Lucene (as well as Elasticsearch and Solr
> built on top of it)

What makes me attracted to these services are their hit highlighting[1]
capabilities so you can show where the search query matches are. This is
especially helpful in long documents but it appears in this case, the
documents are just one-liners and the entire document is returned in the
search results.

[1]
[https://www.elastic.co/guide/en/elasticsearch/reference/curr...](https://www.elastic.co/guide/en/elasticsearch/reference/current/highlighting.html)

~~~
Timothy055
Glad to see this mentioned! I worked on open sourcing the unified highlighter
for Bloomberg along with Lucene/Solr committer David Smiley. It’s a very
important feature for longer documents--in our case legal documents. We gave a
presentation about it at Lucene/Solr Revolution in 2016.[1]

[1] [https://youtu.be/tv5qKDKW8kk](https://youtu.be/tv5qKDKW8kk)

------
boyter
Not the point of the article, but the slow-down from using regular expressions
can be avoided in this case by instead generating each case permutation and
searching on that. Although its actually a bit more complex because you need
to cut off doing this for long strings to avoid too many permutations but you
get the idea.

Posting just in case someone thinks they need to build an index in order to
case-insensitive search a few gigabytes of text. Its actually how tools like
ripgrep tend to work (assuming it can do literal extraction of the regular
expression). The moment you get to 10's of gigabytes though an index is
probably what you need.

I actually have some Go code (since the example was in Go) which does this and
assuming you are doing literal searches is a drop in replacement for
re.FindAllIndex
[https://github.com/boyter/cs/blob/master/str/index.go#L110](https://github.com/boyter/cs/blob/master/str/index.go#L110)
which also does simple case folding for you.

------
GordonS
Not sure if the OP will see this, but _thanks_! I was looking for an
introduction to building basic FTS just last week, and didn't find anything at
the right level for me - this article was absolutely perfect to understand the
basics.

Not sure if you intend to write more articles, but I'd love to see this turned
into a series where you gradually add different capabilities and techniques
(e.g. using tries)

~~~
dstick
Have a look at ElasticSearch. I was taken aback by how fast it was performing
complicated full text queries in a dataset of 300 million web pages.

It was essentially a matter of scrape - clean (remove HTML) - index (add to
the DB) - search. Easy as that. If you have prior experience with JS and
scraping it should be a day’s work for the basics. The documentation is really
well done and easy to follow and the API is intuitive.

------
softwaredoug
Pretty fun - I also often recommend people explore the earlier versions of
Lucene, still on Sourceforge -

[https://sourceforge.net/projects/lucene/](https://sourceforge.net/projects/lucene/)

~~~
aidos
Agreed. Even configuring Lucerne used to be enough to give you a pretty deep
understanding of how it works under the hood. Arguably that’s a bad thing, but
I found it very enlightening.

------
Keverw
Pretty neat. I was considering doing something like this to create full text
search inside of some newSQL type databases I've been looking at but don't
support full text yet, like CockroachDB or YugabyteDB. I know some people use
another system for search and try to synchronize everything but personally
more things to manage and rather have a single source of truth.

Really wish I could find a prefect system where if you got popular wouldn't
have to worry about changing the application where you just throw more servers
at it, with incremental backup (sorta disappointed some charge extra for this
and not included in the open source community version but I guess in the
future could upgrade once your making $$$ and have a large enough dataset
where backups are slow), full text search, reference integrity (so you can't
delete a foreign key that's referenced - so no dangling references), etc... I
guess can't get everything you wish for, but things are clearly better now
than they were even a decade ago. So if starting a greenfield project, a lot
more choices thankfully though.

~~~
jordanlewis
I'd be happy to work with you to make it happen in CockroachDB if you were so
inclined :)

I added array inverted index support a little while ago in this PR:
[https://github.com/cockroachdb/cockroach/pull/45157](https://github.com/cockroachdb/cockroach/pull/45157)

A simple full-text search engine would not be too different at the lower
levels.

~~~
Keverw
Interesting, a bit over my head. Haven't got into Go, into Typescript/Node but
there's been some really cool Go project's I've came across. My idea was to
just store an array of strings in JSON and generate a query by parsing the
search query string, probably separating words out too could work do handle
things like "required" or -required for boolean search and filtering out stop
words.

I keep changing my mind on things and still super early on my project still,
so far I'm thinking CockroachDB for storing more transactional stuff,
inventory, accounts, etc and then use MongoDB for chat, social feeds, logging,
etc.

I guess if I implemented my own search logic this way, could then give
consistent search experience regardless if what they are searching is in the
SQL or noSQL database too.

I guess store things in lowercase too, last I looked CockroachDB didn't have
PG's CITEXT but for search lowercase makes sense anyways, but maybe you want
to search someone's fist name in support probably enter it lowercase while
preserving the original casing for display uses.

------
make3
The next step is usually to use tf-idf or bm25 to rank the matches
[https://en.m.wikipedia.org/wiki/Tf%E2%80%93idf](https://en.m.wikipedia.org/wiki/Tf%E2%80%93idf)
[https://en.m.wikipedia.org/wiki/Okapi_BM25](https://en.m.wikipedia.org/wiki/Okapi_BM25)

~~~
kristopolous
I've always found these things push poor quality results to the top. The
merits it ranks upon are redundancy and repetition.

They are the "next thing taught" but they aren't useful

~~~
boyter
What would you suggest then?

Assuming you only have the term frequency values I cannot think of any other
algorithm you can apply that gives reasonable results. Of course this breaks
down as its easily gamed, but unless you are building a public search engine
it should work.

Heck tools like lucene/elasticsearch use tf/idf and bm25 although they can
include vector space on top of the query as well.

~~~
kristopolous
The next thing should be talking about the different classes of documents and
searches.

Sometimes thesaurus searching is appropriate, sometimes it isn't. Sometimes
removing logical modifiers (ex: not, excluding, etc) is appropriate and other
times, not.

Sometimes verbatim searching is right, sometimes not.

Sometimes approximation searching is right, something it's not.

For instance, I was on Amazon the other day and I was looking for bubble
mailers, 8x12, 50 pcs ... I wasn't able to get amazon to exclude the other
unit and lot sizes ... maybe that's the right answer sometimes. A "better"
search system however, would see the numbers, know the frequency distribution
of those values and present ranges for me to choose.

Google has a focus ambiguity with a similar "one search" approach: it ignores
more focused words to give you more generalized results. which might be the
right answer sometimes ...and sometimes not.

Here's another classification example: On Google books, when looking for say,
"constitution" in older texts, OCR will sometimes misidentify it as
"conftitution" because of the long-s in older typography. For this search, I
shouldn't have to be manually searching both terms to get the results ... the
search query in this case should do something different still, swap letters
with common OCR missteps and then aggregate them into a single result.

I ran my own OCR on documents and wrote a small python program to do this
using Damerau–Levenshtein distance. It's not hard, could easily be taught to
undergrads.

Here's yet another example. let's say I have a garden and I found holes in
say, my basil leaves and am trying to find bugs that may be causing that. In
this search it would be fine to say "basil is an herb (that's the class of
object to search), it finds that say, cabbage loopers eat mint, which is also
an herb and returns those results" \- this indirection is correct here,
assuming these caterpillars eat both plants (they do).

Now let's say I'm looking for a recipe with my basil plant and it uses the
same indirection logic and returns mint recipes. Or I'm looking for songs by
Toni Basil and I get Toni Thyme and Toni Lavender in the results ... now it's
the wrong decision.

Here's another example: suicide by tylenol overdose is common and prompt
medical care is extremely important. Often people mix up the terms "aspirin"
and "tylenol" and many use them interchangably, big mistake here.

Pretend someone finds their friend passed out and incorrectly searches on the
web, "aspirin overdose". It doesn't look like it is life-threatening meanwhile
the time is ticking. The engine should be saying "yo, are you sure it isn't
tylenol? that'll kill you". Getting that right (similar terms, far more common
event, extremely dangerous) requires multiple of the above systems I spoke of
combined together. Sometimes right, sometimes wrong.

Anyway, multi-method classifications of documents and search intents, totally
the next step.

~~~
boyter
Got any papers that go though this in terms of pros/cons and such?

While I agree with you I have to wonder given what a lot of people have in
terms of search problems are easily solve-able though the use of tf/idf or
bm25 because it improves where they are starting to a level thats often good
enough.

~~~
kristopolous
I don't have a doctorate and I'm not an academic. Let me try to be helpful.

IDF is old, like 1950s-1970s for most of the work. You have to put things in
context. Digital storage was for surveys and structured documents; things that
could be computed upon. Wasting computers for literature and opinions wasn't
part of the application space. Why would you be storing newspaper opinion
columns on your univac?

So if we are looking at say, demographic population surveys, essentially
almanac data, then IDF serves you well. If I was searching "Duluth", then the
document concerning the measurements taken in Duluth will have that word in it
frequently.

Academically there's something called a "Zipfian" distribution and some math
to back it up, but I have very little confidence in my mathematical formalism
skills (been working on it for 20 years, still not very good) so excuse me for
skipping over it.

The pros/cons appear in "application" sections when describing a technique.
You can find these on wikipedia. Here's wikipedia's category on it:
[https://en.wikipedia.org/wiki/Category:Information_retrieval...](https://en.wikipedia.org/wiki/Category:Information_retrieval_techniques)

I think, as a whole, we've understated the human interface problem in search.
There's some technical tools that if are exposed and used can dramatically
improve results.

For instance, a swap operator.

In some of my search systems I permit a syntax:

    
    
      "s(phrase[0]:...:phrase[n])"
    

That's because sometimes you can have a phrase say "x y z" where "z y x" and
"x z y" are common but have different meanings and you only want a subset.

Then you have polysemy and homonomy ... so sometimes you want an exact terms,
sometimes you want to fuzzy search, sometimes you want a collection of exact
terms. So we can do exact with say "=", collection with a "|" and so on...

    
    
      's(=x:y0|y1|y2:z)"'
    

Now pretend you also want a range in there:

    
    
      's(=x:y0|y1a..y1z|y2:z)'
    

and so on. This syntax is also quite fast to process. You can permute all the
elements without too much effort, look them up, and aggregate the results in
very little time but we are really starting to get into a RegEx like query
system which means a decent amount of technical knowledge is needed and that's
where we have the interface problem.

These systems are _relatively_ easy to code and run on modest hardware (as in,
something you can easily fit under a desk) but require more from the user - a
level that frankly you just aren't going to get, let's be real here. I've
found lots of otherwise competent programmers who struggle with say regex and
bpf, I think I have some natural talent there that is not super common.

This is an example of why I think search is maybe, 50% a human interface
problem.

------
keshavmr
bleve is a FTS engine written fully in go.
[https://blevesearch.com/](https://blevesearch.com/)

~~~
mdaniel
And, going for the HN gold, Toshi is a FTS server written in Rust:
[https://github.com/toshi-search/Toshi#readme](https://github.com/toshi-
search/Toshi#readme) (backed by the FTS library [https://github.com/tantivy-
search/tantivy#readme](https://github.com/tantivy-search/tantivy#readme) also
in Rust)

~~~
FridgeSeal
Adding to this, here's a couple more options (also written in Rust):

Meilisearch - [https://meilisearch.com/](https://meilisearch.com/) Sonic -
[https://github.com/valeriansaliou/sonic#readme](https://github.com/valeriansaliou/sonic#readme)

Sonic is slightly different in that it returns the indexes to the documents
that you use to go and get from an external store, but the engineering behind
it is super cool: it uses finite-state-transducers to build the trie's which
are then search.

Edit: updated to correct link

~~~
mdaniel
Thanks for reminding me about MeiliSearch!

I think you meant your 2nd link to be
[https://github.com/valeriansaliou/sonic#readme](https://github.com/valeriansaliou/sonic#readme)

~~~
FridgeSeal
Yes I certainly did, thanks for the catch.

------
euske
I found that the notion of token is very English centric. What you really want
to do is to convert each sentence to a set of features, be it words, n-grams
or prefixes/suffixes, etc. When searching, the way I implemented was to use
those features for only narrowing the candidates, and apply a post processing
to filter out inexact results.

------
known
2 major issues;

    
    
        Google is doing near real-time indexing of ~200,000 websites; It's expensive for others to do that;
    
        Google PageRank has matured over period of time to prevent https://en.wikipedia.org/wiki/Google_bombing

------
zemo
what about the opposite? given a large block of previously unseen text: check
if the text contains a word in a given list of bad words. I'm thinking ...
profanity filtering, for example. So user inputs a page or two of text, we
check it for profanity.

~~~
crawdog
You could build a bloom filter with the profane words, then check your tokens
for the page against that filter.

~~~
teraflop
That would make sense if your list of expletives was too large to fit in
memory, which would be... impressive, to say the least.

~~~
GordonS
I'm Scottish, and I'll say it's definitely possible :p

------
cdbattags
Now someone do the same in Rust.

~~~
Can_Not
Does this count?
[https://github.com/valeriansaliou/sonic](https://github.com/valeriansaliou/sonic)

------
jrnichols
for some reasons this sounds kind of like Gopher

