
Fast Search Using PostgreSQL Trigram Indexes - sytse
https://about.gitlab.com/2016/03/18/fast-search-using-postgresql-trigram-indexes/?
======
alexpeattie
I thought this was a great article, and trigram indexes are definitely an
amazing tool to have in the arsenal.

I think PostgreSQL fulltext search (with tsvector, tsquery etc.) is given an
unfairly bad rap though. I'm not sure that "full text search works best when
the text vectors are stored in physical columns with an index" is true - in my
experience there's no performance penalty to just indexing the tsvector
expression - no need to worry about additional columns or triggers.

I also think the assertion that the key problem is that "full text search is
that words are broken up according to the rules defined by the language of the
text" is very context-dependent. In many situations that's the _most_ awesome
feature of fulltext search. Usually when I search for "cat" I'm not interested
in results for catacombs or categories, but when I search for restaurants,
results matching restaurant (singular) are relevant too.

I definitely see that for a use-case like GitLab's, where the data includes
code, full text search's stemming would be a hindrance rather than a help.

~~~
brightball
One of the perks of using columns is that you can setup multiple columns with
different dictionaries and specify which one you're searching against.

I'm not sure how easy it would be to ensure that the index would get picked up
if searching on a weighted field with multiple columns specified or if it
would require that the program know in advance what those columns were. Would
the order matter for the index to be used?

Additionally, a column can also easily included relevant search data from
related records in other tables or even datasources if desired.

~~~
alexpeattie
There's nothing stopping you indexing multiple columns with different
dictionaries either :). And indexing works beautifully with multiple weighted
columns.

The related records point is true though - indexes don't help if you're trying
to implement multi-table search. Although then I'd argue a materialized view
(refreshed with triggers) might be better than putting more columns into one
of the data tables.

~~~
brightball
I will have to tinker with that then...

------
mbenjaminsmith
Trigrams are a very straightforward solution to this but they have one major
limitation: They break with misspellings / alternative spellings. For example
the words "vodka" and "votka" share no trigrams yet are obviously very
similar.

I recently worked on this problem for a project and came up with a more robust
solution using pre-computed Levenshtein indexes. You've probably heard of
Levenshtein distance as a measure of word similarity. It counts the number of
single-character edits that are required to transform one string into another.
For example "vodka" and "votka" have a Levenshtein distance of 1.

Using a related approach (single character edits) it's possible to build a so-
called Levenshtein index.

The major issue with a vanilla Levenshtein index is that it's impractically
large. There is a way to make a space / time trade-off however that makes it
much smaller. The basic idea is that instead of generating all Levenshtein
variations for a word you only use the set of variations created by deleting
single characters for a word.

Since that doesn't match all possible variations we have to compensate
somehow. For example with a full index "votka" would be in the set of
variations for "vodka" which means if we had misspelled vodka when searching
for it we'd still get a match. With our deletions-only index, however, we'd
get "voka" as our closest match. To work around this, we generate the
deletion-only variations of our search term as well. In this case our search
term "vodka" would become "odka", "vdka", "voka", etc. and each of those terms
would be matched against the index. That means our query is slower overall as
it contains a number of sub-queries but in practice it's more than fast
enough.

I wrote a meandering blog post about it here: [http://lattejed.com/writing-an-
embedded-full-text-search-eng...](http://lattejed.com/writing-an-embedded-
full-text-search-engine)

There are some additional ideas in there such as how to make a Levenshtein
index handle prefix matches better (for e.g., incremental search) that may be
helpful if you need to implement something like this.

If you do read that I'd like to point out that I no longer recommend LevelDB.
That doesn't change anything of importance in the post though.

~~~
TheLogothete
Or, you know, use a search engine. There are countless little things like that
and out-of-the-box search in DBs will never catch up with.

~~~
mbenjaminsmith
The resulting product is fast, easy to tune (for my particular application)
and handles every edge case and language I've thrown at it.

Do you have any specific concern you'd like to share or are you just generally
dismissive of other people's hard work?

~~~
TheLogothete
I'm dismissive of pointless hard work.

------
simonw
A project I worked on used trigram indexes a few years ago to solve the
autocomplete problem. They are AMAZING - they basically allow you to run LIKE
queries with wildcards anywhere in the string (as opposed to suffix-only-
wildcards) against an index.

An autocomplete search for e.g. "rub rai" becomes the following SQL query:

    
    
        select * from topics where name ilike "%rub%rai%";
    

Which, thanks to the magic of trigram indexes returns in just a few ms, even
against hundreds of thousands of rows. Without trigram indexes, the same query
would be a full scan and would be too slow to justify hooking up to a search-
as-you-type UI.

    
    
        create extension pg_trgm;
        create index topic_name_gin on topics using gin (name gin_trgm_ops);

------
YorickPeterse
Hi! Author in question here, if anybody would like to know more about this
particular feature I'll be happy to answer any questions. The changes
discussed in this article are currently available on GitLab.com, 8.6 will be
released on the 22nd as usual.

~~~
koolba
Why bother supporting both MySQL and Postgres for the backend? Wouldn't
picking one ( _~cough~ Postgres_ ) simplify things?

From my experience, attempting to support multiple backend stores leads to
either crippled functionality, i.e. lowest common denominator CRUD usage, or
lots of messy if/then logic to use DB specific features.

~~~
YorickPeterse
Personally I'd love to drop support for MySQL but we have a few too many
organizations that only run MySQL _and_ want to use GitLab. It would be a
waste if said organizations wouldn't be able to use GitLab.

The approach I tend to go with is to make queries perform as good as possible
on PostgreSQL without using too many PostgreSQL specific bits (which would
make supporting both DBs more difficult), while making sure they still perform
good enough on MySQL.

------
beachstartup
ah yes, the old 're-implement the search engine inside the database' project,
undoubtedly put up on the board because someone is tired of their get-the-
data-from-the-database-to-the-search-engine process breaking constantly.

next on the map: discovering how shitty dictionary management is, the joys of
NLP, and abandoning the project entirely because you realize all this stuff
has already been solved in 3 or 4 different ways and the getting-the-data-
from-the-database-to-the-search-engine process isn't really that bad, and
pulling your hair out from customers asking insane questions because they
don't understand how search engines actually work and why can't this be like
google? can't you just do it how google does it, even though you don't have
$100B and 50,000 employees?

i realize this is a product feature but i have ptsd on this topic so i had to
vent.

~~~
bazzargh
_the joys of NLP_ ... this is a search index for Gitlab, that's used primarily
to store _code_ , not natural language.

Looking at its competitor, Github's search engine clearly has a base in NLP,
and ditches many punctuation characters
([https://help.github.com/articles/searching-
code/](https://help.github.com/articles/searching-code/)) - which are _way_
more important in code than they are in English - for example, I can't search
for code containing "$/" when it should have "\Z/" to match the end of a
string. So, right now I have ~250 repos checked out from our Github Enterprise
so I can search them offline.

The queries I actually want to run are, pretty much, regexes. Those can be
accelerated by trigram indexes like the one gitlab are using (see eg
[https://swtch.com/~rsc/regexp/regexp4.html](https://swtch.com/~rsc/regexp/regexp4.html)),
but apparently this isn't implemented by say, Elastisearch ( _Whereas prefix
matching can be made more efficient by preparing your data at index time,
wildcard and regular expression matching can be done only at query time._
[https://www.elastic.co/guide/en/elasticsearch/guide/current/...](https://www.elastic.co/guide/en/elasticsearch/guide/current/_wildcard_and_regexp_queries.html))

I'd agree with you if this wasn't code. I've seen terrible text search in the
DB, and I've built document management systems where we integrated real search
engines. But those horses aren't for this course.

~~~
beachstartup
sure, primarily code. except when code contains language, like in comments,
and other related documentation that's stored in-repo. and of course, people
are going to completely mis-use the repo to store ancillary human-readable
binary documents, which will need to be converted, indexed, searched, updated,
purged, etc, etc...

since developers tag, reference, and document their code with comments,
they're going to expect the full suite of NLP treatments with all indexed
content, including code.

which basically means: you'll have to tokenize out the special characterize
for NLP, but retain them for literal code searches, thereby increasing the
size of the inverted index for every permutation of the desired search
criteria.

maybe you can detect and filter out all the comments, and index them
separately, or maybe have some kind of dual system where the NL indexing
system exists separately from the code indexing system ... you see where this
is going?

~~~
bazzargh
_they 're going to expect the full suite of NLP treatments with all indexed
content, including code._ Nope. In 30 years of searching through code, I've
never once felt the need for that. On the other hand, almost every time I use
Github search, I find the limitations of a word-oriented index get in the way
(looking for partial keywords to find code relevant to an api, etc. It's not
just special characters).

------
elchief
"Another problem with full text search is that words are broken up according
to the rules defined by the language of the text"

You can use to_tsvector('simple', column_name) if you don't want a language-
specific parse, or you can make your own.

"This means that searching for “yorick” or “peterse” will match the data, but
searching for “yor” will not"

You can use prefix searching:

    
    
      select to_tsvector('simple', 'Yorick Peterse') @@ to_tsquery('simple', 'yor:*') --true
    

Heck you can even do suffix searching:

    
    
      select to_tsvector('simple', reverse('Yorick Peterse')) @@ to_tsquery('simple', reverse('rick') || ':*')
    

The real problem with PostgreSQL's text search is that it doesn't support BM
25 or TFIDF out of the box

------
supergreg
IIRC Google also used trigrams for their Code search engine. This allowed them
to make a first filter before running a regex search on the resulting
documents.

Does gitlab support regex searches?

~~~
YorickPeterse
> Does gitlab support regex searches?

No, not at the moment.

------
eveningcoffee
So how good do the trigram indexes scale? Both on index build time and query
speed.

------
mkarklins
Off topic: What is the motivation for GitLab to support two DBMS?

~~~
jobvandervoort
See my reply here:
[https://news.ycombinator.com/item?id=11315426](https://news.ycombinator.com/item?id=11315426)

