Hacker News new | comments | ask | show | jobs | submit login

To get optimum performance with PostgreSQL full text search you need to create a column to store the tsvector values with an index on it. Then you need to fill this column with the tsv values and create a trigger to update the field on INSERT and UPDATE.

I've been playing around with full text search in Postgres, and I took this sort of approach when starting out, but then realized you I could just have the index be an expression.

So instead of (per the example)

  CREATE INDEX tsv_idx ON documents USING gin(tsv);
doing something like

  CREATE INDEX tsv_idx ON documents USING gin(to_tsvector('english', text));
Is there any reason you wouldn't do this? For multi-languange you you'd have to detect the language of the text, but there's no reason you couldn't parameterize that too.

You can precompute to_tsvector() in parallel ahead of time if you're storing it in a dedicated column. CREATE INDEX runs on a single thread, including the part where it evaluates to_tsvector() for each row. If you ever need to recreate the index, it'll be faster if the tsvector is in a dedicated column.

I have a table with 30 million documents using pgsql's full text index. Creating the index takes ages, and search performance is generally very poor. The difference between creating the index with the precomputed column versus creating the index with the expression in the index itself (which is how I originally did it) was substantial.

No, there's no real reason why you would not use a functional index. The reason so many tutorials use a dedicated column is simply lack of up-to-date information, I think.

That is the way I do it also. I have used Lucene and Solr a lot in the past, but I now find Postgres text search to be more than adequate for my needs and it does make software development and deployment simpler.

You're not including the title from the JSON column or setting weights in your alternate version. I'm not sure if that can be included in the index or not, but that might be the reasoning.

If you want to preprocess your document, or aggregate different parts of it with different labels and weights, it can be helpful to store it in a separate column.

Of course you can always just index your preprocessing/aggregating function and call it every time you want to search, but depending on how expensive that is, it might be in your interest to do it upfront and make your searches a bit quicker.

It has the same effect as the UPDATE/INSERT trigger they use as far as I'm aware, without the additional storage.

With triggers you can fetch and index data from relations.

This is the main reason for me. I have to be able to quickly search data that's stored in a couple different tables.

Using an index means the read queries become a little more complex (they have to exactly match the index expression).

relation == table

relationship == foreign key

Did i phrase it wrong? You can fetch and index data from related tables.

I think he simply tried to clarify the difference to people who don't understand jargon.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact