
Full text search in milliseconds with PostgreSQL - maxnov
https://blog.lateral.io/2015/05/full-text-search-in-milliseconds-with-postgresql/
======
bigbento
_To get optimum performance with PostgreSQL full text search you need to
create a column to store the tsvector values with an index on it. Then you
need to fill this column with the tsv values and create a trigger to update
the field on INSERT and UPDATE._

I've been playing around with full text search in Postgres, and I took this
sort of approach when starting out, but then realized you I could just have
the index be an expression.

So instead of (per the example)

    
    
      CREATE INDEX tsv_idx ON documents USING gin(tsv);
    

doing something like

    
    
      CREATE INDEX tsv_idx ON documents USING gin(to_tsvector('english', text));
    

Is there any reason you _wouldn 't_ do this? For multi-languange you you'd
have to detect the language of the text, but there's no reason you couldn't
parameterize that too.

~~~
fleetfox
With triggers you can fetch and index data from relations.

~~~
pgaddict
relation == table

relationship == foreign key

~~~
fleetfox
Did i phrase it wrong? You can fetch and index data from related tables.

~~~
spoiler
I think he simply tried to clarify the difference to people who don't
understand jargon.

------
buro9
> This query takes about 50ms! We experimented with returning the document
> text with results and found that if we returned the full text of documents,
> it added about 350ms to the query time which seems to be network overhead
> rather than a slower query. If we only returned 200 characters from the
> texts then it only added around 100ms.

The way you have written the query, it has to extract the title from all of
the JSON documents before applying the LIMIT.

Perhaps do the work to get the identifiers, order by rank, and apply the
limit... all in a subquery. So that only the outer query actually touches the
JSON document and extracts the titles.

Try it and report back on the speed improvement.

~~~
maxnov
Thanks for the feedback! I tried out your suggestion by running the following:

    
    
      SELECT id, meta->>'title' as title, meta FROM(
        SELECT id, meta FROM (
          SELECT id, meta, tsv
          FROM data_rows, plainto_tsquery('search query') AS q
          WHERE (tsv @@ q)
        ) AS t1 ORDER BY ts_rank_cd(t1.tsv, plainto_tsquery('search query')) DESC LIMIT 5
      ) AS t2;
    

Please let me know if this isn't what you meant. The above did improve the
speed but only a little. I think for our use case this approach wouldn't be
entirely necessary but maybe if we had a larger data set then it'd be worth
implementing.

~~~
buro9
I hate writing SQL blind, but more like this... note that you fetch _only_ the
identifiers with the inner query, and use them in the outer query to retrieve
the documents.

    
    
        SELECT d.id
              ,d.meta->>'title' as title
              ,d.meta
          FROM (
      	       -- Only select the identifier column
                   SELECT id
                     FROM data_rows
                         ,plainto_tsquery('search query') AS q
                    WHERE tsv @@ q
                    ORDER BY ts_rank_cd(tsv, q) DESC
                    LIMIT 5
               ) AS ids
         -- Join back onto the data_rows, this fetches the documents
         INNER JOIN data_rows d ON d.id = ids.id;
    

You'll want to go through the explain plan, but treat identifier retrieval as
a step by itself. Things like ts_headline (not shown), or meta->>'title'
(shown) should only be performed for the few rows you actually are retrieving
and you should make the data you pass around be as small as possible at all
points.

Note that you don't need to call plainto_tsquery twice, you can use the one
you've already declared.

And I notice you are returning the JSON document whole, have you checked
whether it's faster to extract the title here or later in whatever your
programming language is?

The giveaway clue in your post btw is this: "We experimented with returning
the document text with results and found that if we returned the full text of
documents, it added about 350ms to the query time which seems to be network
overhead rather than a slower query.".

As soon as you notice a linear slowdown, when you're only returning 5 rows
whatever happens... then internally you must be doing something in the query
that relates to _all_ rows.

PostgreSQL fulltext scales nicely, I use it on a platform driving over 300 web
sites with 10s of millions of comments. I originally used it thinking it will
buy me time before I have to add a dedicated search engine, and it helps now
as I store permissions knowledge in the database too... but truth is, it's
just not been an issue yet. I'm not needing to consider anything else as it's
performing great. The slowest part of my query isn't fulltext search, it's my
permissions stuff (because I've gone down the rabbit hole of users, groups,
roles, etc and this is calculated for all content), but still the 95th
percentile for searches is typically under 10ms.

~~~
maxnov
Just wanted to say, I implemented this change and it did really speed things
up! Thanks a bunch :) This is the query I went with:

    
    
      SELECT d.id, d.meta FROM (
        SELECT id FROM data_rows, to_tsquery('query') AS q
        WHERE tsv @@ q ORDER BY ts_rank_cd(tsv, q) DESC LIMIT 5
      ) AS ids INNER JOIN data_rows d ON d.id = ids.id;
    

You can see it in action here on the top right:
[https://lateral.io/visualiser/ted](https://lateral.io/visualiser/ted)

------
loopbit
Sure, if you only need full text search there's no need to use Solr or
ElasticSearch.

I've never tried with PostgreSql, but I noticed the same drop in performance
in MySql at the 1-2 million documents count, although full text search queries
in MySql are slower (from what I remember, I haven't tested in a couple of
years).

The power of Solr and other tools out there is all the other stuff you can do,
like synonyms processing, faceting, range queries, etc, etc... as well as give
you more flexibility on what data you load and how you process it. But if you
don't need it, there's no reason to use a different tool.

~~~
s3th
There's no question that Solr and ElasticSearch have more capability than
PostgreSQL alone, but you'd be surprised at what Postgres can do!

It's certainly more than just full text search. Check out this article [1]
that talks about Postgres' support for stemming, ranking / boost, multiple
languages, fuzzy search, and accent support.

[1] [http://blog.lostpropertyhq.com/postgres-full-text-search-
is-...](http://blog.lostpropertyhq.com/postgres-full-text-search-is-good-
enough/)

~~~
loopbit
You are right, I had assumed that it's capabilities were similar to MySql
(don't ask me why).

Still, I would describe most of those features as 'basic', as you really need
some support for all that to use full-text search in real life and I wonder
how easy it is to add/remove stopwords or other tweaks like that.

Anyway, thank you very much for the link, very interesting! I've been looking
at PostGIS as an option for a project I'm working on and using PostgreSql as
the main db is getting more and more points... Thanks!

------
praveenster
"The accuracy of PostgreSQL searches is not the best. In the examples that
follow I’ve explained how to weight titles over text, but we found a few cases
where typing the exact title of a document wouldn’t return that document as
the first result."

Isn't this a concern as the main objective of search is to provide accurate
results?

~~~
coldtea
> _Isn 't this a concern as the main objective of search is to provide
> accurate results?_

If it provides 99% of the results and misses some documents because of some
weird bug or encoding issue, then it could very well be good enough for their
purposes. Heck, even 90% could be good depending on what they do (e.g. serve
articles in an online site).

For other uses, like police or medical records obviously they'd need 100%
results.

~~~
brazzledazzle
I know "good enough" is probably not just a good idea with a startup, it's
possibly mandatory since there's only so much time and money. But as a
user/consumer/customer/target demographic I can't begin to describe how much I
disdain knowing that something exists on a site but being unable to find it
using search, particularly when I know the exact title. Reddit's search
several years ago was quite bad and left a sour taste in my mouth.

~~~
rspeer
I'm already cringing about people in this thread talking about "language
detection" and "stemming" as if there are good, easy solutions to them.

Take your favorite language detector, like cld2. Apply it to some real-world
language, like random posts on Twitter. Did it detect the languages correctly?
Welp, there goes that idea.

(Tweets are too short, you say? Tough. Search queries are _shorter_. You
probably aren't lucky enough for your domain's text to be complete articles
from the Wall Street Journal, which is what the typical NLP algorithm was
trained on.)

Stemming will always be difficult and subtle. It's useful but it isn't even
linguistically well-defined, so you'll have to tweak it a lot. If stemming
seems easy, you haven't looked at where it goes wrong for your use case.

------
sbilstein
Nice writeup, I wrote a similar blog post a few years ago:
[http://tech.pro/tutorial/1142/building-faceted-search-
with-p...](http://tech.pro/tutorial/1142/building-faceted-search-with-
postgresql)

Mine covers tags and other aspects of faceted search as well.

We were able to build a really simple search facility for our marketplace
using Postgres in a very short period of time and for cheap. It was incredibly
helpful for a small startup short on people and resources.

~~~
troels
Curious - Why not Sphinx or solr?

~~~
sbilstein
In this case we wanted to ship the marketplace as quickly as possible (we had
some related promotions and partnerships that required it in a specific
timeframe) with a tiny team.

Solr/Lucene is the right answer for a full blown solution but it incurs the
cost of having to maintain a second service besides your webapp as well as
learning and gaining expertise in a second system. All things that can and
should be done assuming you see growth, but as quick solution that would work
for several months, you can knock out a reasonable search in a couple weeks
with tech your app is already using.

------
krastanov
Naive question from a person with little DB experience: Why store the metadata
in JSON inside the database instead of in multiple columns in the same table?
Is this not the point of using a database?

~~~
devmach
If I'm not mistaken: Altering a table can be costly and may cause a
performance issues, specially if you have millions of records. Rows will be
most likely locked and get updated and during this process other threads
should wait until related row become "free".

~~~
ddorian43
That only happens if you add a non-nullable column.

------
chdir
> The main arguments against using PostgreSQL search are accuracy and speed at
> scale.

Also, lack of faceted search, phrase search, partial word search (although you
could use pg_trgm)

Previous 2 related discussions: \-
[https://news.ycombinator.com/item?id=8381748](https://news.ycombinator.com/item?id=8381748)
\-
[https://news.ycombinator.com/item?id=8714477](https://news.ycombinator.com/item?id=8714477)

------
danneu
I tried using Postgres 9.4 full-text search for my forum, but ran into some
"column wider than full-text index" error (don't remember the exact issue).
Wasn't running into any of these:
[http://www.postgresql.org/docs/8.3/static/textsearch-
limitat...](http://www.postgresql.org/docs/8.3/static/textsearch-
limitations.html)

Googling the error lead me to a mailing list where someone was "working on it"
or "going to revert the problem soon".

Ended up using AWS CloudSearch which costs a minimum of $50/mo where I was
already paying for Postgres.

------
krychu
> This query takes about 50ms!

Would be useful to know how many rows you had in the table (?)

~~~
maxnov
The table that we run this on has 220K rows with a size on disk of 1.8 GB. I
mentioned in the post that we'd tried out this approach with 23 million rows
which was slow! From our (limited) testing the speed holds up until around 1
to 2 million rows, at which point I'd suggest a dedicated solution.

~~~
krychu
Thanks, that's useful to know. I thought that your issues with speed in the
past were due to using a different approach to searching in PostgreSQL than
the one you described in the post.

I wonder if there are any specific limitations to search performance in
PostgreSQL stemming from its underlying data model and structures or could it
possibly reach similar performance and scale as say Solr (or Luecene for that
matter) without too much of an overhaul. Just thinking single node here.

------
qxmat
50ms! Wow... take it from me, you should have used Solr.

------
curiously
PostgreSQL is coming up a lot and making waves, never thought much about it
until recently it turns out to be more reliable than mongodb. Interesting
phenomena.

------
CHY872
Seems sketchy.

While Postgres _can_ do full text search and it _may_ help you in your case,
you can set up Elasticsearch for this sort of quantity of data in like an
afternoon, if that.

Also, these 'advanced features' people are throwing around are really trivial;
they're the sort of things that would crop up in lecture 2 of an information
retrieval course. Good search is really reliant on proper weighting etc (not
like 'we weight the title more highly than the body' but more 'foogle is a far
more important word than bar, if we have both in a query, we care about foogle
more'). This generally requires a tonne of experimentation; information
retrieval is not easy and it's very subtle.

Postgres seems to try and make everyone design their own ranking function,
which is a recipe for disaster. I haven't used Elasticsearch for a while, but
I'd be surprised if they didn't have a bunch of relatively good presets.

Postgres is good for loads of things, but Elasticsearch or Solr are state-of-
the-art and Postgres isn't (so far as I know); if you aren't just hacking
around then _please_ consider the extra time to set up Elasticsearch and it'll
make your life easier.

~~~
tonyarkles
Looking at elasticsearch, the first thing I see is a pom.xml. Seems sketchy.
;)

Seriously though, none of my servers have a JVM installed. Most of them have
nginx and ruby on them. And all of them talk to a handful of postgres servers.
If a client wants to add some kind of full-text search to an app, this is a
great approach to doing it in a cost-effective way. I don't particularly want
to have to add more infrastructure just to be able to search through data.
That's an additional monthly cost, along with a new environment to monitor for
critical security updates etc.

If someone needs high performance search, sure, I'd cost out elasticsearch or
solr. If they just need something basic, postgres full-text looks like the
perfect low-effort solution.

