
Postgres full text search is good enough - chdir
http://blog.lostpropertyhq.com/postgres-full-text-search-is-good-enough/#1
======
birken
Well the article is right in one sense: Postgres full text search is probably
better than you think it would be and is very usable at non-massive scale

However, I still wouldn't recommend using it. The queries and indexes are all
unintuitive, complex, not very readable and require learning a bunch of new
syntax anyways (as you can see in the examples in the blog post). So just take
the effort you would spend doing that and instead learn something with more
features that will continue running no matter how large your scale.

While the blog mentions elasticsearch/solr (which are awesome and powerful),
it doesn't mention sphinx
([http://sphinxsearch.com/](http://sphinxsearch.com/)). If I were trying to
make a simple search engine for my postgres/mysql powered site, I'd use
sphinx. The indexer can be fed directly with a postgres/mysql query, and it
includes simple APIs for all languages to interface with the searcher daemon.
And of course it is super fast, has way more features and can scale larger
than you'd ever need (sphinx powers craiglist's search).

~~~
mrweasel
One interesting thing we learn from build an online store: People uses search
alot less that you would think.

We started to notice that customers will use the menu and filtering options,
rather than searching for an item. Need something from the "electronics"
category, click the electronics menu tab, and pick the subcategory. From there
customers browse or filter until they find what they're looking for.

Of cause so do use the search, and most people would have better results if
they did search directly.

My point is: Maybe we could have just used Postgresqls full text search in
this case. The queries will be rather simply anyway. The new syntax won't
matter much, you need to learn something new anyway if you use Solr or
ElasticSearch. Search is fare less important than people think, it just needs
to be there and be "good enough".

Yes, Sphinx is awesome, if you're doing plain searches, filtering can get
complicated pretty quickly.

~~~
chdir
The menu & filtering option can benefit a lot from Facets (or aggregations in
ElasticSearch for more advanced options).

~~~
mrweasel
That's actually what we use. I should have been more clear, it's the fulltext
search that doesn't matter to much.

------
rachbelaid
As the author, it's maybe worthwhile that I explain the goals of this post.

The first goal is about using your current architecture to solve small/medium
search needs and help your project to introduce search without requiring to
add components to your current architecture. It's not a solution on the long
run if your application revolve around search. I'm hoping that with the
introduction of JSONB in 9.4, the idea of using Postgres as document store
raise more attention on the search which can to be improved. But for an out-
of-the-box feature, I think that the Postgres community did an amazing job.
You can imagine that you get something working with Postgres in matter of
hours if you are already using it (and you are comfortable with PG), but if
you have never used SOLR/ElasticSearch it will take you much longer to
introduce it in your project and get started (Ops, document sync, getting
familiar with query, ...).

The second goal, is about introducing full-text search concepts. The post try
to guide the user to build a search from nothing to a quite decent full text
search in English / French (I cannot give feedbacks on other languages)

The third which is probably the less clear is that people are still using
MySQL search sometimes which is IMHO an horrible search solution. I think this
happen because some web framework like django provide easy access to match in
the ORM. In this context, the post is aimed also to provide MySQL user some
insights about what can be done with Postgres and being more aware about its
features.

If you are interested by the topic then I suggest you to have a look to the
amazings posts from Tim van der Linden who did an amazing job of going into
more details about the subject.

[http://shisaa.jp/postset/postgresql-full-text-search-
part-1....](http://shisaa.jp/postset/postgresql-full-text-search-part-1.html)
[http://shisaa.jp/postset/postgresql-full-text-search-
part-2....](http://shisaa.jp/postset/postgresql-full-text-search-part-2.html)
[http://shisaa.jp/postset/postgresql-full-text-search-
part-3....](http://shisaa.jp/postset/postgresql-full-text-search-part-3.html)

Postgres full-text search is not the silver bullet of search but matter of
your needs, it's maybe good enough ;)

~~~
NDizzle
Hey, thanks for the article. I'm still going through it now. Would it work for
my scenario? I'm full text indexing large legal documents - 50 to even 300
pages. What I've been doing is just throwing the entire document (OCR) into a
big mysql (Mariadb 10) LONGTEXT columns. Then I'm using a JDBC river to shoot
those over to elasticsearch.

I don't fully understand elasticsearch, because like you said it's insane. The
query that I build to perform a simple search (think a google-like search box
with a go button) is about 100 lines.

Should I try out postgres full text search for my purposes? I guess I could
change to indexing page by page, and then I could group on the document hash
to stamp out duplicates. That would also get me a list of page numbers where
the results were. I'm not sure if I could get the snippets though. (What I
call a snippet is the word surrounding the match - I'm returning that to
datatables so that the user can search within their search to further refine
the search.)

Here's what my search (and results) looks like now:
[http://imgur.com/Fjckva6](http://imgur.com/Fjckva6) As you can tell, I'm not
too familiar with elasticsearch (A shameful offset input!) and I'm more
comfortable in a RDBMS.

~~~
rachbelaid
Which languages? English only? Anykind of fuzzy search required?

I don't have experience with big data set but you may to read about GIST/GIN
to pick the right index in your case (probably GIST) But from what I saw when
I prepared this post, some people are getting some decent performance with
full-text search on dataset like the size wikipedia.

[https://wiki.postgresql.org/images/2/25/Full-
text_search_in_...](https://wiki.postgresql.org/images/2/25/Full-
text_search_in_PostgreSQL_in_milliseconds-extended-version.pdf)

I think that doesn't take long to try Postgres FTS if you are already using it
but it may require more investment if you need to move from MySQL to PG.

~~~
NDizzle
Thanks for the reply. English only. The only fuzzyness I've told my users
about is "phrase here"~20 which would search for the given phrase within 20
words of each other. That helps for the imperfect OCR'd documents. A lot of
the stuff we're indexing is best-guess OCR so it's nowhere near perfect. They
are boring agreement documents that people have scribbled all over (setting up
the redactions) so fuzzy is somewhat interesting to have. But it's not fuzzy
in the sense of a typo, it's fuzzy in the sense that it's OCR'd garbage.

------
pilif
While it's certainly good enough for english bodies, its half-assed support
for compounds is a problem for bodies in languages that use compounds (German
for example).

There is sorta-support for compounds, but only really for ispell dictionaries
and the ispell format isn't very good at dealing with compounds (you have to
declare all permutations and manually flag them as compoundable) plus the
world overall has moved over to hunspell, so even just getting an ispell
dictionary is tricky.

As a reminder: This is about finding the "wurst" in "weisswürste" for example.

Furthermore, another problem is that the decision whether the output of a
dictionary module should be chained into the next module specified or not is
up to the C code of the module itself, not part of the FTS config.

This bit me for example when I wanted to have a thesaurus match for common
misspellings and colloquial terms which I then further wanted to feed into the
dictionary, again for compound matching.

Unfortunately, the thesaurus module is configured as a non-chainable one, so
once there's a thesaurus match, that's what's ending up in the index. No
chance to ever also looking it up in the dictionary.

Changing this requires changing the C code and subsequently deploying your
custom module, all of which is certainly doable but also additional work you
might not want to have to do.

And finally: If you use a dictionary, keep in mind that it's by default not
shared between connections. That means that you have to pay the price of
loading that dictionary whenever you use the text search engine the first time
over a connection.

For smaller dictionaries, this isn't significant, but due to the compound
handling in ispell, you'll have huge-ass(tm) dictionaries. Case in point is
ours which is about 25 Megs in size and costs 0.5 seconds of load time on an 8
drive 15K RAID10 array.

In practice (i.e. whenever you want to respond to a search in less than 0.5
secs, which is, I would argue, always), this forces you to either use some
kind of connection pooling (which can have other side-effects for your
application), or you use the shared_ispell extension
([http://pgxn.org/dist/shared_ispell](http://pgxn.org/dist/shared_ispell)),
though I've seen crashes in production when using that which led me to go back
to pgbouncer.

Aside of these limitations (neither of which will apply to you if you have an
english body, because searching these works without a dictionary to begin
with), yes, it works great.

(edited: Added a note about chaining dictionary modules)

~~~
donw
I've got a project that needed multilingual support over very, very small
documents, and ended up just rolling that myself -- neither ElasticSearch nor
Postgres cope very well when you start mixing multiple languages in the same
document.

~~~
pilif
If it's indeed the same document, you're screwed. Yes. But in Postgres, I
could at least have different column (or join tables with different language
codes) for different languages and then query them with either dictionary.

ElasticSearch doesn't support that: All fields in a document are assumed to be
of the same language which leads to a huge duplication of documents in cases
where most fields of a document are language-independent.

~~~
Argorak
> All fields in a document are assumed to be of the same language

That is not true. You can configure analyzers on a field basis, both for
search and index.

~~~
pilif
Oh. Cool. Thanks. When I was looking into this, it wasn't yet possible. That's
very interesting for me.

ElasticSearch has MUCH better support for language analysis to the point where
it might be worth it to move from Postgres tsearch to ElasticSearch, but this
limitation was a huge hurdle for me not being able to move.

~~~
Argorak
I am surprised, that's possible since any version I've been working with
(0.12x and up).

What you cannot do is control that via client (e.g. "this index request uses
the chinese analyzer for that field only"), because the analyzer setting in
the index request overwrites all other configs. Mappings supported that for a
long time.

One of the arts of search optimization on Elasticsearch is finding fields that
you want to index multiple times with different analyzers and combine in a
clever fashion.

------
brianhempel
For full text search, the only global frequency information Postgres uses is
the stop-word list. It does not do TF-IDF ranking. [1]

For example, if you search for "Bob Peterson", Postgres will rank these two
documents the same:

"I saw Bob."

"I saw Peterson."

In contrast, an IDF-aware search would notice that "Peterson" occurs in fewer
documents than "Bob" and score "I saw Peterson" higher for that reason.

[1]
[http://en.wikipedia.org/wiki/Tf%E2%80%93idf](http://en.wikipedia.org/wiki/Tf%E2%80%93idf)

[2] [http://stackoverflow.com/questions/18296444/does-
postgresql-...](http://stackoverflow.com/questions/18296444/does-postgresql-
use-tf-idf)

~~~
rorydh
TF–IDF ranking doesn't seem to be too complex a thing to implement. Maybe this
is an opportunity for someone here to contribute to the open source project.

~~~
jeltz
The concurrency aspects of this seem a bit tricky. How do we ensure that a
bloated index does not screw our results too much?

------
petergeoghegan
I'm surprised that there is no mention of the recent big improvements in the
GIN infrastructure that Postgres full text search uses. The 9.4 release notes
list:

* Reduce GIN index size (Alexander Korotkov, Heikki Linnakangas)

* Improve speed of multi-key GIN lookups (Alexander Korotkov, Heikki Linnakangas)

------
Argorak
The most important sentence: "Probably not if your core business needs revolve
around search."

Postgres full text search is very good, but once you get into the realms were
Elasticsearch and SOLR really shine (complex scoring based on combinations of
fields, temporal conditions or in multiple passes, all that with additional
faceting etc.), trying to rebuild all that on top of Postgres will be a pain.

While that doesn't break the article, it runs into a nasty problem: `unaccent`
doesn't handle denormalized accents.

    
    
        # SELECT unaccent(U&'\0065\0301');
         unaccent 
        ----------
         é
        (1 row)
    

(That problem is also present in Elasticsearch if you forget to configure the
analyzer to normalize properly before unaccenting)

~~~
rachbelaid
Thanks for you comment, I was not aware of the unaccent limitation. This blog
post only present a solution for small/medium search needs without adding
extra dependency ... the postgres full text search is far of being a silver
bullet

~~~
Argorak
Yep, that's how I understood it and I like the rest a lot, too. I just wanted
to make make the sentence stand out, as it is so far below. BTW, the problem
seems to be that postgres doesn't actually handle decomposed UTF-8 and is non-
compliant in that regard:

[http://www.postgresql.org/message-
id/53E1AB15.8050702@2ndqua...](http://www.postgresql.org/message-
id/53E1AB15.8050702@2ndquadrant.com)

So, probably making sure everything is in composed form before writing to the
DB seems to be the best way to go.

------
rboling91
Shouldn't this query:

SELECT to_tsvector(post.title) || to_tsvector(post.content) ||
to_tsvector(author.name) || to_tsvector(coalesce((string_agg(tag.name, ' ')),
'')) as document FROM post JOIN author ON author.id = post.author_id JOIN
posts_tags ON posts_tags.post_id = posts_tags.tag_id JOIN tag ON tag.id =
posts_tags.tag_id GROUP BY post.id, author.id;

be rewritten as:

SELECT to_tsvector(post.title) || to_tsvector(post.content) ||
to_tsvector(author.name) || to_tsvector(coalesce((string_agg(tag.name, ' ')),
'')) as document FROM post JOIN author ON author.id = post.author_id JOIN
posts_tags ON posts_tags.post_id = post.id JOIN tag ON tag.id =
posts_tags.tag_id GROUP BY post.id, author.id;

------
yror10
If your working with a Ruby/Rails app, check out the pg_search gem
[https://github.com/Casecommons/pg_search](https://github.com/Casecommons/pg_search)

It allows you to do the things in this article with ActiveRecord models.

~~~
wuliwong
Nice! In fact, this came up first when I searched "postgress full text search
rails 4". I'll check it out. I was actually going to make a gem for my own
needs. Hopefully you've done all the legwork for me! I'll also follow the repo
and try to help out once I understand it. Thanks.

------
mihai_ionic
One use case where Postgres FTS fails spectacularly is when your search has to
return results ordered by some field (e.g., timestamp). This is critical for
applications that have to return close-to-realtime results due to the queried
content being time-sensitive.

Neither GIN nor GiST indices support such document ordering. For a top-N query
of the most recent results, the entire result set has to be processed. With
common domain-specific words, this can be more than 50% of the entire document
set. As you can imagine, this is insanely expensive. When you are in this
situation, it helps to set gin_fuzzy_search_limit to a conservative number
(20000 works for me, less if you expect heavy traffic) in postgresql.conf, so
that pathological queries eventually finish without processing every result.
Result quality will take a hit, because many documents are skipped over.

If you need any type of ordered search on more than a hundred thousand
documents, do yourself a favor and use something else than Postgres.

I'm not sure what search back-end Wikipedia is using, but it seems like they
are not quite immune to this problem either:
[https://en.wikipedia.org/w/index.php?search=a+the](https://en.wikipedia.org/w/index.php?search=a+the)

------
rhema
I really wish I had some benchmarks here. Postgres can be crazy fast, but can
it compete with Elasticsearch? How long would a typical query across 1,000,000
documents?

~~~
buro9
I haven't made benchmarks specifically against other products, but we use
PostgreSQL across a database of 7.7m blog-post length documents and have found
the search to be extremely fast.

The only times in which search slows down is when you search for common words
that are not stop words and are also performing ordering (such as ranking)
across that body of results before returning your window. That is to say, the
search is still fast but sorting a large resultset can be slow.

The only gotchas I would call out to anyone thinking of using PostgreSQL full
text search are:

1) Beware of row-level queries like ts_highlight() for extracting a fragment
of the matched text, be sure that you are only doing this for the n lines that
you are returning for a LIMIT and not the entire resultset (most of which you
will discard).

2) Beware of accuracy and think about the text you are indexing. If you are
receiving raw markdown and transforming to HTML, then you shouldn't index
either (raw markdown may contain XSS attacks that can survive ts_highlight(),
and the transformed HTML will contain markup that will be indexed). You should
figure out a way to build a block of raw text that is safe to index _and_ to
show fragments for (ts_highlight()) which likely means running a striptags()
over the HTML and also putting hyperlinks into the text (if you want those to
be searchable) and indexing that.

We had no issues with performance, control over search ordering, or anything
else. The only gotchas we encountered related to sorting and displaying
results, and we resolved both easily enough, though too few of the high-level
docs like the linked article pointed out the potential issues there.

------
hendry
Postgres, elasticsearch and SOLR all seem pretty heavyweight to me.

What's the lightweight approach to create an inverted index nowadays with some
basic stemming?

~~~
rjbwork
You can always just roll your own with lucene or lucene.NET, which is what
ES/SOLR use underneath.

~~~
collyw
I never use Lucene, but I got the impression fro others that it was difficult
to use, and that Elasticsearch gives it a relatively usable layer on top.

~~~
rpedela
It isn't that hard to use in my opinion, but it is only a library. So you need
to implement many things to turn it into a server which Solr and Elasticsearch
have already done.

------
elchief
Text search is ok in postgres. Better than mysql. Not as good as solr, but you
don't have to sync with solr if you use it.

My main complaint is that 'united states' @@ 'united states of america' is
false, ie no token overlap similarity.

You can overcome this w the smlar extension. Have requested smlar integration
in 9.5

~~~
aswanson
Has the team been responsive with regard to integration requests?

~~~
elchief
Craig K seemed to think it was a good idea. We'll see...

------
ipmb
We're using it on our IRC logging service,
[https://botbot.me/](https://botbot.me/) and agree with the author's
sentiments. It's not perfect, but it keeps our infrastructure simple and is
very fast over 20M+ rows.

------
resca79
I love this article especially when I'm thinking about the premature
optimization. Having a good full text-search with adding zero components to
the platform is great. When postgres search will be not enought, happy to add
elasticsearch, sorl, and so on.

------
bshimmin
It's awesome what you can do with Postgres, undeniably, but it seems to me
that if you're building a Rails app, for instance, it's an absolute no-brainer
to use the brilliant Sunspot gem with Solr. Yes, you have the moderate
unpleasantness of setting up Solr and fiddling about with Java and XML files,
but it's all pretty painless after that.

I don't know if similarly nice things exist for other web frameworks.

As others have said, what's described in the OP may well be "good enough" for
many purposes but it seems like quite a lot of effort to get there.

~~~
mctx
Django has haystack[0], which I've used with an elasticsearch backend on a few
projects and is generally pretty easy to set up and get good results with. I
haven't had to do anything more advanced than autocomplete/boost/spelling
suggestions, but from what I've read it seems pretty powerful.

[0]: [http://haystacksearch.org/](http://haystacksearch.org/)

------
hiharryhere
Full text search is good when you have big passages of text to search, but it
isn't good for typos on short fields. The tolerances just aren't configurable
enough.

I was using it to search across user profiles, and while the fuzzy search was
great for a slab of text in, say, an about me section, it wasn't good at all
for short bits of text like name. Its just too fuzzy for short blobs.

------
rom16384
At least for our use case Postgres full text search wasn't good enough.
Indexing and searching 1M records was troublesome, and its advanced queries
weren't as flexible as we'd like, so after a while we started using Solr, and
while it was better it still had its share of problems, so we are currently
using ElasticSearch.

~~~
sanderjd
Hmm, was it "troublesome" in terms of performance building the index or doing
the search or was there some other issue?

~~~
rom16384
Part of the problem was due to the Django package we used to interface with
the Postgres search, it wasn't designed to handle large tables. I don't recall
how long the index took to build (hours? a day?), but the queries themselves
were slow due to their complexity and ranking using ts_rank.

~~~
sanderjd
Interesting, thanks for the info. What did you switch to? Were the queries
significantly faster?

------
rorydh
Here's a nice PDF on some of the great performance you can get out of it
[https://wiki.postgresql.org/images/2/25/Full-
text_search_in_...](https://wiki.postgresql.org/images/2/25/Full-
text_search_in_PostgreSQL_in_milliseconds-extended-version.pdf)

------
chdir
If you've used both FTS in postgres and either ElasticSearch or Solr recently:

Objectively, what are some data points when you need to switch over? E.g. more
than 'x' values to index, index update time, or perhaps features like
prefix/substring/phrase search, lack of facets, etc.

------
tim333
For knocking up a site quickly with a search button I wonder how this compares
with using Google custom search? I guess with Postgres you can cover stuff
that Google has not indexed?

------
jcroll
For the record: MySQL has fulltext search already for MYISAM tables and _will_
have fulltext for InnoDB in 5.6.

~~~
feld
did you read the article? He explained why MySQL's fulltext is not acceptable.

------
chatman
Faceting, spell correction, sloppy search, etc. are tricky to implement in
Postgres.

------
fiatjaf
Swish-e is good enough.

------
seivan
I wrote a half-assed full text search in Redis once. Want to go back and work
on it some more just for kicks.

------
iamleppert
If you're going to do something, do it right. Or leave search out of your
product if you aren't willing to invest the effort and energy required to
implement a proper solution.

Those SQL queries are hideous. Most people have learned that, because of many
developers like this, to avoid search because it's next to useless. Please do
us a favor and leave it out of your product if you aren't willing to invest in
a proper solution. Not one that is "good enough".

~~~
sanderjd
Honestly, what is wrong with the queries? They use pretty straightforward SQL
functions, datatype conversions, operators, and relational querying
capabilities. Is it just aesthetics or are there performance or other actual
problems with them?

