
Postgres full-text search is Good Enough (2015) - okket
http://rachbelaid.com/postgres-full-text-search-is-good-enough/
======
k_kelly
I've done two large search projects in the last year. One with Postgres search
and one with elastic search. The elastic search version was trivial to
configure, and was embarrassingly superior to the point where I wish I had
never considered Postgres.

~~~
r888888888
It's not that simple.

If you're using a secondary data store like Elasticsearch presumably you're
combining it with a traditional relational database like Postgres or MySQL.
And in most use cases, you're going to require additional constraints on top
of full text, like restricting to a specific customer or topic, etc. This
means data duplication. Any experienced developer should be breaking out in
hives at this point because this means (1) keeping two data stores in sync;
(2) adding another service to the deployment process; (3) adding another
service to development environments; (4) adding another service to debug when
problems crop up; (5) large scale schema changes and migrations are
complicated because of two data stores; (6) operational overhead from
maintaining Elasticsearch like separate backup scripts and deployment scripts
and sharding config (across production, qa, and dev); (7) queries are
potentially complicated because you may have to aggregate information from two
data stores (pagination and delayed updates makes this even more hellish); (8)
if your company is anal about devs accessing production data then ES just
introduces another layer of requesting permissions and every time the schema
changes you have to ask again.

And maybe the advantages of Elasticsearch far outweigh these negatives, but
this is a conversation you're going to have with your devs because the cost of
using ES is not zero. There are tradeoffs, and maybe just using Postgres full
text search is adequate for your use case given sufficient additional
constraints.

~~~
nostrademons
Data duplication virtually _always_ happens as the organization gets larger.
Even if you stick to RDBMS-only, you're going to get it once you setup
master/slave replication, batch jobs, data science, tape backups, etc.

It's not a serious problem as long as you designate one data source as primary
and every other data source as derived. That way, if you ever get data
corruption issues, you can re-create the derived data and blow away the old
copy. You also don't face complex data-sync issues, because all data-transfer
is unidirectional. You write into Postgres or your other source-of-truth, and
then you write from Postgres to ElasticSearch. If there's a difference between
them, the Postgres version is authoritative, so copy its data over
ElasticSearch.

Most of your other points are common to adding any new major feature, and you
should always weigh the user benefits against the maintenance costs of a large
new dependency. Presumably, if you're looking at ElasticSearch, it's because
full-text search in your RDBMS is _not_ good enough (which won't always be the
case, but has been my experience and apparently the experience of several
other people here).

------
buro9
Plus it's in your database, which means it is trivial to limit the search
results by any JOINable criteria, i.e. permissions, roles, groups, users,
organisations or anything that you have in your database.

~~~
jpgvm
Also because it's your authoritative datastore you don't need to worry about
keeping your secondary indexing system (solr/elasticsearch etc) in-sync.

Using Postgres for search just from a simplicity point of view can save many
engineering hours especially if you don't have an expert in Solr/Elasticsearch
available. This is even more true these days with SolrCloud needing a
Zookeeper ensemble as well.

~~~
yummyfajitas
It's not that hard to keep your secondary indexes up to date with Postgres.
You can set up external triggers with LISTEN/NOTIFY; i.e. every time a row is
updated, Postgres will inform the index refreshing service that an update is
needed.

[https://www.chrisstucchio.com/blog/2013/postgres_external_tr...](https://www.chrisstucchio.com/blog/2013/postgres_external_triggers.html)

~~~
gnud
And how do you get back in sync if the client that did the LISTEN, crashes and
has to be restarted?

~~~
openasocket
Using NOTIFY puts the data into a message queue, so messages will just hold
there until your client listens again. Eventually it the channel queue will be
filled up (8GB by default), at which point the NOTIFY will fail.

[https://www.postgresql.org/docs/9.0/static/sql-
notify.html](https://www.postgresql.org/docs/9.0/static/sql-notify.html)

~~~
gnud
I don't think it will? When the client restarts, it will create a new session
with the postgres server. This new session will not receive messages that are
queued for the old session.

~~~
openasocket
Ah, yes, I think you're right. I suppose you would want to represent a message
queue using a table and have your daemon process remove things from the table
as it adds things to elastic search. That way if the daemon crashes the events
will just continue to buffer.

------
mrweasel
It's a little sad that people seem to forget that Sphinx Search exists. For
quick full-text search implementation Sphinx is much easier than trying to
spin up and understand Solr or ElasticSearch.

For a large number of sites search could be improved greatly by simple setting
up Sphinx with an SQL query for slurping up data from you backend database.

~~~
ngrilly
Yes, but how do you make sure your Sphinx index is always synchronized with
your SQL database, within a short window (a few seconds), and without to much
pressure on the database?

~~~
morgante
> within a short window (a few seconds),

Why is that necessary? One of the things which makes maintaining a search
index easier is that typically your search datastore has much weaker
consistency and durability requirements than your primary one.

~~~
ngrilly
Because in our application, users can add new documents, and will be confused
if the document they just added is missing in the search result (which by
default show the most recent documents). Considering that Elasticsearch
website and documentation emphasize "real-time" features, I think it's not a
weird requirement :-)

------
justin_vanw
This comes up over and over.

For basic utility stuff or simple backend apps it's pretty useful. For
anything serious or consumer facing it's crap.

~~~
nostrademons
TBH if you're looking to do something serious & consumer facing you may want
to look into paying for Google's Custom Search API, or if your data set
differs dramatically in shape from webpages, hiring a _really good_ data
scientist and rolling your own:

[https://developers.google.com/custom-
search/](https://developers.google.com/custom-search/)

I'm an ex-Googler who's now working fairly extensively with ElasticSearch.
While I've been impressed at how easy it is to get started, _all_ the
published academic ranking algorithms are utter crap compared to what Google
has. Okapi BM25 is mentioned as the state-of-the-art up-thread; it was
developed in the 80s, long before Google was born, and most of the advances in
search since the 90s have happened behind closed corporate doors inside
Google, Yahoo, and Bing.

~~~
hacknat
This is solid advice. At my last company we had SOLR. I think we would have
even qualified for what most people would call a "fair" use case of a search
stack. Millions of items in the catalog, and millions of users to go along
with it. Lucene is light years faster than a SQL engine with such requirements
for search.

What is it not better at though? Pretty much everything else. GIS, Cross
joins, parent child indexing, etc. Postgres can do it all too with a little
digging.

The thing neither of them has? Semantic search. If I ever hit that scale again
at another company, the jump won't be from SQL to Lucene, but from SQL to
Google.

~~~
rpedela
You can do semantic search with Solr and ES, it just isn't built-in. But it
has support for stemming and synonyms which are the building blocks. You have
create the synonyms file yourself with word2vec or a similar algorithm.

------
patsplat
Saying postgres is "good enough" acknowledges that elastic search is "better"
but based on minimum viable product requirements may not be necessary. It's a
fair point that is mostly valid when approaching a deadline on a fixed bid
contract.

Given enough sprints, the search engine will eventually be run by an external
service. +1 for the Google Custom Search suggestion. If SEO matters to the
problem domain, might as well leverage those carefully indexed documents.

------
okket
Previous discussions:

[https://news.ycombinator.com/item?id=8714477](https://news.ycombinator.com/item?id=8714477)
(~2 years ago, 58 comments)

[https://news.ycombinator.com/item?id=8381748](https://news.ycombinator.com/item?id=8381748)
(~2 years ago, 95 comments)

~~~
ngrilly
And this one a few days ago:

[https://news.ycombinator.com/item?id=12605156](https://news.ycombinator.com/item?id=12605156)

And this new index type for PostgreSQL:

[https://github.com/postgrespro/rum](https://github.com/postgrespro/rum)

------
mark_l_watson
I usually choose in favor of using Postgres for RDMS and search, but I like to
write a thin abstraction layer in case I later need to change my mind. The
overhead (coding wise) for doing this is fairly small and this approach can
save having to change a lot of application code later.

------
elchief
It really needs TF-IDF or BM25

~~~
donretag
I was going to post the same thing. As far as I can tell, there are no term or
document frequencies in postgres. Then again, I have seen several Lucene
implementations with either or both are disabled. Can you even get an explain
back from postgres? The main difference between a database and a search engine
(a subtype) is the ability to score and rank. Take away the ability to score
effectively diminishes the value.

BM25 is simply TFIDF with a couple of tunable parameters. These parameters
will help with precisely the issues which force people to disable TF
(saturating the value early, ...). From what I can see, Elasticsearch is
mainly used as an analytics/logging engine and not search. Many NLP heavy
organizations are using Solr (or straight Lucene).

~~~
elchief
there is the smlar extension (written by one of the guys that wrote fts) but
I'd prefer native

You'd think it wouldn't be that hard seeing as SQLite has lots of working
code, and it's reasonably similar to pg.

------
vemv
Everything is good enough at low-to-moderate scale.

------
amelius
Good. We need more competition in search.

