
Setting Up a Fast, Comprehensive Search Routine with PostgreSQL - craigkerstiens
https://rob.conery.io/2018/07/23/setting-up-a-fast-comprehensive-search-routine-with-postgresql/
======
deedubaya
I just yanked elasticsearch out of an app and replaced it with PG's full text
search.

It's like shedding a 500# pack from my shoulders. Everything is so much
simpler now -- dev is easier, testing is easier, deployment is easier.
Infrastructure costs are down.

Elasticsearch is a pretty neat tool, but what I was using it for was heavy
handed. I only chose it because it was what everyone else was using it.

~~~
brightball
I did the same exact same thing with Solr a few years back. PG performed
equally for speed, the results were more accurate and the data inconsistencies
vanished. Solr was such a black box to debug that it couldn’t be trusted.

After I left the dev hired to replace me put Solr back.

~~~
ngrilly
What is the size of your dataset (number of documents and average size of
documents)?

~~~
brightball
It’s been a while so I’m going from memory here.

We had about 14 million records and the full text search included 3 varchar
255 and a text that could get very long depending on the user. These were
product auction listings.

I don’t remember the size specifics though.

~~~
ngrilly
Thanks. That's not a small dataset. It's significant you got good enough
performance on this. How many docs matched a typical query? And what was a
typical response time?

~~~
brightball
It's been a long time. I honestly don't remember.

At the time, when we removed Solr and replaced it with an assembled query the
performance was at least equal but our data issues vanished.

We were having serious issues with keeping the Solr dataset in sync with
changes, invalidating old ones and getting new ones to appear. A user pays to
create a listing, they expect to see the listing in the search...and sometimes
that wasn't happening. When a listing expired, sometimes it wasn't going away.
Being able to just keep this stuff up to date with a simple PG trigger
completely eliminated the problems and made it easy to tune.

The search used a combination of full text, categories, geographic distance,
user names and a couple of other filters. Being able to simply construct each
of those pieces in a query, sometimes filtering parts with subqueries, etc was
really effective. You had a few indexes involved.

Keep in mind, this is also before I knew anything about Postgres partitioning
with PG Partman or how to create partial indexes. I could have made it a lot
more efficient than it was, in hindsight.

The experience was good enough though that I don't even begin to think about
pulling in a 3rd party search tool unless I have project requirements that
make it unavoidable. Elastic Search is really good for use cases with constant
streaming data ingestion, for example. If I just need search for data that's
already in the database though...it's really hard to justify.

~~~
ngrilly
Great info. Thanks!

------
sandGorgon
the biggest issue with postgres search is the inability to use TF-IDF or BM25
(the current default and state of the art on elasticsearch). The current
ranking system is not very relevant.

Anyone who is familiar with PG internals - is there something in the internal
data structure that prevents a BM25 or TF-IDF style rank generation ?

The work on RUM seems to have stagnated (and TF-IDF was a todo anyhow here)
[https://github.com/postgrespro/rum](https://github.com/postgrespro/rum)

I have a theory that if they incorporate these algorithms, it makes postgres
potent enough that a lot of people may choose not to use elasticsearch/lucene.

~~~
threeseed
I had never heard of these algorithms until now.

But I would still use Elasticsearch since (a) everyone else does so lots of
support, (b) ridiculous amount of plugins, drivers and rivers, (c) PostgreSQL
has a far weaker scalability story, (d) Kibana is a great UI to help with
debugging, (e) you can use ELK for your monitoring as well as search.

I could definitely come with a few more but Elasticsearch is a product that
does search really well.

~~~
aisofteng
These are standard information retrieval algorithms that have been known and
used for decades.

------
evadne
The first solution uses a materialised view. It essentially precomputes what
the query can possibly return and uses up additional disk space, plus WAL
traffic unless if it is configured unlogged. There has been no mention of how
long it takes to rebuild such a materialised view: you will still have full
table scans then, but at build time against the primary read instance.

The second solution is better, but please read documentation [1] before
replicating it. Also to_tsvector takes an optional regconfig parameter [2] and
here is what I get right now

    
    
        $ psql
        psql (9.6.2)
        Type "help" for help.
    
        evadne=# \dF
                       List of text search configurations
           Schema   |    Name    |              Description              
        ------------+------------+---------------------------------------
         pg_catalog | danish     | configuration for danish language
         pg_catalog | dutch      | configuration for dutch language
         pg_catalog | english    | configuration for english language
         pg_catalog | finnish    | configuration for finnish language
         pg_catalog | french     | configuration for french language
         pg_catalog | german     | configuration for german language
         pg_catalog | hungarian  | configuration for hungarian language
         pg_catalog | italian    | configuration for italian language
         pg_catalog | norwegian  | configuration for norwegian language
         pg_catalog | portuguese | configuration for portuguese language
         pg_catalog | romanian   | configuration for romanian language
         pg_catalog | russian    | configuration for russian language
         pg_catalog | simple     | simple configuration
         pg_catalog | spanish    | configuration for spanish language
         pg_catalog | swedish    | configuration for swedish language
         pg_catalog | turkish    | configuration for turkish language
        (16 rows)
    
    

[1] [https://www.postgresql.org/docs/current/static/textsearch-
co...](https://www.postgresql.org/docs/current/static/textsearch-
controls.html)

[2] [https://www.postgresql.org/docs/current/static/textsearch-
in...](https://www.postgresql.org/docs/current/static/textsearch-
intro.html#TEXTSEARCH-INTRO-CONFIGURATIONS)

~~~
1996
A larger problem is the need to precompute.

Regular views are just as slow (or fast) as the normal code.

It seems something inbetween would be needed- not just a trigger, but a
regular update as data is appended, maybe in chunks?

I do not know psql much. It may already exist.

~~~
manigandham
A trigger to do the update for the tsvector field works just fine, and is the
normal implementation.

------
manigandham
It's important to note that a simple to_tsquery(...) will not handle phrases,
or prefixes, or even normalizing the input. There are plainto_tsquery(...) and
phraseto_tsquery(...) methods but these don't have the same flexibility and
also won't do prefixes.

If you want some normalization and prefix abilities, you can actually call
to_tsvector(...) on your query input, then stitch that together into a better
tsquery input.

    
    
       SELECT STRING_AGG(lexeme || ':*', ' & ') 
       FROM UNNEST(to_tsvector('testing 123'))
    

This will return `123:* & test:*` which can then be used for searching. Full
example here: [https://stackoverflow.com/questions/46122175/fulltext-
search...](https://stackoverflow.com/questions/46122175/fulltext-search-
combined-with-fuzzysearch-in-postgresql/51433877#51433877)

------
thom
We've always used pg_trgm with GiST indexes, with the same sort of
materialized view described here, containing unaccented search terms. One of
the big benefits here is we can then do really fast '10 best matches' by text
similarity. Obviously this is a slightly different use case than full-text
over largish documents, but it's great when you've got a database of foreign
names that people only vaguely know how to spell.

~~~
brightball
trgm is excellent for more fixed length text.

------
aldoushuxley001
What are the downsides for implementing something like this rather than going
for Elastic Search or SOLR?

Is it possible to have production-quality search using just postgresql?

~~~
ngrilly
The upside: only one service to maintain (PostgreSQL) instead of two
(PostgreSQL + ElasticSearch) and not having to synchronize PostgreSQL and
ElasticSearch (which is more difficult than it sounds).

The downside: ranking can be slow when your keywords match a large number of
documents (have a look at github.com/postgrespro/rum for a solution), and
faceting is more difficult to implement than in ElasticSearch

~~~
rbjorklin
What about only running SOLR/Elasticsearch? That would keep the upside and
eliminate the downside, no?

~~~
aisofteng
Solr and Elasticsearch are not relational database systems and have none of
the optimizations for relational data that Postgres does. They are built on
top of Lucene and target document retrieval via an inverted index [1] rather
than dealing with relational data.

[1]
[https://en.wikipedia.org/wiki/Inverted_index](https://en.wikipedia.org/wiki/Inverted_index)

~~~
vesak
A rather small portion of software that requires search requires a relational
database, though.

~~~
purerandomness
Can you name an example?

I've never built something that did _not_ need a relational database to model
the domain, and search is such a common UX element that it's taken for granted
in everything you use.

I can't even think of something that could use search that's _not_ relational.
Can you?

~~~
threeseed
Yes plenty.

Elasticsearch stores data in a JSON document store style format. So any
situation where you need O(1) lookup for all data related to a particular
entity you can have nested structures within JSON that facilitates this. In a
RDBMS it is O(n) since you have to do a costly join for each embedded
structure.

360 customer views are good example of this.

~~~
purerandomness
There's a lot of misconceptions there.

If you want to save JSON documents, you can pick Postgres [1] or MySQL and
still have a RDBMS for free additionally.

Elasticsearch does that too, but its access certainly isn't O(1) - it has a
complexity of O(log n), just like Postgres or MySQL would have, because all of
them use tree-like index data structures. (Please note that ElasticSearch is
not meant as a data storage layer)

Now, if you just want to have the JSON, you're done. O(log n) for
ElasticSearch, Postgres or MySQL. That's it.

But normally, you want to do something with that data. You have a user and
want all the products. Or you have a book and want all the authors, because
you have a real app to write.

Most data in our world is relational: A student can partitipate in courses, a
store sells products, and so on. Queriyng your database using JOINs will make
use of indexes in each of the tables that you have previously set up.

These indexes will prevent the theoretical worst-case (the "Full Table Scan",
where your complexity is O(n) for the table). The query planner will decide
which index is better to use for the optimal response time.

You can, however, choose to not leverage this feature of RDBMS. What do you do
now, if you need all the courses of a student? Do you get all the JSONs and
compare their IDs? Do you parse the JSON keys in your own app? But having your
app iterate over all the JSON keys already has O(n) complxity. So you end up
recreating tree-like indexes anyway somewhere in your stack. But all of that
is already provided for you, with decades of debugging and performance
optimization - for free.

To sum up: Use Postgres JSONB type if you only need to write JSON documents.
You might later want to query the nested JSON data structure and join it. I
highly recommend "Mastering PostgreSQL in Application Development" [2]

[1] [https://www.postgresql.org/docs/current/static/datatype-
json...](https://www.postgresql.org/docs/current/static/datatype-json.html)
[2] [https://masteringpostgresql.com/](https://masteringpostgresql.com/)

~~~
threeseed
You have missed the point here.

The comparison was between ElasticSearch and a relational store. PostgreSQL's
JSON type is not relational. And if you are fetching any entity by a primary
key then it should be O(1). The point was that document stores offer you the
ability to nest data structures whilst relational forces you to join. And so
if that is your query/data pattern then a document store can be orders of
magnitude faster than a purely relational store e.g. 360 Customer View.

~~~
purerandomness
> O(1)

Primary key fetching will always be O(log n) due to how indexes are
implemented. Access time complexity will never be independent of the item
count in your data store, relational or not.

> PostgreSQL's JSON type is not relational.

It is. In Postgres, you can join nested JSON structures' keys with normal,
relational tables: [https://dba.stackexchange.com/questions/83932/postgresql-
joi...](https://dba.stackexchange.com/questions/83932/postgresql-joining-
using-jsonb)

My point is: Postgres offers you a document store and gives you _the option_
to join, if you feel like it.

