Hacker News new | comments | show | ask | jobs | submit login

As much as I would like it to be 'good enough', Postgres' full text search lacks the feature set to be a reasonable solution to most search use cases.

Background: I wrote a product search engine for a company called iTrackr using Postgres full text search, and later wrote the product search for milo.com using Solr. I also designed a simple (but very effective) algorithm to boost search rankings based on historical user click data, by generalizing click data onto the feature space of the document set (using mutual information). http://www.google.com/patents/US20140067786

So, why would I recommend against using Postgres' full text search:

- It is not configurable. A search engine like Solr or ElasticSearch allows you to configure the tokenization pipeline, what is indexed, how it is indexed, etc. Postgres hard-codes it's tokenization pipeline. You can turn stemming on and off, but you can't, for example, lowercase text, strip out html characters, handle special cases for tokenization (camel cased or delimited text, etc). You can write your own tokenizer, but it has to be as a C extension. Could you get it to do what you want? Mostly, but only with an order of magnitude more work than with Solr or ElasticSearch.

- Postgres has no facility to make it possible to do faceting (histograms). This is a pretty big deal breaker for most kinds of eCommerce. I've seen a few attempts at this, but they are incredibly complex and slow compared to other search engines.

- Postgres does not give you much control over ranking. It has some good generic ranking methods, but those will only get you so far. (Query Amazon for 'ipad -dkfjdkf' to see how well it's generic ranking function is. No generic ranking function can do very well compared to something tuned to the domain, and Postgres just isn't flexible enough to extend in any reasonable way.

- It is not scalable. Solr and ElasticSearch are both very very easy to scale up (and out). Postgres' built in replication is very heavy, as it has to stream all database changes to all slaves (yes there are things like SLONY but they are incredibly hard to deal with). In any real environment, you have to protect your database from load. It's the the hardest thing to scale, and consulting it for all traffic is a surefire way to run into issues that are very hard to get back out of.

Now, with all that said, I would recommend it for a few cases:

- You just want to do SQL queries with 'LIKE' queries, but indexed.

- You need to do arbitrary relational constraints on top of text queries. Doing a query like "give me all the tickets closed in November by Hugh where the project was ProjectX and the ticket name has the word "nested loop" in it's title OR it's description OR it's title is less than 15 characters long" is going to be more or less impossible in any other environment. If you were building something like Jira, or really anything that needs to support relational logic, Postgres' full text search would be a great way to go.

- This is probably the most common case I would recommend it: You already are using Postgres, and you just want to add on some basic search functionality to data that is already in there. In that case, being able to integrate with your existing queries, and not having to support a seperate indexing step and corresponding index updating processes is a huge win.




This is probably the most common case I would recommend it: You already are using Postgres, and you just want to add on some basic search functionality to data that is already in there.

Perhaps this is true for more sites than you imagine - you started your comment saying that it is not a reasonable solution to most search use cases, but the vast majority of site search on the web is subpar and way below psql FTS. So for those sites FTS would be good enough, and a big improvement on what is there with minimal time investment (far less than installing, maintaining and learning to use something like solr).

I've replaced ILIKE type searches with psql FTS on a few sites and been very happy with it - it certainly won't be suitable if your entire value proposition is centred around searches, if you need configurable tokenisation, many languages etc, but it is more than adequate for simple searches across text columns where you don't need to use the psql ranking for example but just want a list of hits to order as you see fit. If you don't want to have a JVM dependency on your server, and don't have high traffic or specialised search requirements, FTS is a good choice. There is a huge long tail of sites with terrible search who could improve it by using FTS in postgres instead, and that's who this article is aimed at - not people who know what stemming is and why they should use it, or what faceted searches are (I didn't till I read your comment, thanks!). Imagine for example a startup who want to quickly enable a decent search of their blog posts - FTS is a good option which doesn't require external dependencies and is very quick to start with.

Thanks for the insight though on what the limitations of psql FTS are and where it starts to break down, it's interesting to hear this from someone who's tried better options and written a search service which was heavily used.


I'm not convinced that you can magically improve search on those 'tail' sites. Postgres' full text search is nontrivial to set up and use, even if you have some SQL experience. I've done substantial work on text search in the past, and I still have trouble following the docs for FTS. The interface is cumbersome at best.

Finally, if search on a site is especially bad, probably the people that made it can't do better, or don't care. I would suspect that most of them are in the "can't do better" camp. One of the not-so-surprising things I have learned about these sorts of programmers is, they don't spend any time reading up on programming related topics, so they'll never see blog posts like this!


Postgres' full text search is nontrivial to set up and use, even if you have some SQL experience.

Seemed pretty straightforward to me.


Elasticsearch is much more cumbersome to setup. I think Postgres' wins in relative ease of setting up.


No it isn't. DEB/RPM package install and you're pretty much done.


I m glad to see that you got the main goal behind this post. ES/SOLR are great tools but some project may not need some of their feature


> You can turn stemming on and off, but you can't, for example, lowercase text, strip out html characters, handle special cases for tokenization (camel cased or delimited text, etc). You can write your own tokenizer, but it has to be as a C extension.

Of course this can be done without messing with C, you just massage your data on a trigger before updating the FTS index.


> Postgres has no facility to make it possible to do faceting (histograms). This is a pretty big deal breaker for most kinds of eCommerce. I've seen a few attempts at this, but they are incredibly complex and slow compared to other search engines.

why exactly GROUP BY doesn't work here?


> You can write your own tokenizer, but it has to be as a C extension.

Another option is to tokenize the data before insertion into PostgreSQL (which, ironically, I'm currently doing in one of my projects using Lucene).


Sure, but that gives away most of the benefit of using it. You have to have your tokenizer algorithm available at query time (so you can tokenize the query the same way), usually at least.

You also have to undertake some batch process to reindex things that have been added/changed. You don't have to handle deletes, which is certainly the trickiest thing to try to keep in sync, but once you have some external cronjob or whatever running your code over all the full text entries, you might as well just make that job insert the data into ElasticSearch. Most of the difficulty in setting up full text search outside of the DB is creating and managing/monitoring the process to shuttle the data out of the db and into ElasticSearch, and if you are going to do most of the work, you might as well get the benefit, rather than just shuttling the data out of the database and then back into the database.


Hi Justin, do you have a link to a technical paper on the anonymous? click algorithm. Was curious if you applied it to domains other than products/shopping.


I linked to the patent application (which has been denied).

The algorithm is actually very very simple.

1. produce a set of 'features' for every document. This is what you will index into the inverted index. Features can be words from the document, or they can be things like 'price_between_0_100'.

2. Build a set of pairs (search_phrase, clicked_on_document_id).

3. For each search_phrase in your set of pairs, run the search_phrase through your search engine. The set of all results we will call SET_A.

4. SET_B is the set of documents for a search_phrase that have been clicked on. SET_B is a subset of SET_A. Using mutual information, it is possible to determine which features best differentiate documents from SET_B from (SET_A - SET_B) for a specific search_phrase. These are the features you will 'boost' when you see that search_phrase.

As an example, one of the features this pulled out in practice was the word 'Oceanic' for the search_phrase "lost". It basically zeroed in on the fact that the word 'Oceanic' appearing in the description of a product differentiated 'good' results from 'bad' results for the query 'lost'.

Edit: Here is a hard to follow presentation on it: https://docs.google.com/presentation/d/1xzQbQ1cM3GnaS_LoCvmJ...

Edit2: Another presentation, easy to follow but without most of the technical depth: https://drive.google.com/file/d/0B_eNCTyNa0DHcy1UOGpSUk41NDA...


what about ILIKE '%KeyWords%'? I know it doesn't get ride of 'ing' suffixes, what about performance?


This post shows the problem when people ignore theory. Elasticsearch and Lucene are simply inverted indexes. PostgreSQL full text search is also an inverted index, but there are some small architectural issues that make the difference.

Depending on your use case, you cannot replace like or ilike with full text search. You may need to use a trigram index, which you cannot do with ElasticSearch and Lucene.

PostgreSQL text search is absolutely good enough for a great many cases. Maybe not for specialized cases like yours, but for general search it is fine.


This is not a case of 'ignoring the theory', it's a case of you being pedantic.

I didn't say "replace 'like' queries with identical functionality", I said 'replace', as in 'supersede' and 'improve upon'. Except for insane cases such as the search 'ant' matching the word 'pedANTic', which is to say whenever whole tokens are intended to be matched, postgresql's full text search will replace 'like' queries for you with better functionality (such as matching non-continuous tokens in a search text) and better performance (assuming an index is created).

However, having full control of the tokenization pipeline means you can certainly satisfy any special cases your old 'like' queries would satisfy. For example, in Solr it is trivial to tokenize additionally on word/camel case/number-letter boundaries, so that the tokens generated from something like EM22C can be EM 22 C, EM22 C, EM 22C, EM22C, etc.

That is a specific example, so please don't respond that I am ignorant of the theory, I fully realize that you can index any string, not just the exact ones I mentioned there.

Speaking of being pedantic, Postgresql's full text search is not an 'inverted index'. It is not an index at all, although it can use indexing to improve performance. In addition to GIN (Generalized Inverted Index) it can also use GiST (Generalized Search Tree)


You seem to be implying that the primary purpose of like queries is to do full text search, which is of course not true. E.g. perhaps you need prefix matching, or perhaps you are implementing an autocomplete box where you want partial matching, or you are searching over filenames and want partial matching, the latter two both being "insane" cases in your world.

And yes, there are GiST indexes for text in addition to inverted indexes. I didn't bring these up because you are comparing primarily to products using inverted indexes. My point is to illustrate that these products are actually mostly the same, with differences primarily around tooling.

No one is arguing that Solr and Lucene don't have advantages over PostgreSQL's text search. But I strongly disagree with your claim that "full text search lacks the feature set to be a reasonable solution to most search use cases."

You are quick to claim that postgresql cannot use customize tokenizers or cannot scale or doesn't have features. Then in the same paragraph say that well, yes, it can do both these things but it's harder to work with. You also wipe all the other advantages that postgresql brings under the rug with a dismissive "oh well these are minor situations." I guess they are minor in your products, but they are certainly not minor in others.

If you only need text search, or are using a database other than postgres, then Solr and Lucene are great choices. On the other hand, if you are using PostgreSQL already, its' full text search is entirely reasonable and feature rich for many if not most search cases.


> You may need to use a trigram index, which you cannot do with ElasticSearch and Lucene

You definitely can create an analyzer to generate trigrams in Elasticsearch. Unless you mean something different with "trigram index" than indexing trigrams?




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: