I've actually tried both Postgres FTS and Elasticsearch(in conjunction with a Postgres DB) and found Elasticsearch easier for search. It's much more robust - for my specific case I needed a bunch of Japanese text to be searchable alongside english text; Postgres has very poor support for this - and its API is excellent and its very quick and painless to set up(documentation is not stellar though). You don't have to worry about things like vectors or indices with ES, you can just put the data in and it will be easily searchable, and it's miles more efficient when searching across large data sets. It Java-based so it is a pretty big resource hog though, so I can see the advantage of someone who just needed very lightweight search for their project using tsearch.
I'm in favor of Postgres, I make my living in large part from my expertise with it; but it rubs me the wrong way when people downvote an informative comment covering someone's experience where it didn't happen to be the right tool.
no_future I apologise to you for the poor reception your comment got from immature people who don't understand how to discuss things and who cannot handle new information. This did not used to be representative of the community here.
He was shadowbanned because of his previous trollicious comment, not the one below, so your assumptions here are incorrect. Perhaps the mods are a bit quick to shadowban (or the algos they use are) - I personally think it should be reserved for persistent trolls, but have some sympathy - moderating a community like this is a hard, thankless task and there is an endless supply of toxic trolls. I'm sure if he asks nicely they'd reinstate the account.
This wasn't a single misstep, it was just the first one that attracted much attention.
Even a broken clock is right twice a day.
I'd much rather filter out the trolling from the useful experiences myself than have somebody shadowbanned because people disagreed with them.
Remember, downvotes signal disagreement, not civility. This is a problem.
Background: I wrote a product search engine for a company called iTrackr using Postgres full text search, and later wrote the product search for milo.com using Solr. I also designed a simple (but very effective) algorithm to boost search rankings based on historical user click data, by generalizing click data onto the feature space of the document set (using mutual information). http://www.google.com/patents/US20140067786
So, why would I recommend against using Postgres' full text search:
- It is not configurable. A search engine like Solr or ElasticSearch allows you to configure the tokenization pipeline, what is indexed, how it is indexed, etc. Postgres hard-codes it's tokenization pipeline. You can turn stemming on and off, but you can't, for example, lowercase text, strip out html characters, handle special cases for tokenization (camel cased or delimited text, etc). You can write your own tokenizer, but it has to be as a C extension. Could you get it to do what you want? Mostly, but only with an order of magnitude more work than with Solr or ElasticSearch.
- Postgres has no facility to make it possible to do faceting (histograms). This is a pretty big deal breaker for most kinds of eCommerce. I've seen a few attempts at this, but they are incredibly complex and slow compared to other search engines.
- Postgres does not give you much control over ranking. It has some good generic ranking methods, but those will only get you so far. (Query Amazon for 'ipad -dkfjdkf' to see how well it's generic ranking function is. No generic ranking function can do very well compared to something tuned to the domain, and Postgres just isn't flexible enough to extend in any reasonable way.
- It is not scalable. Solr and ElasticSearch are both very very easy to scale up (and out). Postgres' built in replication is very heavy, as it has to stream all database changes to all slaves (yes there are things like SLONY but they are incredibly hard to deal with). In any real environment, you have to protect your database from load. It's the the hardest thing to scale, and consulting it for all traffic is a surefire way to run into issues that are very hard to get back out of.
Now, with all that said, I would recommend it for a few cases:
- You just want to do SQL queries with 'LIKE' queries, but indexed.
- You need to do arbitrary relational constraints on top of text queries. Doing a query like "give me all the tickets closed in November by Hugh where the project was ProjectX and the ticket name has the word "nested loop" in it's title OR it's description OR it's title is less than 15 characters long" is going to be more or less impossible in any other environment. If you were building something like Jira, or really anything that needs to support relational logic, Postgres' full text search would be a great way to go.
- This is probably the most common case I would recommend it: You already are using Postgres, and you just want to add on some basic search functionality to data that is already in there. In that case, being able to integrate with your existing queries, and not having to support a seperate indexing step and corresponding index updating processes is a huge win.
Perhaps this is true for more sites than you imagine - you started your comment saying that it is not a reasonable solution to most search use cases, but the vast majority of site search on the web is subpar and way below psql FTS. So for those sites FTS would be good enough, and a big improvement on what is there with minimal time investment (far less than installing, maintaining and learning to use something like solr).
I've replaced ILIKE type searches with psql FTS on a few sites and been very happy with it - it certainly won't be suitable if your entire value proposition is centred around searches, if you need configurable tokenisation, many languages etc, but it is more than adequate for simple searches across text columns where you don't need to use the psql ranking for example but just want a list of hits to order as you see fit. If you don't want to have a JVM dependency on your server, and don't have high traffic or specialised search requirements, FTS is a good choice. There is a huge long tail of sites with terrible search who could improve it by using FTS in postgres instead, and that's who this article is aimed at - not people who know what stemming is and why they should use it, or what faceted searches are (I didn't till I read your comment, thanks!). Imagine for example a startup who want to quickly enable a decent search of their blog posts - FTS is a good option which doesn't require external dependencies and is very quick to start with.
Thanks for the insight though on what the limitations of psql FTS are and where it starts to break down, it's interesting to hear this from someone who's tried better options and written a search service which was heavily used.
Finally, if search on a site is especially bad, probably the people that made it can't do better, or don't care. I would suspect that most of them are in the "can't do better" camp. One of the not-so-surprising things I have learned about these sorts of programmers is, they don't spend any time reading up on programming related topics, so they'll never see blog posts like this!
Seemed pretty straightforward to me.
Of course this can be done without messing with C, you just massage your data on a trigger before updating the FTS index.
why exactly GROUP BY doesn't work here?
Another option is to tokenize the data before insertion into PostgreSQL (which, ironically, I'm currently doing in one of my projects using Lucene).
You also have to undertake some batch process to reindex things that have been added/changed. You don't have to handle deletes, which is certainly the trickiest thing to try to keep in sync, but once you have some external cronjob or whatever running your code over all the full text entries, you might as well just make that job insert the data into ElasticSearch. Most of the difficulty in setting up full text search outside of the DB is creating and managing/monitoring the process to shuttle the data out of the db and into ElasticSearch, and if you are going to do most of the work, you might as well get the benefit, rather than just shuttling the data out of the database and then back into the database.
The algorithm is actually very very simple.
1. produce a set of 'features' for every document. This is what you will index into the inverted index. Features can be words from the document, or they can be things like 'price_between_0_100'.
2. Build a set of pairs (search_phrase, clicked_on_document_id).
3. For each search_phrase in your set of pairs, run the search_phrase through your search engine. The set of all results we will call SET_A.
4. SET_B is the set of documents for a search_phrase that have been clicked on. SET_B is a subset of SET_A. Using mutual information, it is possible to determine which features best differentiate documents from SET_B from (SET_A - SET_B) for a specific search_phrase. These are the features you will 'boost' when you see that search_phrase.
As an example, one of the features this pulled out in practice was the word 'Oceanic' for the search_phrase "lost". It basically zeroed in on the fact that the word 'Oceanic' appearing in the description of a product differentiated 'good' results from 'bad' results for the query 'lost'.
Edit: Here is a hard to follow presentation on it: https://docs.google.com/presentation/d/1xzQbQ1cM3GnaS_LoCvmJ...
Edit2: Another presentation, easy to follow but without most of the technical depth: https://drive.google.com/file/d/0B_eNCTyNa0DHcy1UOGpSUk41NDA...
Depending on your use case, you cannot replace like or ilike with full text search. You may need to use a trigram index, which you cannot do with ElasticSearch and Lucene.
PostgreSQL text search is absolutely good enough for a great many cases. Maybe not for specialized cases like yours, but for general search it is fine.
I didn't say "replace 'like' queries with identical functionality", I said 'replace', as in 'supersede' and 'improve upon'. Except for insane cases such as the search 'ant' matching the word 'pedANTic', which is to say whenever whole tokens are intended to be matched, postgresql's full text search will replace 'like' queries for you with better functionality (such as matching non-continuous tokens in a search text) and better performance (assuming an index is created).
However, having full control of the tokenization pipeline means you can certainly satisfy any special cases your old 'like' queries would satisfy. For example, in Solr it is trivial to tokenize additionally on word/camel case/number-letter boundaries, so that the tokens generated from something like EM22C can be EM 22 C, EM22 C, EM 22C, EM22C, etc.
That is a specific example, so please don't respond that I am ignorant of the theory, I fully realize that you can index any string, not just the exact ones I mentioned there.
Speaking of being pedantic, Postgresql's full text search is not an 'inverted index'. It is not an index at all, although it can use indexing to improve performance. In addition to GIN (Generalized Inverted Index) it can also use GiST (Generalized Search Tree)
And yes, there are GiST indexes for text in addition to inverted indexes. I didn't bring these up because you are comparing primarily to products using inverted indexes. My point is to illustrate that these products are actually mostly the same, with differences primarily around tooling.
No one is arguing that Solr and Lucene don't have advantages over PostgreSQL's text search. But I strongly disagree with your claim that "full text search lacks the feature set to be a reasonable solution to most search use cases."
You are quick to claim that postgresql cannot use customize tokenizers or cannot scale or doesn't have features. Then in the same paragraph say that well, yes, it can do both these things but it's harder to work with. You also wipe all the other advantages that postgresql brings under the rug with a dismissive "oh well these are minor situations." I guess they are minor in your products, but they are certainly not minor in others.
If you only need text search, or are using a database other than postgres, then Solr and Lucene are great choices. On the other hand, if you are using PostgreSQL already, its' full text search is entirely reasonable and feature rich for many if not most search cases.
You definitely can create an analyzer to generate trigrams in Elasticsearch. Unless you mean something different with "trigram index" than indexing trigrams?
First I'm glad that this post made HN first page for a second time.
I encourage to read this comment because it summarizes well few of the missing features in Postgres about search. https://news.ycombinator.com/item?id=8715624
It's not the first time that I get similar feebacks and I understand that "Enough" is subjective notion and maybe I should've pick an other title like "Postgres full-text search for basic search need".
Just to clarify, this post is not against ES or SOLR. This tools are great and really powerfull but obviously this tools will better than Postgres Search. ES / SOLR are tools build to be anly around document store and search and on the otherside the search in Postgres is only one feature in many.
I think there is multiple reason when you would want to give Postgres search a go:
- You have no experience with ElasticSearch or SOLR
- The project is still early stage and you don't want to invest time/money in a secondary datastore (eg: ES cluster)
- You want to give quickly a search feature (maybe not perfect) in the hand of your user
- Team knowledge, you may want to be careful before introducing a new tech to your team and ensure that people understand it. There is nothing worst than introducing a tech and being the only one who can maintain it.
The tech world is big and doesn't only apply to Startup usecase, I worked in very different environment: Financial institution, Open Source company, Startup and Service company. A thing that I learn is that you don't always have flexibility. In some company having new servers is long process and even if I hate you have to do with this constraint then you have 2 choices front of you: You do nothing and wait or you move yourself to offer to your users a workaround solution without any extra dependency ... That's what Postgres search can be in some usecases.
From my experience, if you are already using Postgres and you find the right documentation (this post try to describe a start to end example) then I think you could get a search working in 1 day of work without any extra servers or resources.
I don't have any experience with SOLR so I will have to pick ES, but if you never used it then you are going to have loads of "fun":
- Sync. ElasticSearch IMHO cannot be used as a cannonical datasource so you will need to figure out a sync strategy. Some people may argue on that but I don't think that ES started with this goal in mind. They may change direction now but still I wouldn't store my data only in ES.
- Hosting. In the case that you are not using a hosted ES service then you will need to learn to host and build a cluster and also monitor it.
- Learning to use it. After reading ~50pages of doc then you start to figure out how to use type mapping, custom analyser and the 42 ways to do a search query.
I can only speak from my experience but it took me ~2weeks to get my head around all this point without any previous experience.
The project for which I used PG search and wrote this blog post recently moved to ES. More or less, 1y later and I'm glad that the PG solution was good enought until now. In our case we needed to support more languages: Japanese, .. and I also wanted to get rid of a bunch of triggers.
The results are good and very fast but for english the results are quite similar to Postgres solution. I think that's mainly because I'm using much only ES basic feature, but we have loads of room for improvement which is a good position to be.
Now there is few other usecases where you may want to consider PG search:
- you are using MySQL and you are using the builtin then moving to Postgres can give your search a bit more power
- Side project using Postgres and you want to start a search feature.
- Search on machine output, case where analyser doesn't really matter and you don't need stemming. Lucene features for tokenization and stemming are great.
If you are interrested about the PG search and you want to know more about it than I suggest you to have a look to this blog posts. I think that the author did an amazing job.
I don't know if people will be interrested but I thought few times about writing a following blog post "When Postgres search is not Enough" to describe how to integrate Postgres with ElasticSearch with JDBC River to pull content from the DB.. Let me know if it's something that you will be interrested about.
Sorry for the numerous grammatical mistakes but I'm not a native english speaker and I don't want to spend too much time reviewing my comment grammar
Thanks again to upvote this post on the 1st page of HN
>> You want to give quickly a search feature (maybe not perfect) in the hand of your user
If this is your goal, rather than simply keeping your project contained to postgres as much as possible, Swiftype would be a better solution. Building a whole full-text search engine in a day with Postgres is faster than you could reasonably do with ES or Solr, but if your goal is simply the functionality, you can get a search engine into your app in less than an hour with Swiftype. We're built on Elasticsearch and Lucene, so you can get all their power from a team that works on it full time.
If you get started with building a secondary system for searching anyways, I wouldn't recommend it.
Also, don't use MongoDB fulltext, it's just broken.
I created a Django app to help with this syncing - check it out at https://github.com/jaddison/django-simple-elasticsearch. I love feedback and pull requests!
It's just personal opinion but I try to avoid having my application responsible of the data integrity so I went into the way of ES River plugin to pull data. I used the JDBC one, https://github.com/jprante/elasticsearch-river-jdbc ... I met some problems but at the end, it works quite well and don't need logic in the application keep the data synced.
Two other reason that I used River was to not make the app slower by saving data in ES and also of being able to run the application during development without the need of ES and being installed able to substitute the search with a stub.
I've thought of adding in support for pushing bulk index request data to redis (for example) so that an Elasticsearch river could pull from it; this would decouple the app somewhat - but not completely as you've noted. It would likely help with throughput however, and still provide you with the ability to do pre-processing on the data as needed within your app's/project's context.
The Elasticsearch JDBC river isn't as flexible for processing data if I'm not mistaken, as it doesn't have context for the data? Please correct me if I'm wrong, but it's somewhat limited?
Also you cannot also have multiple queries but not to update and existing document(eg: adding extra fields)
The SQL query for me ended being ~100 lines but the model is quite complex with multiple languages support.
I'm not going to complain about it :)