I can deeply vouch for the performance and joy of implementing Full-Text Search with PostgreSQL and Django.
I just finished a a project where we chose Postgres's FTS over using Elastic Search. At the beginning, I was worried about what performance we'd see since we choose to not use ES. But after slight performance tweaking, we had even our least performing queries under 50ms.
You need to look at other stuff than performance - relevancy is probably the biggest thing when implementing search. Is it more relevant than what you experienced with ES?
I'd argue that relevancy is more your application application's design then the underlying system retrieving the results. For example, putting the same dataset in Postgres or ES wouldn't make one deliver more relevant results given equal configurations.
You could lean on the relevancy strategies built in to ES, but in my experience you're better off understanding what relevancy means for your dataset and implementing a strategy yourself. Your millage may vary though, I'd never advocate reimplenting something that's already provided by your chosen tool.
The options and tools for configuring and tweaking relevancy between ES and PostgreSQL's FTS are surprisingly similar for many application use cases. If you're interested you can check out Postgres' search rank and query weighting configurations.
I've had to build complex queries against ElasticSearch and it is specifically designed for things like this. We had custom weightings so when you searched for certain natural keys associated with each item they would rank above everything else, and that is easily do-able with ES. Simultaneously, we would weigh results according to various metadata we had attached to each entry (audio stream languages, subtitles, content owner name, genre, etc.). And finally, if you searched for the name of the media (a movie or an episode in a TV show) the user would see all the matches ranked accordingly, but again weighed according to the content owner and various language features of that media file.
You can probably hack that together with PostgreSQL, but is basically one big query in ES. PG's FTS is still great; but its use-case is slightly different.
That’s cool and I hear you; that’s a complex relevancy definition.
( Maybe surprisingly? ) This is type of query is natively supported by Postgres. That support is robust and mature, you don’t have to hack it together.
ES is a great tool and it’s clear your a fan of it. If you’re interested, I’d recommend you look into Postgres capabilities. It’s not a replacement for ES by any means ( or even a competitor to in my opinion; Postgres isn’t even distributed ). But for specific use cases, you might find that Postgres capabilities surprise you!
(I did leave out some bits that made it more complicated than I indicated.) Also, this was years ago; I know PG has improved its FTS a lot since then. ES is just a useful tool. If you can express your problem in relational terms then a RDBMS is almost always the right choice.
By the way, I am a huge fan of PG and relational databases in general; PG, especially, is a great database, and the first tool I reach for when it comes to data storage. However, we had other requirements (aside from the complexity I left out) to do with versioning and so forth that swung in favour of ES. Ultimately the problem with FTS in RDBMS, for me, boils down to doing FTS across disparate -- let's call them 'documents' -- stored across multiple tables. Basically you have to use materialised views (with manual refreshing) or complex join mechanics that affect performance. Perhaps PG 10 has improved in this area also?
I was about to give him the same answer but you beat me to it. It’s not jus throwing stuff in a field in PG. You can weight multiple bits of data. You can even define multiple different search vectors in separate columns if you want to use different search styles in different situations.
Thanks! I replied to the OP; but weighing, although critical to our needs, was a small part of a larger problem space we had to solve. I hate introducing new technology unless it's strictly required, but we ran into limitations that forced us to down the road of using ES.
> You could lean on the relevancy strategies built in to ES, but in my experience you're better off understanding what relevancy means for your dataset and implementing a strategy yourself.
Ranking is hard. You SHOULD lean on the tools available in Lucene/Solr/ES. PG's ranking tools are a joke in comparison.
> The options and tools for configuring and tweaking relevancy between ES and PostgreSQL's FTS are surprisingly similar for many application use cases.
I don't the sense of the article is that PG FTS is better than ES , but in some situation, as the one I illustrated in my article, you can implement a the same search function with both of them, but with if have PG already in your stack configuring and using it with Django is very simple and convenient.
> For example, putting the same dataset in Postgres or ES wouldn't make one deliver more relevant results given equal configurations.
This is not the case.
The OOTB search capabilities in ElasticSearch (even by default) far, far exceed what you get in PostgreSQL FTS.
Also you're completely contradicting yourself. You say you don't advocate reimplmenting something provided by the tool but then suggest doing exactly that.
But in many situation you don't need all the ES features and you can implement a quite good and fast FTS function directly with your PostgreSQL database if you already use it in your stack.
> The options and tools for configuring and tweaking relevancy between ES and PostgreSQL's FTS are surprisingly similar for many application use cases.
Maybe in very very limited scenarios, but in general, they aren't even close. PostgreSQL doesn't take corpus frequencies into account, which makes it pretty difficult to come anywhere near the relevance ranking quality of Elasticsearch (or any proper search engine).
I second this. You can't let your queries take unusable amounts of time, but below a certain threshold relevancy is infinitely more important.
I'm putting together a product which has a search feature and that uses Django + MySQL and I'm struggling with relevancy. I'd happily accept 500ms queries if that guaranteed me the relevant hit would be on the first page. That's FAR more usable than 50ms queries and then the relevant hit is on page 5.
Full text search in MySQL isn’t in the same ballpark as PG. Thats not a dig at MySQL, just praise for the quality of what you get from the PG implementation.
Second this, even if you are a PG fanboy and a search newbie, you need to pay attention to:
1. issues with i18n and l10n tokenization. Does PG support other languages?
2. At minimum you need to support tf-idf (or something better), it doesn't look like PG supports this either.
3. For extremely dumb ranking, you can have a render/engaged column in PG. For decent production stuff you need a decision tree ranker (or GBDT).
All in all, none of these are there in PG, I'm not familiar with Solr/Lucene either, but please educate yourselves before expressing such strong opinions marketed as the absolute truth.
PG FTS support other languages https://www.postgresql.org/docs/current/static/textsearch-ps...
Anyway the point of my article is not that PG FST is better than ES, but that for a quite good and fast FTS function you can use only Django and PostgreSQL and most of the time you don't need all the other ES features and at the same time your stack will be easier to build and maintain.
I think search relevancy is very important, and I wrote in my article start using PG FTS had permitted to work on search relevancy because I had more time which I used before in ES configuration and maintain another layer in my stack.
Having implemented this for a client in the past I have to agree that it is a cheaper option than ElasticSearch, especially for smaller projects with a lower number of records to index.
ElastiSearch easily gets expensive and the search suggestion is pretty bad.
Thanks for your feedback, I obviously agree with you, but I'm starting to plan to use PG FTS with Django also in some bigger project. I hope to write another article about it in new future.
Depends on the data right. I've seen good performance on larger tables after using GIN indexing with records that rarely needs to be updated and simple queries.
I'm not a expert in PostgreSQL by any means, but reducing cost and learning something never hurts.
- Using a `SearchVectorField` is a must after 500K rows.
- Make keeping this field up to date easy for yourself by populating it using `SeachVector` with a Django pre_save signal or PostgreSQL trigger. This reduces CPU utilization significantly as the parsing and tokenization of the field your searching on is done a head of time.
- Adding a GIN Index on your `SearchVectorField` column will improve performance dramatically.
- You should specify your language configuration for postgres FTS parser. The default parser doesn't do much. It just removes spaces and normalize case. Specifying a language lets the parser make heavier optimizations that noticeably improve performance and the quality of results. If you need support for more then one langue, Django already makes it easy for this configuration to be dynamic.
Not the OP, and I haven't used PG regularly for a few years, but back then I vaguely recollect the query optimizer not behaving consistently based on whether you had a full column with a GIN (or GIST) index and a (potentially partial) index on an expression for some reason. In a nutshell it preferred using the full column with an index rather than the expression index.
Even more importantly in some circumstances, having the full column allows the optimizer to pick another index when it's totally relevant, and filter the relevant rows without needing to recompute the TSV one by one for the subset.
If your "document" is based also on columns on other table, as in the example on my article, you can't have a GIN index on your expression, but you can have a GIN index on your specified column.
The Postgres docs suggest using another column. My guess is that an expression index would be too large if it held the tokenized value of all of your FTS documents. These things can and often are entire written documents. Imagine the index size for 2 Million rows of tokenized documents at 2,000 words each.
You might be able to get away with if if you were indexing a less then large amount of very small documents.
But if you're maintaining a separate ts_vector column, which you then index, you're creating the exact same amount of index data.
Unless you're saying that you would populate the field only on some rows and not all of them, and control this from the app. But you could do that with an expression index, too, assuming the rule is a simple, pure function:
CREATE INDEX index_posts_on_body
ON posts (to_tsvector(body, 'english'))
WHERE published = true;
If you want the out-of-the-box distributed experience, then you can't go wrong with Elasticsearch because of the built-in sharding mechanism. However, figuring out how many shards you need for each index, where should each shard sits, and how to balance out the cluster has always been a trial and error exercise. Indices continue to grow like there is infinite space, and performance will degrade. What I find useful in the end is to have multiple ES clusters, which seems obvious right? One of the most frequently asked questions is "how many nodes do I need", and the answer is always "it depends" which is true, but gives me a chill. With Cassandra, from my experience, adding more nodes (scale horizontally) so data spread more thinly, and revisit some data structure change, would be an acceptable answer.
The other thing is a lot of folks I have worked with just dump stuff into ES because ES feels like a dumping ground for "json"-like documents (reminds of me MongoDB). Please compress, understand mappings and different fields to make the document smaller...
I am still very convinced SQL databases are robust enough to take any lazy dumping.
> Why not just run ElasticSearch on a single instance then ?
Sorry, I am not sure why you brought up a single instance. If you run on a single instance, you lose half of the benefits of ES which is trying to be highly available.
If we have an index which does not need any backups, single instance is fine, but for most people single instance is not an option. Furthermore, if one is already struggling with 3-4 nodes, how helpful is single instance?
I brought up single instance since you suggested using a SQL database most of which don't support a distributed deployment in the same way ElasticSearch does.
My point was merely that ElasticSearch scales from very small use cases to very large ones. You don't have to setup a cluster. And that using your SQL database isn't going to be significantly easier to manage.
Search index is created by adding a automatically populated search vector column to each searchable table in Postgres and appropriate SQLAlchemy models. Left Algolia partly because our old implementation was poorly done, overly complicated, and becoming difficult to maintain, but also because Postgres FTS could give us equivalent search capabilities (we didn't need all the features Algolia had to offer) for free.
I think it depends on your specific situation and your project requirements. In my project for example I had exactly the search quality I need with Django and PostgreSQL without installing another software (Elastic), using a different query syntax, synchronizing the data and installing another layer (haystack).
Maybe in other project ES will be a better solutions.
My comment was in response to a question whether PG can do TF/IDF or BM25 and fuzzy search (handle misspelling) simultaneously. And as far as I am aware, PG cannot do it. My comment had nothing to do with your blog post.
I used Elastic in past project as I reported in my article, but I never used relevance, can you add a link to the elastic official documentation about it ?
Depending on the ES version, it either does TF/IDF (older versions) or BM25 (newer versions) by default. You have to explicitly disable it if you don't want it.
Can you expand on the difficulty with synchronization? With ES I understand because you have to build it from scratch. However Solr has the DataImportHandler [1] which is made to sync data from a relational database.
Thanks for your link. I read about Postgrest months ago and I think is a good software. I think Django is good for some problem and postgrest for other.
I used haystack in different project to connect to Elastisearch as I wrote in my article, but in this article I wrote about Django Full-Text Search with PostgreSQL and you don't need haystack to use it, because there's a 'contrib' module in Django code
https://docs.djangoproject.com/en/2.0/ref/contrib/postgres/s...
I'll write another article about it and, if they vote it, I'll have a talk about it in the next PyCon Nove conference
http://www.paulox.net/talks/#pycon-nove
I'm seeing the same thing in Firefox 57.0.3 on Debian, if my viewport is more than or equal to 1320 pixels wide. It goes away for the text on screen if I force it to repaint, but comes back if I scroll or resize the viewport.
Same problem here. Seems to be a font rendering issue, the strange artefacts disappear at certain zoom levels. My platform is Fedora 27 with Firefox 57.0.1.
A small visual complain. The Django logo is really too big. I am on a big screen, not sure if that's relevant. But it's preferable to scale back. Firefox 57 on Mac.
Thanks for your feedback. The django logo is an svg and i tested only on a HD monitor and i like it. Maybe in a bigger screen it's too big but I haven't worked a lot on the design of my blog. I hope you enjoyed the content more than the logo. ;-)
Thanks for your and other report I found the issue and fixed it.
I think now the layout is working correctly on low-width window
https://i.imgur.com/BCaPTEt.png
This is with latest Firefox (59) – it gives me a horizontal scrollbar, so I can also scroll to the left, then part of the content is cut off on the right instead.
I just finished a a project where we chose Postgres's FTS over using Elastic Search. At the beginning, I was worried about what performance we'd see since we choose to not use ES. But after slight performance tweaking, we had even our least performing queries under 50ms.