Hacker News new | past | comments | ask | show | jobs | submit login
Full-Text Search in Django with PostgreSQL (paulox.net)
180 points by pauloxnet on Jan 6, 2018 | hide | past | favorite | 96 comments



I can deeply vouch for the performance and joy of implementing Full-Text Search with PostgreSQL and Django.

I just finished a a project where we chose Postgres's FTS over using Elastic Search. At the beginning, I was worried about what performance we'd see since we choose to not use ES. But after slight performance tweaking, we had even our least performing queries under 50ms.


You need to look at other stuff than performance - relevancy is probably the biggest thing when implementing search. Is it more relevant than what you experienced with ES?


I'd argue that relevancy is more your application application's design then the underlying system retrieving the results. For example, putting the same dataset in Postgres or ES wouldn't make one deliver more relevant results given equal configurations.

You could lean on the relevancy strategies built in to ES, but in my experience you're better off understanding what relevancy means for your dataset and implementing a strategy yourself. Your millage may vary though, I'd never advocate reimplenting something that's already provided by your chosen tool.

The options and tools for configuring and tweaking relevancy between ES and PostgreSQL's FTS are surprisingly similar for many application use cases. If you're interested you can check out Postgres' search rank and query weighting configurations.


I disagree.

I've had to build complex queries against ElasticSearch and it is specifically designed for things like this. We had custom weightings so when you searched for certain natural keys associated with each item they would rank above everything else, and that is easily do-able with ES. Simultaneously, we would weigh results according to various metadata we had attached to each entry (audio stream languages, subtitles, content owner name, genre, etc.). And finally, if you searched for the name of the media (a movie or an episode in a TV show) the user would see all the matches ranked accordingly, but again weighed according to the content owner and various language features of that media file.

You can probably hack that together with PostgreSQL, but is basically one big query in ES. PG's FTS is still great; but its use-case is slightly different.


That’s cool and I hear you; that’s a complex relevancy definition.

( Maybe surprisingly? ) This is type of query is natively supported by Postgres. That support is robust and mature, you don’t have to hack it together.

ES is a great tool and it’s clear your a fan of it. If you’re interested, I’d recommend you look into Postgres capabilities. It’s not a replacement for ES by any means ( or even a competitor to in my opinion; Postgres isn’t even distributed ). But for specific use cases, you might find that Postgres capabilities surprise you!


(I did leave out some bits that made it more complicated than I indicated.) Also, this was years ago; I know PG has improved its FTS a lot since then. ES is just a useful tool. If you can express your problem in relational terms then a RDBMS is almost always the right choice.

By the way, I am a huge fan of PG and relational databases in general; PG, especially, is a great database, and the first tool I reach for when it comes to data storage. However, we had other requirements (aside from the complexity I left out) to do with versioning and so forth that swung in favour of ES. Ultimately the problem with FTS in RDBMS, for me, boils down to doing FTS across disparate -- let's call them 'documents' -- stored across multiple tables. Basically you have to use materialised views (with manual refreshing) or complex join mechanics that affect performance. Perhaps PG 10 has improved in this area also?


In my article the "document" contain data from different table and i stored it in specific column and it's very fast with a GIN index on it.


I was about to give him the same answer but you beat me to it. It’s not jus throwing stuff in a field in PG. You can weight multiple bits of data. You can even define multiple different search vectors in separate columns if you want to use different search styles in different situations.


Thanks! I replied to the OP; but weighing, although critical to our needs, was a small part of a larger problem space we had to solve. I hate introducing new technology unless it's strictly required, but we ran into limitations that forced us to down the road of using ES.


This is true.


Thanks for your reply , my opinion is th same


> You could lean on the relevancy strategies built in to ES, but in my experience you're better off understanding what relevancy means for your dataset and implementing a strategy yourself.

Ranking is hard. You SHOULD lean on the tools available in Lucene/Solr/ES. PG's ranking tools are a joke in comparison.

> The options and tools for configuring and tweaking relevancy between ES and PostgreSQL's FTS are surprisingly similar for many application use cases.

That simply isn't true.


I don't the sense of the article is that PG FTS is better than ES , but in some situation, as the one I illustrated in my article, you can implement a the same search function with both of them, but with if have PG already in your stack configuring and using it with Django is very simple and convenient.


> For example, putting the same dataset in Postgres or ES wouldn't make one deliver more relevant results given equal configurations.

This is not the case.

The OOTB search capabilities in ElasticSearch (even by default) far, far exceed what you get in PostgreSQL FTS.

Also you're completely contradicting yourself. You say you don't advocate reimplmenting something provided by the tool but then suggest doing exactly that.


But in many situation you don't need all the ES features and you can implement a quite good and fast FTS function directly with your PostgreSQL database if you already use it in your stack.


> The options and tools for configuring and tweaking relevancy between ES and PostgreSQL's FTS are surprisingly similar for many application use cases.

Maybe in very very limited scenarios, but in general, they aren't even close. PostgreSQL doesn't take corpus frequencies into account, which makes it pretty difficult to come anywhere near the relevance ranking quality of Elasticsearch (or any proper search engine).

In order to tell whether PG vs Elastic is appropriate for your use case, you need to do an evaluation. See: https://en.wikipedia.org/wiki/Text_Retrieval_Conference


I second this. You can't let your queries take unusable amounts of time, but below a certain threshold relevancy is infinitely more important.

I'm putting together a product which has a search feature and that uses Django + MySQL and I'm struggling with relevancy. I'd happily accept 500ms queries if that guaranteed me the relevant hit would be on the first page. That's FAR more usable than 50ms queries and then the relevant hit is on page 5.


Full text search in MySQL isn’t in the same ballpark as PG. Thats not a dig at MySQL, just praise for the quality of what you get from the PG implementation.


Why mysql? Search in pg is waaay better.


Yes I think you need both of them of course and I found it on my project with Django and PostgreSQL.


Second this, even if you are a PG fanboy and a search newbie, you need to pay attention to:

1. issues with i18n and l10n tokenization. Does PG support other languages?

2. At minimum you need to support tf-idf (or something better), it doesn't look like PG supports this either.

3. For extremely dumb ranking, you can have a render/engaged column in PG. For decent production stuff you need a decision tree ranker (or GBDT).

All in all, none of these are there in PG, I'm not familiar with Solr/Lucene either, but please educate yourselves before expressing such strong opinions marketed as the absolute truth.


PG FTS support other languages https://www.postgresql.org/docs/current/static/textsearch-ps... Anyway the point of my article is not that PG FST is better than ES, but that for a quite good and fast FTS function you can use only Django and PostgreSQL and most of the time you don't need all the other ES features and at the same time your stack will be easier to build and maintain.


I think search relevancy is very important, and I wrote in my article start using PG FTS had permitted to work on search relevancy because I had more time which I used before in ES configuration and maintain another layer in my stack.


Having implemented this for a client in the past I have to agree that it is a cheaper option than ElasticSearch, especially for smaller projects with a lower number of records to index.

ElastiSearch easily gets expensive and the search suggestion is pretty bad.


What does this mean ?

ElasticSearch starts off as a small Java application that wraps the Lucene library.

Obviously heap will increase with usage and number of documents but I am still confused how it is in any way "expensive".


The stack I proposed in my article is pretty simple: Django + PostgreSQL (DB + FTS).

In other project I used Elastic for the search function: Django + PostgreSQL (DB) + Haystack + ES (FTS).

Is obvious that the second solution is more expensive.


Thanks for your feedback, I obviously agree with you, but I'm starting to plan to use PG FTS with Django also in some bigger project. I hope to write another article about it in new future.


Depends on the data right. I've seen good performance on larger tables after using GIN indexing with records that rarely needs to be updated and simple queries. I'm not a expert in PostgreSQL by any means, but reducing cost and learning something never hurts.


I totally agree with you


Any tips on what is worth tweaking?


Absolutely.

- Using a `SearchVectorField` is a must after 500K rows.

- Make keeping this field up to date easy for yourself by populating it using `SeachVector` with a Django pre_save signal or PostgreSQL trigger. This reduces CPU utilization significantly as the parsing and tokenization of the field your searching on is done a head of time.

- Adding a GIN Index on your `SearchVectorField` column will improve performance dramatically.

- You should specify your language configuration for postgres FTS parser. The default parser doesn't do much. It just removes spaces and normalize case. Specifying a language lets the parser make heavier optimizations that noticeably improve performance and the quality of results. If you need support for more then one langue, Django already makes it easy for this configuration to be dynamic.


Why not just have a GIN index on the expression to_tsvector(body, 'english') or whatever? Then you don't need to maintain a separate column.


Not the OP, and I haven't used PG regularly for a few years, but back then I vaguely recollect the query optimizer not behaving consistently based on whether you had a full column with a GIN (or GIST) index and a (potentially partial) index on an expression for some reason. In a nutshell it preferred using the full column with an index rather than the expression index.

Even more importantly in some circumstances, having the full column allows the optimizer to pick another index when it's totally relevant, and filter the relevant rows without needing to recompute the TSV one by one for the subset.


If your "document" is based also on columns on other table, as in the example on my article, you can't have a GIN index on your expression, but you can have a GIN index on your specified column.


The Postgres docs suggest using another column. My guess is that an expression index would be too large if it held the tokenized value of all of your FTS documents. These things can and often are entire written documents. Imagine the index size for 2 Million rows of tokenized documents at 2,000 words each.

You might be able to get away with if if you were indexing a less then large amount of very small documents.

I like your expression index idea a lot.


But if you're maintaining a separate ts_vector column, which you then index, you're creating the exact same amount of index data.

Unless you're saying that you would populate the field only on some rows and not all of them, and control this from the app. But you could do that with an expression index, too, assuming the rule is a simple, pure function:

  CREATE INDEX index_posts_on_body
  ON posts (to_tsvector(body, 'english')) 
  WHERE published = true;
or similar.


Very good tips ;-)


I can report the same with both Rails and Elixir. People reach for outside search tools far too quickly.


You are right


Thanks, I'm happy to read similar experience from other developers.


Have worked with ES...

If you want the out-of-the-box distributed experience, then you can't go wrong with Elasticsearch because of the built-in sharding mechanism. However, figuring out how many shards you need for each index, where should each shard sits, and how to balance out the cluster has always been a trial and error exercise. Indices continue to grow like there is infinite space, and performance will degrade. What I find useful in the end is to have multiple ES clusters, which seems obvious right? One of the most frequently asked questions is "how many nodes do I need", and the answer is always "it depends" which is true, but gives me a chill. With Cassandra, from my experience, adding more nodes (scale horizontally) so data spread more thinly, and revisit some data structure change, would be an acceptable answer.

The other thing is a lot of folks I have worked with just dump stuff into ES because ES feels like a dumping ground for "json"-like documents (reminds of me MongoDB). Please compress, understand mappings and different fields to make the document smaller...

I am still very convinced SQL databases are robust enough to take any lazy dumping.



Its not while live indexing and has many caveats


Why not just run ElasticSearch on a single instance then ?

That way you're not having to completely redo your search infrastructure at some point in time.

You can seamlessly scale from 1 document to billions without any change to your code, backup approach, monitoring etc.


> Why not just run ElasticSearch on a single instance then ?

Sorry, I am not sure why you brought up a single instance. If you run on a single instance, you lose half of the benefits of ES which is trying to be highly available.

If we have an index which does not need any backups, single instance is fine, but for most people single instance is not an option. Furthermore, if one is already struggling with 3-4 nodes, how helpful is single instance?


I brought up single instance since you suggested using a SQL database most of which don't support a distributed deployment in the same way ElasticSearch does.

My point was merely that ElasticSearch scales from very small use cases to very large ones. You don't have to setup a cluster. And that using your SQL database isn't going to be significantly easier to manage.


> SQL database isn't going to be significantly easier to manage.

A SQL database doesn't run out of memory and crash when you add documents to it "too fast". ElasticSearch does.


We use Flask with SQLAlchemy, this guide on sqlalchemy-searchable:

https://sqlalchemy-searchable.readthedocs.io/en/latest/insta...

had us easily swap out Algolia for search with Postgres FTS.


Not familiar with SQLAlchemy. May I ask where do you host the search index and also why you decided to leave Algolia? Thanks.


Search index is created by adding a automatically populated search vector column to each searchable table in Postgres and appropriate SQLAlchemy models. Left Algolia partly because our old implementation was poorly done, overly complicated, and becoming difficult to maintain, but also because Postgres FTS could give us equivalent search capabilities (we didn't need all the features Algolia had to offer) for free.


I think everyone that use PostgreSQL FTS had similar stories and motivation to switch from external software/service.


Thanks for the link


Does anyone know how to play with relevance and spell corrections in FTS ?

In elasticsearch, relevance is Tf-IDF or BM25. I don't see anything similar in FTS.

Not sure how I can results around ranking as well as spell correction/suggestions simultaneously.


As far as I am aware, you cannot get ES/Solr quality ranking + fuzzy search out of the box.


I think it depends on your specific situation and your project requirements. In my project for example I had exactly the search quality I need with Django and PostgreSQL without installing another software (Elastic), using a different query syntax, synchronizing the data and installing another layer (haystack).

Maybe in other project ES will be a better solutions.


My comment was in response to a question whether PG can do TF/IDF or BM25 and fuzzy search (handle misspelling) simultaneously. And as far as I am aware, PG cannot do it. My comment had nothing to do with your blog post.


I used Elastic in past project as I reported in my article, but I never used relevance, can you add a link to the elastic official documentation about it ?


Depending on the ES version, it either does TF/IDF (older versions) or BM25 (newer versions) by default. You have to explicitly disable it if you don't want it.

https://www.elastic.co/guide/en/elasticsearch/reference/curr...


I might shamelessly plug my own book, Relevant Search (http://Manning.com/books/relevant-search) which introduces relevance on Elasticsearch.

Ping me (http://o19s.com/doug) if you’d like a discount code!


here - https://www.elastic.co/guide/en/elasticsearch/reference/curr...

similarity/relevance is the mathematical difference between search and database query. Its an inexact match.


Thanks. I'll read it.


If you want to reason about these things from first principles, then I recommend the IR book, which is free online: https://nlp.stanford.edu/IR-book/


Can you expand on the difficulty with synchronization? With ES I understand because you have to build it from scratch. However Solr has the DataImportHandler [1] which is made to sync data from a relational database.

1. https://lucene.apache.org/solr/guide/7_2/uploading-structure...


Not to discredit Django but if you're already using PostgreSQL you could use http://postgrest.com which is a more tightly knit component for fts.

With PostgREST you can avoid the abstractions overhead and just use plain SQL and REST calls, basically to do a fts using `to_tsquery` you would do:

CREATE TABLE blog_entry( headline text, body_text tsvector );

GET localhost:3000/blog_entry?body_text=fts.Cheese&select=headline


Thanks for your link. I read about Postgrest months ago and I think is a good software. I think Django is good for some problem and postgrest for other.


Why hasn't anyone brought up Django Haystack?

http://django-haystack.readthedocs.io/en/master/tutorial.htm...


I used haystack in different project to connect to Elastisearch as I wrote in my article, but in this article I wrote about Django Full-Text Search with PostgreSQL and you don't need haystack to use it, because there's a 'contrib' module in Django code https://docs.djangoproject.com/en/2.0/ref/contrib/postgres/s...


Out of curiosity, which one of them is faster for doing complex search?


I'm sorry but "complex search" is too generic. I can say that in my project the PG FTS is quite fast but a lot less complex than ES


BTW I migrated the djangoproject.com from ES to PG FTS and it works great https://github.com/django/djangoproject.com/commit/1be9a483f...

I'll write another article about it and, if they vote it, I'll have a talk about it in the next PyCon Nove conference http://www.paulox.net/talks/#pycon-nove


Does the Django postgres module handle partial word lookups?


Considering your nickname, @OP, I assume you’re the author of the website: The site is currently broken on screens with a width of 1024px or less.

This screenshot was taken on my secondary monitor, which is 1280x1024 in portrait mode: https://i.imgur.com/n3LGCKM.png


It does something differently funky for me: https://i.imgur.com/gnXgBK5.png

Also I'm not a fan of all the visual noise, such as the orange paragraph markers next to every title.

EDIT: Replaced "fucking" with the intended "funky", I have no idea how that typo happened.


I'm seeing the same thing in Firefox 57.0.3 on Debian, if my viewport is more than or equal to 1320 pixels wide. It goes away for the text on screen if I force it to repaint, but comes back if I scroll or resize the viewport.


Thanks for the report.


Same problem here. Seems to be a font rendering issue, the strange artefacts disappear at certain zoom levels. My platform is Fedora 27 with Firefox 57.0.1.


Thanks for the report. I'm thinking to change font.


Thanks for your report, which browser are you using to take the screenshot ?

I used the default markdown plugin to generate paragraph markers and I found convenient to link directly to a paragraph.


I'm using Firefox 57 on Ubuntu. You may want to hide the markers until someone hovers over the title in question, like in the Django docs[1].

[1]: https://docs.djangoproject.com/en/2.0/


I updated markers as you suggested, thanks again http://www.paulox.net/2017/12/22/full-text-search-in-django-...


Looks much better, good job!


Thank you!


I'm using Firefox 57 on Ubuntu 17.10. Thanks for your suggestion about the paragraph markers. I'll update it. ;-)


A small visual complain. The Django logo is really too big. I am on a big screen, not sure if that's relevant. But it's preferable to scale back. Firefox 57 on Mac.


Thanks for your feedback. The django logo is an svg and i tested only on a HD monitor and i like it. Maybe in a bigger screen it's too big but I haven't worked a lot on the design of my blog. I hope you enjoyed the content more than the logo. ;-)


Yes! Sorry if I didn't mention it :-) thank you!


Works fine on mobile.

Until it's fixed you can probably just narrow your browser window until the hero image moves from the side to the top.


Thanks for the report. I'm happy it works well on mobile.


On my iPad I had to manually force it to reflow correctly.


Thanks for the report. I've never tested on a iPad.


Thanks for your and other report I found the issue and fixed it. I think now the layout is working correctly on low-width window https://i.imgur.com/BCaPTEt.png


Seems like OP is using one of the Pelican themes, most of them last maintained 5 years ago.


Yes, I used pelican with Flex theme for my blog. http://www.paulox.net/about/#blog

The Flex theme is not so old, but can be better, but I'm going to update it based on many report I received here.

BTW pelican site is quite fast https://developers.google.com/speed/pagespeed/insights/?url=...


Thanks for the report, I'm going to check this issue.

Can I ask which browser are you using ?


This is with latest Firefox (59) – it gives me a horizontal scrollbar, so I can also scroll to the left, then part of the content is cut off on the right instead.


You are right, thanks for the bug report.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: