Hacker News new | past | comments | ask | show | jobs | submit login
Postgres Full Text Search vs. the Rest (supabase.com)
338 points by kiwicopple 3 months ago | hide | past | favorite | 117 comments



One thing that isn't really covered in this article is relevance calculations.

A surprising (to me) thing about PostgreSQL FTS is that it doesn't do TF/IDF or BM25 relevance calculations.

These calculations take statistics about the entire corpus into account - they do things like ranking a document higher if it contains terms from the search which are statistically rare in the overall corpus.

PostgreSQL FTS uses how often the search terms appear in the document as part of the ranking score, but doesn't look at how common/rare individual terms are. https://www.postgresql.org/docs/current/textsearch-controls....

OpenSearch/Elasticsearch uses BM25: https://opensearch.org/docs/latest/opensearch/rest-api/expla...

SQLite stores these kinds of statistics and can support BM25 or TF/IDF or custom ranking functions. I wrote more about that here: https://simonwillison.net/2019/Jan/7/exploring-search-releva...

Meilisearch does something a bit different (on reading this I don't think it considers full corpus statistics, but I may be misinterpreting that): https://docs.meilisearch.com/learn/core_concepts/relevancy.h...

It looks like Typesense only considers the documents, not the overall corpus: https://typesense.org/docs/guide/ranking-and-relevance.html


It's curious they omitted MySQL, which is the usual point of comparison for Postgres.

> InnoDB full-text search is modeled on the Sphinx full-text search engine, and the algorithms used are based on BM25 and TF-IDF ranking algorithms. For these reasons, relevancy rankings for InnoDB boolean full-text search may differ from MyISAM relevancy rankings.

> InnoDB uses a variation of the “term frequency-inverse document frequency” (TF-IDF) weighting system to rank a document's relevance for a given full-text search query. The TF-IDF weighting is based on how frequently a word appears in a document, offset by how frequently the word appears in all documents in the collection. In other words, the more frequently a word appears in a document, and the less frequently the word appears in the document collection, the higher the document is ranked.

MySQL's FTS is fine. We're using it at work for fairly basic boolean searches on millions of documents, also retrieving the relevancy score, and it's plenty fast. We'll outgrow it one day, but for now it's pretty easy and does well enough.

https://dev.mysql.com/doc/refman/8.0/en/fulltext-boolean.htm...


BTW: when you outgrow it, it will be sudden. It’s performance fails in a very non-linear way.

Edit: keep the index in memory


I thought the universal practice for getting good performance for relational databases, RDF, other graph, etc. data stores is to be able to hold indices in memory.


Yes, but paged b-trees are more graceful when they can’t fit in memory—which was the default when I cut my teeth on databases.

It’s also a reason why MyIsam separated data and index files. InnoDb is a clustered index, so all the data is packed next to the primary key.


Agreed that MySQL FTS is "fine". Have you run into issues with the default value of innodb_ft_result_cache_limit (2 GB)? That's bit us (OOM restarts) a couple times.


Yep. PG FTS works fine for a recall-based problem. If your data has a high cardinality then recall search works great. But with low cardinality data, you start needing BM25, ML, etc. The larger the dataset, the lower the cardinality in general. So PG FTS tends to work okay for small/medium datasets where most records are unique and you don't need advanced features like misspellings, synonyms, etc. Some of that advanced stuff is possible, but it's not out of the box and can be hard to set up.


What does you expression "recall search" mean? The corpus is so diverse that wrong results are impropable?


Recall referring to precision and recall [1]. Recall search meaning the use case primarily cares about finding all the results and less about ranking them correctly (precision). So find all the records containing an email address regardless of order is a recall problem

1. https://en.m.wikipedia.org/wiki/Precision_and_recall


> doesn't do TF/IDF or BM25 relevance calculations

In practice these are nearly worthless. Useful relevance ranking is difficult. Google sort of gamed it with PageRank (using inbound links). "Information Retrieval" by Buettcher et al is a good book on search implementation with a decent amount of info about relevance ranking, though maybe it no longer up to date: https://mitpress.mit.edu/9780262528870/information-retrieval...


Useful relevance ranking is definitely difficult, but a lot of people are getting something that they believe to be useful from, say, ElasticSearch or Solr, both of which use broadly TF/IDF-baesd algorithms.

Are you suggesting that you can do just as well without TF/IDF, or with postgres fts specifically? In some/all circumstances? I'd be interested in hearing more about that, if it comes from experience.

Looking at the table of contents for the Büttcher et al book, it looks to me like it covers TF/IDF-based algorithms pretty extensively. BM25 is in the table of contents specifically. Büttcher's own pedagogical search engine, _Wumpus_, includes a BM25 implementation. http://stefan.buettcher.org/cs/wumpus/docs/relevance.html


The only way I found to get any useful ranking in Solr was to implement application-level scoring of the documents based on semantic considerations of their contents, and also of relationships between the documents. And even that was pretty bad. I concocted some dubious schemes to condition those scores on external info as well, but the project ended before I got to try any of that.

Yes that book described tf/idf and bm25 but a bunch of other stuff too that was somewhat more promising. Really though, IMHO there is no getting away from understanding the actual documents, and possibly their connection with the outer world (pagerank being an example of the latter). For web search it's now probably worse than before, since instead of merely being overwhelmed by noise, the data is now actually adverserial in the sense of having SEO trying to game your ranking.

Even without that though, go on any retail site and try a search. The relevance ranking will be so awful that sorting by price or age or alphabetically will work a lot better. Same thing with the Algolia search here on HN. Chronological is almost always more useful than the search engine's idea of relevance. For automating relevance the scoring system needs much more semantic understanding of the data. Maybe that is more feasible with recent advances in NLP. I don't know whether that is good or bad.


As long as you can calculate a metric of importance for each of your rows, Postgres search rank seems to work plenty fine - we use Postgres FTS to power many of our search efforts and are replacing elasticsearch implementations because they’re harder to manage and no one seems to notice any quality differences. We just merely use a product of the importance metric and the ts_rank and that seems to do the trick.


In practice, it depends on the dataset size and use case. For web search? Mostly worthless, but can be a valuable signal to train ML. For small corpus of documents? BM25 alone does a pretty good job in general.



https://xapian.org/users.html

Also, it is used by the the public-inbox Perl scripts. See https://public-inbox.org

That includes the Linux kernel mailing list, Git mailing list and other lists available via lore.kernel.org:

https://lore.kernel.org

For example, to search the BPF mailing list for "bpf" and sort results by relevance:

https://lore.kernel.org/bpf/?q=bpf&r

To view the Xapian query parser operators:

https://lore.kernel.org/bpf/_/text/help


I think that isn't really built for large datasets. I had a buddy who used it a lot on smaller web sites though. I'm way out of date and this stuff might be horribly old fashioned. Vector search is the new hotness but on the other hand, search systems everywhere are getting worse, so who knows.


No, gmane was big enough. And it's far advanced over PostgresQL FTS or Elasticseatch


Postgres Extensions should allow you to make a TF/IDF and BM25 plugin pretty easily as well as dealing with the other relevance problems.


You can also do rapid prototyping using one of the scripting languages, then build a module after validating approaches. Python or JavaScript come to mind.


This is within reach due to the flexibility of features like setweight[0], but I don't know enough about TF/IDF or BM25.

Seems like an excellent weekend project.

[0]: https://www.postgresql.org/docs/current/textsearch-features....


I don't see how the setweight function could be used to give you something like a traditional TF/IDF algorithm, but perhaps I'm missing it. I would be interested in reading more about that if I am.


What I'm thinking is a little far out but it came up recently on a project where I'm using Postgres FTS (it's slow but I guess I might as well link it for now[0] -- please do not HN hug it).

Basically, I read on the internet (and was surprised by) the fact that setweight can be used and combined with individual terms on tsvectors, and then those tsvectors can be combined and they keep their weightings.

Some code from that project to illustrate:

    UPDATE podcasts
    SET fts_doc = setweight(to_tsvector(COALESCE(title, ' ')), 'A')
                  || setweight(to_tsvector(COALESCE(homepage_url, ' ')), 'A')
                  || setweight(to_tsvector(COALESCE(podcast_idx_itunes_author, ' ')), 'A')
                  || setweight(to_tsvector(COALESCE(podcast_idx_itunes_ownername, ' ')), 'A')
                  || setweight(to_tsvector(COALESCE(podcast_idx_host, ' ')), 'A')
                  || setweight(to_tsvector(array_to_string(categories, ' ')), 'B')
                  || setweight(to_tsvector(COALESCE(description_html, ' ')), 'D')

Basically I'm making tsvectors out of chunks of the document, weighting them differently then recombining with other vectors without losing the weightings -- I'm thinking this could be applied to words identified by the corpus-level algos.

So my simplistic thinking here is that if you've done the corpus level processing, you could build an intermediate data structure and re-evaluate each search document with the appropriate weighting. It would likely be quite the lengthy stored procedure, but seems like setweight could support the usecase? Maybe I'm being a bit optimistic.

[0]: https://podcastsaver.com


If you could figure that out, it would be an awesome plugin.

PS podcastsaver looks neat!

some quick feedback:

1) your "switch back to light mode" icon looks a LOT like a gear for a settings menu. I turned on dark mode, did a search, saw the "back to light mode" icon and thought "huh, the dark mode toggle is settings now? Weird choice, let's see what's there..."

2) the show notes seem truncated. It would be helpful for me to be able to search the show notes for a defined set of podcasts. Sometimes I remember that a podcast mentioned a product or service that I wanted to check out, but I can't remember the name of the product or the overall episode, and it's painful to find the right one by scrolling back through everything in my pod catcher.

3) are you tracking Podcasts 2.0? Some interesting additional stuff to index there. https://origin.fm/blog/podcasting-2point0/


Sorry I just got around to implementing some of your feedback and didn't realize that podcasting 2.0 was the Podcast Index -- That is the main data source!


Thanks for the detailed feedback!

On (1) I can definitely see that — will fix!

(2) yeah I need to go to the source for that, I think podcast index data might have been why? I’m going to double check.

(3) no I’m not! Thank you for the pointer!

I’m going to work on all of this (and tackle the speed issue)


Hm, I'm not totally following, but... would you have to recalculate all row values every time the corpus changes? I guess that could work for a seldom-changing corpus, not sure how popular a use case that is. I suspect most people would not be interested in such an approach, instead either making do without TF/IDF, or moving to a non-pg solution.


> would you have to recalculate all row values every time the corpus changes

Yep, I mean this is always the case for corpus-level algos right?

No reason you can’t do it iteratively —- postgres has triggers…

Oh but actually thinking about it, it could be a function! You’d just need access to that intermediate representation.

> I suspect most people would not be interested in such an approach, instead either making do without TF/IDF, or moving to a non-pg solution.

Well people would be happy if it was there at all, I think. Then they could at least make the choice or have a decent option.

It probably won’t be as performant as other solutions which can make more drastic architecture changes but… might still be worth having


> Yep, I mean this is always the case for corpus-level algos right?

I am not sure which parts of which calculations lucene (Elastic Search and Solr) does on the fly vs pre-calculates after any change to corpus, because it's more or less transparent. I mean, I guess that's not entirely true -- there are definitely index-rebuilds that happen after updates, and for larger-scale things they can be resource-intensive enough that you have to account for them (for very small-scale things you can more or less ignore them), maybe it's just that Solr/ES have architectures built around accounting for that and giving you tools to deal with it with various approaches.


I agree with all this. As a heavy user of pg_search I can also say that it is also not nearly as good when you need to combine multiple full text indexes in very complex queries. A dedicated indexing tool seems to have so many advantages, with the only 2 disadvantages being complexity increase and a slight indexing delay.

Overall, pg_search is an amazing tool for low-complexity systems. Once you need really complex searching going on, a dedicated tool wins by a lot.


In case you skip to the comments like I usually do, this post is a light comparison of Postgres FTS vs:

  - MeiliSearch: https://www.meilisearch.com/
  - OpenSearch: https://opensearch.org/
  - SQLite FTS: https://www.sqlite.org/fts5.html
  - Typesense: https://typesense.org/
Some of the callouts from the results:

  - Even when consuming similar content, engines can produce different results, but generally ratios between queries on the same engine should be consistent.
  - Postgres FTS is quite close performance-wise to many other solutions, at least in their default configuration.
  - Only Typesense and MeiliSearch properly handled mis-spellings (the "suprman" query).
  - Typesense was relatively strict with matches compared to other engines.
  - OpenSearch was very fast with ingest, but also failed with the misspelling out of the box.
  - In-memory SQLite is by far the fastest, and PG isn't too far behind for this small data set.


there's also quickwit and sonic

different strengths and best use cases

https://jxyz.me/modern-search-engines


Yup! I'd heard of Quickwit and sonic, but Quickwit seems to have pivoted to being a log-search focused engine. It's built on Tantivy[0] IIRC so I could have used something like Toshi[1].

Sonic[2] I know much less about but it also seems good. Honestly anything except ES is what I like to hear about (though OpenSearch is interesting).

Another thing I think the world really needs is a CLI +/- API tool (ideally rust lib + CLI + API) that unifies interacting with these things. I got REALLY close to writing it while working on this article, but I was already running late and I have a penchant for yak shaving.

This won't be the last thing I write about search engines -- there's been a LOT of movement in the space that has nothing to do with the elastic/opensearch debacle and I don't see enough tires getting kicked.

[0]: https://github.com/quickwit-oss/tantivy

[1]: https://github.com/toshi-search/Toshi

[2]: https://github.com/valeriansaliou/sonic


I'll add: (only?) Meilisearch is fully multi-lingual.

I didn't dive into the various engines, but I was looking for one that would support Russian in a small side-project, and Meilisearch was the only one [1] that had it right there out of the box

[1] Criteria for "only one" where "out of the box, ease of operation, no fiddling with configs, if not directly inside DB then with an easy HTTP API"


Good to know. If you find some Russian language support improvement we can do, don't hesitate to create an issue on our tokeniser https://github.com/meilisearch/charabia.

Moreover, it's Hacktoberfest. If you want to help us improve the language support, it would be awesome!


So far it's been working like a charm :)


Sphinx and Manticore Search [1] have support of the Russian language out of the box for like 15+ years :)

[1] https://manticoresearch.com/


MeiliSearch is really nice to work with, can recommend it. I built a Twitter bookmark search recently and used PG + MeiliSearch after using PG FTS in another project. Very happy with that choice. I like PG FTS for some use cases (internal search where you don't need the best performance) just as it's very simple and works well enough.


MeiliSearch doesn't seem secure enough for my needs, so I'm opting to use Typesense instead. E.g. no encryption at rest in MeiliSearch and they have no intention to support it.


Hello, I'm Quentin, Meilisearch's CEO. Meilisearch Cloud is encrypted at rest. We're using, at this time, a k8s with replicated and encrypted persistent volumes. We just didn't talk about it yet.


Hi Quentin, thank you for your reply. I was just referring to this comment of yours: https://github.com/meilisearch/MeiliSearch/discussions/870#d.... It is good to hear that Meilisearch cloud is encrypted though. Will definitely keep an eye on Meilisearch for future projects.


Hi Quentin, 2 small questions. When can we expect cluster/sharding support? Glad to see there’s a new release to fix slow indexing issues and support has been added for Thai. Which languages still need adding? Keep up the good work!


Hello, I would love to know what are your needs/limitations that makes you ask for cluster/sharding. Meilisearch is cloud-native, so we don't need clusters of machines to provide a highly available solution. By default, our service is highly available and vertically scalable. We don't have geo-replication today, but it will come later in 2023. I don't know your need for scalability, but we can handle almost every use case. You may only be in an Enterprise plan.

We support almost all languages. The only thing is that in some languages, without help and intervention from the community, we stop when our level of comprehension is not enough. Today we handle perfectly Latin-based languages and all languages that are space separated. We have also worked with the community to improve Chinese, Japanese, Thai, and Korean, which is under review.


Let’s say I have 10TB of data that I want to make searchable, and want to self host, what are my options with Meilisearch?


Can I get your use case? What is the data you have to index? If it's logs, indeed, Meilisearch will not be the best fit. I would recommend https://quickwit.io. If it's structured or unstructured data, even if 10TB seems huge for our use cases, I would love to put you in contact with my engineering team. We would find a way to manage it.


There is "Run in production" guide in Meilisearch documentation https://docs.meilisearch.com/learn/cookbooks/running_product... It uses DigitalOcean, but self-hosting in the right environment should work in a close way.


Thanks for the link! I however still don’t know how I’d be able to scale meilisearch to index terabytes of data.


I'm curious also about your use case (disclaimer, I'm one of the cofounder of Quickwit). Do you want simple search with a basic UI? Or some dashboarding with grafana for example?

Thanks @qdequelen for pointing out Quickwit :)


The fact that misspelling need to be turned in open search does not mean anything "failed". If you want spelling correction, turn it on, don't call it failure.


that's a fair comment - I'll get the author to update the post (I can't edit my comment anymore though)

edit: fixed https://github.com/supabase/supabase/pull/9565


Pretty much all these engines (and PG) are fast enough.

Gist/Gin indexes are great and do a fine job to make millions of records searchable in very few milliseconds.

The problem is accuracy.

I’ve tried a few of these and accuracy is wildly different with most of them.

Accuracy depends on how much you index.

For example, even with decently designed weighting: If you index title, subtitle, tags and content — too much of content ruin relevancy.

And yes, we have proper relevancy sorting setup nicely.

The best is something like elastic search, but it does not integrate nicely with PG for our use case. Because of the multi tenant nature of our data setup.

PG will get us to that magic 80% but that leaves the all important 20% which is not great.

Not to mention the inability to index and search Asian character sets.

So even though we exclusively use it, it’s not great and every time our team can’t find something that drives my team into the Psql CLI — I start searching for alternatives again.

The big brick wall is updating, inserting and deleting from an external solution fast enough so we don’t miss stuff.

And quickly searching multi tenant records in that external solution.

And no I’m not ready to use ES as my primary database.


Yup this is exactly the problem, which is why I'm surprised with such decent extension support and heavy use no body has created something that builds better analysis and query pipelines specifically for full text search on top of these databases to solve the relevancy issues.


Do you two know about ZomboDB (https://github.com/zombodb/zombodb)?

Quite deep Postgres/Elasticsearch integration


That's really cool but that's not quite the solution to this problem that I would like.

I haven't used ZomboDB but I have managed plenty of applications where Postgres was the main db and elastic was used for FTS. Zombo looks like it makes it easier to do that type of setup, but Postgres is so high performance (at least at parity on speed w/ Elastic) and Elastic is such a pain in the ass to manage from a DevOps perspective that I'd like to eliminate the need for Elastic by investing some time into Postgres Extensions.

Plenty of great db solutions have come out of Postgres in the last few years based on extensions and fulltext search is one of the areas that has been very quiet, I think we can do better and I'd like to try.

I think in the interim Zombo looks like a really good stopgap though!


I wish we had an extension like ZomboDB but using a lighter search engine like https://github.com/quickwit-oss/quickwit, https://github.com/toshi-search/Toshi and https://github.com/mosuka/bayard

Here I'm listing engines based on https://github.com/quickwit-oss/tantivy - tantivy is comparable to Lucene in its scope - but I'm sure there are other engines that could tackle ElasticSearch.

Another thing that could happen is maybe directly embed tantivy in Postgres using an extension, perhaps this could be an option too.


Do you index keywords or bigrams? Bigrams provide context and help a ton with relevancy.


We have been looking at pg trigram to help improve relevancy.

It does take some migration but we are looking at that as an add-on.


Ah, I guess this term is overloaded. I am not familiar with the specific PG offerings, but more complex search engines are often based on indexing pairs of words (bigrams) as opposed to individual keywords. Pairs of words together are much more meaningful.


PostgreSQL has good enough built in search but I find it limited in large real world global user cases:

- CJK support is tricky for almost all of these search products, but postgresql is probably the only one that needs additional extensions, and some extensions are not updated for a while.

- Scaling is tricky, bc on one hand, your search and other normal queries may not have same load pattern, but you only have one schema in multiple nodes, so sometimes you have to waste some resource; on the other hand, dependencies of the 3rd party CJK extensions makes it hard to use managed instances and maintaining your own is time consuming.

- Most other search engines have well tuned APIs for typical use cases like filters/typo/etc, but you have to build your own in PostgreSQL. Of course it’s not entirely bad thing bc PostgreSQL does have most flexible query capabilities for those less than typical use cases.

End of day, search for end users have many semi-edge cases, one have to try all these engines to find the best fit. For large use cases, operational cost and user tolerance of search errors are all part of consideration. A conventional wisdom is do less work in the early days , and find better solutions later if have to.


This lines up with what I’ve seen. I did a full deep dive benchmark against Elasticsearch and a Postgres FTS instance a few months ago.

With some tuning and memory page size adjustments in Postgres you get very compelling speeds.

I’ve actually previously written a full twitter thread on a somewhat inactive account about why we don't use existing dbs for search more.

https://twitter.com/kinglycrow/status/1533270619353231360?s=...

That thread led me to a project/product idea where you take an existing Postgres instance used for normal products or whatever, replicate it to various read only clusters with a custom search extension loaded and some orchestrator sitting on top (I’ve written most of one in rust that uses 0mq to communicate with it’s nodes) and create drop in search from existing databases with a nice guided web gui for automatic tuning suitable for most business use cases.

It fell off when my friend who wanted to help work on it went off to Google and you know working on a search engine for them became a bit of a no no. I still think it's a great idea with a lot of value, I should circle back.

I also write (at length) about how ease of use will win search here: https://twitter.com/KinglyCrow/status/1532402654218964993


> That thread led me to a project/product idea where you take an existing Postgres instance used for normal products or whatever, replicate it to various read only clusters with a custom search extension loaded and some orchestrator sitting on top (I’ve written most of one in rust that uses 0mq to communicate with it’s nodes) and create drop in search from existing databases with a nice guided web gui for automatic tuning suitable for most business use cases.

Very interesting idea -- just want to add one thing, write it in rust (with pgx?[0]) :)

[0]: https://github.com/tcdi/pgx


I was unaware, and absolutely! Thanks for showing me this.


Absolutely -- I think the next wave of awesome pg extensions will be built on this, and I'm excited to use yours!


I built a website around pg full text search and hit a lot of weird corner cases that ultimately will require me to move to something like elastic search. It's very powerful and useful, but if your main product is search, I wouldn't use it as a foundational pillar.

They're fundamentally different products, but sometimes "pg full text search is good enough" isn't true.


I'm coming up to a large segment of work where we'll adding search with elasticsearch and this was the assessment I came to after reviewing PG FTS.

I would say, frankly, that if you already have a PG database and

1. you want better full text search than a non-existing solution.

2. you don't need the excellent searching like we've come to expect from search engines.

then use the full text search feature your PG database already has. I wouldn't necessarily go out of my way to use a PG database for FTS but I think its a good solution when you're in the right position to use it and its not too shabby.


I also had this experience. I read blog posts like this one that said "just start with postgres," so tried that, and spent a bunch of time with GIN and RUM indexes trying to work around edge cases before going to Elasticsearch, where things worked out of the box because it's designed to be a search engine. I'm sorry I can't remember specifics anymore -- it was stuff like filtering and result counts. There just tended to be more surprises because everything was a custom addition instead of a feature enabled by default. In retrospect for an evolving product, adding Elasticsearch was more of a one time up front integration investment, while working around edge cases in Postgres felt like an ongoing limitation we'd be stuck with.


You could have a bit of logic on top of PG to better handle the search query, for example using https://github.com/caub/pg-tsquery


Agreed, it's okay for an MVP but not okay for a real app.


The amount of time needed for pg fts integration is less than elastic, but if you don't replace it ~immediately~ after your MVP has been validated, you might as well have gone elastic from the beginning.


Since most deployed applications search with:

  "for item in ... if item.containsSubstring(query)" 
I would say it's definitely useful for real apps.


It's a spectrum . On the left is the incredibly common use cases of inefficient substring search over small datasets that works more often than O(1) purists would like. To the right of that is where db fts rules, it's a comfortable but small niche.

Once you start exiting that and start touching document search, being able to sort on relevance OR date with indexes, or any data ingestion killing search latency, you will have wished you went to the vast and uniform area to the right of db fts called elastic search. That is where I found myself.

If search is not integral to your product, by all means avoid the complexity and curve that is elastic and be happy that you did.


I use both ElasticSearch/OpenSearch and Postgres Full Text Search.

It's not even close: Postgres Full Text Search is a kludge compared to Lucene-based services. It is better than nothing, so, yes, I agree with the Pareto improvement idea. But it is also much, much worse than a Lucene-based alternative. If you need something and can't support having ElasticSearch/OpenSearch (heck, even Solr) running at the same time to support full text search, then sure, use Postgres FTS. But if the queries are still too slow or you need ranking or other text/information retrieval/NLP-type features, you'll want to give up on Postgres FTS and move to something else.


Hey everyone -- I did the exploration for this piece and enjoyed getting a chance to try so many of the new search solutions!

Feel free to ask me anything and let me know just how many postgres features I missed out on (pg_trgm is definitely one) -- I love to learn about corners of Postgres I've missed.


Can you elaborate what this means?

> Typesense was relatively strict with matches compared to other engines.

Perhaps an example to clarify the statement?


This had to do with the queries —- it returned less of the documents than almost anything else


We migrated FTS to OpenSearch from PostgreSQL a couple of months ago. Here the main issues we had with PostgreSQL FTS:

- cross-tables FTS requires at least materialized views, which lock data for writing at refresh. This was too much worrying for us.

- sorting by rank is not indexable, so we can't sort our dataset and have acceptable timings at the same time. Our dataset isn't enormous, but neither small (~1.5m records)


Large materialized views in general are really slow in PostgreSQL. If you go down the path of indexing joined data, consider database triggers or application-level hooks to update individual rows as-needed in a standalone search table. These are generally the same hooks necessary to update a document in OpenSearch.


You're right, that's actually what we implemented, application-level hooks, but they needed development and maintenance effort that come for free with the adapter we're using for OpenSearch integration, which also comes with welcome features: synonyms, partial matches, and many others.

Spoiler, the adapter is Searchkick: https://github.com/ankane/searchkick


Exactly that. Had the same issue with sorting. It is indexable though, and very fast, when using the RUM extension.


Never heard about RUM extension, I'll give it a look. Thanks!


Broad full text search is incredibly hard to do well. I've had to build, maintain, and improve multiple search systems. The difficulty is largely dependent on the context of the type of search problem you are solving. A lot of complexity depends on your answer to some of these questions:

  - Does the searcher already know the result they are looking for? (If yes, much easier)
  - Are there subjective and objective qualities of the results which should alter the search score, sometimes separate from the text being indexed? (If yes, much harder)
  - What is the quality of the text being indexed? (If end-user provided, this will vary widely)
Ultimately, building good search is often a struggle against providing the best possible results between searcher intent and incomplete document evaluation criteria. People never really think about when a search is working really well, but they definitely know and complain when it's working poorly.


How do you classify the difference (in problem/scope, etc) of type ahead versus full blown search? It feels like these systems can be grown almost completely differently -- you could hack together completely in-browser "search" with datalists[0] and just prune it actively (and fool most users, depending on how varied searches were).

I do wonder how much deep search really matters when people only really expect to look at the first page.

[0]: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/da...


"Type ahead" or "autocomplete" is absolutely a different type of problem, and often simpler. This generally falls into the use case where the searcher already knows the specific item they are looking for. Often the results are objects owned by or known to the user in question, or you are searching through a very limited and relatively static set of documents and topics. Reference documentation for software often falls into this category.

In my experience, you don't have to spend a lot of time thinking about scoring and relevancy for these types of search. Generally you only want to include a small edit distance in the results at all to handle misspellings.

This is so vastly different when you have a corpus of millions of documents about an encyclopedia's worth of topics.

> I do wonder how much deep search really matters when people only really expect to look at the first page.

Getting the first page to have the best quality and relevancy is much more difficult if the user is searching through something like scientific papers, stock video footage. It is a challenge in bridging the distance between ideas and expectations.


> Only Typesense and MeiliSearch properly handled mis-spellings (the "suprman" query).

That's where trigrams come in. pg_trgm can fix mis-spellings, and even compound words ("super man" to "superman"). I opt for performing the search as entered, and use trigrams to offer suggestions to the user, e.g. do the search for "suprman", but show an option to search for "superman".


Absolutely -- this is something I didn't include in the article -- it wasn't really fast-search/type-ahead search per say.

Gitlab actually has a great article about this for those who want to read more:

https://about.staging.gitlab.com/blog/2016/03/18/fast-search...


I think it would be useful to include pg_trgm's performance vs others as well.

I'm not sure how it compares but in my use case with 200,000 rows it has been "good enough" for <100ms autocomplete.


I still look forward to RUM indexes [1] getting integrated into PG which would build the foundation for better ranking functions such as TF/IDF or BM25. PG seems to lag behind here and hasn't be a lot of movement in a while

1. https://github.com/postgrespro/rum


I love sqlite's FTS. I have a number of personal projects where I have an archive of data and want to occasionally be able to do ad-hoc searches over and get reasonable ranking and results. Throwing sqlite fts at the problem gets me that in just a few lines of code.

You can even combine it with a sqlite vfs[0] to run full text searches against a sqlite db stored in s3 relatively efficiently.

[0]: https://github.com/psanford/sqlite3vfshttp


That project looks neat! I could see it being useful in searching archived, infrequently accessed data.


I'm at least as interested in quality of results as I am in performance; full-text search is not something where there is an obvious single correct algorithm, and is something where results are going to vary between products (and depending on how they are configured, and what configuration options are available).

This is a lot harder to evaluate and compare because it's so not-black-and-white, but... I don't think most people are choosing a full-text search option based on latency alone, are they? While some products may have unsatisfactory latency, most popular products probably do okay, and once you have good enough latency, the issue is results quality.

Getting an overview of pg's fts capabilities, and a list of other products with similar, is I guess a useful starting point. But the article's focus on performance is not too useful to me; are there really an audience of people choosing an fts solution based mostly on performance? I want to know if pg can provide good enough results compared to fulltext-search-focused products -- it's true it's less clear how to measure that, and may depend on your exact situation. Which is why I'd be interested in reading from someone who has something to say on it!


I think Postgres Full Text Search is "good enough" for some internal use cases, but for user facing search I had a project where there are a lot of weird edge cases that eventually makes it worth it to move something dedicated. I believed we used `websearch_to_tsquery` and users complained often enough where we swapped it out for quickwit. A big issue I remember is when users wanted to search for something and order by another field (such as find all documents matching query "x" order by creation date), and PG performance just took a massive nosedive. Even though we knew where was one or two fields we wanted to co-index with FTS, you can't have compound indexes when using FTS.


Has anyone looked into Vespa? Yahoo's relatively recently (2017) open-sourced search engine. Seems incredibly full featured but also an exceptionally large code base (which is less than ideal). Still, worthy of inspection.


My experience with Postgres FTS (did a comparison with Elastic a couple years back), is that filtering works fine and is speedy enough, but ranking crumbles when the resulting set is large.

If you have a large-ish data set with lots of similar data (4M addresses and location names was the test case), Postgres FTS just doesn't perform.

There is no index that helps scoring results. You would have to install an extension like RUM index (https://github.com/postgrespro/rum) to improve this, which may or may not be an option (often not if you use managed databases).

If you want a best of both worlds, one could investigate this extension (again, often not an option for managed databases): https://github.com/matthewfranglen/postgres-elasticsearch-fd...

Either way, writing something that indexes your postgres database into elastic/opensearch is a one time investment that usually pays off in the long run.


I always thought it was interesting that full text search handles misspellings, as opposed to just telling the client that they've misspelled and to clarify their intent and retry.

From this it seems like the best solution would be to use SQLite's FTS and replicate the database using something like rqlite, litestream or mvsqlite, and then loadbalance the requests to SQLite.

Given that SQLite is serverless in the classic sense, seems like a nobrainer.


Postgres comes with a trigram module. Supports indexing and its much more stable in its result than relying on stemming especially if you're interested in misspellings, the similarity comes with variants with specialized behavior for total similarity, word similarity etc.

You might want to give that a chance.


This hasn't made the front page of HN (maybe someone should post it), but Litestream actually doesn't do replication that way any more -- I stumbled upon this commit like 2 days ago:

https://github.com/benbjohnson/litestream/pull/411

[EDIT] - I indulged myself: https://news.ycombinator.com/item?id=33204347


We use both Algolia and postgres FTS. While tsvector and GIN indices are fast, there’s a couple of downsides if you are looking for prefix based search on every key stroke.

1) It’s based on tokenizing input, and ‘english’ config will tokenize it in unexpected ways. We had people’s name being tokenized like “eddie” to “eddi” and searching for “eddie:*” doesn’t work. Even the ‘simple’ config will tokenize it undesirably. I couldn’t find the exact config to turn it off.

2) you can index multiple columns by concat strings, but when results come back, no easy way of knowing which fields matched where to highlight the search.

3) Typo tolerance isn’t there. You have to send possible typos with OR in query. Typesense achieves this with ART index where it’s very cheap to explore more paths of a trie.

4) it’s fast, but not get results on every keystroke fast. Prefix based searching is still fairly expensive compared to token based search.

Postgres FTS is good usecase if you want to search for whole words in large document.

Typesense is nice, but I wish they built an auto-sync with postgres so one doesn’t have to deal with ETL headache.


I don't have experience with PG FTS, but I've used SQLite FTS5 and love it for its ease of use. Also I really like its NEAR() queries. I was a bit surprised that the post doesn't mention this capability and compare it with the other implementations (i.e. is it available and how does it compare performance-wise).


What do near queries do?


It lets you specify a list of terms which mustn't be further apart from each other than a given number of tokens.


Does anyone have any comparable experience using FTS in DuckDB?

https://duckdb.org/docs/extensions/full_text_search


I've used Meilisearch for the last two years with around 300k items in the search index for a personal project and its pretty good. The earlier verisons were a bit rough but recently even indexing a ton of data is speedy.


What's the story with typos for meilisearch? Is it still one - two characters?

Been trying to find an alternative to mysql fts, and some typo tolerance / did you mean mechanism is needed for some of our clients.


Since its previous version, Meilisearch is offering a finer control over typo tolerance https://blog.meilisearch.com/typo-tolerance/


Cool! Meilisearch's CEO here. We have spent so much time improving indexation speed. I'm so glad the experience is better for you now! Even more improvements will be released in the future.


I've worked with FTS in the database before, and I'd much rather use Elastic.

There's an FDW for Elastic, so they work together quite nicely and you can have the best of both worlds.


Had a use case recently where a vendor dataset of approximately 1.5 million companies needed to be searchable. Fiddled around with Postgres for quite a while, not really managing to get good results while keeping query times below 1 second on average. Ended up going with elastic search instead but I hate the additional complexity. Anyone with similar experience or who actually made a similar setup work with pgsql?


We use Gin for searching cyber asset configuration at Resmo. It works good enough because users search for specific things like "logs bucket" or UUIDs and there are exact matches. Typos are not important. But if they'd be, Postgres search is not very helpful. But it beats hosting another thing just for this use case if you'd be happy enough.


Slightly off-topic but relevant to competing search engines: do any HNers have recommendations for FTS that runs in the browser (i.e. that can be "local-first")? I've been looking for one for my product and I need the best option when it comes to relevance.


Isn’t there a web standard like IndexedDB or something similar (maybe WebSQL?) that is basically a thin JS wrapper around SQLite? Wouldn’t that have full text in the browser?


Nope. Indexeddb is a key value database with indexes only on fields (no computer indexes), that may be backed by sqlite in some browsers, and levelDB in others. webSQL is gone as no-one wanted to build another sql standard and implementation just for browsers, they just wanted to use sqlite.

You can however implement a full text algorithm such as tf-idf atop indexedDB. It had reasonable performance when I made a toy version and fed Shakespeare's plays in as a corpus. Fun way to learn about stemming and building queries.


Oh also shameless plug -- If you liked all the F/OSS search engines in this post, you're going to love Awesome F/OSS[0][1] -- I'm going to highlight a new open source project every day for uhhh forever (terms and conditions apply).

I know of at least 5 more search engines I can't wait to share and they're burning a hole in my bookmarks folder.

Also, donating 60% of subscription revenue towards supporting the open source project of your choice (or one that was featured if you have no preference) -- you can sign up for that as well[2].

[0]: https://awsmfoss.com

[1]: https://twitter.com/awsmfoss

[2]: https://baserow.vadosware.io/form/Xv5rChuZb-YodDOKjpJJpuDhrE...


How does Groonga compare to it?


At this point why not compare it to grep on an SSD?


Are Solr and Lucene still a thing? I had very good experience with them many years ago.


Postgres Full Text Search has ALOT of limitations doesn't it?

I briefly looked into for storing long form text archive records for my company two years ago. There are EXTREME limitations in the source code around it that no one really talks about but have important implications.

Phrase searching doesn't really work the way our analysts would have liked and needed. There are a bunch of technical limitations in the source about how much data is tracked about the tokens. I can't remember exactly but there was something weird about stemming or lemming in the phrase search too.

The following variables need to be bumped up to get phrase searching more accurate.

- MAXSTRLEN (2047) https://github.com/postgres/postgres/blob/master/src/include...

- MAXSTRPOS (1048575) https://github.com/postgres/postgres/blob/master/src/include...

- MAXENTRYPOS (16363) https://github.com/postgres/postgres/blob/master/src/include...

- MAXNUMPOS (256) https://github.com/postgres/postgres/blob/master/src/include...

TsHeadline for highlighting doesn't consider phrase searching so you can get weird results. It probably needs to be rewritten to match websearch_to_tsquery.

The accuracy issue drained the blood from my BA's faces. I eventually just went with on-premise SOLR because it's easier to add new hardware for it than elasticsearch.

TLDR: postgres search is probably fine for short-form content, but major gotcha's once you go past those max limits. Also phrase searching will probably not work the way people are used too.


The combinatorics of full-text search are astronomical. A dashed-off comparison like this is pure noise. There are so many knobs to twiddle, and a nudge to any one of them can yield remarkably different results.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: