
The state of full text search in PostgreSQL 12 - fanf2
https://fosdem.org/2020/schedule/event/postgresql_the_state_of_full_text_search_in_postgresql_12/
======
vyruss
Thank you so much for your interest in my FOSDEM talk!

Slides link:
[https://fosdem.org/2020/schedule/event/postgresql_the_state_...](https://fosdem.org/2020/schedule/event/postgresql_the_state_of_full_text_search_in_postgresql_12/attachments/slides/4138/export/events/attachments/postgresql_the_state_of_full_text_search_in_postgresql_12/slides/4138/FTS.pdf)

Slideshare: [https://www.slideshare.net/vyruss000/the-state-of-full-
text-...](https://www.slideshare.net/vyruss000/the-state-of-full-text-search-
in-postgresql-12-226820234)

Video will appear under these URLs after mirrors are synced:

MP4:
[https://video.fosdem.org/2020/H.2214/postgresql_the_state_of...](https://video.fosdem.org/2020/H.2214/postgresql_the_state_of_full_text_search_in_postgresql_12.mp4)

WebM:
[https://video.fosdem.org/2020/H.2214/postgresql_the_state_of...](https://video.fosdem.org/2020/H.2214/postgresql_the_state_of_full_text_search_in_postgresql_12.webm)

------
idoubtit
In the later part of the slides, I learned that Postgres 12 now supports case
and accent insensitive collations.
[https://www.postgresql.org/docs/current/collation.html#COLLA...](https://www.postgresql.org/docs/current/collation.html#COLLATION-
NONDETERMINISTIC)

For the many simple cases where a grep-like search is enough, my little
experience with pg was far from satisfying. MySQL is simple to use for this
thanks to the awfully named `utf8_mb4_unicode_ci`. With Postgres, searching
for "étranger" won't find "Etranger" and "Étranger" unless you put some hard
work into your database search system.

Unfortunately, it seems that PG12 won't change much in that aspect. From the
link above, you have to create the collation (which is already a heavy
restriction) and then "certain operations are not possible with
nondeterministic collations, such as pattern matching operations." Looks like
uncharted territory.

On a side note, I've always seen Solr, SphinxSearch and such specialized tools
used instead of the DB's FTS. The main drawback is keeping the data in sync,
but it has many pros.

~~~
irrational
CREATE EXTENSION unaccent;

begin; CREATE OR REPLACE FUNCTION public.insensitive_query(text) RETURNS text
AS $func$ SELECT lower(public.unaccent('public.unaccent', CAST($1 AS text)))
$func$ LANGUAGE sql IMMUTABLE; ALTER FUNCTION public.insensitive_query(text)
OWNER TO [your user that has create rights on the database]; commit;

To use in a query:

WHERE insensitive_query(some_table.name) LIKE insensitive_query('%Étranger%')

~~~
rakoo
If you use a function like this, it will need to be run for every entry;
doesn't that invalidate the whole point of having indexes ?

What you can do is storing the unaccented, simplified version of a word. Which
I'm sure pg already does in a way.

~~~
cies
> If you use a function like this, [each query] it will need to be run for
> every entry

~~~
rzwitserloot
No; if you create an index for the item: `insensitiveQuery(some_table.name)`
and you then attempt to run: `SELECT * FROM some_table WHERE
insensitiveQuery(name) LIKE insensitiveQuery('Étranger');`, the engine would
run the `insensitiveQuery` function on the string `'Étranger'`, and then look
up that result in the index directly, which is O(nlogn) or faster. If you have
a million tables, that'd run the function only once (on `Étranger`) and do
about 20 lookups in the index before finding you the qualifying row(s).

~~~
cies
Cheer. I did not get the index-on-a-derived-value bit.

------
sbarre
The actual slide deck (that is easy to miss) is here:

[https://fosdem.org/2020/schedule/event/postgresql_the_state_...](https://fosdem.org/2020/schedule/event/postgresql_the_state_of_full_text_search_in_postgresql_12/attachments/slides/4051/export/events/attachments/postgresql_the_state_of_full_text_search_in_postgresql_12/slides/4051/FTS.pdf)

~~~
jadbox
Really well put together- I learned several more things about PG text
searching from this.

From my personal experience working with dozens of California tech companies
of varying sizes, I'd say around 80% of the time I see ElasticSearch being
used when the product could have just used their existing PG cluster (and
existing team experience). I.e. so many companies pick up ElasticSearch only
because they lack of full-text search capability awareness of PG. Of course,
ES does have its advantages, but I feel it's in very specific use-cases.

~~~
jordic
Postgresql is awesome, but I don't see any advantage when you need to add like
40 aggregations plus joining to 10 tables.. (easy if you denormalize to
elastic)

~~~
mistahenry
I can think of a bunch off the top of my head that mostly stem from a serious
reduction in complexity:

\- having one data store instead of two

\- no data store synchronization and related consistency headaches

\- the ability to easily join/restrict based on business logic (especially
access control)

\- easier deployments

\- easier qa/local setups

\- fewer failure scenarios

\- although non-standard, you're dealing with sql and can explain/analyze your
queries with tools one already knows if one understands postgres

\- the ability to extend my existing (Spring Boot) integration tests to test
new searching logic without having to worry about a separate test db.

I've been building a travel blogging platform as a side project that has about
~250 users. Through a couple of choice materialized views to create search
documents from a number of different tables, Postgres FTS, and some GIN
indices, I was able to make a lighting fast search with intelligent
suggestions(ie using user specific restrictions such as follower/following,
places a user has posted, etc).

Is my search perfect? No. I could do a better job normalizing diacritics,
stemming could be improved, etc. But it was dead simple to implement well
enough in a very short amount of time. I'm very happy with my current
solution. There's a time and place for a dedicated search solution but it's
certainly not in fledgling systems without massive load.

~~~
aldoushuxley001
Ever consider doing a write up of how you implemented your Postgres FTS?
Sounds like you’ve had a good experience with it.

------
rzwitserloot
This sounds like it's about the problem of: I have a column filled with the
last names of lots of people and I want a text search that will find these
names, but, which takes into account all the various ways people end up
writing that name. So, a search for 'mueller' should also find 'Müller'.

However, that problem sounds nearly intractable to me.

In basis it sounds like you just 'canonicalize' everything. You store,
alongside 'Müller', also the string 'mueller' (the rule being: first transform
all characters into their asciification, with the asciification of ü being
defined as 'ue', then lowercase that.

The problem is, of course.. that searching for 'MULLER' should also find this
person. And now we're more into trying to get this done by first
deconstructing the unicode into ascii + accent marks and then just wiping out
the accent marks.

But once we state that 'mueller', 'muller', and 'müller' need all be 'equal'
to each other, where does it end? A last name can be from any of a thousand
languages, each with their own unique conventions on how these things are
asciified, and any given asciification must match any other. Sounds like a
combinatorial explosion.

One could disregard this notion and say that 'muller' does not match 'müller'.
However, note that generally, Sjögren's syndrome tends to be asciified as
'Sjogren's syndrome', probably because Sjögren was norwegian and I assume that
the norwegian asciification of ö, is o. The german asciification of ö is oe
however; and I can't tell from a last name if it is germanic in origin,
norwegian in origin, or from the land of fairytales and flying pandas, such as
Haägen-Dazs.

Just by thinking about this for a while I have concluded that it's hopeless,
but perhaps I'm missing some fancy trick to at least try to make this problem
dealable. So far my attempts to look for solutions tend to come down to advice
to use unicode normalization and then get rid of all accents, then lowercase
it, which obviously doesn't work at all (with that strategy, mueller would not
find müller), or heuristic solutions (if a lot matches, good enough), which
are their own can of worms.

Is there a solution to this to me seemingly impossible problem?

~~~
combatentropy
Have you looked into Soundex or Metaphone? I myself have not used them. I just
saw them mentioned in PHP's function list and thought they might be useful
some day.

\-
[https://www.php.net/manual/en/function.soundex.php](https://www.php.net/manual/en/function.soundex.php)

\-
[https://www.php.net/manual/en/function.metaphone.php](https://www.php.net/manual/en/function.metaphone.php)

PostgreSQL has an extension, fuzzystrmatch, that uses the same algorithms:

\-
[https://www.postgresql.org/docs/12/fuzzystrmatch.html](https://www.postgresql.org/docs/12/fuzzystrmatch.html)

But it warns that the functions "do not work well with multibyte encodings
(such as UTF-8)." So maybe first translate any such characters to ASCII?

~~~
bmn__
These only work with English, so not a general solution. There are more
modern, better tools for the problem.

------
jonatron
The video should be uploaded soon. See the FOSDEM video status page:
[https://review.video.fosdem.org/overview](https://review.video.fosdem.org/overview)

~~~
theandrewbailey
Thanks. I was just about to comment that the article seems to be missing a
video!

------
fourstar
Anyone here using
[https://github.com/zombodb/zombodb](https://github.com/zombodb/zombodb)?

~~~
FridgeSeal
I hooked it up and had a quick play around with it for a proof of concept
project at work, unfortunately that project never went ahead in the form that
would have used Zombo, but during testing it was a breeze to setup and seemed
to do everything it promised more or less straight out of the box.

I'd definitely give it another, proper go if I got the chance.

~~~
fourstar
Thanks! Looks like I’ll invest some time into setting it up. Doing the current
tsvector with the English dictionary for stems and zombodb seems closer to
what I want.

~~~
zombodb
ZomboDB developer here. With ZDB you don't need tsvector at all. With a
properly defined mapping, Elasticsearch can do the stemming for dozens of
languages, including English.

ZDB exposes darn near _everything_ ES supports.

------
sandGorgon
Still no relevance algorithm support ?

Bm25 or tfidf would be welcome

~~~
SigmundA
Would like to see this work integrated:
[https://github.com/postgrespro/rum](https://github.com/postgrespro/rum)

------
nreece
This looks great! Been considering moving from SQL Server to Postgres for our
.NET Core apps' storage and search, rather than use a separate service like
Manticore or Elastic. I wish Microsoft SQL Server had easier and better full-
text support, now that their cross-platform development framework and tooling
has improved so much.

