
PostgreSQL B-Tree index deduplication - petergeoghegan
https://blog.rustprooflabs.com/2020/09/postgres-beta3-btree-dedup
======
yxhuvud
This is an great change, but I wonder, will it apply retroactively to indices
from old databases or will the indices have to be rebuilt? Like, I understand
that just upgrading the version won't reduce the sizes, but will reading and
writing to the indices gradually reduce the size?

~~~
petergeoghegan
You will need to REINDEX after upgrading using pg_upgrade. Otherwise
deduplication won't be applied in the old indexes at all.

These days it's possible to use the CONCURRENTLY option with REINDEX, which is
much less disruptive.

------
andy_ppp
Cool, I’d love to know how this affects query performance, rather than just
index size?

~~~
redis_mlc
Postgres has a write amplification problem - when you update or insert a row,
it has to check all the indexes to see what was touched (unlike MySQL.) Uber
made this famous in a blog post where they migrated off postgres because of
it.

So with Postgres I try to keep the number of indexes per table down to around
5, but with MySQL InnoDB, 20 indexes is not a problem for many use cases.

The other little-known Postgres problem is that by default, installing it sets
up replication that writes a local file in a busy manner. This can cause
problems with fsck or on low-end SSD/flash storage resulting in corruption
that causes postgres restarts to fail hard, so disable that.

~~~
andy_ppp
I’m interested to find out why the community feel this answer is incorrect. I
don’t know enough about database internals to know! Would be good if someone
could help me to become more informed about this. I can’t personally see a way
that indexes wouldn’t be touched when inserts happen in MySQL but who knows?

~~~
jchb
Previously on HN:

Why Uber Engineering Switched from Postgres to MySQL (2016):
[https://news.ycombinator.com/item?id=17280239](https://news.ycombinator.com/item?id=17280239)

A PostgreSQL response to Uber [pdf]:
[https://news.ycombinator.com/item?id=14222721](https://news.ycombinator.com/item?id=14222721)

One important point is that Uber wrote this running on Postgres 9.2. The
latest Postgres release is now 12.4

~~~
andy_ppp
Thanks, I just always trust Postgres more from my experiences. Reading some of
the above I’m fairly unconvinced by Uber’s reasoning...

~~~
pgaddict
I'd say some of the limitations hit by Uber were real, and were a bit of a
PITA even before that. Some were improved significantly since then, others
still need to be considered when designing application accessing PostgreSQL.

I'd say it was a nice example what happens when you design application based
on how you imagine an ideal database would work, not realizing that there's a
lot of complexity and limitations. And then when you eventually have to face
reality you blame the database for having to do trade-offs and refuse to
mitigate the trouble by changing the application.

------
rossmohax
With deduplicatio enabled, how does it differ from GIN index now?

~~~
avremel
> The big difference is that duplicate index entries can still occur in B-tree
> indexes. GIN indexes have to integrate new entries into the posting list
> every time, which makes data modifications less efficient (this is mitigated
> by the pending list). In contrast, PostgreSQL deduplicates B-tree entries
> only when it would otherwise have to split the index page. So it has to do
> the extra work only occasionally, and only when it would have to do extra
> work anyway. This extra work is balanced by the reduced need for page splits
> and the resulting smaller index size.

[https://www.cybertec-postgresql.com/en/b-tree-index-
deduplic...](https://www.cybertec-postgresql.com/en/b-tree-index-
deduplication/)

