Optimizing Postgres queries at scale

mistrial9 · on Dec 28, 2021

I recently found this paper (2015) on optimizers, focusing on postgresql.

https://www.vldb.org/pvldb/vol9/p204-leis.pdf

jackhodkinson · on Dec 28, 2021

Curious how other people run these kinds of experiments on their production data in a safe way that does not affect performance of prod. Eg how do you avoid CPU spikes in prod that might effect users?

capableweb · on Dec 28, 2021

Start a new DB instance and restore all production data from latest backup (you are doing regular backups riiight?). Serves two purposes: enabling debugging of production data without actually using production + verifying that your production backups/restores actually work in practice, not just in theory.

If the database is too big, just use last N amount of rows from the database instead of a full dump.

Bonus points if you log actual SQL queries that happen in production to a log file, so you can then re-run last X queries against your new test-production database and log the total run time. After doing bunch of optimizations, you can re-run your little benchmark and see how it compares.

Normal_gaussian · on Dec 28, 2021

This is how its done.

Be careful with the last n rows approach, you have to have an understanding of how the query scales to do this effectively (ie. your n rows are at an appropriate scale to optimise). I assume that if you are at a small enough scale that you are asking this question, restoring the whole backup is the best option.

polskibus · on Dec 28, 2021

What do you guys use for large postgresql backups? Pgdump? Setting up replicas? Coming from ms sql, backups seem to be a pain in postgresql world, esp on a live database that is receiving updates.

stickfigure · on Dec 28, 2021

Our hosted postgres instance (GCP) gets nightly backups automatically. Restoring to a new instance is a few buttonclicks in the UI.

This is infrastructure level stuff that someone else should run.

merb · on Dec 28, 2021

pgbackrest and/or wal-e/wal-g btw. it's important to make wal archive backups if you have a big database. https://www.postgresql.org/docs/14/continuous-archiving.html

so most of the time, both things are good.

there is also:

- Barman

- pg_probackup

(P.S: with archiving you can even have a "delayed" downstream cluster)

__s · on Dec 28, 2021

Not from Heap, but I use https://github.com/wal-g/wal-g

samokhvalov · on Dec 28, 2021

> If the database is too big, just use last N amount of rows from the database instead of a full dump.

For SQL optimization purposes, you cannot do that – there are high chances you'll get different query plans.

capableweb · on Dec 28, 2021

Yeah, if your database is under 1TB, it's relatively easy and usable to just clone the full thing. But above that, it starts becoming a hassle and you might need to strip away some stuff you don't care about (and won't affect the query plans), otherwise it can take a long time to backup/restore into new instance.

samokhvalov · on Dec 28, 2021

But in general case, such clones are not suitable for query optimization tasks.

And as I already mentioned in my other comments here, thin cloning is a great way to solve the speed (and money) problem for larger DBs.

devmunchies · on Dec 28, 2021

postgres can collect stats across all queries against the database[1]. GCP Cloud SQL has this enabled by default.

You can do `select * from pg_stat_user_tables` to see how many table have had (1) a full sequential scan, (2) how many records have been traversed by sequential scans, (3) how many index scans, and (4) how many records have been scanned using indexes.

You can also do `select * from pg_stat_user_indexes` to see (1) which indexes have or have not been used, (2) how many times they've been used, and (3) how many records have been crawled using each index.

Using this information, you would deduce which indexes to add/remove, and but you would still need to figure out which queries are not hitting indexes (an exercise for the reader).

If you save these stats once a day (e.g. using `pg_cron` to copy to a new table with a timestamp), you will be able to monitor over time whether an index should be added/removed

[1]: https://www.postgresql.org/docs/14/monitoring-stats.html

sergiotapia · on Dec 28, 2021

We used to spin up a production-like Postgres instance and restore from latest backup to test SQL optimizations.

samokhvalov · on Dec 28, 2021

This is great that you can do that. An experimental environment must be of the same size.

Ideally, absolutely the same physically, to keep the same physical layout as on prod – bloat, etc – though, sometimes it's not allowed.

Another problem here is that for each experiment, you might need a new environment. Plus, in larger teams, many people might want to do this work at the same time – so you end up combining efforts, synchronizing them, sharing environments. It slows you down. Or increases your cloud spending a lot.

This is a kind of problem Steve Jobs described in his famous 1980 speech [1] about Apple: if we give each person their own computer, something special happens compared to the case when 10 persons share a single computer. Here it is the same: if we can give each engineer their own full-size DB copy for experiments, something special happens. They start moving much, much faster. SQL quality improves. DB changes (DDL, massive DML) stop failing. Finally, engineers start learning SQL, it's an eye-opener – they now can see how it works on large volumes (because they didn't have good environments for experiments before!)

This is what we (Postgres.ai) have learned over the last couple of years developing Database Lab Engine [2]. It's an open-source tool for superfast cloning of Postgres databases of any size, based on copy-on-write provided by ZFS or LVM. On a single machine, you can have dozens of full-size clones up and running, conduct a lot of experiments at the same time, and still be paying only for 1 machine. And clone provisioning takes only a few seconds, it feels like magic. But the main magic is how engineering processes change.

One of great use cases is how GitLab uses it for Postgres query optimization and change control [3].

[1] https://youtu.be/0lvMgMrNDlg?t=586

[2] https://github.com/postgres-ai/database-lab-engine

[3] https://postgres.ai/resources/case-studies/gitlab

samokhvalov · on Dec 28, 2021

Thin clones can help: https://github.com/postgres-ai/database-lab-engine

ilaksh · on Dec 29, 2021

One challenge that Algorand (a cryptocurrency from MIT) has is that most queries on Blockchain data end up going through either the Rand Labs Postgres servers or PureStake's. This in my opinion is because there is no option for a partial index. The Postgres stores all transactions or none. So that makes it impractical to set up and run on cloud servers for most teams.

They should have partial indexing and also really should have a decentralized Indexer in the long run.

the_duke · on Dec 28, 2021

This reads like SEO spam without any notable content.