Curious how other people run these kinds of experiments on their production data in a safe way that does not affect performance of prod. Eg how do you avoid CPU spikes in prod that might effect users?
Start a new DB instance and restore all production data from latest backup (you are doing regular backups riiight?). Serves two purposes: enabling debugging of production data without actually using production + verifying that your production backups/restores actually work in practice, not just in theory.
If the database is too big, just use last N amount of rows from the database instead of a full dump.
Bonus points if you log actual SQL queries that happen in production to a log file, so you can then re-run last X queries against your new test-production database and log the total run time. After doing bunch of optimizations, you can re-run your little benchmark and see how it compares.
Be careful with the last n rows approach, you have to have an understanding of how the query scales to do this effectively (ie. your n rows are at an appropriate scale to optimise). I assume that if you are at a small enough scale that you are asking this question, restoring the whole backup is the best option.
What do you guys use for large postgresql backups? Pgdump? Setting up replicas? Coming from ms sql, backups seem to be a pain in postgresql world, esp on a live database that is receiving updates.
Yeah, if your database is under 1TB, it's relatively easy and usable to just clone the full thing. But above that, it starts becoming a hassle and you might need to strip away some stuff you don't care about (and won't affect the query plans), otherwise it can take a long time to backup/restore into new instance.
postgres can collect stats across all queries against the database[1]. GCP Cloud SQL has this enabled by default.
You can do `select * from pg_stat_user_tables` to see how many table have had (1) a full sequential scan, (2) how many records have been traversed by sequential scans, (3) how many index scans, and (4) how many records have been scanned using indexes.
You can also do `select * from pg_stat_user_indexes` to see (1) which indexes have or have not been used, (2) how many times they've been used, and (3) how many records have been crawled using each index.
Using this information, you would deduce which indexes to add/remove, and but you would still need to figure out which queries are not hitting indexes (an exercise for the reader).
If you save these stats once a day (e.g. using `pg_cron` to copy to a new table with a timestamp), you will be able to monitor over time whether an index should be added/removed
This is great that you can do that. An experimental environment must be of the same size.
Ideally, absolutely the same physically, to keep the same physical layout as on prod – bloat, etc – though, sometimes it's not allowed.
Another problem here is that for each experiment, you might need a new environment. Plus, in larger teams, many people might want to do this work at the same time – so you end up combining efforts, synchronizing them, sharing environments. It slows you down. Or increases your cloud spending a lot.
This is a kind of problem Steve Jobs described in his famous 1980 speech [1] about Apple: if we give each person their own computer, something special happens compared to the case when 10 persons share a single computer. Here it is the same: if we can give each engineer their own full-size DB copy for experiments, something special happens. They start moving much, much faster. SQL quality improves. DB changes (DDL, massive DML) stop failing. Finally, engineers start learning SQL, it's an eye-opener – they now can see how it works on large volumes (because they didn't have good environments for experiments before!)
This is what we (Postgres.ai) have learned over the last couple of years developing Database Lab Engine [2]. It's an open-source tool for superfast cloning of Postgres databases of any size, based on copy-on-write provided by ZFS or LVM. On a single machine, you can have dozens of full-size clones up and running, conduct a lot of experiments at the same time, and still be paying only for 1 machine. And clone provisioning takes only a few seconds, it feels like magic. But the main magic is how engineering processes change.
One of great use cases is how GitLab uses it for Postgres query optimization and change control [3].
One challenge that Algorand (a cryptocurrency from MIT) has is that most queries on Blockchain data end up going through either the Rand Labs Postgres servers or PureStake's. This in my opinion is because there is no option for a partial index. The Postgres stores all transactions or none. So that makes it impractical to set up and run on cloud servers for most teams.
They should have partial indexing and also really should have a decentralized Indexer in the long run.
https://www.vldb.org/pvldb/vol9/p204-leis.pdf