
Improved vacuum in PostgreSQL 13 - amitkapila
https://amitkapila16.blogspot.com/2020/05/improved-autovacuum-in-postgresql-13.html
======
ComodoHacker
Among modern successful RDBMSes employing MVCC only PosgreSQL (correct me if
I'm wrong) stores stale row versions on the page. Other vendors store them in
some physically separate place (tempdb in MSSQL, UNDO tablespace in Oracle,
etc.) This reduces both regular read (scans) and write (page splits) I/O and
eliminates the need to vacuum (more I/O).

Of course this approach requires some maintenance of that separate place. But
such maintenance doesn't compete for I/O with regular workload and doesn't
hurt performance.

So after decades of progress and improvement it looks like this latter
approach turned out to be more efficient and is winning historically.

~~~
jsmith45
Actually MSSQL still has a vacuum system under the hood ("ghost record cleanup
task") that is needed even if you have disabled all MVCC features. It is not
as critical as in PostgreSQL, since there will be fewer rows to clean up.

It runs automatically in the background, with rather conservative settings
that usually prevent it from having much perf impact.

The system also ties into MVCC because deleted rows in the main table cannot
be removed while any query can still access some other version of that row.
The same is true when changing any key column in an index. In those cases, the
update is a delete and insert within that index. And the deleted row must
stick around if there is any transaction that could read it.

So while SQL Server does not get as many rows that needs to be vacuumed as
PostgreSQL does, just storing the versions separately certainly does not
completely eliminate the need for after the fact cleanup.

~~~
grogers
This is exactly what MySQL (and I think Oracle) does. The final delete of a
row remains tombstoned in the main table until a background process determines
it is safe to remove based on transaction visibility (MySQL calls this purge).

It seems the important difference is whether the table needs to be scanned to
find dead rows. In Postgres it seems the answer is yes. In MySQL it keeps this
info as a separate scratchpad to quickly find the rows to remove once they are
allowed to be.

Both approaches have advantages and disadvantages. For example, in MySQL, if
you have a very long running transaction such that it can't purge rows for a
long time, you can actually overflow the rollback segment onto disk and
purging slows down dramatically. You might even need to pause writes to get it
to recover, although I haven't ever seen it get that bad. With a table scan
approach it degrades more gracefully in that scenario.

~~~
evanelias
One annoying wrinkle in MySQL's purge implementation: it won't progress to any
old row versions that are _newer_ than the oldest transaction, even if there
are no transactions that could ever possibly access those rows. [1]

As you noted, this can get painful since purge effectively gets blocked -- if
you have a high-volume OLTP workload with heavy emphasis on quick small
UPDATEs / DELETEs, and then introduce a single very long-running read
transaction (such as a logical dump/backup), purge can't make progress. But
this is just due to the current implementation. In theory a smarter system
could differentiate between row versions actually needed by old transactions
vs ones that are inaccessible to all transactions.

[1]
[https://bugs.mysql.com/bug.php?id=74919](https://bugs.mysql.com/bug.php?id=74919)

------
Epskampie
I just tried postgres yesterday for the first time after years of being on
mysql only.

I immediately got some warnings in pgadmin that some tables need to be
vacuumed. Is this really necessary? Can i just vacuum everything at night on a
cronjob?

~~~
fabian2k
In general you don't need to vacuum manually, this happens in the background
automatically. If you have a lot of activity on the DB, or fit into certain
corner cases you have to tune the autovacuum parameters. The default is
generally getting better with each Postgres release.

And if you want to play with the Autovacuum settings, you really should read
up on this in detail. There are some counter intuitive aspects to it, e.g. in
many cases if you have problems with autovacuuming you should vacuum more, not
less.

One exception is after you restore a backup, you might want to run "VACUUM
ANALYZE" so that you get some good statistics immediately. That also will
happen in the background later, but until then you can get weird query plans.

~~~
welder
Normally you only need to adjust autovacuum settings per table if your tables
have a very large number of rows:

[https://lob.com/blog/supercharge-your-postgresql-
performance...](https://lob.com/blog/supercharge-your-postgresql-performance/)

------
PudgePacket
Maybe a silly question but what is the point of an insert-only table. Surely
it must be read/used in some capacity ? Do they mean "insert heavy"?

~~~
brianwawok
Means not updated. Say Www visitor logs. No reason to ever update a previous
log. What happened happened.

~~~
PudgePacket
.

~~~
CameronNemo
Selects would still work. But non insert DML would be disallowed I assume
(delete, update).

------
stormdennis
Some argue that having a separate vacuum process is a bad idea. That the right
time to clean up a no longer transaction-tied version is the next time someone
visits the row in a query. That way there are no extra database accesses to
each row beyond normal use.

~~~
jeltz
I see several pretty big issues with that. It would leave empty rows and pages
since there is no guarantee that there will ever be a scan which hits a
particular page which contains deleted rows. Additionally if you recently have
deleted a lot of rows the next sequential scan could become very exprnsive
since suddenly a read query needs to dirty a lot of pages, maybe more pages
than the number of available buffers forcing our read query to have to wait on
write io.

What PostgreSQL has though is micro vacuum where PostgreSQL tries to remove
any no longer visible rows when it already has to modify a page. But it only
does this as long as no index updates are necessary.

~~~
jsmith45
MSSQL does this with a two pronged approach.

If a dead record is encountered in a scan, it is queued up to cleaned up by a
background task, that runs every 5 seconds. This typically happens soon enough
that the record is still cached, so no extra read I/O is needed, and in any
case the location of the record is already known, so no extra scans are
needed.

To handle other records, it maintains a bitmap of pages with ghost records.
Each time it is awakened, it cleans up the queue, and then does up to 10 pages
indicated in the bitmap. Then it goes back to sleep, and only wakes up again 5
seconds later.

This super slow cleanup of dead records that have not been recently scanned
helps to avoid overloading the system, but does ensure cleanup will eventually
occur, even if the table never gets scanned.

Pedantic note: (Technically it is a bytemap, as the bit is part of a larger
structure, that holds other flags about a page.)

------
Jonnax
So I'm running a single Postgres 10 instance for a small project that probably
hits about 200 queries a day at most.

I haven't upgraded but would it actually be worth the effort?

~~~
barbegal
Well what are the issues you are currently having with Postgres 10? It doesn't
sound like there is any compelling reason.

~~~
Jonnax
Yeah. That's the question haha.

I've got pgBackrest set up for backups and pgBouncer for connection pooling
set up so it's a case of wondering if there is some benefit to the newer
versions for someone using it "casually"

Since it's require quite a bit of reconfig and testing.

~~~
rrdharan
Curious why you bothered with pgBouncer for 200 queries _per day_? Just future
proofing?

EDIT: forgot, pgBouncer also allows you to do hitless restart/upgrade, which
is useful at any query volume I suppose ;)

~~~
Jonnax
Indeed.

Also the server the DB is on is RAM constrained. And Postgres making a process
per connection brings it right down versus using pgBouncer to pool
connections.

It seems complicated but it's actually quite simple.

------
AlexTrask
I suggest an improvement, to be able to vacuum without having to have the free
space of the table. Example to vacuum a 1Gb table you must have a 1Gb free

~~~
dtech
Is that always the case? I'd expect that only to be necessary with a complete
table rewrite, i.e. VACUUM FULL, which is not what (auto)VACUUM does by
default.

~~~
heavenlyblue
Even with a full table rewrite, why do you need to rewrite to RAM?

~~~
hans_castorp
The "you must have a 1Gb free" refers to disk space, not main-memory (RAM).

