
No More Full Table Vacuums in PostgreSQL - dhd415
http://rhaas.blogspot.com/2016/03/no-more-full-table-vacuums.html
======
bcantrill
This sounds promising! Transaction wraparound autovacuum is a particularly
nasty pathology because it is so devastating and strikes with so little
warning: a system transitions from entirely fine to one that is entirely (and
mysteriously!) non-functional in a single transaction. And because the system
needs to be up for a pretty long time before you see this (it is
tautologically rare), many only learn about this failure mode the hard way.
Certainly, we at Joyent got completely nailed by this -- and if you want to
get a sense of how disorienting this failure mode is when you're seeing it for
the first time, read our postmortem of the service outage that transaction
wraparound autovacuum induced.[1]

A long way of saying: this is a nasty failure mode, and it's terrific to see
the PostgreSQL team tackle it. While there are still some questions (e.g.,
this sounds like the autovacuum is still kicked but that is at once more
aggressive and more intelligent, but does that mean that an exclusive lock on
the table is still blocked for the duration of the operation?), this is
clearly a big leap in the right direction, and (more) validation that we've
made the right choice for our transactional data; thank you PostgreSQL team!

[1] [https://www.joyent.com/blog/manta-
postmortem-7-27-2015](https://www.joyent.com/blog/manta-postmortem-7-27-2015)

~~~
dice
>Since these tunables are not dynamically adjustable, we manually patched the
running process to avoid sleeping in this case.

Holy crap. You wrote a jump into an active process' memory space?

~~~
bcantrill
While I _have_ hot-patched program text on production machines many times in
my career, in this case it was a bit simpler in that we were dynamically
changing the value of a variable. The effect was instant; our chat logs from
the time captured our visceral reaction to the resulting performance
improvement.[1]

[1]
[https://twitter.com/bcantrill/status/628320729898045440](https://twitter.com/bcantrill/status/628320729898045440)

------
nickmerwin
This is one of the things about being a web developer and part-time DBA that
keeps me up at night (sometimes literally all night).

Around a month ago the source file table on Coveralls.io[0] hit a transaction
wraparound, and put us in read-only mode for over 24 hours while RDS engineers
performed a full vacuum (that's not something that can be done manually as a
customer). On a managed database I'm paying thousands a month for, I was
hoping there would be some sort of early warning system. Well, apparently
there is, but it's buried in the logs, and won't trigger any app exceptions so
went un-noticed.

What's worse is there's 0 indication of how long a vacuum is going to take,
nor progress updates while it's going. So for a production web app with
customers, this means damage control language like:

"Our engineers have identified a database performance issue and working to
mitigate. Unfortunately we do not have an ETA at this time."

About a week later, more calamity hit: the INT "id" field on the same table
exceeded the max length. My first thought was change it to a BIGINT, but after
~4 hrs into the migration without any indication of how much longer it would
take, I pulled the plug and sharded the table instead.

Moral of the story is that web devs should be aware of these pitfalls, and
that no matter how much trust you put into a managed database service, it
still _could happen to you_ (queue ominous background music).

Anyway I'm glad to see this lurking monster in our beloved database tamed,
thank you Mr Haas!

[0] [https://coveralls.io](https://coveralls.io)

~~~
amitlan
> What's worse is there's 0 indication of how long a vacuum is going to take,
> nor progress updates while it's going.

Upcoming 9.6 will help with this to a certain degree:
[http://www.postgresql.org/docs/devel/static/progress-
reporti...](http://www.postgresql.org/docs/devel/static/progress-
reporting.html)

------
jedberg
Praise the deity! Oh how I wish I had this when I was running reddit's
databases. It's exactly the use case this was built for -- high write load
where little changes after it's written.

They solved the problem by moving the SSD after I left, but I feel like
incorporating this patch might actually let them reduce the size of their
Postgres cluster, or at least grow it slower.

~~~
BenoitP
> Postgres cluster

This write up post [1] mentions your InfoQ talk. The general philosophy I got
from it is that postgres is used for the source of truth, gets all the writes;
and then reads can be offloaded to slave instances, or other specialized
engines like Cassandra.

However there are no details at all about the replication solution. If you
don't mind sharing, how was you general experience around it? Was it trigger-
based, WAL-based? Was it the postgres solution, or Slony/Bucardo/etc? What
were the hurdles/gotchas you encountered? Did you have to compromise between
synch/asynch? It is said that there are two big problems in computing: naming
things and cache invalidation. Did you have to design a specific caching
policy? How was it enforced? How did you decide what to put to Cassandra vs
postgres slaves?

Also, what features did you use in postgres? Did you chose it specifically for
hstore?

[1] [http://highscalability.com/blog/2013/8/26/reddit-lessons-
lea...](http://highscalability.com/blog/2013/8/26/reddit-lessons-learned-from-
mistakes-made-scaling-to-1-billi.html)

~~~
jedberg
Big disclaimer: I haven't worked at reddit in five years, so everything I'm
saying was true five years ago but may not be today. I also haven't admined a
production postgres box since I left reddit.

> The general philosophy I got from it is that postgres is used for the source
> of truth, gets all the writes; and then reads can be offloaded to slave
> instances, or other specialized engines like Cassandra.

That was true at the time of the talk, I don't think it's true today. I
believe Cassandra holds the source of truth for some data now.

> However there are no details at all about the replication solution. If you
> don't mind sharing, how was you general experience around it?

It was terrible. That was the part of Postgres that was totally unsolved. I
once spent an entire family vacation resyncing databases while sitting in a
hotel room.

> Was it trigger-based, WAL-based?

We started with Wal based replication, then moved to a tool called londiste
(which is trigger based). We liked that tool because it was in Python so we
could dig into the internals if we had to (which we did have to on occasion).

> Was it the postgres solution, or Slony/Bucardo/etc?

I think it was third party at the time. It looks slightly more official now?

> What were the hurdles/gotchas you encountered?

A stable replication system. :) The replication would keep dying because the
nodes would get out of sync and the replication wouldn't be able to recover.
They discovered after I left that the issue was an occasional pathological
code branch that would make a write directly to a slave instead of a master. I
think replication got a lot more stable after they fixed that.

> Did you have to compromise between synch/asynch?

It was all async and we designed around that. The nice thing was that anything
we wrote to the database went to the cache too, so even if a read slave hadn't
picked up the change yet it didn't matter because the read was never hitting
Postgres -- the data was in the cache already. Sometimes we did have problems
where out of date data would be read from a slave and then cached, but it was
usually fixed on its own, either by being written again or just falling out of
the cache.

> It is said that there are two big problems in computing: naming things and
> cache invalidation. Did you have to design a specific caching policy? How
> was it enforced?

The app developer could choose the level of caching they wanted for each piece
of data. There were multiple caches available with different levels of
globalness. For example we had caches right on the servers that were used to
cache rendered html bits. Since the html was the same as long as the data was
the same, which was part of the cache key, then you never had to invalidate
it.

Basically, to solve the invalidation problem, as long as you write your code
in a way where the value is tied to the cache key, you never have to
invalidate the cache because the changing data does it for you, so for things
like pre-rendered html this worked great, but for things where a value was
being cached obviously not so much. In those cases, there was a global cache
so you just had to deal with race conditions (but for something like points,
the command that was sent was "increment 1", not "set to X").

> How did you decide what to put to Cassandra vs postgres slaves?

It depended on the use case. Things that were faster/easier to store in
Postgres were stored there. For example, the list of the top 1000 links for
every subreddit were stored there (and only there). For every vote, that list
was recalculated from the data in Postgres and put in C*. So technically it
was the source of truth for that data, but since it could always be recomputed
from Postgres, we didn't call it that. (BTW there were shortcuts for not
recomputing on every vote).

> Also, what features did you use in postgres? Did you chose it specifically
> for hstore?

Hstore didn't exist when we started using Postgres (or even when I left reddit
for that matter).

We didn't use any special features in Postgres, we just liked that it was rock
solid code and conformed to standard SQL better than MySql. It also had a way
better query planner.

> From below: how on earth did moving to SSDs help Reddit avoid the Postgres
> autovacuum failure mode?!

It didn't, it just meant that when a vacuum happened the node was still
useable, as opposed to with spinning disk. A vacuum on a read slave during
peak sucked. A vacuum on a master during most any time was disastrous. I had
to schedule the full vacuums for Saturday nights and hope that was sufficient.

~~~
BenoitP
Thanks you for the answer!

------
tiffanyh
Robert Haas is just amazing.

For just Postgres, he has completed:

\- No more full table vacuums

\- Parallel Sequential Scan

\- Major memory improvements for NUMA

\- Major improvements for large multi-core systems

My only concern is that it appears that Robert is the only person tackling
big/major improvements for Postgres.

I might be wrong, but that's the perception. I hope this isn't the case and
other are also tackling big improvements to Postgres.

~~~
pgaddict
While Robert certainly does a huge amount of work, this is a bit incorrect
perception of how contributing to PostgreSQL works. There are ~20 developers
responsible for final reviews and committing changes coming from other
contributors, and Robert is one of them.

That's of course a lot of work and requires a lot of experience, but it may
skew the perception a bit.

I often hear "there are only 20 people working on PostgreSQL" because the
people don't realize that the guy who committed it may not be the one who
implemented it.

For example while Robert did a lot of work on the parallel stuff, large parts
it were written by Amit Kapila, but he's not a committer so Robert is the one
whose name is on the git commit. The actual authors/reviewers and so on are
tracked in the commit message, but people often miss that.

I sometimes hear "But there are only 20 people working on PostgreSQL!" because
people don't realize that (and tools like ohloh/openhub/... make this mistake
too).

~~~
cookiecaper
The actual author can be correctly attributed in git by using the --author
flag at commit time. For example, git commit --author "My Friend
<friend@example.com>". The resulting git commit will show the committer as
committer and the author as author, correctly providing attribution in
changelogs. :)

~~~
anarazel
That doesn't really work well, because frequently for more complex patches
there's more than one significant author. So we end up crediting them in the
text of the commit message instead. We're slowly converging of using a bit
more standardized descriptions of who authored a commit, but so far it's not
easily machine parsable :(

~~~
cookiecaper
I think the idea is that you use a topic branch and each incremental piece is
authored by the individual that authored it. Then the branch is merged in when
the feature is done. If someone helps substantially at the atomic commit
level, then they get greetz in the commit message; the person with 50%+
responsibility should get the author flag. If commits are in small enough
units, that should be pretty doable and all code remains properly attributed.

I do sympathize that not enough attention has been given to this, though, and
there are probably people who don't want to follow the branching model. Would
be nice if you could identify multiple authors in git. Here's a bug discussing
that: [https://bugs.debian.org/cgi-
bin/bugreport.cgi?bug=451880](https://bugs.debian.org/cgi-
bin/bugreport.cgi?bug=451880)

~~~
anarazel
That model leaves you with an incredibly unusable history. So I think the
chances of going there are zero.

~~~
pgaddict
It also doesn't work because a patch if sometimes reworked by multiple people
- either the committer himself, or a co-author of the patch series.

------
dvdplm
Continuously blown away with the quality and usefulness of Postgres. It is
truly lightyears ahead of MySQL and I keep wondering wth took me so long to
switch. Kudos.

~~~
eknkc
What I don't like about postgres is that it's really hard to have a decent
replication / failover setup manually.

Meanwhile, it seems like AWS RDS have more development on MySQL and their own
MySQL compatible engine (aurora?). At least they have Postgres RDS though.
Google Cloud does not offer any hosted Postgres solutions. Cloud SQL is all
MySQL.

I love Postgres but I really don't like maintaining rdbms installations.

What is my best bet running pg on google cloud with ha and minimun hassle?

~~~
eyepulp
We dislike managing PG too, so we've been migrating to
[https://www.compose.io/postgresql/](https://www.compose.io/postgresql/) It's
not for every situation, but they have replication and failover and
downloadable daily backups with the option for SSH-only access. Been using
them for about 6 months on some smaller DB projects, and haven't had any
hiccups.

~~~
lololomg
They say that they'll happily host 3TB databases, but the cost is too high for
databases larger than a few GB.

1TB (with only 100GB of RAM) will set you back $150k a year!

For comparison we pay our generic service provider about 20% of that to manage
2 redundant dedicated servers with 24/7 monitoring and support.

------
merb
This is huge. Actually the latest three versions of PostgreSQL gave us so much
things, I'm just happy that I'm a PostgreSQL user.

~~~
busterarm
Honestly ever since 9.0, they've been on a tear, but the pace is increasing. I
barely even have to consider using an alternative anymore.

------
devit
Why not use 64-bit ids to prevent wraparound, at least as an option?

A storage design that requires periodic maintenance doesn't seem a good one.

~~~
jeffdavis
Transaction IDs are 64 bits, but they need to be stored in the tuples
themselves for visibility purposes. So, only the low 32 bits are stored, but
that creates a wraparound problem.

------
danharaj
i think PostgreSQL is the highest quality code i've ever used and it's
constantly improved at a brisk pace. i think the project exemplifies the full
potential of open source development.

~~~
pgaddict
It's also a bit frustrating. I'm a shitty developer and it takes ages to get
my patches in ;-)

------
rwmj
Is there a reason why PostgreSQL can't use 64 bit transaction IDs?

~~~
jtc331
If you're running PG at a scales where this actually matters, then the
increased storage load is actually a serious concern.

------
brightball
I am curious if companies like Enterprise DB have solutions for cases like
this?

------
askyourmother
Further affirmation we made the right choice to use PostgreSQL for our DB,
which was a tough fight because internal company politics usually require the
use of Oracle or MySQL, due to previously agreed licences.

PostgreSQL remains one of those great products that you can argue for and win
based on the quality, documentation and performance of the product.

------
programminggeek
Well that sucks.

~~~
programminggeek
HN has no sense of humor.

~~~
justinclift
"vacuum" and "sucks". Get it? :D

