Hacker News new | past | comments | ask | show | jobs | submit login
No More Full Table Vacuums in PostgreSQL (rhaas.blogspot.com)
290 points by dhd415 on Mar 18, 2016 | hide | past | web | favorite | 73 comments



This sounds promising! Transaction wraparound autovacuum is a particularly nasty pathology because it is so devastating and strikes with so little warning: a system transitions from entirely fine to one that is entirely (and mysteriously!) non-functional in a single transaction. And because the system needs to be up for a pretty long time before you see this (it is tautologically rare), many only learn about this failure mode the hard way. Certainly, we at Joyent got completely nailed by this -- and if you want to get a sense of how disorienting this failure mode is when you're seeing it for the first time, read our postmortem of the service outage that transaction wraparound autovacuum induced.[1]

A long way of saying: this is a nasty failure mode, and it's terrific to see the PostgreSQL team tackle it. While there are still some questions (e.g., this sounds like the autovacuum is still kicked but that is at once more aggressive and more intelligent, but does that mean that an exclusive lock on the table is still blocked for the duration of the operation?), this is clearly a big leap in the right direction, and (more) validation that we've made the right choice for our transactional data; thank you PostgreSQL team!

[1] https://www.joyent.com/blog/manta-postmortem-7-27-2015


> strikes with so little warning: a system transitions from entirely fine to one that is entirely (and mysteriously!) non-functional in a single transaction

that's half true.

After vacuum_freeze_table_age postgres will automatically start freeze-only autovacuum processes with very low i/o priority. Unless you have tweaked that value, it's set to 150M which is way below the 2^32 hard failure point.

From then on, postgres will also start warning you about this in the log file.

If you ignore this, once you're at 2^32-10M transactions, every single query to the database will yield a WARNING including a nice countdown that will count down how many transaction can still be processed before armageddon.

Only if you also ignore these, you will eventually run into the problem.

So while I agree that it's a very real problem, I really don't agree that it's a problem which strikes with "little warning". Postgres is very much yelling at you as it starts panicking.


>Since these tunables are not dynamically adjustable, we manually patched the running process to avoid sleeping in this case.

Holy crap. You wrote a jump into an active process' memory space?


While I have hot-patched program text on production machines many times in my career, in this case it was a bit simpler in that we were dynamically changing the value of a variable. The effect was instant; our chat logs from the time captured our visceral reaction to the resulting performance improvement.[1]

[1] https://twitter.com/bcantrill/status/628320729898045440


Thanks very much for sharing that, interesting reading!


This is one of the things about being a web developer and part-time DBA that keeps me up at night (sometimes literally all night).

Around a month ago the source file table on Coveralls.io[0] hit a transaction wraparound, and put us in read-only mode for over 24 hours while RDS engineers performed a full vacuum (that's not something that can be done manually as a customer). On a managed database I'm paying thousands a month for, I was hoping there would be some sort of early warning system. Well, apparently there is, but it's buried in the logs, and won't trigger any app exceptions so went un-noticed.

What's worse is there's 0 indication of how long a vacuum is going to take, nor progress updates while it's going. So for a production web app with customers, this means damage control language like:

"Our engineers have identified a database performance issue and working to mitigate. Unfortunately we do not have an ETA at this time."

About a week later, more calamity hit: the INT "id" field on the same table exceeded the max length. My first thought was change it to a BIGINT, but after ~4 hrs into the migration without any indication of how much longer it would take, I pulled the plug and sharded the table instead.

Moral of the story is that web devs should be aware of these pitfalls, and that no matter how much trust you put into a managed database service, it still could happen to you (queue ominous background music).

Anyway I'm glad to see this lurking monster in our beloved database tamed, thank you Mr Haas!

[0] https://coveralls.io


> What's worse is there's 0 indication of how long a vacuum is going to take, nor progress updates while it's going.

Upcoming 9.6 will help with this to a certain degree: http://www.postgresql.org/docs/devel/static/progress-reporti...


Thanks for the post. I'm a big fan of coveralls (I use it to collate the simplecov reports from 20 CircleCI instances), and I was wondering about the downtime.


that's why I always use BIGINT for 'most' tables. only when I know that it will never be bigger than INT I will use int.


Praise the deity! Oh how I wish I had this when I was running reddit's databases. It's exactly the use case this was built for -- high write load where little changes after it's written.

They solved the problem by moving the SSD after I left, but I feel like incorporating this patch might actually let them reduce the size of their Postgres cluster, or at least grow it slower.


> Postgres cluster

This write up post [1] mentions your InfoQ talk. The general philosophy I got from it is that postgres is used for the source of truth, gets all the writes; and then reads can be offloaded to slave instances, or other specialized engines like Cassandra.

However there are no details at all about the replication solution. If you don't mind sharing, how was you general experience around it? Was it trigger-based, WAL-based? Was it the postgres solution, or Slony/Bucardo/etc? What were the hurdles/gotchas you encountered? Did you have to compromise between synch/asynch? It is said that there are two big problems in computing: naming things and cache invalidation. Did you have to design a specific caching policy? How was it enforced? How did you decide what to put to Cassandra vs postgres slaves?

Also, what features did you use in postgres? Did you chose it specifically for hstore?

[1] http://highscalability.com/blog/2013/8/26/reddit-lessons-lea...


Big disclaimer: I haven't worked at reddit in five years, so everything I'm saying was true five years ago but may not be today. I also haven't admined a production postgres box since I left reddit.

> The general philosophy I got from it is that postgres is used for the source of truth, gets all the writes; and then reads can be offloaded to slave instances, or other specialized engines like Cassandra.

That was true at the time of the talk, I don't think it's true today. I believe Cassandra holds the source of truth for some data now.

> However there are no details at all about the replication solution. If you don't mind sharing, how was you general experience around it?

It was terrible. That was the part of Postgres that was totally unsolved. I once spent an entire family vacation resyncing databases while sitting in a hotel room.

> Was it trigger-based, WAL-based?

We started with Wal based replication, then moved to a tool called londiste (which is trigger based). We liked that tool because it was in Python so we could dig into the internals if we had to (which we did have to on occasion).

> Was it the postgres solution, or Slony/Bucardo/etc?

I think it was third party at the time. It looks slightly more official now?

> What were the hurdles/gotchas you encountered?

A stable replication system. :) The replication would keep dying because the nodes would get out of sync and the replication wouldn't be able to recover. They discovered after I left that the issue was an occasional pathological code branch that would make a write directly to a slave instead of a master. I think replication got a lot more stable after they fixed that.

> Did you have to compromise between synch/asynch?

It was all async and we designed around that. The nice thing was that anything we wrote to the database went to the cache too, so even if a read slave hadn't picked up the change yet it didn't matter because the read was never hitting Postgres -- the data was in the cache already. Sometimes we did have problems where out of date data would be read from a slave and then cached, but it was usually fixed on its own, either by being written again or just falling out of the cache.

> It is said that there are two big problems in computing: naming things and cache invalidation. Did you have to design a specific caching policy? How was it enforced?

The app developer could choose the level of caching they wanted for each piece of data. There were multiple caches available with different levels of globalness. For example we had caches right on the servers that were used to cache rendered html bits. Since the html was the same as long as the data was the same, which was part of the cache key, then you never had to invalidate it.

Basically, to solve the invalidation problem, as long as you write your code in a way where the value is tied to the cache key, you never have to invalidate the cache because the changing data does it for you, so for things like pre-rendered html this worked great, but for things where a value was being cached obviously not so much. In those cases, there was a global cache so you just had to deal with race conditions (but for something like points, the command that was sent was "increment 1", not "set to X").

> How did you decide what to put to Cassandra vs postgres slaves?

It depended on the use case. Things that were faster/easier to store in Postgres were stored there. For example, the list of the top 1000 links for every subreddit were stored there (and only there). For every vote, that list was recalculated from the data in Postgres and put in C*. So technically it was the source of truth for that data, but since it could always be recomputed from Postgres, we didn't call it that. (BTW there were shortcuts for not recomputing on every vote).

> Also, what features did you use in postgres? Did you chose it specifically for hstore?

Hstore didn't exist when we started using Postgres (or even when I left reddit for that matter).

We didn't use any special features in Postgres, we just liked that it was rock solid code and conformed to standard SQL better than MySql. It also had a way better query planner.

> From below: how on earth did moving to SSDs help Reddit avoid the Postgres autovacuum failure mode?!

It didn't, it just meant that when a vacuum happened the node was still useable, as opposed to with spinning disk. A vacuum on a read slave during peak sucked. A vacuum on a master during most any time was disastrous. I had to schedule the full vacuums for Saturday nights and hope that was sufficient.


Thanks you for the answer!


Great questions, all. I would add one more: how on earth did moving to SSDs help Reddit avoid the Postgres autovacuum failure mode?!


I don't think it solved it/avoided it, so much as throwing more IOPS at the issue means it takes less time.


Robert Haas is just amazing.

For just Postgres, he has completed:

- No more full table vacuums

- Parallel Sequential Scan

- Major memory improvements for NUMA

- Major improvements for large multi-core systems

My only concern is that it appears that Robert is the only person tackling big/major improvements for Postgres.

I might be wrong, but that's the perception. I hope this isn't the case and other are also tackling big improvements to Postgres.


In this post, Haas explicitly credits Masahiko Sawada for most of the work on this feature. Haas definitely does lots of work, but there are other significant contributors as well.


While Robert certainly does a huge amount of work, this is a bit incorrect perception of how contributing to PostgreSQL works. There are ~20 developers responsible for final reviews and committing changes coming from other contributors, and Robert is one of them.

That's of course a lot of work and requires a lot of experience, but it may skew the perception a bit.

I often hear "there are only 20 people working on PostgreSQL" because the people don't realize that the guy who committed it may not be the one who implemented it.

For example while Robert did a lot of work on the parallel stuff, large parts it were written by Amit Kapila, but he's not a committer so Robert is the one whose name is on the git commit. The actual authors/reviewers and so on are tracked in the commit message, but people often miss that.

I sometimes hear "But there are only 20 people working on PostgreSQL!" because people don't realize that (and tools like ohloh/openhub/... make this mistake too).


The actual author can be correctly attributed in git by using the --author flag at commit time. For example, git commit --author "My Friend <friend@example.com>". The resulting git commit will show the committer as committer and the author as author, correctly providing attribution in changelogs. :)


That doesn't really work well, because frequently for more complex patches there's more than one significant author. So we end up crediting them in the text of the commit message instead. We're slowly converging of using a bit more standardized descriptions of who authored a commit, but so far it's not easily machine parsable :(


I think the idea is that you use a topic branch and each incremental piece is authored by the individual that authored it. Then the branch is merged in when the feature is done. If someone helps substantially at the atomic commit level, then they get greetz in the commit message; the person with 50%+ responsibility should get the author flag. If commits are in small enough units, that should be pretty doable and all code remains properly attributed.

I do sympathize that not enough attention has been given to this, though, and there are probably people who don't want to follow the branching model. Would be nice if you could identify multiple authors in git. Here's a bug discussing that: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=451880


That model leaves you with an incredibly unusable history. So I think the chances of going there are zero.


It also doesn't work because a patch if sometimes reworked by multiple people - either the committer himself, or a co-author of the patch series.


That isn't the case. If you are interested, I would suggest joining the Postgres hackers mailing list. Reading that will give a better picture of what is happening and who is doing the happening.


> Postgres hackers mailing list

Link on Gmane:

http://news.gmane.org/gmane.comp.db.postgresql.devel.general


When we talk about "big" patches, it's probably worth examining full change sets. The changes for removing vacuuming are here [1]. Compare that to those for the "biggest" 9.5 feature, upsert [2] (by Peter Geoghegan) and you can see that there's probably at least a full order of magnitude difference in effort there.

That said, Postgres probably does have a problem in that there are too few people writing these major patches, but that number is greater than one.

[1] http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;...

[2] http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;...


> Postgres probably does have a problem in that there are too few people writing these major patches

I don't think that's the major problem. That is having enough experienced people with enough bandwidth, to make those patches suitable for integration.


Robert is amazing, I 100% agree. Not only has he done a lot of great things himself, but he's helped with other people's work (including mine) as well.

But lots of others do great work, as well. Some of it very big and ambitious, some of it is more humble and incremental.


Continuously blown away with the quality and usefulness of Postgres. It is truly lightyears ahead of MySQL and I keep wondering wth took me so long to switch. Kudos.


What I don't like about postgres is that it's really hard to have a decent replication / failover setup manually.

Meanwhile, it seems like AWS RDS have more development on MySQL and their own MySQL compatible engine (aurora?). At least they have Postgres RDS though. Google Cloud does not offer any hosted Postgres solutions. Cloud SQL is all MySQL.

I love Postgres but I really don't like maintaining rdbms installations.

What is my best bet running pg on google cloud with ha and minimun hassle?


One can hope that this will get fixed over time. 9.4 introduced some flexibility over streaming out data changes from the Write Ahead Log, from [1]:

> Add support for logical decoding of WAL data, to allow database changes to be streamed out in a customizable format.

Also 9.5 added new metadata to it [2]:

> Each WAL record now carries information about the modified relation and (s) in a standardized format. That makes it easier to write tools that need that information, like pg_rewind, prefetching the blocks to speed up recovery, etc.

Hopefully we will see the ecosystem pick up on those changes. Percona made a killing in helping replication take place in MySQL. I don't see why this would not happen with postgres.

I'm confident things will improve.

[1] http://www.postgresql.org/docs/9.4/static/release-9-4.html

[2] http://michael.otacoo.com/postgresql-2/postgres-9-5-feature-...


> Each WAL record now carries information about the modified relation and (s) in a standardized format. That makes it easier to write tools that need that information, like pg_rewind, prefetching the blocks to speed up recovery, etc.

Maybe only tangentially related, but they also made enhancements to pg_receivexlog that make it acknowledge transactions over the wire.

It may not seem like a big deal, but with those changes I was pretty trivially able to patch a version of pg_receivexlog that replicates WALs to HDFS and flushes them (instead of the local FS), so that you can essentially treat HDFS nodes as synchronous standbys. Incredibly useful for setups where your only non-ephemeral storage is HDFS, and the 16MB chunk boundary imposed by archive_command is too course-grained.

I'd imagine similar things can be written for lots of other non-posix filesystems that support appending and flushing... Being able to replicate to something that's not just another plain old server is really useful (not to mention how much simpler it is than having to set up an entire other Postgres node just to synchronously ship logs.)


> What I don't like about postgres is that it's really hard to have a decent replication / failover setup manually.

This is one reason I personally have a hard time moving away from MSSQL (despite the obscene licensing). Clustering, replication, and failover of MSSQL on Windows is really quite powerful. Though not necessarily cheap or easy to setup and troubleshoot, it is wonderful when it works.


We dislike managing PG too, so we've been migrating to https://www.compose.io/postgresql/ It's not for every situation, but they have replication and failover and downloadable daily backups with the option for SSH-only access. Been using them for about 6 months on some smaller DB projects, and haven't had any hiccups.


They say that they'll happily host 3TB databases, but the cost is too high for databases larger than a few GB.

1TB (with only 100GB of RAM) will set you back $150k a year!

For comparison we pay our generic service provider about 20% of that to manage 2 redundant dedicated servers with 24/7 monitoring and support.


Compose runs on AWS, though. OP was asking about Google Cloud.


AWS is investing more into MySQL because of market and the fact that MySQL is an interface to numerous database engines. It's significantly easier for them to built something like Aurora with MySQL compatibility because of that setup.

Don't get me wrong here, PostgreSQL is light years better but that's the big "why".


Write support for the SQL/MED standard was added in 9.3.

Postgres now has a variety of foreign data wrappers available and can serve that exact same function.


My company, Databaase Labs (https://www.databaselabs.io/), runs fully managed Postgres DBaaS and now has Google Cloud as a deploy target in beta.

Mail me if you'd like to be in the beta - pjlegato@databaselabs.io.


Heroku has a decent offer: https://www.heroku.com/postgres.

Another option is intoGres: https://www.intogres.com.


Those aren't for Google Cloud, though. Heroku is hosted on AWS, and intoGres has their own cloud. You don't want to incur the latency of talking to a database from a different cloud.


Hadn't heard of intoGres. Thanks for that.


There's ElephantSQL (https://www.elephantsql.com) and Aiven (https://aiven.io/postgresql.html).

However, they are super expensive. Considerably more than running yourself, and considerably more than CloudSQL or RDS on AWS.

Their plans are so small it seems that they're not really made for running anything big. Aiven's biggest plan is 3 nodes, each with 8 cores and up to 32GB RAM.

Elephant's most expensive plan is 1 node with 4 cores and 15GB RAM. Not sure if you just buy multiple of these. They have replication support, but I don't see anything about pricing, and they don't have automatic failover.


Usually better off just running Google cloud SQL, because for many workloads the benefits of hosted FT outweigh the gains of Postgres over MySQL.

Now if you need Postgres specific features... gets more tricky. Setting up FT is terribad.


Some ex-colleagues just started offering what you want, I think. I am sure they'd appreciate feedback even if it isn't: https://aiven.io (disclaimer: haven't tried it myself for lack of a use case)


I've managed some significant MySQL replication clusters for the last couple of years. They are a nightmare, frequently stopping for dumb errors and/or falling completely out of sync. It's only manageable because of Percona's tools.

While PostgreSQL replication may be a bit clunkier on the initial setup, it's a dream comparatively.


What about a dockerized Postgres? https://hub.docker.com/_/postgres/


Throwing something into a container does not magically make it have replication nor failover.


It actually makes your life harder.


Could you elaborate on this? I was considering using containers for a PG installation but am reevaluating.


What are you trying to solve by doing this? Are you planning to run other applications on the same host. Generally it's not a good idea to share host running database with other applications due to different workloads.

If you would want to put multiple databases on one host, probably more efficient would be still to put all data in single instance. I could see this if it's a very small database, then this could work, but then wouldn't you be better off with using SQLite?

In additions when you add containers to the mix you turn a single problem into many, some examples:

- assuming you have multiple hosts, you need to figure out where you'll store persistent data (and you generally want a solution with high IOPS) - how you handle logging (where you store them?) - how the applications figure out where your database is (service discovery) - how you solve replication (and figure out which database is the master) - how you handle failover


I was considering deploying Postgres (as a container) into a Kubernetes cluster. Kubernetes takes care of service discovery, networking, persistent disk mounting, moving the container if a server goes down, etc.

Wouldn't you have similar problems with Postgres without the container?


I don't know kubernetes, but looks like it solves many of those issues already. I still believe the HA might be a challenge.

Typically to make the database highly available you might set up a second one (or perhaps more) that replicates from the master. This can become problematic if the postgres containers will be moved around.

As I said, I don't know kubernetes, if you for example can have containers that have state (e.g. you destroy and recreate it somewhere else, and they are exactly same) and also keep the same IP then this is not an issue, but if when you move it around and each instance is technically a new postgres, then such setup might become problematic.

Regarding your question, the traditional way of running it is that you set up a host and run postgres on it. It doesn't move around so you need those solutions. Granted that for example if you implement service discovery for example if something happens to a host, you can set up another and quickly point everything to it.


Well on FreeBSD you can toss it into a jail with iocage and get snapshots, clones, easy HA with vrrp or CARP. Not replication per se. But for a lot of people that works just fine and is incredibly easy to automate and accomplish.


How does this get you HA (easy or otherwise) without application-level configuration? Does this involve something like lockstep execution between the two servers?

I'd expect that if you run two Postgres servers, unaware of each other, at the same IP address you will rapidly get data corruption, but maybe I'm missing a step since you say this works fine for a lot of people.


I'd guesstimate it involves shipping snapshots to a cold standby, since snapshotting/cloning is mentioned.


If at least temporal table support would be on the TODO list

Temporal tables: https://wiki.postgresql.org/images/6/64/Fosdem20150130Postgr...

TODO: https://wiki.postgresql.org/wiki/Todo



I don't get to use psql much anymore but the documentation is among the best I've ever seen for any project.


This is huge. Actually the latest three versions of PostgreSQL gave us so much things, I'm just happy that I'm a PostgreSQL user.


Honestly ever since 9.0, they've been on a tear, but the pace is increasing. I barely even have to consider using an alternative anymore.


Why not use 64-bit ids to prevent wraparound, at least as an option?

A storage design that requires periodic maintenance doesn't seem a good one.


Transaction IDs are 64 bits, but they need to be stored in the tuples themselves for visibility purposes. So, only the low 32 bits are stored, but that creates a wraparound problem.


i think PostgreSQL is the highest quality code i've ever used and it's constantly improved at a brisk pace. i think the project exemplifies the full potential of open source development.


It's also a bit frustrating. I'm a shitty developer and it takes ages to get my patches in ;-)


Is there a reason why PostgreSQL can't use 64 bit transaction IDs?


If you're running PG at a scales where this actually matters, then the increased storage load is actually a serious concern.


I am curious if companies like Enterprise DB have solutions for cases like this?


Further affirmation we made the right choice to use PostgreSQL for our DB, which was a tough fight because internal company politics usually require the use of Oracle or MySQL, due to previously agreed licences.

PostgreSQL remains one of those great products that you can argue for and win based on the quality, documentation and performance of the product.


Well that sucks.


HN has no sense of humor.


"vacuum" and "sucks". Get it? :D


Try a good joke?




Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: