A long way of saying: this is a nasty failure mode, and it's terrific to see the PostgreSQL team tackle it. While there are still some questions (e.g., this sounds like the autovacuum is still kicked but that is at once more aggressive and more intelligent, but does that mean that an exclusive lock on the table is still blocked for the duration of the operation?), this is clearly a big leap in the right direction, and (more) validation that we've made the right choice for our transactional data; thank you PostgreSQL team!
that's half true.
After vacuum_freeze_table_age postgres will automatically start freeze-only autovacuum processes with very low i/o priority. Unless you have tweaked that value, it's set to 150M which is way below the 2^32 hard failure point.
From then on, postgres will also start warning you about this in the log file.
If you ignore this, once you're at 2^32-10M transactions, every single query to the database will yield a WARNING including a nice countdown that will count down how many transaction can still be processed before armageddon.
Only if you also ignore these, you will eventually run into the problem.
So while I agree that it's a very real problem, I really don't agree that it's a problem which strikes with "little warning". Postgres is very much yelling at you as it starts panicking.
Holy crap. You wrote a jump into an active process' memory space?
Around a month ago the source file table on Coveralls.io hit a transaction wraparound, and put us in read-only mode for over 24 hours while RDS engineers performed a full vacuum (that's not something that can be done manually as a customer). On a managed database I'm paying thousands a month for, I was hoping there would be some sort of early warning system. Well, apparently there is, but it's buried in the logs, and won't trigger any app exceptions so went un-noticed.
What's worse is there's 0 indication of how long a vacuum is going to take, nor progress updates while it's going. So for a production web app with customers, this means damage control language like:
"Our engineers have identified a database performance issue and working to mitigate. Unfortunately we do not have an ETA at this time."
About a week later, more calamity hit: the INT "id" field on the same table exceeded the max length. My first thought was change it to a BIGINT, but after ~4 hrs into the migration without any indication of how much longer it would take, I pulled the plug and sharded the table instead.
Moral of the story is that web devs should be aware of these pitfalls, and that no matter how much trust you put into a managed database service, it still could happen to you (queue ominous background music).
Anyway I'm glad to see this lurking monster in our beloved database tamed, thank you Mr Haas!
Upcoming 9.6 will help with this to a certain degree: http://www.postgresql.org/docs/devel/static/progress-reporti...
They solved the problem by moving the SSD after I left, but I feel like incorporating this patch might actually let them reduce the size of their Postgres cluster, or at least grow it slower.
This write up post  mentions your InfoQ talk. The general philosophy I got from it is that postgres is used for the source of truth, gets all the writes; and then reads can be offloaded to slave instances, or other specialized engines like Cassandra.
However there are no details at all about the replication solution. If you don't mind sharing, how was you general experience around it? Was it trigger-based, WAL-based? Was it the postgres solution, or Slony/Bucardo/etc? What were the hurdles/gotchas you encountered? Did you have to compromise between synch/asynch? It is said that there are two big problems in computing: naming things and cache invalidation. Did you have to design a specific caching policy? How was it enforced? How did you decide what to put to Cassandra vs postgres slaves?
Also, what features did you use in postgres? Did you chose it specifically for hstore?
> The general philosophy I got from it is that postgres is used for the source of truth, gets all the writes; and then reads can be offloaded to slave instances, or other specialized engines like Cassandra.
That was true at the time of the talk, I don't think it's true today. I believe Cassandra holds the source of truth for some data now.
> However there are no details at all about the replication solution. If you don't mind sharing, how was you general experience around it?
It was terrible. That was the part of Postgres that was totally unsolved. I once spent an entire family vacation resyncing databases while sitting in a hotel room.
> Was it trigger-based, WAL-based?
We started with Wal based replication, then moved to a tool called londiste (which is trigger based). We liked that tool because it was in Python so we could dig into the internals if we had to (which we did have to on occasion).
> Was it the postgres solution, or Slony/Bucardo/etc?
I think it was third party at the time. It looks slightly more official now?
> What were the hurdles/gotchas you encountered?
A stable replication system. :) The replication would keep dying because the nodes would get out of sync and the replication wouldn't be able to recover. They discovered after I left that the issue was an occasional pathological code branch that would make a write directly to a slave instead of a master. I think replication got a lot more stable after they fixed that.
> Did you have to compromise between synch/asynch?
It was all async and we designed around that. The nice thing was that anything we wrote to the database went to the cache too, so even if a read slave hadn't picked up the change yet it didn't matter because the read was never hitting Postgres -- the data was in the cache already. Sometimes we did have problems where out of date data would be read from a slave and then cached, but it was usually fixed on its own, either by being written again or just falling out of the cache.
> It is said that there are two big problems in computing: naming things and cache invalidation. Did you have to design a specific caching policy? How was it enforced?
The app developer could choose the level of caching they wanted for each piece of data. There were multiple caches available with different levels of globalness. For example we had caches right on the servers that were used to cache rendered html bits. Since the html was the same as long as the data was the same, which was part of the cache key, then you never had to invalidate it.
Basically, to solve the invalidation problem, as long as you write your code in a way where the value is tied to the cache key, you never have to invalidate the cache because the changing data does it for you, so for things like pre-rendered html this worked great, but for things where a value was being cached obviously not so much. In those cases, there was a global cache so you just had to deal with race conditions (but for something like points, the command that was sent was "increment 1", not "set to X").
> How did you decide what to put to Cassandra vs postgres slaves?
It depended on the use case. Things that were faster/easier to store in Postgres were stored there. For example, the list of the top 1000 links for every subreddit were stored there (and only there). For every vote, that list was recalculated from the data in Postgres and put in C*. So technically it was the source of truth for that data, but since it could always be recomputed from Postgres, we didn't call it that. (BTW there were shortcuts for not recomputing on every vote).
> Also, what features did you use in postgres? Did you chose it specifically for hstore?
Hstore didn't exist when we started using Postgres (or even when I left reddit for that matter).
We didn't use any special features in Postgres, we just liked that it was rock solid code and conformed to standard SQL better than MySql. It also had a way better query planner.
> From below: how on earth did moving to SSDs help Reddit avoid the Postgres autovacuum failure mode?!
It didn't, it just meant that when a vacuum happened the node was still useable, as opposed to with spinning disk. A vacuum on a read slave during peak sucked. A vacuum on a master during most any time was disastrous. I had to schedule the full vacuums for Saturday nights and hope that was sufficient.
For just Postgres, he has completed:
- No more full table vacuums
- Parallel Sequential Scan
- Major memory improvements for NUMA
- Major improvements for large multi-core systems
My only concern is that it appears that Robert is the only person tackling big/major improvements for Postgres.
I might be wrong, but that's the perception. I hope this isn't the case and other are also tackling big improvements to Postgres.
That's of course a lot of work and requires a lot of experience, but it may skew the perception a bit.
I often hear "there are only 20 people working on PostgreSQL" because the people don't realize that the guy who committed it may not be the one who implemented it.
For example while Robert did a lot of work on the parallel stuff, large parts it were written by Amit Kapila, but he's not a committer so Robert is the one whose name is on the git commit. The actual authors/reviewers and so on are tracked in the commit message, but people often miss that.
I sometimes hear "But there are only 20 people working on PostgreSQL!" because people don't realize that (and tools like ohloh/openhub/... make this mistake too).
I do sympathize that not enough attention has been given to this, though, and there are probably people who don't want to follow the branching model. Would be nice if you could identify multiple authors in git. Here's a bug discussing that: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=451880
Link on Gmane:
That said, Postgres probably does have a problem in that there are too few people writing these major patches, but that number is greater than one.
I don't think that's the major problem. That is having enough experienced people with enough bandwidth, to make those patches suitable for integration.
But lots of others do great work, as well. Some of it very big and ambitious, some of it is more humble and incremental.
Meanwhile, it seems like AWS RDS have more development on MySQL and their own MySQL compatible engine (aurora?). At least they have Postgres RDS though. Google Cloud does not offer any hosted Postgres solutions. Cloud SQL is all MySQL.
I love Postgres but I really don't like maintaining rdbms installations.
What is my best bet running pg on google cloud with ha and minimun hassle?
> Add support for logical decoding of WAL data, to allow database changes to be streamed out in a customizable format.
Also 9.5 added new metadata to it :
> Each WAL record now carries information about the modified relation and (s) in a standardized format. That makes it easier to write tools that need that information, like pg_rewind, prefetching the blocks to speed up recovery, etc.
Hopefully we will see the ecosystem pick up on those changes. Percona made a killing in helping replication take place in MySQL. I don't see why this would not happen with postgres.
I'm confident things will improve.
Maybe only tangentially related, but they also made enhancements to pg_receivexlog that make it acknowledge transactions over the wire.
It may not seem like a big deal, but with those changes I was pretty trivially able to patch a version of pg_receivexlog that replicates WALs to HDFS and flushes them (instead of the local FS), so that you can essentially treat HDFS nodes as synchronous standbys. Incredibly useful for setups where your only non-ephemeral storage is HDFS, and the 16MB chunk boundary imposed by archive_command is too course-grained.
I'd imagine similar things can be written for lots of other non-posix filesystems that support appending and flushing... Being able to replicate to something that's not just another plain old server is really useful (not to mention how much simpler it is than having to set up an entire other Postgres node just to synchronously ship logs.)
This is one reason I personally have a hard time moving away from MSSQL (despite the obscene licensing). Clustering, replication, and failover of MSSQL on Windows is really quite powerful. Though not necessarily cheap or easy to setup and troubleshoot, it is wonderful when it works.
1TB (with only 100GB of RAM) will set you back $150k a year!
For comparison we pay our generic service provider about 20% of that to manage 2 redundant dedicated servers with 24/7 monitoring and support.
Don't get me wrong here, PostgreSQL is light years better but that's the big "why".
Postgres now has a variety of foreign data wrappers available and can serve that exact same function.
Mail me if you'd like to be in the beta - email@example.com.
Another option is intoGres: https://www.intogres.com.
However, they are super expensive. Considerably more than running yourself, and considerably more than CloudSQL or RDS on AWS.
Their plans are so small it seems that they're not really made for running anything big. Aiven's biggest plan is 3 nodes, each with 8 cores and up to 32GB RAM.
Elephant's most expensive plan is 1 node with 4 cores and 15GB RAM. Not sure if you just buy multiple of these. They have replication support, but I don't see anything about pricing, and they don't have automatic failover.
Now if you need Postgres specific features... gets more tricky. Setting up FT is terribad.
While PostgreSQL replication may be a bit clunkier on the initial setup, it's a dream comparatively.
If you would want to put multiple databases on one host, probably more efficient would be still to put all data in single instance. I could see this if it's a very small database, then this could work, but then wouldn't you be better off with using SQLite?
In additions when you add containers to the mix you turn a single problem into many, some examples:
- assuming you have multiple hosts, you need to figure out where you'll store persistent data (and you generally want a solution with high IOPS)
- how you handle logging (where you store them?)
- how the applications figure out where your database is (service discovery)
- how you solve replication (and figure out which database is the master)
- how you handle failover
Wouldn't you have similar problems with Postgres without the container?
Typically to make the database highly available you might set up a second one (or perhaps more) that replicates from the master. This can become problematic if the postgres containers will be moved around.
As I said, I don't know kubernetes, if you for example can have containers that have state (e.g. you destroy and recreate it somewhere else, and they are exactly same) and also keep the same IP then this is not an issue, but if when you move it around and each instance is technically a new postgres, then such setup might become problematic.
Regarding your question, the traditional way of running it is that you set up a host and run postgres on it. It doesn't move around so you need those solutions. Granted that for example if you implement service discovery for example if something happens to a host, you can set up another and quickly point everything to it.
I'd expect that if you run two Postgres servers, unaware of each other, at the same IP address you will rapidly get data corruption, but maybe I'm missing a step since you say this works fine for a lot of people.
Temporal tables: https://wiki.postgresql.org/images/6/64/Fosdem20150130Postgr...
A storage design that requires periodic maintenance doesn't seem a good one.
PostgreSQL remains one of those great products that you can argue for and win based on the quality, documentation and performance of the product.