pg_rewind in PostgreSQL 9.5

moe · on March 23, 2015

This looks like a great tool, but it's also a sour reminder that replication still feels a lot like open heart surgery on postgresql.

Why can't we just type "enslave 10.0.0.2" into psql and have the computer do the hard work? The machinery is "almost there" for a half a decade now. Who do we have to bribe (wink wink, nudge) to bring the UX into a state where crutches like pg_rewind are not needed?

jeffdavis · on March 24, 2015

I'd summarize the replication story of postgres to be: "harder to use but easier to understand".

PostgreSQL generally builds things from the bottom up -- first working really hard to get the fundamental infrastructure right, and then the utility starts falling into place.

This has a few advantages:

* The fundamental infrastructure tends to be more general, so many different functional improvements come out quickly after the infrastructure is in place.

* It forces everyone involved to think through the edge cases first, leading to a more robust feature and a better-tested feature.

* When things go wrong, all of the parts make more sense individually, so it's easier to understand what happened and fix it.

* Eventually, the usability issues are addressed in a way that's not dependent on a lot of black magic.

Anecdotally, when things go wrong in $MAGICALLY_REPLICATING_DB, they tend to go very wrong, very fast. I am skeptical by nature, so I generally assume that $MAGICALLY_REPLICATING_DB glosses over a lot of the finer points, which don't get exercised in a normal testing environment. I haven't presented any facts here, so you may disagree.

A lot of people share my skepticism, and are willing to put up with a few rough edges in the usability as long as the pieces make sense and they feel they can understand and trust them. I was at a presentation by Instagram, and replication was one of the primary reasons they went with postgres. They used a tiny engineering team (one and a half engineers working on the database) to scale to tens of millions of active users (also using sharding; another thing that is harder to use but easier to understand in postgres).

rosser · on March 24, 2015

A thing people consistently seem not to get when bitching about PostgreSQL (streaming) replication is that, by design, it's slave pulls instead of master pushes. There are Very Good Reasons for that.

And it's really not that hard. MVPRC (Minimum Viable PostgreSQL Replication Configuration):

  Master:
    postgresql.conf:
      wal_level = 'hot_standby'
      max_wal_senders = N
    pg_hba.conf:
      host replication all A.B.C.D/E md5
  Slave:
    postgresql.conf:
      hot_standby = 'on'
    recovery.conf
      standby_mode = 'on'
      primary_conninfo = 'host=W port=X user=Y password=Z'

Literally six lines of "directive = value". (Well, five plus the HBA rule.)

Then you use pg_basebackup, and start the slave.

EDIT: Decided to be specific instead of handwavy.

moe · on March 24, 2015

Literally six lines

Which is six lines too many for what is inherently a runtime configuration that shouldn't be mentioned in any config file at all.

Then you use pg_basebackup, and start the slave.

I take it you don't care for failover and monitoring then?

Otherwise you would have mentioned either repmgr and pgpool here, or your conglomerate of rather complex homegrown scripts to mimic their functionality.

rosser · on March 24, 2015

No, and nor did I mention "fencing". I said "minimum viable". No other open source RDBMS manages those things, so calling postgres out for lacking them is a little disingenuous.

Replication isn't "inherently" anything. Making it a runtime thing is one way of doing it. Just because it's your preferred way (or — as often seems to be the case among people who share your opinion about how postgres does it — just how they were first exposed to it) doesn't make any of the myriad other, perfectly viable ways of doing it wrong.

As for my "conglomerate of homegrown scripts", I keep my db configs versioned in git, and use symlinks from $PGDATA into a role- and environment-specific (e.g., testing-master or production-slave or whatever) subdirectory in a clone of that repo. Puppet clones the repo on spinning up a new machine, and I repoint the symlinks as necessary. You know:

  $ ln -sf /path/to/repo/environment/role/*.conf $PGDATA/

Complex, huh?

waitwaitwhay · on March 24, 2015

> No other open source RDBMS manages those things, so calling postgres out for lacking them is a little disingenuous.

Doesnt MySQL Cluster do this pretty well?

davidgerard · on March 24, 2015

MySQL clustering is a goddamn PITA, just as MySQL replication is, regularly requiring you to take a plunger to unclog it. The quirks and bugs are myriad.

moe · on March 24, 2015

fencing

Which relates to a master/slave cluster as in how?

no other open source RDBMS out there manages those things either

Rethinkdb, Redis, Riak, ElasticSearch all do.

(them not being relational is irrelevant to replication)

Making it a runtime thing is one way of doing it.

It's the only way of doing it for modern, dynamic application layouts (the whole "cloud" thing). The stereotypical static database cluster, carefully configured by a graybeard once and then never touched again, it still exists. But it's rapidly losing ground.

Puppet clones the repo on spinning up a new machine, and I repoint the symlinks as necessary.

So how do you monitor e.g. the replication lag?

ploxiln · on March 24, 2015

Bringing up Redis, Riak, and ElasticSearch is nice...

You see, ElasticSearch gets split brain surprisingly easily, if you have a real significant load / number of hosts / host failures you will most definitely experience it, whereas with most other databases it'll only happen once every couple of years to the very largest and heaviest users. It's also just not known for data consistency, you just have to be able to refresh data from some other canonical source periodically.

Redis requires 2x memory present as is used to serve the load of requests, in order to backup. You can only safely really use half the server's memory. Because it has to fork to either save the snapshot or to compact the write log.

Riak performance is difficult to manage and keep consistent, as you replace failed nodes.

What I have experience with is a very large / heavy use of mysql at a popular website. We did manual/application sharding, and master/slave streaming replication, with automatic read-failover to slave, but only manual promotion. The slave did backups each day, and there was enough transaction log on the master and slave to spin up a new slave to either, with a backup up to 3 days old, and have it catch up. The nature of the load was that it was not possible to take a backup of the master in order to seed a new slave, because the traffic never stops and we could not afford the memory or iops overhead of backing up the master, and we really didn't do downtime if we could help it. A couple hours a year maybe.

We had trending and alerting of everything, including replication, roles, and we automated it all with something not unlike the configuration it appears you have to use for postgresql. Frankly these new-fangled auto-everything databases are untrustworthy, I've seen three of them fail (in ways you might not notice if you're not a graybeard), and the fourth has a name that really puts me off. RethinkDB? yeah, go ahead kids. I just love all this unreliable crap on the internet these days...

PS I'm 26 :)

moe · on March 24, 2015

Frankly these new-fangled auto-everything databases are untrustworthy

I'm not asking for postgres to change its ways and become "untrustworthy".

I'm merely asking that postgres wraps more automation and convenience around the things that it already does.

Replication should generally be operated from inside psql. Spinning up a slave should be a single command. Failing over should be a single command. There should be meaningful error messages, ETAs and progress bars. No rsync'ing of WAL files, no futzing with config files. No elaborate third party 'repmgr' daemons.

pyvpx · on March 24, 2015

> I'm merely asking that postgres wraps more automation and convenience around the things that it already does.

and if history is any indication, it will when it's good and ready to be convenient and automated. as a previous poster (or two) most eloquently noted, postgres tends to work from the bottom up. when the "fundamentals" have been hashed out and thoroughly tested through all or nearly all edge cases, then convenient interfaces "bubble up"

and hey, your patches are always welcomed :)

moe · on March 24, 2015

and if history is any indication, it will

Very true. I was merely bemoaning the status quo. This is a thread about a third party tool to patch up a rather severe UX flaw after all...

rosser · on March 24, 2015

fencing ... relates to a master/slave cluster as in how?

Because when you promote a slave, you also need to fence writes to the old master, otherwise you can end up in a "split-brain" scenario, where the old and new masters are both accepting writes, and nobody's replicating.

Have fun cleaning that up.

(them not being relational is irrelevant to replication)

I disagree. Their being relational imposes extra constraints upon replication. ACID?

So how do you monitor e.g. the replication lag?

Icinga invokes this: https://bucardo.org/check_postgres/check_postgres.pl.html#re...

Look, I'm rapidly losing interest in going back and forth like this. You don't seem very interested in the validity of another perspective, but rather just keep finding fault in things that don't do it your preferred way. If you want to have a discussion, I'll play. If not, let's just walk away.

moe · on March 24, 2015

Because when you promote a slave, you also need to fence writes to the master, otherwise you can end up in a "split-brain" scenario.

There is only one master, hence there can not be a "split brain". You seem to be confusing master/slave and multi-master configuration (which pg doesn't support).

I disagree. Their being relational imposes extra constraints upon replication. ACID?

Well, it doesn't.

Look, I'm rapidly losing interest in going back and forth like this.

You claim things are "easy" that you clearly haven't done before. I was merely calling you out on that, but agree the discussion is unlikely to go anywhere from here.

Edit: Oh and you retouching your comments all the time doesn't help either (the split-brain paragraph just changed...). I'm out.

atombender · on March 24, 2015

> There is only one master, hence there can not be a "split brain".

That's very naive. There are situations -- very real ones -- where the failing master doesn't know it is failing, where clients can still connect and the new master can't safely tell the old one to stop being a master. This is where fencing comes in, and other concepts like heartbeats and STONITH and backup interconnects and so on, all of which makes automatic failover complicated and error-prone.

There are good reasons that 35+ years after the invention of the relational database, only the top commercial ones implement this really cleanly (but they still require expensive dedicated DBAs, for the most part).

rosser · on March 24, 2015

You claim things are "easy" that you clearly haven't done before.

I've been a PostgreSQL DBA for a decade, and built HA setups for my last three employers. Care to reevaluate your claims of what I have or haven't done?

atombender · on March 24, 2015

> Rethinkdb, Redis, Riak, ElasticSearch all do.

None of those are RDBMSes. Since they all lack schemas, constraints, strict consistency and ACID transactions, their replication logic can be much simpler. There are some very good reasons that true multimaster replication is nearly unknown among relational databases.

moe · on March 24, 2015

Since they all lack schemas, constraints, strict consistency and ACID transactions, their replication logic can be much simpler.

Why would an RDBMS txlog be somehow harder to ship than any other db log?

And how is that an excuse for crappy usability in any case?

atombender · on March 24, 2015

Conflict management and consistency in general.

The products you mention have no interdependencies within relations or between relations, such as an RDBMS has, which means that they can just push streams of individual K/V pairs without having to maintain consistency between them.

Most of those projects only offer temporal consistency (last write wins) and the CRDTs offered (eg., Riak) are exceedingly simple compared to the complexities of managing uniqueness constraints or foreign key constraints across transactions.

Hinted handoff -- a common technique that mitigates split-brain data loss and increases availability in a shared scenario -- simply wouldn't work in a relational scenario because nobody would have any idea if the changes would apply when the target shard came back.

Projects like Postgres-XC (ActorDB also looks promising) have shown that multimaster replication is clearly possible in a relational database, but also that it's clearly a difficult problem to crack, because as has been discovered the last decade or so, consistency is the enemy of availability and partition tolerance.

Witness the very slow progress made by XC to retrofit multimaster replication into Postgres, and the many challenges (such as sequences) they have had to solve. Or the fact that it's pretty much alone in the open-source field in trying to solve this issue with a modern relational database. Even very mature, advanced, commercial solutions like Oracle RAC and Oracle Advanced Replication require a lot of configuration, careful design and monitoring to get right.

Note that I'm explicitly using the projects you mentioned as the benchmark here, ie. multimaster-replicated stores with automatic failover based on consensus protocols. You can build a simpler system — single master, consensus-based election, strict consistency guarantee across shards (reducing either availability or partition tolerance) — but it's still harder than you seem to think it is.

moe · on March 24, 2015

Hinted handoff multimaster

What does multi-master have to do with anything?

The subject was making the postgres master/slave replication as easy to use as the aforementioned db's.

Yes, some of them use more advanced clustering models (multi-master) than postgres and are still easier to use. That doesn't exactly make postgres look better in the comparison.

atombender · on March 24, 2015

Are you being deliberately obtuse? You explicitly mentioned four multimaster key/value stores:

  Rethinkdb, Redis, Riak, ElasticSearch all do.
  (them not being relational is irrelevant to replication)

I argued that those are not RDBMSes, and that their replication model was much simpler, something you still seem to be arguing against.

I don't disagree that Postgres's replication could be much nicer, but you're not articulating your position well. For one, bringing up these completely apples-to-orange comparisons makes no sense.

moe · on March 24, 2015

Are you being deliberately obtuse?

You're the one being obtuse here.

I introduced four databases, three of which use more complex replication models than PG, all of which are easier to admin than PG. And I'm asking PG to make their (simpler) replication as easy to use as these other databases.

If that's too confusing then just use Redis for reference and ignore the other three. Redis is plain master/slave and spinning up a slave takes a single command ("SLAVEOF foo"). That's exactly how it should be in postgresql as well. Not more, not less.

atombender · on March 24, 2015

They can and do use different replication models because their data model is explicitly wildly different from Postgres. For example, Postgres couldn't support eventual consistency even if it wanted to. You're barking up the wrong tree.

As for Redis, sure, it has a simple master/slave system and Postgres could offer a similarly simple command UI that would let you issue a command instead of doing the current, trivial amount of setup. I don't disagree, but I think it's an inconsequential problem that affects a sysadmin/devops person about five minutes each year, and an issue that ultimately only deters dilettantes. The real problem to solve is dynamic promotion/demotion of masters and slave. It's not entirely trivial [1].

[1] http://redis.io/topics/sentinel

moe · on March 24, 2015

Postgres couldn't support eventual consistency even if it wanted to. You're barking up the wrong tree.

And you're barking at a strawman. Nobody was talking about eventual consistency.

I don't disagree

You could do that less verbosely then.

The real problem to solve is dynamic promotion/demotion of masters

This whole thread is about how postgres hasn't even streamlined the manual promotion/demotion of one master in a simple master/slave pairing.

Erwin · on March 23, 2015

I asked a few people that at the excellent pgnordic conference and got some hand-waving about how repmgr fixes everything: https://github.com/2ndQuadrant/repmgr/blob/master/QUICKSTART...

Having tried that out, I didn't find it particular user friendly compared to the various fancy NoSQL database where it's just something like... database-server --connect the_master:12324 -- and you've got your cluster even with automatic replication of data depending on your sharding rules.

I suppose that ACID-SQL makes it harder to set this up reliably.

Is there one of the commercial things like EnterpriseDB that fixes that? Effortless, reliably clustering with a nice status that says: slave2 is 95% sync'ed with master1 ETA 2 hours.

undergrowth54 · on March 23, 2015

> enslave "10.0.0.2"

I know it is functionally immaterial, but boy howdy do I ever wish we'd chosen a better convention for how to refer to the relationship between these system components.

moe · on March 24, 2015

I know it is functionally immaterial

There's nothing wrong with Master/slave. We also read man pages, shove male plugs into female ports, fork and kill children, and use dozens of other odd words to describe perfectly harmless things.

kenko · on March 24, 2015

"man" is at least short for "manual".

There's nothing essential about master/slave so why not just switch to primary/replica?

moe · on March 24, 2015

Because there is no reason to screw with common terminology to begin with.

Neither Master nor Slave is a derogatory word. And using the concept of slavery to describe a relationship between inanimate objects in a technical context is not an endorsement of such practices between humans.

robd003 · on March 24, 2015

Because the master forces the slave to do it's bidding.

kenko · on March 24, 2015

But ... so what? For one thing, the metaphor doesn't have to go that way. For another, "leader/follower" captures that too.

diroussel · on March 24, 2015

Not if the replica is pulling changes from the primary.

waterhouse · on March 24, 2015

http://en.wikipedia.org/wiki/Deforestation_%28computer_scien...

olefoo · on March 24, 2015

I try to use 'primary' and 'replica'; but depending on who I'm talking to I find myself slipping back to the Master/Slave terminology.

And to the OP, replication is a complex process fraught with security risks. Take about 10 minutes thinking about just roles and authentication in the context of your proposed syntax and what you would need to set up ahead of time to make it work.

The current method of starting with a pg_basebackup on the replica is nearly as simple. Even if it does require that you do some minor configuration on the master. It's not nearly open heart surgery.

moe · on March 24, 2015

what you would need to set up ahead of time to make it work

Nothing except a shared secret or keypair. Other databases (rethinkdb, redis, riak etc.) show how it's done.

twerquie · on March 24, 2015

Let's just agree to use better words starting now.

Primary/Replica?

neilk · on March 24, 2015

My proposal: muse and scribe. It's evocative and if you know the old terminology the initials are the same.

scott_karana · on March 24, 2015

> simon says 10.0.0.2

sanderjd · on March 24, 2015

Heroku uses "follower", which is much less jarring.

lmm · on March 24, 2015

I find it's a useful heuristic for knowing whether someone can think clearly and abstractly. Kind of like the "all green birds have two heads" test.

undergrowth54 · on March 25, 2015

Could you expand on that? I get the reference, but what is your heuristic?

lmm · on April 1, 2015

Someone who has a problem with master/slave terminology probably can't think abstractly (in the same sense as someone who fails the all green birds have two heads test).

rdtsc · on March 23, 2015

ActorDB is an interesting project that operates on distributed SQLite database. It tries to provide clustering between instances and it does it by continuously replicating the WAL between nodes.

I am not affiliated with the project, but just saw it the other day and thought it was a pretty cool pattern:

http://www.actordb.com/

Here is the excerpt from their description page:

---

Actors are replicated using the Raft distributed consensus protocol. The way we have integrated Raft with the database engine is by replicating the WAL (write-ahead log). Every write to the database is an append to WAL. For every append we send that data to the entire cluster to be replicated. Pages are simply inserted to WAL on all nodes. This means the master executes the SQL, but the slaves just append to WAL.

If a server is very stale or is added new, ActorDB will first send the base actor file to the new server, then it will send the WAL pages (if there are any).

We use a combined WAL file for all actors. This means even with potentially thousands of actors doing writes at the same time, the server will not be appending to thousands of files at once. All writes are appends to the same file and that performs very well.

---

Would this work for PG replication as well I wonder?

amitlan · on March 24, 2015

I hope something like BDR project evolves in a UX-centric (too!) direction.

[0] http://blog.2ndquadrant.com/dynamic-sql-level-configuration-...

[1] https://wiki.postgresql.org/wiki/BDR_User_Guide

nierman · on March 23, 2015

If you are performing a planned failover then the old master can be turned into a slave without extra tools or steps. Simply shut down the master first. As part of this process it waits until the slave has the necessary wal.

pg_rewind will be great for remastering under other scenarios (unexpected failovers, etc.)

andyidsinga · on March 24, 2015

have to say - "pg_rewind" is nice and meta here on HN.

need to convince a postgres contributor to add easter egg "pg_essay".

snuxoll · on March 24, 2015

It's nice to see a lot of work being put into mirroring replication in the PostgreSQL 9.0 line, but until better admin tools are there I'll probably just keep using corosync and pacemaker with a shared fiber channel volume for clustering. Sure, it's cold standby, but it only takes a couple seconds for a standby node to come up and I just have to keep my WAL backups like normal for recovery.

keypusher · on March 23, 2015

One pain point we have run into is the inability to fail over to a previous master that only has access to other WAL logs. That is, we have nodes that can see each other's WAL logs, but not each other's full database. Last master can come back alone, a previous slave can come back alone, but a node that went down as master and came back after having missed transactions apparently cannot.

hobs · on March 23, 2015

Nice, definitely something that is a pain for mirroring in SQL Server, though if you have a good runbook its just time consuming and boring.

sqlcook · on March 23, 2015

? MSSQL mirroring failover is very easy and painless( arguably one of best compared to other rdbms), just make sure not to do automatic failover as that can cause false failover with spotty network between nodes and witness.

hobs · on March 23, 2015

Sorry about that, I was referencing failing back over to the primary after failing over to the secondary. Specifically from the post:

- Cool. And how do I fail back to the old master?

- Umm, well, you have to take a new base backup from the new master, and re-build the node from scratch..

sqlcook · on March 23, 2015

The post is about Postgres, you were referring to MSSQL mirroring, which can fail back from primary to secondary and back to primary almost seamlessly, depending on type of failure and log catchup.

Postgres has been a pain as described in the article.......

corford · on March 24, 2015

I don't get it. Why is it a problem taking a pg_basebackup of the new master to re-seed the old master (which is now a slave) and then promoting it to master again?

Or does pg_rewind offer a way to do this that doesn't require taking the current master offline when you promote a slave?

mappu · on March 23, 2015

Can anyone comment on how this compares to postgres-BDR?

I'm in the market for an asynchronous multi-master RDBMS to cope with a dozen masters and huge (~800ms) latencies - i think my best bets are either BDR or maybe the Cassandra storage engine for MariaDB.

iradik · on March 23, 2015

how does mysql failover compare in this regard?

barrkel · on March 23, 2015

MySQL in master/master mode is fairly painless. Set the secondary site to readonly mode, and when failing over, ensure the primary site is definitely down before enabling writes on the secondary site.

When failing back, bring the primary site back up in readonly mode, give it some time to catch up, make secondary site readonly again, verify if you want to be sure (Percona tools help here), and finally make primary site read-write.

The initial setup in MySQL is a bit fiddly, but overall it's been fairly problem-free as far as I've seen. It's not automatic, but the cost of manual intervention is far lower than the headache of split brain and write divergence.

(Disclaimer: I far prefer postgres as a development target, but hassle-free failover and failback is a hard requirement where I work, and our business model includes giving every customer (banks etc.) a completely separate database instance, including multiple VMs on a separate vlan; per-CPU costs don't work out, it's pretty much MySQL or nothing, until postgres makes it just as pain-free.)

morgo · on March 23, 2015

This looks similar* to replication w/GTIDs + transactional replication (both MySQL 5.6 features).

* MySQL replication works a little different by using binary logs.

bradhe · on March 24, 2015

every step forward in modern relational DBs for reliability is a step backwards for operations and the simplicity of the model. If you're steeped in the ecosystem and know how things "used to be" you don't see how insane things actually are. Forrest, trees, etc.

jeffdavis · on March 24, 2015

Can you please expand?

dijit · on March 24, 2015

he's scared of the increasing complexity of individual systems.

personally, I see where he's coming from with things like systemd.

complexity != usability.

simple things work well and work reliably.

however, I like the increasing features surrounding postgresql- and I'm playing devils advocate. (and anyway, it's not like it's part of the core build, the postgresql binary is untouched!)