

pg_rewind in PostgreSQL 9.5 - willlll
http://hlinnaka.iki.fi/2015/03/23/pg_rewind-in-postgresql-9-5/

======
moe
This looks like a great tool, but it's also a sour reminder that replication
still feels a lot like open heart surgery on postgresql.

Why can't we just type "enslave 10.0.0.2" into psql and have the computer do
the hard work? The machinery is "almost there" for a half a decade now. Who do
we have to bribe (wink wink, nudge) to bring the UX into a state where
crutches like pg_rewind are not needed?

~~~
afarrell
> enslave "10.0.0.2"

I know it is functionally immaterial, but boy howdy do I ever wish we'd chosen
a better convention for how to refer to the relationship between these system
components.

~~~
moe
_I know it is functionally immaterial_

There's nothing wrong with Master/slave. We also read man pages, shove male
plugs into female ports, fork and kill children, and use dozens of other odd
_words_ to describe perfectly harmless things.

~~~
kenko
"man" is at least short for "manual".

There's nothing _essential_ about master/slave so why not just switch to
primary/replica?

~~~
robd003
Because the master forces the slave to do it's bidding.

~~~
kenko
But ... so what? For one thing, the metaphor doesn't _have_ to go that way.
For another, "leader/follower" captures that too.

------
rdtsc
ActorDB is an interesting project that operates on distributed SQLite
database. It tries to provide clustering between instances and it does it by
continuously replicating the WAL between nodes.

I am not affiliated with the project, but just saw it the other day and
thought it was a pretty cool pattern:

[http://www.actordb.com/](http://www.actordb.com/)

Here is the excerpt from their description page:

\---

Actors are replicated using the Raft distributed consensus protocol. The way
we have integrated Raft with the database engine is by replicating the WAL
(write-ahead log). Every write to the database is an append to WAL. For every
append we send that data to the entire cluster to be replicated. Pages are
simply inserted to WAL on all nodes. This means the master executes the SQL,
but the slaves just append to WAL.

If a server is very stale or is added new, ActorDB will first send the base
actor file to the new server, then it will send the WAL pages (if there are
any).

We use a combined WAL file for all actors. This means even with potentially
thousands of actors doing writes at the same time, the server will not be
appending to thousands of files at once. All writes are appends to the same
file and that performs very well.

\---

Would this work for PG replication as well I wonder?

~~~
amitlan
I hope something like BDR project evolves in a UX-centric (too!) direction.

[0] [http://blog.2ndquadrant.com/dynamic-sql-level-
configuration-...](http://blog.2ndquadrant.com/dynamic-sql-level-
configuration-for-bdr-0-9-0/)

[1]
[https://wiki.postgresql.org/wiki/BDR_User_Guide](https://wiki.postgresql.org/wiki/BDR_User_Guide)

------
nierman
If you are performing a planned failover then the old master can be turned
into a slave without extra tools or steps. Simply shut down the master first.
As part of this process it waits until the slave has the necessary wal.

pg_rewind will be great for remastering under other scenarios (unexpected
failovers, etc.)

------
andyidsinga
have to say - "pg_rewind" is nice and meta here on HN.

need to convince a postgres contributor to add easter egg "pg_essay".

------
snuxoll
It's nice to see a lot of work being put into mirroring replication in the
PostgreSQL 9.0 line, but until better admin tools are there I'll probably just
keep using corosync and pacemaker with a shared fiber channel volume for
clustering. Sure, it's cold standby, but it only takes a couple seconds for a
standby node to come up and I just have to keep my WAL backups like normal for
recovery.

------
keypusher
One pain point we have run into is the inability to fail over to a previous
master that only has access to other WAL logs. That is, we have nodes that can
see each other's WAL logs, but not each other's full database. Last master can
come back alone, a previous slave can come back alone, but a node that went
down as master and came back after having missed transactions apparently
cannot.

------
hobs
Nice, definitely something that is a pain for mirroring in SQL Server, though
if you have a good runbook its just time consuming and boring.

~~~
sqlcook
? MSSQL mirroring failover is very easy and painless( arguably one of best
compared to other rdbms), just make sure not to do automatic failover as that
can cause false failover with spotty network between nodes and witness.

~~~
hobs
Sorry about that, I was referencing failing back over to the primary after
failing over to the secondary. Specifically from the post:

\- Cool. And how do I fail back to the old master?

\- Umm, well, you have to take a new base backup from the new master, and re-
build the node from scratch..

~~~
sqlcook
The post is about Postgres, you were referring to MSSQL mirroring, which can
fail back from primary to secondary and back to primary almost seamlessly,
depending on type of failure and log catchup.

Postgres has been a pain as described in the article.......

------
corford
I don't get it. Why is it a problem taking a pg_basebackup of the new master
to re-seed the old master (which is now a slave) and then promoting it to
master again?

Or does pg_rewind offer a way to do this that doesn't require taking the
current master offline when you promote a slave?

------
mappu
Can anyone comment on how this compares to postgres-BDR?

I'm in the market for an asynchronous multi-master RDBMS to cope with a dozen
masters and huge (~800ms) latencies - i think my best bets are either BDR or
maybe the Cassandra storage engine for MariaDB.

------
iradik
how does mysql failover compare in this regard?

~~~
barrkel
MySQL in master/master mode is fairly painless. Set the secondary site to
readonly mode, and when failing over, ensure the primary site is definitely
down before enabling writes on the secondary site.

When failing back, bring the primary site back up in readonly mode, give it
some time to catch up, make secondary site readonly again, verify if you want
to be sure (Percona tools help here), and finally make primary site read-
write.

The initial setup in MySQL is a bit fiddly, but overall it's been fairly
problem-free as far as I've seen. It's not automatic, but the cost of manual
intervention is far lower than the headache of split brain and write
divergence.

(Disclaimer: I far prefer postgres as a development target, but hassle-free
failover and failback is a hard requirement where I work, and our business
model includes giving every customer (banks etc.) a completely separate
database instance, including multiple VMs on a separate vlan; per-CPU costs
don't work out, it's pretty much MySQL or nothing, until postgres makes it
just as pain-free.)

------
bradhe
every step forward in modern relational DBs for reliability is a step
backwards for operations and the simplicity of the model. If you're steeped in
the ecosystem and know how things "used to be" you don't see how insane things
actually are. Forrest, trees, etc.

~~~
jeffdavis
Can you please expand?

~~~
dijit
he's scared of the increasing complexity of individual systems.

personally, I see where he's coming from with things like systemd.

complexity != usability.

simple things work well and work reliably.

however, I like the increasing features surrounding postgresql- and I'm
playing devils advocate. (and anyway, it's not like it's part of the core
build, the postgresql binary is untouched!)

