

Replicated PostgreSQL with pgpool2 - secure
http://michael.stapelberg.de/Artikel/replicated_postgresql_with_pgpool

======
ruckusing
Another solution is wal-e[1] which handles continuous archiving. It was built
by the Heroku guys and as such is battle-tested.

I use wal-e myself and its indispensable and easy to use.

[https://github.com/wal-e/wal-e](https://github.com/wal-e/wal-e)

~~~
secure
After reading its README.rst, it seems to me like wal-e only cares about
backing up (and restoring) WAL files, not about synchronuous replication like
the original article.

~~~
tacticus
But it is very easy to combine wal-e and the built in streaming replication.

------
rpedela
Why isn't the built-in master/slave replication plus WAL archiving to
something like S3 sufficient?

~~~
secure
Excellent question. The way I understand it is that master/slave replication
provides eventual consistency by shipping WAL logs to the other node once a
WAL log is finalized.

pgpool2 on the other hand provides statement-level consistency, i.e. a
statement is only successful when it is committed to all healthy nodes.

I agree that for many use-cases, the built-in master/slave replication is
probably good enough, but I wanted to try going all the way… :)

~~~
bremac
Master / slave replication also has a synchronous mode, so lack of consistency
isn't an issue. Either way, you still need to do some work on top of
replication in order to get automatic failover.

~~~
secure
After reading the manual again, I recall what made me not pursue the
PostgreSQL built-in master/slave replication:

""" Commits made when synchronous_commit is set to on or remote_write will
wait until the synchronous standby responds. The response may never occur if
the last, or only, standby should crash.

[…]

If you really do lose your last standby server then you should disable
synchronous_standby_names and reload the configuration file on the primary
server. """

The way I interpret this is that in case my one and only standby server
crashes, my database will not allow any modifications until I intervene. With
pgpool2, writes will continue to work on the master, and it’s my
responsibility to eventually bring back the standby server.

~~~
joevandyk
If you want to use sync commits, you need a minimum of three servers.

Think about what happens if you only have two machines. If the standby goes
down, then it's impossible for data to be protected if the master also goes
down.

~~~
secure
With pgpool2 and two servers, I can have synchronous commits _as long as_ both
servers are healthy, and still continue operating (with less durability) when
only one server is healthy.

Of course, it’s impossible to protect against data loss when the remaining
server also goes down, but you always have that risk :). As I said, I realize
that a setup with only two servers cannot be perfect, but it’s all I’m willing
to afford for a spare-time hobby.

So, in comparison, pgpool2 provides me with a more convenient mode of
operation for my use-case.

~~~
joevandyk
What's the point of using sync commits if you don't actually ensure that the
data is on two machines before the transaction is committed?

~~~
rpedela
I think OP wants at least one machine that is healthy and fully operational at
any point in time. But the cost of a three machine setup is too great. There
are three main failures scenarios.

1\. Primary fails and standby takes over. Sync mode helps here because there
should be no data loss for completed transactions. In async mode, there could
be some data loss for completed transactions.

2\. Standby fails and primary continues to operate normally. When the standby
is back online, it catches up. Currently the primary would not be able to
continue to operate normally because of those config settings.

3\. Both primary and standby fail simultaneously. A very unlikely scenario but
can be solved with WAL archiving which does have the risk of potential data
loss.

------
imglorp
See also bucardo.

[http://bucardo.org/wiki/Bucardo](http://bucardo.org/wiki/Bucardo)

------
narsil
We run a similar setup at Kloudless [1]. We use PgBouncer [2] for connection
pooling, which connects to pgpool2 to load balance between our Postgres
servers. We've noticed PgBouncer is more performant at handling thousands of
simultaneous connections.

[1] [https://kloudless.com](https://kloudless.com) [2]
[http://wiki.postgresql.org/wiki/PgBouncer](http://wiki.postgresql.org/wiki/PgBouncer)

~~~
techdragon
pgbouncer is like having an nginx reverse proxy in front of your postgres
connections.

------
adamnemecek
Seems down Google cache link

[http://webcache.googleusercontent.com/search?q=cache:MPIiThx...](http://webcache.googleusercontent.com/search?q=cache:MPIiThxiSD8J:michael.stapelberg.de/Artikel/replicated_postgresql_with_pgpool+&cd=1&hl=en&ct=clnk&gl=us)

~~~
secure
You sure? I am monitoring the page and didn’t notice anything. The site is
IPv6-enabled, though, so perhaps your IPv6 connectivity is broken?

Traceroutes welcome in case it is reproducibly down for you :).

~~~
adamnemecek
Yeah it's reproducible. It works if I go through a proxy. If you give me your
email I'll send you a traceroute.

~~~
secure
The email address is the same as the domain, just with the first dot replaced
by an @ sign :).

------
mrmondo
I've done a lot of work with pgpool over the past year, be aware there are
lots of situations where it won't work for you. For example if your devs don't
write their own SQL and instead use a framework with limited control you're
going to have a bad time.

~~~
jbardnz
I use Postgres with Laravel so i'm not writing any SQL queries directly, can
you go into more detail about why this won't work for me?

~~~
mrmondo
I was speaking to the author (Tatsuo Ishii) around the reliability of PGPool-
II when you did not have complete control over your SQL, he made it clear that
if you cannot control what kinds of SQL is and isn't run from the application
you and I quote "I will suggest you avoid pgpool-II.", He also noted that
"Pgpool-II does not handle multi statements query very well" so pretty much
count yourself out of using PGPool if you're using any multi-statement queries
which is at least in our case a great deal of our SQL.

------
RA_Fisher
Fantastic write up, really. Surely, there has to be a simpler way, no?

~~~
secure
Thank you.

As for a simpler way, I think suitable scripts for recovery should be shipped
with pgpool2 as a first step.

A second step would be integrating pgpool2 functionality into PostgreSQL
itself. That way, the whole authentication problem would go away, and you
would not need to run a separate program. Also, the WAL shipping could be
replaced by just letting the non-primary nodes use a replication connection to
the primary node directly.

That would not get rid of all of the complexity, but it’d hide a lot more of
it from the user :).

------
raarts
The problem with these setups is that: 'There Are Many Ways To Do It'(tm),
and: 'You Really Need To Test For Your Use-Case'(tm). I need to read and
understand everything to decide what's best in my case. And then you need to
write a lot of scripts, do a _lot_ of time-consuming testing, and document
everything.

Scalable, Reliable PostgreSQL is not really there yet.

There are many ways to do it, yes, but most people just want one thing: the db
failing over in case it goes down.

~~~
jeffdavis
As the author points out below[1], this use case does seem to be solved by the
built-in postgres replication.

So, there is a single solution. If you want greater flexibility, pgpool2 may
be an option; otherwise, just use what's built-in.

[1]
[https://news.ycombinator.com/item?id=8250497](https://news.ycombinator.com/item?id=8250497)

~~~
raarts
Try reading the PostgreSQL documentation on this, then read my comment again.

~~~
jeffdavis
Can you please make a more specific criticism? What should be simplified?

~~~
techdragon
As someone that's deployed a streaming replication master slave pair with a
backup regime designed to achieve a Point In Time Recovery Target of no more
than 1 minutes data loss in the event of complete failure requiring
restoration from backups....

Everything, it all needs to be simplified. Yes it's the "difficult" problem
for a database, but it's also the one thing where I feel Postgres is lacking.
What's needed is the Postgres community to quit messing around and accept that
clustering is no longer something they can fob off to other projects. They
need to adopt either Postgres-XC or preferably Postgres-XL and push it forward
with full force.

The current state of affairs is "pgpool, slony, repmgr, buccardo, and the
rest, pick one, spend hours testing how well they fit your use case. Then use
that." Now this is ok when a project is new. But Postgres is about 20 years
old. Clustering is not new, many of these solutions are old, we finally have a
very good clustering option but it's not even mentioned on half the official
wiki pages about clustering last time I looked. Postgres is a professional
grade tool, so it's ok to not endorse things not considered ready yet... But
Postgres-XL is in use in production deployments, it's built out of Postgres
itself not via proxies or other layering over or inside vanilla Postgres. It's
not a hack, and endorsing it as the definitive future of clustered Postgres
but not necessarily endorsing it as "as good as regular Postgres" is long
overdue. At least that's my opinion as a serious production Postgres user who
makes the decision to use other databases more often than not, due to lack of
good clustering.

~~~
rpedela
The last time I looked at Postgres-XL, it required handling standby data nodes
yourself. Has that changed?

~~~
techdragon
When you bring them up in matched pairs, ensuring that the data nodes are
using streaming replication is a lot easier ... AND worth the trouble for the
ability to do whole cluster "upgrades" without taking the cluster itself
offline.

------
xordon
Surely there must be an easier way?

~~~
akbar501
PostgresXL ([http://www.postgres-xl.org](http://www.postgres-xl.org)) is one
alternative solution.

~~~
majorsc2noob
That does not have automatic failover. I really want to use Postgres at work
but every time I start reading about failover, load balancing, sharding and so
on it seems like such a mess with PG.

I just want sharding with automatic failover to separate datacenter in a
simple package...

~~~
dc2447
I'm not sure there is actually something in the open source space that is ACID
compliant and offers all those features.

Sharding is a problem in Postgres and I am sure they will fix the disconnect
between how the data is inserted and how it's read pretty soon.

Automatic failover and datacentre awareness are not typical features in Open
Source RDBMS, although you some NOSQL solutions may do this for you.

I'm willing to overlook many weaknesses in PG as we get blistering performance
and amazing stability combined with features like hot schema upgrades.

