
Scaling PostgreSQL at Braintree: Four Years of Evolution - pgr0ss
https://www.braintreepayments.com/braintrust/scaling-postgresql-at-braintree-four-years-of-evolution
======
jacques_chester
> _Unlike most applications, we are much heavier on writes than reads._

I think what they meant to say is that unlike most _web_ applications, they
are heavier on writes than reads.

Write-heavy workloads -- aka OLTP -- are the bread and butter of relational
databases. To the point that people complain they aren't as good at read-heavy
workloads.

~~~
JoelJacobson
What they probably meant is the do more writes than web applications, not more
writes than reads.

~~~
jacques_chester
Which was my point.

Credit card transactions are literally the textbook example of OLTP.

------
nathan_f77
I found this article incredibly interesting, because I've only worked on low-
traffic Rails apps so far. I've always wanted to learn about sharding, but
haven't needed to, and didn't know where to start.

Now I'm inspired to start up a few VMs and have a play with the gems they
mentioned, and try to get replication and failover happening.

~~~
sync
Do play, but also consider not sharding at all, e.g.
[http://37signals.com/svn/posts/1509-mr-moore-gets-to-punt-
on...](http://37signals.com/svn/posts/1509-mr-moore-gets-to-punt-on-sharding)

------
cagenut
Its not scaling if its not chock full of numbers (and maybe even graphs).
There's not one mention of requests per second or latency in this.
Users/capacity per shard, bottlenecks run into, etc.

~~~
peterwwillis
Scarcasm/trolling aside, it's not four years of scaling because they didn't
need more than one server until late last year.

Also, people, _please_ stop trying to do "automatic failover". There is no
such thing, in the total sense. You have to have humans checking to make sure
everything ok, and lots and lots of tripwires to prevent one of a multitude of
catastrophic problems from occurring. Just automate pieces in totally reliable
and non-harmful ways and let a human oversee the cut-over.

~~~
pgr0ss
For the first few years, we scaled vertically rather than horizontally. We
still went through many rounds of scaling our single server, changing both
hardware and software configs.

We were reluctant to do automatic failover as well, since we all hear horror
stories about automatic failover causing more problems than it solves.
However, we changed our minds this summer for a couple of reasons. One, our
traffic continues to grow and people rely on us for payments. If our database
server crashes in the middle of the night, we don't want our customers to have
to wait for us to do manual failover. Two, we felt a lot more comfortable with
Pacemaker and PostgreSQL streaming replication failover than we used to with
DRBD. We did extensive testing and tweaking, and we believe we understand the
failure scenarios.

------
falcolas
Yeah, DRDB is terrible for write heavy performance. It was probably the source
of many of your MySQL issues as well, but it's good that you've found a new
home regardless.

------
JoelJacobson
To author: For how long could you have avoided the need to chard if you would
have off-loaded all the read-only queries to multiple hot-standby slaves
instead? Your scaling story is almost identical to ours, same pg versions on
same years, DRBD, etc. But we haven't split the database into multiple chards
yet. I'm hoping to post-pone it for at least a year by off-loading read-only
queries to a lot of synchronous slaves.

~~~
frankwiles
I would imagine given their product they have a very write heavy work load and
the read load, while non-zero, isn't a major concern. In typical web app
environments however, you can definitely increase your "time to sharding" with
read slaves and increased caching.

------
hahainternet
One thing that isn't mentioned which we are currently facing problems with is
the change of timeline once a failover occurs. The code isn't yet merged to
use the stream to push the timeline update so all slaves must read from the
shared archive.

Maybe I've implemented things incorrectly, but this is a frustrating issue for
the moment.

~~~
pgr0ss
To be safe, we always rebuild the old master after we fail over to a
synchronous standby server. We have capistrano tasks to automate it, so it's
not too bad. The async servers merely follow the new server after the IP
moves, so we don't need to do anything there.

Edit: We also don't increment the timeline on failover. Instead, we stop
PostgreSQL, remove the recovery.conf, and restart.

~~~
wiredfool
I've got a 3 level setup, with a primary/secondary pair that are big machines,
and a tertiary that's enough for an emergency, but not much more. They're all
running hot spare.

I didn't feel safe blatting the new primary back to the secondary till we had
the third level in there, since once you've failed over, you're at the mercy
of that one machine and your latest backup is on the one that you're
overwriting.

The third level one will follow the change in the recovery timeline if it's
incremented in recovery.conf, and you make sure that it gets the appropriate
history file.

------
jkahn
Thanks for the article. What do you use to measure your PostgreSQL
performance? Do you use any visual dashboards to track down bottlenecks, etc?

~~~
fdr
If you are lucky enough to be able to use Postgres 9.2, consider running
pg_stat_statements _all the time_. It is one of my favorite pieces of work: it
canonicalizes queries and can tell you a number of useful statistics on it.

One of the advantages is that it can help you identify queries that are both
very short but frequent (and could use a cache, or whatnot), and would fly
below the radar of most log aggregations, because most people turn off logging
off for very short statements. See also: identifying n+1 query pathologies.

It really blows away log analysis and pg_fouine for all but the most
heavyweight tasks, and its approach is a lot more sound than log analysis
tools because it uses the semantic representation -- not the syntactic one --
of the query submitted.

~~~
JoelJacobson
If you need even more fine-grained statistics, check out
<https://github.com/johto/call_graph>

It allows you to get statistics per top-level function, which can reveal
abnormalities otherwise impossible to detect, such as a function which is
usually very fast, but very slow for a small number of calls relative to the
total number of calls. Consider fa()->fx() and fb()->fx(), if fa()->fx()
always is fast and you have a million such calls, but for some reason
fb()->fx() is very slow, but you only have a few such calls, fx() will look
like a very fast function without any performance problems, if you only look
at pg_catalog.pg_stat_user_functions, but call_graph will reveal fx() being
slow when called by the top-level function fb().

This is only relevant for systems always accessing the database through Stored
Procedures though.

------
trung_pham
Why do you put audit logs in postgresql? Wouldnt riak be a better storage?

~~~
pgr0ss
We have a few different types of auditing information. At that point in the
timeline, it was all in PostgreSQL. Today, we keep only the critical auditing
information in the main database.

~~~
raverbashing
Anything special about audit tables? Different machine? (maybe this is
sensitive info)

