
Scaling our infrastructure to multiple data centers - ChrisArchitect
http://engineering.instagram.com/posts/548723638608102/
======
samstave
I love writeups like this. Its great when you see that companies like facebook
and instagram have the same problems that you experience but at a much more
massive scale - but the premise of the problem, and largely the stack
involved, can be very similar if not identical.

The part about read-replicas falling way behind... had that happen last year.
The thing got more than 700 hours behind. So we just killed that guy and had
to build a new one and do it more timeley...

With that said; what would be a great next step for such write-ups would be
slightly more technical details of how X was accomplished.

"We ran a second streamer as soon as the box came up" \--> LINK TO HOWTO
bullets steps on doing this, would be awesome...

Although I know thats asking a bit much of any regular eng team to do so, but
would be nice to have so that people who face the same issues, with only a
couple engineers in a small environment can actually learn from and implement
best-practices found by at scale orgs.

Thanks for the post though! -- The VPC migration post from back when set me up
with a lot of VPC info I needed when I also went to VPC-->HQ-DC with
overlapping CIDRs.

------
dberg
I would love to understand more how they deal with postgreSQL writes across a
60ms cross-country connection. Sounds like both datacenters write to
postgreSQL in one datacenter and 60ms is not insignificant

------
jon-wood
I particularly enjoyed this snippet: "Facebook regularly tests its data
centers by shutting them down during peak hours."

That's having faith in your disaster recovery plans.

------
Florin_Andrei
> _Both PostgreSQL and Cassandra have mature replication frameworks_

Speaking of Postgres - master/slave replication is fairly easy and seems to
work well. But trying to implement master/master replication is painful. There
are some external solutions like Bucardo, which is basically a Perl-based
service that handles replication outside of PG (yikes!); there's BDR which is
integrated with the main PG daemon, but it's not production ready. And this is
only asynchronous replication.

Any suggestions for sane, reliable master/master replication with PG that
simply works?

~~~
fancy_pantser
Here's a table:
[https://wiki.postgresql.org/wiki/Replication,_Clustering,_an...](https://wiki.postgresql.org/wiki/Replication,_Clustering,_and_Connection_Pooling#Comparison_matrix)

It's worth noting that BDR is much more mature that the table indicates and is
probably the easiest to set up today (for up to 48 masters in a pool).

~~~
Florin_Andrei
I did a test with a data backup from our live DB, and BDR fails in mysterious
ways. I had a two-node cluster, and tried to restore the DB into one node.
Failed, I think, when it was rebuilding indexes (no useful messages in the
logs), with some scary warnings also when it was importing some big tables.
Most of the data made it across network to the other node - emphasis on
_most_.

I _really_ wanted this thing to work. :( On paper it looks great.

------
cagenut
if you like this stuff you'll probably love this video:
[https://www.youtube.com/watch?v=HfO_6bKsy_g](https://www.youtube.com/watch?v=HfO_6bKsy_g)

~~~
nodesocket
Really clever that Fastly used rsyslog at first to send purge notifications to
nodes. Start simple then iterate.

