
Migrating From AWS to FB - dctrwatson
http://instagram-engineering.tumblr.com/post/89992572022/migrating-aws-fb
======
agwa
> The main blocker to this easy migration was that Facebook’s private IP space
> conflicts with that of EC2

IPv6 adoption could not happen soon enough.

~~~
cheald
I think it has more to do with the fact that there are a standard series of
private IP blocks.

~~~
rdl
Right, but if everyone used IPv6 there would be no need to use non-routeable
private IPs for anything, you could just use non-conflicting IPv6 addresses
and not route them.

~~~
sliverstorm
I kind of like having standard private subnets. My router is always
192.168.1.1, or sometimes 10.0.1.1, and so is my friend's, my parent's, and my
grandparent's.

~~~
jon-wood
After spending some time as a contractor doing systems work for a few
companies I've stopped ever assigning an internal network to 192.168.0.0,
192.168.1.0, or 10.0.0.0.

The number of times I found myself attempting to VPN into a clients network,
only to find it conflicted either with my home network, or whatever coffee
shop I was sitting in, was ridiculous. Depending how many hosts you need to
run on your network there are huge numbers of possible subnets you could use
for an internal network - do yourself a favour and keep off the ones set up be
default on every router sold.

~~~
stygiansonic
This is also important advice if you're thinking about setting up VPN access
to your home network: Do not pick the most common/default subnets, i.e.
192.168.0.0/24, 192.168.0.1/24, etc. Picking a somewhat-random subnet as
suggested would mitigate the problem and it's what I did for my home network.

------
NathanKP
I wonder why Instagram wasn't using VPC in the first place. I've been using
AWS for a startup for a few years now and I had our instances running in VPC
from about the second month onward.

It's been one of the best architecture decisions I've ever made. At this point
we only use one public IP address. (If direct access to a machine is needed
then you can connect via VPN running on the one bastion host with the public
IP address, and this gives your machine access to the local IP addresses of
instances running inside the VPC.)

All the machines in our cluster are protected inside local VPC address space,
with the access by the external world being ELB to expose public service
endpoints like the API and website. I can't think of any good reason why you
wouldn't be using VPC in the first place. Having public IP addresses for
private machines sounds like a recipe for disaster if you ever accidentally
miss a port in your security rules.

~~~
mikeyk
Mike from IG here. VPC was barely a thing when we got on AWS (2010) and at the
time not the default. I would definitely have done VPC from day 1 in
hindsight, though.

~~~
blackaspen
Hindsight is 20/20.

I think you guys did an exceptional job to tackling a really difficult problem
(I've been in the same position, migrating EC2 to Datacenters) and we
determined that EC2 -> VPC -> Datacenters is really the only way, and Neti
solves it surprisingly well.

Going forward, hope that acquired companies opened their AWS accounts late
enough that Amazon forced them to use VPC.

~~~
themartorana
We're small, comparatively - 20-30 servers max - and we need to get in to VPC
for a new cluster that requires static internal IPs. (Reboot an EC2 Classic
instance and you may get a different 10.x address.)

In any case, the migration is daunting even at our size, although our devops
team size is 1. I do wish they had VPC when we started.

~~~
rb2k_
You could also just attach EIPs and use those, right?

~~~
themartorana
In an incredibly late reply - EIPs are public-facing, I need internal IPs for
fastest possible LAN routing.

------
tomphoolery
Has Facebook ever been public about the tools they use for deploying new
machines onto bare metal with Chef? My company faces similar problems, albeit
at a much smaller scale, but still...I'm wondering what they have in place of
a tool such as [http://theforeman.org](http://theforeman.org) (which is very
coupled to Puppet).

~~~
rhoml
I think they just said this
[https://www.youtube.com/watch?v=SYZ2GzYAw_Q](https://www.youtube.com/watch?v=SYZ2GzYAw_Q)

------
ch
I think this is the biggest takeaway from the article:

Plan to change just the bare minimum needed to support the new environment,
and avoid the temptation of “while we’re here.”

Good engineering is knowing how to act with surgical precision when necessary.
This is what allows a craft like programming to operate in the confines of a
business.

------
blakesmith
It's usually the stateful stuff that proves challenging in big datacenter
moves, but I don't see any mention of data copying, replication, or moving.
How did you guys tackle the problems of keeping data in sync and doing a clean
cutover?

------
meritt
Is neti open-source?

~~~
kawsper
Managing iptables across datacenters and nodes would be a fun project to do
with something like Serf ([http://www.serfdom.io/](http://www.serfdom.io/))

------
0x006A
why not just create a vpn between the nodes with another private IP space and
send your data through that?

~~~
toomuchtodo
"This task looked incredibly daunting on the face of it; we were running many
thousands of instances in EC2, with new ones spinning up every day. In order
to minimize downtime and operational complexity, it was essential that
instances running in both EC2 and VPC seemed as if they were part of the same
network. AWS does not provide a way of sharing security groups nor bridging
private EC2 and VPC networks. The only way to communicate between the two
private networks is to use the public address space."

That is essentially what Neti does, except instead of static mappings, its
dynamic and software configurable (which is pretty much the only way to go
when you're entire environment is virtual and the underlying network equipment
is out of your control).

~~~
0x006A
Using a VPN would still be an option. Why write essentially your own VPN
(neti) instead of using an existing VPN solution? VPC is not the only VPN you
can use on EC2.

~~~
toomuchtodo
I believe Neti was a better solution at their scale (thousands of VMs, a
dynamic production environment, etc).

------
shiftb
I got an error page inside the Instagram mobile app a few days ago and was
surprised to see Facebook server chrome around the error message.

I'm impressed out how fast they got this migration done, considering how
massive the scale they operate at is.

------
SushiMon
I'm wondering if they got nailed by out-migration charges and how much that
was. I assume a bunch of their images were in S3. Amazon charges a pretty
penny to take things out.

~~~
ceejayoz
I'm confused by this. S3 GET requests are the cheapest request type, and
getting the images out would just cost you the bandwidth involved.

Maybe you're mixing things up with Glacier?

~~~
SushiMon
GET Requests are cheap But I was thinking of bandwidth costs to get things out
of S3 entirely and do a complete outmigration. But the prices there have come
down quite a bit since I last checked.

~~~
ceejayoz
Bandwidth costs of a one-off transfer out would be a lot less than they were
already paying to serve those images out of S3 to the public.

------
general_failure
Any reason not to choose docker over lxc? Is it because fb data centers are
already lxc friendly?

~~~
nbm
The existing Facebook deployment system supports running deployments within an
LXC (and setting up cgroups, &c.) and was written well before docker was
available.

Some background:

* [http://www.slideshare.net/dotCloud/tupperware-containerized-...](http://www.slideshare.net/dotCloud/tupperware-containerized-deployment-at-facebook)

------
yeukhon
Contributors? What contributors? People from within the company or open source
contributors?

------
mkfifo
and now Instagram can share ALL of its data with the US gov too.

------
bmetz
And those "numerous integration points" are?

~~~
mikeyk
Mike from IG here. Some early wins are integrations with spam fighting
systems, logging infrastructure, and FB's Hive infrastructure.

------
EGreg
These engineering feats are truly impressive and worth writing about.

And yet every time I read about this kind of stuff I think, how glad am I that
we are building a DISTRIBUTED social network and will never have to solve
problems on this massive scale! We won't have to move millions of other
people's photos here or there if everything is distributed from day 1. People
will be able to move their own stuff easily wherever they want.

------
maceip
"Facebook’s private IP space conflicts with that of EC2"

^ That wouldn't've happened in GCE (i.e., they should have been acquired by
Google).

~~~
cheeseprocedure
Are you sure? This document suggests GCE instances use the 10.x.x.x address
space (just as AWS instances in EC2 Classic do):

[https://developers.google.com/compute/docs/instances-and-
net...](https://developers.google.com/compute/docs/instances-and-network)

~~~
maceip
""" Although Compute Engine doesn't allow creating an instance with a user-
defined local IP address, you can use a combination of routes and an
instance's ‑‑can_ip_forward ability to add local IP address as a network
static address which then maps to your desired virtual machine instance.

For example, if you want to assign 10.1.1.1 specifically as a network address
to a virtual machine instance, you can create a static network route that
sends traffic from 10.1.1.1 to your instance, even if the instance's network
address assigned by Compute Engine doesn't match your desired network address.
"""

Meaning they could have avoided conflicts using this mechanism.

~~~
cheeseprocedure
Maintaining thousands of forwarding/routing configs sounds just as nasty as
implementing Neti.

At any rate, Instagram's been around since 2010 and GCE didn't exist until
June 2012 (and wasn't generally available until this past December).

------
cooltrance
It is amazing how a post like this could reach the front page of hacker news
just because it comes from Instagram rather than for its technical relevance.

They mentioned Neti but didn't dig into details other than "a dynamic iptables
manipulation daemon, written in Python, and backed by ZooKeeper." and they
mentioned the ip blocker which is an issue on almost every migration.

Also taking into consideration that they didn't write a post in the past 10
months, I am sure that they can do it better.

------
aristus
Called it, five years ago:
[http://www.web2expo.com/webexny2009/public/schedule/detail/9...](http://www.web2expo.com/webexny2009/public/schedule/detail/9537)

Run in multiple clouds from day one. Take the pain. It gives you flexibility.
Basic vendor management 101.

~~~
serge2k
This article mentions nothing about latency.

~~~
sumbry
We're undertaking a similar project and the latency is almost negligible. In
fact the latency is lower bridging between classic and vpc in the same
availability zone than between two classic availability zones.

~~~
ceejayoz
That's not multiple clouds, though.

