
How we spent Friday night coming back online before Instagram and others - jc4p
http://blog.fitocracy.com/post/26245878403/getting-fitocracy-back-online
======
jasonkester
Nice job staying on top of things, but looking in from the outside it does
seem a bit wasteful to spend what, 100 man hours of effort and expense,
canceling a handful of otherwise happy Friday nights, just to gain a couple
hours uptime during a period where roughly nobody is using your thing.

I had a site that was affected by another one of Amazon's outages a while
back, and here was my disaster recovery plan in its entirety:

    
    
      a. go to sleep.
    

There's a reason you farm things like this out to Amazon in the first place.
They have a big team of smart people whose only job in life is to keep your
stuff alive, or scramble like mad to bring it back up if it goes down.

So long as your site knows how to start up automatically when the box turns
on, there's really not a lot you need to do in a situation like this.

If forty percent of the internet is down, and you're part of it, your users
will probably understand. They'll expect you to come back up when the rest of
the internet does. If you do manage to come up a bit earlier, you might get a
shrug and a "cool", but it's probably not enough of a win to cancel Christmas.

------
saurik
An ALIAS record is just an internal-to-Route-53 mapping "when people ask for
X, pretend they asked for Y instead". This is conceptually similar to a
server-side CNAME. The reason these don't have TTL is because users don't see
them.

The returned record, of course, has a TTL, and the ALIAS mechanism will not
alter the TTL of the aliased data, so in the case of an ELB you are talking a
TTL of one hour. There are no magic bullets to the distributed cache
expiration problem.

(The other comments in this article about TTL seem quite confused, though, so
this explanation might not actually have helped. Even if you have a 20-year
TTL, you are going to see changes immediately from clients that do not have
the data cached anywhere on their path to the origin.)

(In particular, there is no difference at all to switching the ALIAS record
out for an A record with regards to your TTL: if the user has the target of
the old mapping cached they will use it, otherwise they will get the new one.
It isn't really "getting away with" that behavior, and it isn't sue to
Amazon's DNS being special.)

~~~
diafygi
Agreed, at the time we were quite confused and still are about the intricacies
of TTL. Sucks that there's not really a good way to address it. Maybe just
keep the TTLs at 300 always?

~~~
elithrar
Keep in mind that not all DNS caches—particularly those at large ISP's—respect
sub-3600s TTL's.

(they do this in an effort to reduce load; whether it is truly effective on
today's hardware is arguable)

~~~
saurik
Some ISPs (in particular, those in countries far from the US, such as those in
the Middle East, although occasionally even Europe) do not even honor the
hour-long TTL used by ELB (for reasons of latency, not load), so if you care
about your traffic not being routed to someone else's server, you should not
allow ELB to get exposed to end user requests (in my case, I use it to balance
my backend servers, but the only incoming connections it handles are from
CDNetworks, whom I know has a to-specification implementation of DNS caching).

------
jetsnoc
I can't praise Chef by Opscode enough. We had several webservers running with
a fresh deployment of our web stack in a matter of ten minutes. We've built
several recipes to install the pre-requisites (nginx, php5-fpm, Rails
depending on the system role) and then it uses a GitHub deployment key to
checkout the stable revision of our applications.

Even before you can afford a part-time DevOps engineer I highly recommend
automating your system administration as much as possible. When you say you
should script your installation of an application server I recommend doing
that with Chef!

This will allow you to quickly bring instances online on about any cloud. HP,
AWS, Linode, or your own self-brewed OpenStack cloud. You will still have
challenges with your persistent data but it's a lot easier to breathe and act
quickly with your persistent data knowing you'll have content servers to
access that resource.

edit: I have no affiliation with Opscode - we aren't even a paying customer we
use their free server. I am sure Puppet or any system administration tool will
get you similar milage.

~~~
shimon_e
I use a similar system. With OVH you can set installation templates that
partition the OS, set ssh keys, and then run a script. Which for me is a
puppet script that sets everything up and deploys the latest version of the
site.

OVH have an android app so I can scale to a new server from the push of a
button on my phone. :D

Plus they cost peanuts compared to amazon.

I has fall over in other data centres too but OVH makes me the happiest.

------
diafygi
Daniel Roesler here, I feel really sorry for those who paid for Multi-AZ
support on their RDS instances only to have all the availability zones go
down. That would make me rage quite a bit.

~~~
sofuture
The only bump in service we had was ~30s of downtime when our multi-AZ RDS cut
over to the failover instance. :) I think we lucked out a little bit...

~~~
jc4p
Were you in us-east-1? That sounds amazingly lucky. If only Amazon would tell
us how many AZs actually went down...

~~~
blagospot
Does 1 AZ == 1 datacenter? It seems like it's possible 5 distinct "AZs" could
have gone down while someone else escaped with all 4 of their AZs unscathed,
since Amazon takes care to talk about availability zones rather than
datacenters, and my AZs by definition aren't the same as your AZs.

~~~
diafygi
AZ's are different data centers, but in the same region. I think that means
that the buildings are close enough to have super low latency connections to
each other. However, when a huge storm rolls through, it can wipe out all the
AZs (which is basically what happened). AZs are mostly for guarding against
floods, fires, and other localized disasters, not regional disasters that
leave 2 million people without power.

~~~
robryan
As I understand it the separate AZs can be the same data center but in theory
they are completely separated from each other in terms of network/power etc so
an outage in one shouldn't effect the others.

------
jacques_chester
tl;dr

We waited for stuff to come back up. Our relatively simple application
required far fewer servers than more complex services with millions of users.

We're awesome!

------
wahnfrieden
Hey Daniel, thanks for the educational writeup. I have to wonder about ways
around the AMI issue. We use puppet to setup new instances (and keep existing
instances in sync... although we tend to just recycle EC2 instances anyway).
This is pretty nice to work with given its declarative nature, but we have to
put up with long, long startup/initialization times for new instances. Which
sucks in downtime of course.

Do you think there's some middle ground of using AMIs but also using puppet
somehow, so you make new AMIs as a perf optimization but keep puppet config up
to date? TBH it's something I've only casually wondered about. But maybe it's
what we both need. Having a puppet config would mean you can launch on
basically any provider.

~~~
diafygi
Yes exactly, the speed of throwing up pre-built AMIs is tons faster than
building from scratch each time you launch.

However, we're probably going to make a deployment script (we use fabric) that
builds the AMI from scratch. Then, when we need to update the AMI, we just
update the fabric script and run it to make the new AMI. That way if we ever
need to make AMI's in a different region, we can just run the fabric script in
that region.

~~~
mryan
I use a similar process to this in my infrastructure - I use Puppet to
configure instances, then a Fabric command to create new AMIs based on these
Puppet configurations.

This gives the best of both worlds - version controlled configuration files,
and an automated process for making new AMIs. Instances also boot quickly, as
nearly all of the configuration is baked-in to the image.

Using EC2 tags gives you even more room for automation here.

~~~
wahnfrieden
Can you elaborate on how you use EC2 tags? The only thin we use them for is to
mark a new instance as bootstrapped once puppet finishes.

It also sounds like you could pretty easily make your AMI creation step a job
in your CI software.

------
Jeema101
Did you guys first attempt to boot up in another us-east availability zone?
They were not all affected as this post seems to imply. I had a slave DB go
down but the rest of our deployment was unaffected (the rest of the deployment
is in a different AZ as the slave DB instance but all are in us-east).

------
ahmedaly
I highly recommend you to sign up for eCompuCloud:
<http://www.ecompucloud.com/>

It relies on several cloud computing providers, to prevent such drops to
happen.

The pricing is even lower Amazon or any other computing provider! (we buy
larger clusters which costs us less)

------
paulsutter
The reason it took you 12 hours to get back online? Because you really never
gave much consideration to availability.

You should THINK about your TTL values long in advance of a problem. You
should also THINK about having a backup instance running (or at least ready to
boot, if an hour of downtime is perfectly OK for your users).

More discussion here: <http://news.ycombinator.com/item?id=4181918>

~~~
uptown
"But you didnt, and you should feel really embarrassed about that."

Sounds to me like they learned a lot, and are sharing what they've learned to
help others. What's really to gain from shaming them over it?

~~~
paulsutter
My summary of their post is "we basically have no idea how to offer a reliable
service and we know almost nothing about basic Internet protocols, and we feel
like heroes anyway".

I would be a lot more impressed if they were asking questions about how DNS
really works and what they should do to avoid problems in the future. That
would be cool.

~~~
aaronbrethorst
Perhaps you could blog about this topic in more depth. I'd be interested in
reading it.

------
taligent
Surprised I am the only who found the title amusing.

The reason you got your site online before Instagram and others is because
they have a lot of infrastructure and moving pieces as a result of being
extremely popular. Obviously Fitocracy doesn't share those characteristics.

That said it is unacceptable for ANY site to go down simply because you lose
power in a data center.

