

A rough guide to keeping your website up through catastrophic events - fredsters_s
http://blog.rainforestapp.com/post/26217277790/a-rough-guide-to-keeping-your-website-up-through

======
birken
I think one factor people don't consider enough is the tradeoffs you need to
make in order to make your app have incredible reliability. This article and
other ones talk about a bunch of work you can do to help ensure you
application stays up in rare events. However, maybe for you particular
product, having 99.5% uptime and offering a ton of features is going to help
you become more successful than 99.9% uptime.

When you are a Google or a Twitter or an Amazon, you lose lots of money per
minute of downtime, so economically speaking it makes sense for them to invest
in this. However for an average startup, I don't think having a couple hours
of downtime per month is actually going to be that big of a deal. Of course
you need to ensure your data is always safe, and your machine configuration is
also easy to deploy (via AMIs or something like puppet) so you have the
ability to get back up and running in catastrophic cases, but at the end of
the day having a good "We're currently having technical issues" screen could
very well be a better investment than shooting for a technical setup that can
still be up during catastrophic events.

~~~
ggwicz
Uptime is something users love regardless of the size of the company whose
product they're trying to use. Investing in uptime is always worth it.

~~~
pg
_Investing in uptime is always worth it._

That simply can't be true. There is always going to be a point where an extra
decimal place of reliability is too costly.

~~~
fredsters_s
There's always a trade off between the cost of failing and the cost of
engineering it out. The problem comes with the lack of understanding about
where and how apps and intrastructure fail and how to avoid it. If you
misunderstand the problem, you'll probably misjudge it.

~~~
bermanoid
People should absolutely at least be doing some back of the envelope math on
this before choosing a strategy.

If you're at N DAU, then a 12h downtime will affect a bit more than N/2 users,
and some percentage of those users will become ex-users - you can run a small
split test to figure out how many if you don't already have data on that.
You'll also lose a direct half day of revenue. This type of thing will happen
somewhere between once a year an once every couple of months, as low and high
estimates.

Crunch those numbers, and you'll have an order of magnitude estimate of what
downtime actually costs you, and what you can actually afford to spend to
minimize it. Keep in mind that engineering and ops time costs quite a bit of
money, and that you'll be slowing down other feature development by wasting
time on HA.

For instance, let's say you're running a game with 1M DAU, and 5M total active
users, making $10k per day (not sure if that's reasonable, but let's pretend),
and you've figured out that 12h of downtime makes you lose approximately 10%
of the users that log in during that period. In that case, 12h of downtime
costs you a one-time "fee" of $5k, and also pushes away ~1% of your total
users, which will cost you $100 per day as an ongoing "cost".

If we assume this happens exactly once, and that a mitigation strategy would
work with 100% effectiveness, then you should be willing to spend up to $100
extra per day to implement that strategy; the $5k up-front loss is not
nothing, but we can probably assume it'll get eaten up by engineering time to
implement that strategy. If such a strategy would cost significantly more than
$100 per day over your current costs, then by pursuing it you're assuming that
"oh shit it's all gone to hell!" AWS events are likely to affect you multiple
times over the period in question.

I'm not saying these numbers are realistic in any way, or that the method I've
shown is 100% sound (I'm on an iPhone, so I haven't edited or reread any of
it); I'm just saying that whether you pursue a mitigation strategy or not,
it's not terribly difficult to ground your decision in numbers. They do tend
to be right on the edge of reasonable for a lot of people, so it's worth
thinking about them (good) or (better) measuring them.

------
WALoeIII
You can't just put nodes in different regions, even with a database like
MongoDB. It will work in theory, in practice you'll have all kinds of latency
problems.

WAN replication is a hard problem and glossing over it by waving your hands is
a disservice to readers.

"Real" solutions are to run a database that is tolerant of partitioning, and
have application level code to resolve the inevitable conflicts. Riak,
Cassandra and other Dynamo inspired projects offer this. On the other hand you
can use a more consistent store and hide the latency with write-through
caching (this is how Facebook does it with memcached + MySQL), but now you
have application code that deals with managing this cache.

Either way you have to have very specific application code to handle these
scenarios, and you may even run a combination of solutions for different types
of data you need to store. There is no silver bullet, there is no framework or
product that does it for you.

~~~
josegonzalez
I agree wholeheartedly. HA is hard, a single blog post won't even halfway
cover it.

~~~
bifrost
HA isn't actually hard, but it does require some forethought and chosing some
technologies which aren't neccesarily cool. However, I would submit that
building HA capabilities into a cat photo sharing or microblogging site is
sortof overkill, most people don't need it. Just take the hit and move on,
people are getting more and more used to sites being down/failing and just
retry later. As much as I hate to say it, I do think its fairly accurate.

~~~
fredsters_s
You pose an interesting question that I want to see answered: do users of B2C
sites actually care when the site is down? Does it decrease MAU? Your
intuition is that it does not, but I'd love to see some data.

------
po
Reading all of these post-mortems and guides to keeping your servers up and
running, it strikes me how much AWS jargon is in there.

The fact that so many developers have invested so much time into learning
Amazon-specific technologies means that developers are left to deal with the
problem within that worldview. Going multiple-datacenter means learning two of
every technology layer.

You _could_ solve all of these problems using standard non-amazon unix tools,
technologies, and products, however Amazon has enabled a whole class of
development that makes it easier to just work within their system. It's easier
to just wait for Amazon to figure it out for the general case and trust them
than to figure it out and implement yourself.

There are other risks with being the lone-wolf but for a lot of people, being
in the herd has a certain kind of safety, despite the limitations.

Not making a judgement call on it but it is something that I have noticed with
these outages.

~~~
fredsters_s
Agreed. Hopefully OpenStack or similar gets some serious traction and the IaaS
players move closer to simply being providers of hardware. The likelihood of
that happening seems quite slim while AWS holds such a strong position in the
market though. And a significant chunk of AWS's success is due to their
innovation in the software layer.

~~~
po
One could draw a lot of parallels between OpenStack/AWS and Android/iOS
development. It is in the best interest of an OpenStack provider to
differentiate their offering rather than compete on price through the common
platform. Just like how there are power users who want to configure, build and
have full control of their mobile device, there's a huge class of people that
just want a working stack from one provider. I consider Rackspace to be
relatively far along in implementing the OpenStack vision and they're still a
far way behind in the offerings AWS has. Google is in a similar position with
their new cloud initiative.

Like with HTML5 I'd like to see all of the main datacenters implement full
stacks and then have OpenStack be a 'codify what works' project. We need 4 or
5 providers who have the depth and breadth of what AWS has.

------
rdl
<https://status.heroku.com/incidents/151>

I assumed this was a response to the recent hella-long outage:

"There are three major lessons about IaaS we've learned from this experience:

1) Spreading across multiple availability zones in single region does not
provide as much partitioning as we thought. Therefore, we'll be taking a hard
look at spreading to multiple regions. We've explored this option many times
in the past - not for availability reasons, but for customers wishing to have
their infrastructure more physically nearby for latency or legal reasons.
We've always chosen to prioritize it below other ways we could spend our time.
It's a big project, and it will inescapably require pushing more configuration
options out to users (for example, pointing your DNS at a router chosen by
geographic homing) and to add-on providers (latency-sensitive services will
need to run in all the regions we support, and find some way to propagate
region information between the app and the services). These are non-trivial
concerns, but now that we have such dramatic evidence of multi-region's impact
on availability, we'll be considering it a much higher priority.

2) Block storage is not a cloud-friendly technology. EC2, S3, and other AWS
services have grown much more stable, reliable, and performant over the four
years we've been using them. EBS, unfortunately, has not improved much, and in
fact has possibly gotten worse. Amazon employs some of the best infrastructure
engineers in the world: if they can't make it work, then probably no one can.
Block storage has physical locality that can't easily be transferred. That
makes it not a cloud-friendly technology. With this information in hand, we'll
be taking a hard look on how to reduce our dependence on EBS.

3) Continuous database backups for all. One reason why we were able to fix the
dedicated databases quicker has to do with the way that we do backups on them.
In the new Heroku PostgreSQL service, we have a continuous backup mechanism
that allows for automated recovery of databases. Once we were able to
provision new instances, we were able to take advantage of this to quickly
recover the dedicated databases that were down with EBS problems."

Then I checked the date. It's actually Heroku's response to their super-long
April 2011 outage. Yet, it appears the "we should go across Regions" lesson
wasn't learned.

------
mnutt
It seems like you can make a good tradeoff between Chef and AMIs by nightly
rebuilding the AMIs off a fully configured system, and then when the machine
comes up you run Chef to make up the incremental difference.

~~~
fredsters_s
Yep! You could also build the AMI's using a build / CI server each time you
push if you wanted. Once you've nailed the process and are used to how long it
takes you'll love it.

------
bifrost
There's this really great thing called a CDN that can be used to keep your
"Web Site" up at all times, even if your source servers are down. It doesn't
help your web app, but its better than looking like you've dissapeared from
the planet.

~~~
ukd1
The problem with just using a simple CDN (i.e. one without proper ESI support
like CloudFront) is that for dynamic content is cached at a fixed TTL per URI,
which for 'apps' is likely to be very short. This means your content will
expire and then be retried from the origin...which would be down.

However for static or near static sites, it's perfect - just make sure your
TTL is correct :-)

~~~
aaronblohowiak
>This means your content will expire and then be retried from the
origin...which would be down.

I don't think ESI is necessarily a requirement to configure your cdn to serve
stale content in case of the backend being unreachable.

~~~
bifrost
Thats correct, often times most people don't even set page cache times in
their dynamic content.

~~~
fredsters_s
I think it's because most web developers don't ever have to understand how
this stuff works. It's sometimes easier to just throw hardware at a problem
than engineer it better. However, understanding what you can do is key imho.

------
ccaum
You can use Puppet to pre-bake your AMIs for you so you can scale very rapidly
and still use configuration management to maintain your instances

[http://puppetlabs.com/blog/rapid-scaling-with-auto-
generated...](http://puppetlabs.com/blog/rapid-scaling-with-auto-generated-
amis-using-puppet/)

------
tillk
This blog post is hilarious.

A region in AWS speak is already multiple supposedly independent data centers
(in AWS terms: AZ (availability zone)).

So if an entire region fails, that's four or so data centers which all go down
at the same time.

So how many companies on bare metal have four data centers and experience this
kind of catastrophic downtimes? Add to that, how many of these companies
operate completely in the dark about which data center is actually which?

These blog posts are annoying because it seems like these people have never
done anything they suggest themselves.

Yes, the cloud lets you setup a fully configured instance within minutes. But
at which expense? Mostly intransparency about what the entire stack and what
is going on?

Food for thoughts.

~~~
fredsters_s
That's just it - they're supposedly separate - but each zone is on the same
physical site albeit seperated. Unlikely to break at the same time, but more
likely than two separate locations. Just google for aws multi zone outages.

Anyway - the article isn't suggesting to use four datacentres, it's trying to
make people aware of the simple steps that can be taken to avoid failure.

~~~
tillk
Ok, to be more obvious: by comparison, a multi-region setup is maybe over the
top. Because essentially, you are distributing over something like eight data
centers then.

Because you cannot stay in a single AZ either – who knows what will go down
and when.

It's insane how this post is so high up on this website. But it goes to show
that neither you or the author of this blog post have ever attempted any
replication over WAN. Otherwise it would not be called simple.

------
ukd1
Love AWS. We're using Heroku and it's been pretty painful over the last month
or so. However, it's super easy. At some point it they should have an SLA, as
there underlying hosting (aka AWS) provide one with credits for outages. The
number one feature for Heroku that would help is being able to specify
multiple zones or regions when making new apps.

~~~
flyt
The failure of a single AZ in a single region does not trigger AWS' SLA.

------
talonx
This article takes a too simplistic view of real world deployments on AWS, and
attempts to sum it up with 5 bullet points. Yes, I know the title is a "rough"
guide, but why not go into more depth and acknowledge that there's more
diversity in terms of deployment models out there? The other option would have
been to keep it very high level and not talk about specific tools.

E.g. Use Route 53? Isn't that hosted on Amazon itself? Why create another
point of failure?

MongoDB - How many big sites on the cloud use it as their primary database?

The only takeaway for me was the last paragraph - "In conclusion, you probably
used a single zone because it’s easy (hey - so do we for now!). There will
come a point where the pain of getting shouted at by your boss, client or
customers outweighs learning how to get your app setup properly yourself."

------
hoop
Am I the only one who feels like the author missed the mark?

The blog post is supposed to be about "keeping your website up through
catastrophic events" and the dominant theme seems to be "invest _more_ heavily
on Amazon." IMHO, the exact opposite needs to happen.

Sure, I understand that being in multiple regions means you supposedly have
very autonomous deployments (to include Amazon's API endpoints), but nobody
can prove to us that each zone or even each region are totally separate.

I'm not saying that Amazon is being dishonest about their engineering - I
simply believe that by being 100% reliant on a single vendor you fail to
mitigate any systemic risk that is present. That risk can be technical risk or
business risk, as engineering at this level isn't strictly a technical
profession.

------
mitchellh
Regarding doing un-encrypted cross-datacenter replication with MongoDB, I
recommend the author of this blog post read this:
[http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Comput...](http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing)

~~~
fredsters_s
SSL support is available -
<http://docs.mongodb.org/manual/administration/ssl/> and latency can be an
issue. Most drivers support routing read queries to the lowest latency slave.
Again, obviously this is not for everyone. It's _always_ a trade-off. Hence
the title: _rough_ guide. The point of this post was to increase awareness.

------
ehsanu1
The article mentions using custom origins with CloudFront, and I don't
understand why setting up origin.mydomain.com was required. At work, we use
mydomain.com directly as the custom origin, and setup was super simple (just
tell CloudFront the domain). Is there anything wrong with doing it this way?

~~~
joliss
No, there's nothing with that. What the author is doing is serving the entire
domain (i.e. www.mydomain.com, not just JS+CSS+images) off of CloudFront, so
you obviously need a separate host name for the actual server. Hence
origin.mydomain.com.

(Note that you need to use a www subdomain for this, or CNAMEs don't work.)

------
fboule
What about enhancing the autoscaling and Automated Configuration & Deployment
and or AMIs part of the article with virtualization (ESXi with or without
vMotion)? No need of configuration and deployment, only duplicate your VM to
have it on the other site or let vMotion move it for you?

------
timothy2012
Okay, so before all that, I guess the first crucial step is to have the
site/system monitored by some third party such as Monitive or Pingdom. Then
you can take action based on information (facts).

Tim.

------
fredsters_s
So before anyone else points this out - yes our app is currently hosted in a
single zone, and no we do not plan on keeping it this way! (we're currently in
early Alpha)

------
mbs348
sounds hard

~~~
ukd1
It's not as simple as using Heroku, but it's not that hard. If you don't
deploy very often - using baked AMI's is the easiest solution. Automating your
build / deployment process to do this worth the effort in the longer run -
especially if you want to scale up and down fast.

