
Amazon EC2 outage lessons - zerop
http://www.agilesysadmin.net/ec2-outage-lessons
======
sausagefeet
If you haven't, you should read James Hamilton's paper on "On Designing and
Deploying Internet-Scale Services".

[http://static.usenix.org/event/lisa07/tech/full_papers/hamil...](http://static.usenix.org/event/lisa07/tech/full_papers/hamilton/hamilton_html/)

~~~
peterwwillis
This is a fantastic guide.

The 'big red button' is important when handling peak traffic but needs to be
expanded to 'little red buttons' as well. You might use the fail whale, or
split up your site features in different URIs and disable/fail-whale those
individually, or be able to dynamically increase logging (another great point
in the guide) to find high-latency pieces of code and big-red-button them.

Perhaps you find one host is being slammed while others are relatively idle;
being able to gracefully stop incoming requests to the slammed host and allow
them to reconnect to the idle hosts can help keep the fire at bay. (Of course
there are features built into some cluster protocols that are supposed to do
this automatically, but these don't always work as expected in the real world.
It can be better to rely on humans to control the flow during emergencies
instead of automated rules)

Meter admission is perhaps the second most important factor in handling peak
traffic. Random jitter doesn't help if you're at 100% capacity. You need to be
able to dial back everything - database queries, disk logging, app server
requests, frontend server connections, etc - so you can handle the maximum
number of traffic without falling over. The hard part is making this totally
dynamic so you don't have to keep restarting services to find the sweet spot.
Of course, this also requires extremely low overhead mostly-real-time metrics
to have an accurate view to tune your knobs by.

~~~
donavanm
Disagree on the "dynamically increase" logging. Log everything all the time.
When you enable logging/debug you've just changed your resource profile.
Presumably your already in a precarious/high risk situation, don't add more
change. Storage is cheap to save your bits. If you need to save money record
different granularities, discard old records, or use (good) sampling.

A system to dynamically/automatically capture something like
oprofile/systemtap/dtrace can be handy. But on the converse I've seen it make
a bad instance dead.

~~~
peterwwillis
The problem there is the sheer amount of logs puts strain on the network when
logging remotely, and will destroy the box from i/o load locally. The more
hardware you have, the cheaper you'll want to be. And unfortunately the
engineers don't get to make the budgets ;-)

If you have your meter admission up to par, you can throttle back on requests
to the box while enabling specific logging. You can employ tricks like enable
extra logging for only specific sessions, URIs, etc or only for a specific
time period. It's not like you're going to profile your whole application
while the site's almost down, but usually there may be a query or a subset of
users or something that's abnormally slow.

------
peterwwillis
All of these points are great and should be taken to heart. I'd like to add
that once you take all of them into consideration, you realize you have a
codebase which isn't really dependent on the cloud anymore. What you've just
implemented is redundant, distributed, fault-tolerant services.

There's no reason you couldn't use both AWS and Linode for the same website at
the same time. You're just stretching your data out a little farther. Of
course it's also different underlying technology, but at the end of the day
they're just service providers. Abstract your code and data away from the
provider-specific APIs and you can expand to whatever platform you like in the
future with minimal glue.

------
josephb
This article was posted in response to the outage in April 2011, but still the
learnings are worth repeating.

~~~
gus_massa
It would be a good idea to add [Apr 2011] to the title of the submission. At
first I thought the article was about the recent outrage that was caused by an
electrical malfunction, so I was very confused.

~~~
davidw
> At first I thought the article was about the recent outrage

I think you meant "outage", not "outrage", although I could see some people
being pretty irritated:-)

------
chrislomax
I think the first point is the most important, expect downtime.

This is the one thing we have come to accept with hosts. For us now it is a
case of how easy it is to keep those servers up if things do go wrong.

If you can afford to have a fail over data centre then this is the only way to
keep things running.

~~~
Silhouette
_I think the first point is the most important, expect downtime._

The trouble is, a lot of these cloud services have one big sales pitch, and it
is based on the improved ease of mangement and reliability you get from
outsourcing to them instead of letting your own in-house IT team run things on
your own systems. If it turns out that hosting in the cloud isn't really any
more reliable than doing stuff in-house -- and yes, I did tell a lot of people
so, and so did many others who looked at the facts rather than the hype --
then you've just undermined the main argument for using these services.

The rest is a cost issue, whether it's cheaper to use scalable, cloud-based
resources or to buy in whatever big iron you need, and to some extent whether
it's cheaper to hire smart people to maintain your systems or to outsource the
management work. I think most of us who've looked into it know the answer to
that one, which is that unless you really do have a very dynamic system where
your resource needs vary by orders of magnitude within a short space of time,
cloud hosting is disproportionately expensive at almost any scale even taking
overheads into account.

~~~
chrislomax
It's weird because when I compare the pricing of cloud compared to our current
setup, the cloud always comes out cheaper. I think this is the main thing I am
looking at currently.

We pay around £1400 a month for our hosting at the minute. This entails 1 DB
server, 2 Web servers, 1 mail server, a load balancer and some other tricks
like a SAN and a NAS. Comparing that to the likes of Azure where I would only
need 2 web servers (for redundancy) and SQL azure, the cost comes in at around
1/2 of what I am paying now.

I always feel like there is going to be a stinger somewhere in that when we
move it all over, we get some stupid high bill for something that I didn't
take into account or was not obvious.

~~~
Silhouette
I assume from the currency that you're in the UK like me. In that case, it
sounds like your current hosting is on the expensive side, unless perhaps you
have particularly high-spec servers or transfer very large amounts of data
that you didn't mention before.

FWIW, I suggest spending a few minutes looking up what a couple of other
hosting services would charge you for a similar set-up today, assuming you
haven't done so lately. I'd be a bit surprised if you couldn't save a
substantial amount of money each month for a relatively simple set-up like the
one you described.

~~~
chrislomax
Our main out goings are software licenses to be honest. SQL has gone up to 240
per CPU now which racks up. Thanks for the advice though as I will certainly
look around.

I do like the propspect of te cloud though and instant scaling which is why I
was looking at cloud solutions anyway

~~~
seunosewa
You're comparing apples and Oranges. SQL Server costs the same amount on
Amazon EC2 if you use it.

~~~
chrislomax
Well we were thinking of using SQL Azure. It's £50 for a 20gb db. This gives
you triple redundancy. To get this type of redundancy in a hosted platform
would be at least 3 sql licenses and 3 machines to put them on and manage the
failover layers

I wasn't thinking of using SQL in aws, although it is slightly cheaper

------
papsosouid
>You’re going to struggle to get the kind of performance you’d get from a
commercial SAN or NAS, with battery-backed cache, but EBS is considerably
cheaper.

I don't get this at all. It isn't a case of "struggling to get the
performance", it is a case of "you will never come close to being in the same
league, much less the same ballpark as the performance of a cheap entry level
SAN". That would be fine if they offered something with reasonable
performance, but they don't. Our only options for storage are "incredibly
slow" and "incredibly slow and transient".

The really bad aspect of this, is that it isn't cheaper than a fibre channel
SAN. When I last priced it, a HA + replication setup (3 devices, 2 in a HA
cluster on site, one offsite being replicated to for disaster recovery) was
the same price as the same amount of EBS storage for 1 year.

~~~
sorenbs
This is the reason why you can't run a large'ish website on sql server on AWS,
and - i believe - a driving factor for adoption of alternative databases that
are more easily sharded.

~~~
seunosewa
The need for sharding make hosting on AWS even more expensive relative to
custom-tuned servers, and not cheaper.

