

How SmugMug survived the Amazonpocalypse - onethumb
http://don.blogs.smugmug.com/2011/04/24/how-smugmug-survived-the-amazonpocalypse/

======
josephruscio
I'm finding a lot of these articles about "surviving" the outage fairly
frustrating. They generally boil down to a combination of the following:

1\. "We use a multi-AZ strategy!" - This outage affected multiple AZ's
concurrently. If you did not see downtime, this means you were fortunate to
have at least one unaffected AZ. This is pure luck however, many sites with
the same level of preparation had significant downtime. (Note: A multi-AZ
strategy is sage and would have minimized your downtime, but does not warrant
a survival claim in this case.)

2\. "We aren't using EBS!" - Not a single article I've seen has claimed that
they weren't using EBS because they feared a multi-day/multi-AZ outage. They
weren't using it because it lacks predictable I/O performance in comparison to
S3. You can't retroactively claim wisdom in the category of availability for
this choice.

3\. "We don't host component <X> on AWS!" - Taking this argument to it's
logical end, any service that doesn't host on AWS could write one of these
articles e.g. "We host on Rackspace so we didn't go down!"

In short, if you don't have a completely multi-region strategy (including your
relational data-store) implemented purely on AWS, your blog post is decreasing
the signal-to-noise ratio on this issue.

------
hopeless
Best quote: "Start surprising your Ops and Engineering teams by killing stuff
in the middle of the day without warning them. They’ll love you"

It sounds stupid but if you really do have a resilient and redunant
infrastructure it shouldn't matter. If you fear someone randomly unplugging
things then you have work to do ;-)

~~~
sfrench
Netflix (who also survived the outage) wrote a blogpost last year in which
they talked about a system they call "Chaos Monkey" which does this exact
thing.

[http://techblog.netflix.com/2010/12/5-lessons-weve-
learned-u...](http://techblog.netflix.com/2010/12/5-lessons-weve-learned-
using-aws.html)

~~~
joegester
I don't think they meant that they were doing that on live systems. Sounds
like a debugging tool.

------
cagenut
I think smugmug's cloud/colo hybrid is more likely to become the norm than the
all-cloud dream of not having to deal with hardware anymore. When it comes to
the "undifferentiated heavy lifting" aws wins. s3 for bulk storage, ec2 for
asynchronous computing, cdn's for edge/delivery. But when it comes to your
core data (meta data? the 64bit picture_id as opposed to the 2megabyte jpg)
you just cannot beat raid10 ssd type colo'd setups right now.

Essentially I think we're going to be in an 80/20-ish cloud/colo sweet spot
situation for years to come.

------
lordmatty
Well done Smugmug.

Perhaps you should diversify into cardiac monitoring!

------
Joakal
A little bit of information on their configuration in the past;
[http://don.blogs.smugmug.com/2008/06/03/skynet-lives-aka-
ec2...](http://don.blogs.smugmug.com/2008/06/03/skynet-lives-aka-ec2-smugmug/)

------
joevandyk
One thing I'm curious about -- if you want to spread your instances over
multiple zones and you are using postgresql, how do writes work? Won't latency
to the master be slow if instances in one zone are trying to write to the
master located in the other zone?

~~~
whakojacko
Latency between different availability zones in the same region is generally
pretty good. "Over the last month, the median is 2.09 ms, 90th percentile is
20ms, 99.th percentile is 47ms. This is based on over 250,000 pings -- one
every 10 seconds over the last 30 days." from [http://www.quora.com/What-are-
typical-ping-times-between-dif...](http://www.quora.com/What-are-typical-ping-
times-between-different-EC2-availability-zones-within-the-same-region)

Certainly 10% being 20ms or more is a little troubling, but if this is only
for writes (ie reads come from a slave in the same AZ) you are probably ok.

------
rgrieselhuber
Great article. It would be nice to know what they are doing for whatever
database they are using (mysql, etc.) because they are not using RDS / EBS.

~~~
onethumb
Yes, I'm seriously overdue on a blog entry about current state-of-the-art for
DBs at SmugMug. :( You can watch my keynote from the MySQL conference two
years ago to see what we used to do, but things have progressed since then.
[http://don.blogs.smugmug.com/2010/04/15/my-mysql-keynote-
sli...](http://don.blogs.smugmug.com/2010/04/15/my-mysql-keynote-slides-and-
video/)

~~~
teoruiz
That would be awesome. +1 on my side.

------
petedoyle
Very interesting having the DBs hosted in another datacenter. I've always
assumed it'd add too much latency, but it looks like that's not the case.

Here's a traceroute between an EC2 instance in us-east-1a and rackspace.com
(which resolved to one of their VA datacenters):
<http://pastebin.com/RF5VrTic>

Sub 2ms. It also looks like the us-east-1a is peered directly with whichever
rackspace datacenter served the request.

------
OstiaAntica
The AWS issues in the NoVa center are continuing today, our RDS is still not
fully accessible.

~~~
Terretta
I've been interested to see AWS status page misrepresent this through its
icons while annotating the continuing issues.

1) Long before EBS API was returned, AWS adjusted the "Amazon Elastic Compute
Cloud (N. Virginia)" status[1] for 24 April to show operational (green). This
has since been corrected in the "Amazon EC2 (N. Virginia)" Status History.

2) Their own RDS service, which is instances backed by EBS, remained
unavailable for its users, proving that #1 was false. If they couldn't operate
a service (RDS) built on themselves (EC2) normally, the underlying service
(EC2) should not have been considered Operational in the status page.

3) At present, the icon for "Amazon Elastic Compute Cloud (N. Virginia)" is
Green for "Service is operating normally" instead of Yellow for "Performance
issues", though the text description is _not_ "Service is operating normally."
but "Instance connectivity, latency and error rates."

4) It seems from anecdotal observation they're using the status page at least
as "median status", or perhaps closer to "20th percentile status", meaning
>80% of something can be down before it toggles to "Service Disruption".

[1] <http://status.aws.amazon.com/>

------
idonthack
The tl;dr:

>we don’t use Elastic Block Storage (EBS), which is the main component that
failed last week.

~~~
mikeryan
More importantly smugmug was smart enough, when moving to the cloud to realize
which components were the most failure prone and to stay away from those.

Not using EBS wasn't luck it was a conscious decision.

~~~
SoftwareMaven
They didn't not choose it because of concerns about availability, they didn't
choose it because of run-time performance concerns. I don't think you can
argue that those concerns even imply anything about availability, much less
have some kind of causal relationship.

SmugMug got lucky in their choice. If performance had been consistent with
EBS, they would have used it and most likely gone down like so many others.

~~~
onethumb
Not true. Our primary decision was based on unpredictable latency, but the
fact that we didn't/don't trust EBS played a huge role. EBS mucks up our basic
availability scenario - systems are no longer individual, disposable,
replaceable units. I'm sorry if that wasn't clear from the blog post - I'll go
re-read that part and update.

