

Large-scale Amazon EC2 Outage - bradly
http://status.aws.amazon.com/

======
Pewpewarrows
Two large, lengthy outages within a few months of each other? I hate to piggy-
back on Amazon's follies, but it's times like these that make me love my
Linode boxes. Total downtime over the last 5 years? 5 hours. I really couldn't
ask for a better hosting company.

Edit: The 5 hours are what I've noticed from either personal servers or those
of friends/acquaintances. YMMV.

~~~
Derferman
I must have picked a bad time to become a Linode customer. My node in Fremont
has been up for two days and has already seen two network outages.

~~~
rednaught
Overall they have great service. Only their Fremont location has been a bit
touchy this year. I've been with them since the days they used UML before Xen
and if it's any consolation, I have experienced at least one downtime in each
DC(but this was over the course of many years now). Unless you plan for
distributed systems, expect some failures from time to time no matter who is
the provider.

------
adamt
It gets better - I hope nobody was relying on their AWS DB snapshots for
backups - I just had a note from Amazon to say that one or more of EU-West
database snapshots had missing blocks (due to an EBS software errors) and had
been removed.

~~~
pdaddyo
Yep just had the same email - I've lost all bar one EBS snapshot. Don't think
I can trust them anymore, I wish I had more time to move our infrastructure
elsewhere!

They've essentially swiss-cheesed all our backups.

Email copy follows....

Hello,

We've discovered an error in the Amazon EBS software that cleans up unused
snapshots. This has affected at least one of your snapshots in the EU-West
Region.

During a recent run of this EBS software in the EU-West Region, one or more
blocks in a number of EBS snapshots were incorrectly deleted. The root cause
was a software error that caused the snapshot references to a subset of blocks
to be missed during the reference counting process. This process compares the
blocks scheduled for deletion to the blocks referenced in customer snapshots.
As a result of the software error, the EBS snapshot management system in the
EU-West Region incorrectly thought some of the blocks were no longer being
used and deleted them. We've addressed the error in the EBS snapshot system to
prevent it from recurring.

We have now disabled all of your snapshots that contain these missing blocks.
You can determine which of your snapshots were affected via the AWS Management
Console or the DescribeSnapshots API call. The status for any affected
snapshots will be shown as "error."

We have created copies of your affected snapshots where we've replaced the
missing blocks with empty blocks. You can create a new volume from these
snapshot copies and run a recovery tool on it (e.g. a file system recovery
tool like fsck); in some cases this may restore normal volume operation. These
snapshots can be identified via the snapshot Description field which you can
see on the AWS Management Console or via the DescribeSnapshots API call. The
Description field contains "Recovery Snapshot snap-xxxx" where snap-xxx is the
id of the affected snapshot. Alternately, if you have any older or more recent
snapshots that were unaffected, you will be able to create a volume from those
snapshots without error. For additional questions, you may open a case in our
Support Center: <https://aws.amazon.com/support/createCase>

We apologize for any potential impact this might have on your applications.

Sincerely, AWS Developer Support

~~~
andrewcooke
are your backups too large to be elsewhere completely? i'm working on a site
on appengine, and although i can't back up _all_ data, i can copy the critical
stuff (the user accounts) to servers elsewhere.

[i realise this is a little off-topic, but what do other app-engine users do?]

[edit: wasn't being critical, just trying to understand what others do and
why]

~~~
pdaddyo
They're not mission critical failures, but that's not the point to me - I pay
a monthly fee to have those snapshots there, and they just carved holes in
them. Disappointing to say the least, even though it's not actually taken down
any of our instances.

~~~
codyrobbins
I’m glad to hear that you didn’t lose anything critical. But snapshotting EBS
volumes are not backing them up. If, by definition, it’s stored using the same
service, and therefore susceptible to all the same catastrophes that might
befall the service, then it’s not a back up. I always set up a local read
slave of the database server which has hot-swappable hard drives that get
rotated out of a safe deposit box or fireproof safe. The only way I can
properly trust that things are being backed up properly are to do it myself.

------
grandalf
My site (hosted on Heroku and RDS) was inaccessible, as was heroku.com ... but
the AWS status website said everything was OK. Heroku status also said
everything was OK.

What is the point of those status dashboards if they are not actually
monitoring the health of the cloud in real time?

Google App Engine's status dashboard quickly returns to green the minute an
outage goes away, hiding the overall unreliability from view.

~~~
bradly
I've found <http://search.twitter.com> far more reliable than any status
dashboard for finding out if there is an issue or not.

~~~
itsnotvalid
one should build a crowdsourced status checker to really know what happens.

~~~
kalleboo
<http://downrightnow.com/> monitors twitter, their own user reports, official
status RSS feeds, etc for a probability calculation of if a site is down or
not.

------
jcsalterego
<http://status.aws.amazon.com/rss/EC2.rss>

[http://reports.panopta.com/cloudharmony-
borderless/server/57...](http://reports.panopta.com/cloudharmony-
borderless/server/57481)

various ec2 clients affected:

\- dotcloud

\- dropbox

\- engine yard

\- foursquare

\- heroku

\- hootsuite

\- instagram

\- kicksend

\- netflix

\- pagerduty

\- reddit

\- twilio

~~~
santi
pagerduty down?! I don't know if I should be sarcastic here and laugh, or just
worry...

~~~
shazow
Their website might have gone down but I'm told that their underlying service
is redundant across multiple availability zones.

Perhaps they should do the same for the website just to instill extra
confidence, though.

~~~
AngryParsley
I didn't get any SMSes or calls from PagerDuty even though I see 4 incidents
triggered in the PD web UI. Although that could be due to Twilio going down.

------
blantonl
My business is on 15+ AWS instances and we are completely down and all hosts
are unreachable.

~~~
blantonl
as of 9:59 PM CST it appears we are back up. Additionally, all our instances
are running normal. It appears to be a network connectivity issue for AWS.

------
m0nastic
I'm moderately embarrassed to admit that I learned about this outage when both
the Facebook games I play went down at the same time. Both back up now though.

------
dstein
I thought it was silly how everyone was going apeshit the last time this
happened. Even here on HN people were falling over themselves proclaiming the
end of the cloud. But you know what, I kind of like seeing outages like this
because it means Amazon is (hopefully) going to reduce the risk of a bug like
this ever happening again which in the end just makes the platform more
reliable in the future.

~~~
sjs
To be fair the last one was a lot longer than a 1/2 hour and even though it
was a small percentage there was permanent data loss. 0.06% of 100 PB is 60 TB
which is still a lot of data. I have no idea how much data they store in one
DC, maybe somebody could hazard a real guess.

~~~
rmc
_the last one was a lot longer than a 1/2 hour_

About 25% of EU west instances have been down for about 36 hours now...

~~~
sjs
Oh! I did not know that.

------
teeray
I would like to run my app on whatever AWS's status board is running on :-P

------
hysterix
Reddit gives: An error occurred while processing your request. Reference
#97.374a7b5c.1312858006.1a17105

Must be this.

~~~
azth
That's why I came here after I found out reddit was down :P

------
bane
I wonder how many 9's Amazon is down to? It's really a shame because it's such
a good service in theory.

But it really is time for Amazon to start thinking about some kind of
automatic failover system.

------
bgentry
3 out of 4 us-east-1 AZs are unavailable from the outside. The AZ that is
working is the one that has chronic availability issues, so there is no way to
recover service in that zone.

~~~
jbellis
There's an AZ with "chronic" availability issues? Where do I find out more
about this?

~~~
bgentry
I should have been more specific, I was in a rush =)

The original us-east-1b has chronic _capacity_ issues, meaning that it is
always at or near its capacity. AWS refuses to sell instance reservations for
this AZ, and it's often difficult/impossible to launch new instances in this
zone.

Keep in mind that AZs are remapped for all accounts newer than a certain point
in time (say, 1 yr ago). Your 1b may not be my 1b.

------
kenneth_chau
Why are these big name companies still relying on exactly ONE web host? These
companies are centralizing failures to just one cloud host. They definitely
need to abstract cloud deployment with something like libcloud and jcloud APIs
and make use of multiple cloud deployments on a dime. As for storage, they
should have replications set up on other cloud hosts as well.

~~~
atambo
And for rubyists:

<https://github.com/geemus/fog>

------
rmc
Remember there is large EC2 EU West Outage that isn't fully fixed. About 25%
of EC2 instances have been down for about 36 hours now...

------
NathanKP
It looks like the outage was only 30 minutes long. At least that is Pingdom's
measurement of the downtime on my EC2 East 1 servers.

~~~
vilda
It depends on what you mean by outage. Instances with EBS volumes attached
have problems even 24 hours later.

------
vilda
Please refrain from judging reliability of cloud providers unless you have a
representative sample. By representative sample I mean several hundreds of
instances.

That your instance runs 99.9% does not say anything, you are lucky. That your
instance dies in two weeks does not day anything, you are unlucky.

------
randomanonymous
Why don't people just be redundant and use multiple locations, or have a
backup server at your business that can at least somewhat take over? It seems
silly all these people rely solely on Ec2 and not have their own hardware
anymore.

~~~
acdha
They could get the same thing by simply reading Amazon's recommendations and
having EC2 servers in multiple regions as well. Most people either don't need
reliability or failed to plan for it - the only difference with EC2 is that
it's easier to add redundancy when you don't have to deal with physical
hardware, network connections, etc.

------
rtrunck
I hear all these things about the need to run in multiple regions, etc. Why do
I have to worry about this? Why isn't there an easy off the self solution?
Honestly, isn't there an easier way to have redundancy?

~~~
cloudwalking
Data synchronization between data centers is tough. There's A LOT of data
moving in those data centers all once--it isn't easy to mirror that across the
country in real time.

~~~
diolpah
That's why the good lord invented eventually consistent data synchronization
models. If your use case can tolerate it, use it.

------
zrail
Huh. My piddly little website hosted on a micro instance is down. This kind of
sucks, not that I get a lot of hits, but I have a few things hosted there that
local stuff depends on.

------
andymoe
I think amazon can fix this problem simply by raising us-east prices to be on
par with us-west. At least then there would not be quite the incentive to make
the foolish choice of running all your instaces out of one region. (I suppose
they could lower prices in us-west to match as well)

But in any case, I expect the PAAS people like heroku to at least start to
step up their game. My apps on Google App Engine are up and I can run clojure
and ruby there too.

------
ahhrrr
My servers (on Engine Yard) were down for a few minutes but appear to be back
up again.

------
Aloisius
I pay the extra money for US-west which seems to have a much better track
record.

------
NARKOZ
rtomayko: Everyone hates AWS when it fails in exactly the way they clearly
state it will in the fine print. It's not free. You have to engineer for it.

------
xal
Looks like Pagerduty is hosted on ec2? Really?

------
reustle
100% green for me

~~~
captaincrunch
scroll down.

------
necubi
Appears to be back up, at least for us.

~~~
ksdsh
Yes, my site is up now too.

------
rjurney
US-East is the Big Lots of EC2 zones.

------
tricolon
Netflix also seems to be affected.

------
bch
Is this pure network issue (ie: strictly connectivity), or were instances
rebooted?

~~~
petedoyle
Seems like network only. I have an instance that was unavailable via SSH. When
things came back up I still had 81 days of uptime.

------
elemenohpee
Is this why Reddit is down?

~~~
ntkachov
yup.

------
teeray
This will likely kick Heroku into high gear with jumping off EC2.

------
mtogo
Netflix is down as well.

------
shriphani
Looks like it is up now. (7:58 pacific). I can access netflix.

~~~
dennisgorelik
11:17 pm EST - netflix.com does not work (though shows some minor signs of
life).

------
mgkimsal
is this related to the one from the other day?

DUH - doesn't seem to be - that was west coast, IIRC, and the only issue I see
now is in Virginia.

------
dennisgorelik
Netflix is down.

------
flog
We've been down for 3 days now.

------
agotterer
All my boxes just came back up

------
dbingham
Looks like we're back.

------
skennedy
Twilio is down. Bummer

~~~
taf2
damn, <http://status.twilio.com/> \- calls are still going through, just a few
extra rings before my service receives the http request from twilio... gotta
love twilio they rock - looks like no call recordings...

------
rmrm
netflix down on my PC and their app is down on my TV.

------
chetan51
It's back up now.

------
enoptix
Dotcloud is down

------
bgentry
Back up now.

------
presidentender
Our minecraft server on linode is down.

~~~
dbingham
Linode relys on amazon? My linode's still up...

~~~
presidentender
I'm a fool, it's mojang's authentication.

------
nirvana
I'm sure Amazon will fix this issue, but the question it brings to my mind is-
why are so many people using ECC? Most startups need hosting, not an elastic
compute cloud. ECC makes sense for someone who needs to spin up 1,000 workers,
pull data out of S3, process it, store the results in S3 and then spin down
the workers when it is done.

But startups need hosts that are up 24/7. ECC doesn't give you any guarantee
of uptime, and if it goes down the local (fast) disk is ephemeral. Yet, the
EBS alternative which is backed up, is very slow.

The basic VPS offerings, that Linode, Rackspace, and just about everyone else
offers, isn't available at all from Amazon (near as I can see.) Yet this is
what startups need- local disks, a small monthly fee, and up all the time.

So, Amazon requires extra engineering--to account for nodes going down more
often and to make ephemeral disks reliable or EBS performant. It also puts you
on the path to lock-in, since so many of amazons services have their own
unique APIs. It isn't exactly cheap, near as I can tell, when compared to,
say, dedicated hosting in germany.

Its advantages exist elsewhere- if you need to spin up a bunch of machines
with an API you can do this at rackspace cloud. And... that's about the only
advantage I'm able to think of. Personally, I'm architecting to have some
overcapacity built in and to survive a spike, because I use that excess
capacity in the off hours for heavy lifting. I wouldn't try to bring up extra
nodes in the morning and shut them down in the evening anyway... and I doubt
that many startups are really doing that.

Possibly, I'm missing something. I tend to forget about features that aren't
compelling to me, but are compelling to others. So maybe there's something
that's important to these startups.

~~~
petervandijck
S3 is a big part of it. If you're storing lots of data, a "local disk" just
isn't enough. Now you're spending your time solving problems Amazon already
solved.

Occasional downtime just isn't that big of an issue for many startups, who are
still looking for product-market fit.

~~~
nirvana
I see what you're saying, and I failed to illuminate that I'm coming from the
perspective of someone who is working with a cluster of riak nodes. Riak is a
ring topology much inspired by the Amazon S3 design. I'd never trust anything
to just one spinning rust device, sure. I was taking a distributed storage
architecture (using whatever open source platform you prefer) for granted. But
I recognize that S3 predates many of them.

------
Hisoka
Imagine launching your startup on the day Amazon AWS goes down? On another
note.. why don't these power outages ever affect Amazon.com from going down?
Come'n, eat your own dog food!!!

~~~
jrockway
Amazon specifically tells you to set up multiple servers in multiple
availability zones. They probably follow their own advice, and, as a result,
don't go down.

I've talked to some people about AWS about this, and the reason why they have
availability zones is because they don't want to charge you the speed cost of
syncing data between zones if your app doesn't need 100% uptime. Generalized
replication slows down your app. AWS gives you the option of not having
replication or bringing your own.

~~~
snorkel
...and only Amazon can afford to follow their own advice. Multiple AZ hosting
ain't cheap. Most CEO/CIO/CTO types spit out their coffee when they see the
costs of fully redundant hosting in the cloud at which point they decide "For
that price we can afford to be down for 24 hours."

~~~
spydum
Very few businesses need 100% uptime. As long as you have good recovery
strategies, and exercise them routinely, you should be set. When was the last
time you ran a failover simulation? do your ops guys know what to do? are
there clear lines of communication as to the status of the event?

Outages are hard to avoid, but the pain can be lessened if your customers are
aware of the recovery progress and you can deliver on your recovery time
goals. Nothing is worse than being down, and leaving customers in the dark to
start rumors that your guys are not even aware of the problem.

------
CamperBob
Dropbox is up but stored content seems to be inaccessible. Yay for single
points of failure!

~~~
mtogo
Using dropbox is a bad idea overall in my opinion. They lied to their
customers about what they can and cannot view in your account. They _lied_ to
you. They intentionally told you that they could not view your files when in
fact they can. If that's not enough, they had a critical security
vulnerability (log in with no password) for four hours, proving that their
systems fail open. Finally, if all this wasn't bad enough yet, they do not
encrypt your files when they store them[1].

[1] _Technically_ they do encrypt the files, but the keys are right next to
them on the same infrastructure. Doesn't do any more good than not encrypting
them.

~~~
blackhole
This really isn't a problem if you aren't storing anything sensitive,
especially if you aren't even paying for it.

~~~
jules
For the bits that need to be protected there is Tarsnap
(<http://www.tarsnap.com/>). The client is open source, so you can check for
yourself that things are encrypted before going on the wire.

------
doubaokun
My servers are down in east us.

------
rmoriz
<http://i.imgur.com/ZIqmo.jpg>

~~~
ars
This type of comment is not desired here. Take it to reddit.

~~~
hamburgersushi
This site is already, effectively, /r/technology combined with /r/politics.

I like this comment.

