
Another EC2 outage (yet AWS dashboard says no error) - rdl
EC2 appears to be down (not Heroku, specifically, but EC2).  AWS dashboard doesn't yet show this.
======
rdl
Amazon Availability Zones are such a fucking lie. ("Shared nothing, and an
outage affecting one will not affect other Zones in the Region.")

I've seen more failures which take out multiple AZs than which take out only a
single AZ. So, a prudent person would split their application across regions
(which are relatively shared nothing, except for admin/account level stuff),
but Amazon goes out of its way to not make that easy -- you're using the
public Internet, pay higher costs, etc.

The right choice is probably EC2 plus a non-EC2 provider (your own hosted
stuff, another cloud (?), etc.), with protection in case either goes down. But
that is a relatively lot of work, and if you're on a PaaS like Heroku which is
100% at risk to EC2, you can't do it.

Kind of sucks.

~~~
dsl
My Nagios awoken 3 AM brain finds no fault in your logic.

~~~
rdl
Even funnier are people doing server monitoring of (things in EC2) from within
EC2. When the EC2 outage happens, there's obviously no problem because no
alerts get sent...

Doh!

~~~
cperciva
For some people, that might be fine. If you don't have plans for how to
rapidly move out of EC2, you might as well just sleep through an all-of-
EC2-goes-down outage for all you can do about it.

~~~
rdl
You should at least know there is an outage to have something to tell your
downstream customers. It is really embarrassing to have a customer (or your
boss) call to report an outage you don't yet know about, even if there is fuck
all you can do to resolve it. Basic principle of ops.

~~~
zackattack
> Basic principle of ops

For my benefit, what are some others?

~~~
rdl
This would actually be an interesting blog post.

------
seldo
Waking up at 2.30am to a total service outage sucks, but at least I know all
you bastards are awake and screaming too. That makes me feel slightly better.

(We're back now. Kudos to Pingdom for noticing and alerting)

~~~
yanivgolan
Ah, pingdom failed to alert us via push got only the email alert (luckily it
was exactly when the team is warmed up and ready to hit the keyboard :) 11AM
at Israel)

~~~
katzboaz
I have an IFTTT recipe to get Push notification if you have a server outage
<http://ifttt.com/recipes/25122>

~~~
yanivgolan
Cool! thanks, adding it now (IFTT rocks)

------
mosburger
One advantage to being on the east coast vs. the rest of you silicon valley
types - I got awakened at 5:30am instead of 2:30am! :)

~~~
mukaiji
true, but that still doesn't make up for the weather difference :)

Plus, who said we were sleeping.

~~~
mosburger
fair enough. And I'm sure the next outage will happen around 11:30p PT. :)

~~~
rdl
Leaving 30 minutes before the bars in NYC close. Just saying.

~~~
intsunny
Last call in NYC is usually 4AM.

~~~
zackattack
He's talking about the next outage :)

~~~
rdl
No, actually I incorrectly added 3h to 2330 and got 0330, mostly because I'm
sleep deprived after this outage. Or I could pretend it was "an hour to
resolve the outage, then back to the bars".

------
colinhowe
Their dashboard not showing this and Twitter being more reliable as a way of
showing downtime is a sad tale.

~~~
tybris
That dashboard sucks, but Twitter being a faster, more reliable source of news
is true for most things now.

------
orph
Back now. My logs show it was down from 2:23 to 2:41.

------
faizanaziz
Painless sever management is slowly becoming painful again :(

Worse if you are tied into the AWS ecosystem. You can't even move out or even
have backup servers in other places.

~~~
verelo
The largest issue i see for this is RDS. If you use RDS...i dont know, but
migrating away is not going to be easy.

Migrating the servers themselves however shouldnt be too hard if you have a
fairly sane build process, however if you dont, and i suspect a lot of people
dont...its not going to be fun.

~~~
rdl
It's more that if you make use of the Amazon APIs for autoscaling other
services, you can't just directly translate that to a more static managed
hosting environment.

Probably the sane way is to special case some subset of your functionality so
it works regardless, and then gracefully scale up/down your app (performance,
scope of features, etc.) based on system health. This is a lot more complex,
and really hard to retrofit.

~~~
verelo
Yeah thats a fair point.

From the very beginning i've always strayed far away from anything that locks
us into AWS. For this reason we've made no use of anything that couldn't be
picked up and moved away, so for us auto scaling was never something we
decided to utilize for this exact reason.

While this held us back a bit at first, even tools like SES initially only had
an API provided by Amazon. Now it supports standard SMTP connections, so we
decided there was no harm in using it as we could easily make a switch with no
code changes.

------
louisgoddard
Did a quick write-up on this here: [http://webdev360.com/users-report-major-
amazon-ec2-outage-bu...](http://webdev360.com/users-report-major-amazon-
ec2-outage-but-official-dashboard-stays-mute-41419.html)

Does anyone know if ops teams generally have to clear these sort of
announcements with PR/comms? I assume they would, given how they get reported
on.

~~~
rdl
At a competent company, no. There are policies in place before the outage, but
having your PR people in the loop slows things down to the point where you're
worthless to your customers. The exception is you loop in legal, PR, etc. if
someone is actually injured/dies, or if crimes are involved.

A lot of providers try to NDA their "ops to customer" service outage
notifications, but most customers flagrantly violate those NDAs. Automated
service dashboards are supposed to be automatic; ops teams often put in short
statements (especially time to fix and any interim way to mitigate the
outage).

Definitive statements after the outage are run by PR (and generally announced
senior to ops), but service notifications of outages (vs. causes,
compensation, and long term corrective actions) are not.

~~~
louisgoddard
Thanks -- that's what I assumed initially, but had second thoughts after
seeing how delayed the announcements were.

~~~
rdl
I suspect there is some human admin level disconnect between their network
ops/routing and the AWS team itself. A connectivity outage wouldn't
necessarily get detected within AWS, and they probably don't have good
monitoring within the AWS product to detect problems like that. The network
team presumably doesn't have a good way to push status updates to the AWS
dashboard automatically (and it's kind of a grey area what is a "network
outage" -- if you lost routes to just Pakistan, that's not really a big deal
for most AWS customers. If you lose routes to everyone, yes, that's a big
deal.)

------
rdl
Apparently it was caused by a network problem above AWS:

2:40 AM PDT We are investigating connectivity issues for EC2 in the US-EAST-1
region. 3:03 AM PDT Between 2:22 AM and 2:43 AM PDT internet connectivity was
impaired in the US-EAST-1 region. Full connectivity has been restored. The
service is operating normally. 6:09 PM PDT We want to provide some additional
information on the Internet connectivity interruption that impacted our US-
East Region last night. A networking router bug caused a defective route to
the Internet to be advertised within the network. This resulted in a 22 minute
Internet connectivity interruption for instances in the region. During this
time, connectivity between instances in the region and to other AWS services
was not interrupted. Given the extensive experience that we have running this
router in this configuration, we know this bug is rare and unlikely to
reoccur. That said, we have identified and are in the process of deploying a
mitigation that will prevent a reoccurrence of this bug from affecting network
connectivity.

We understand that when networking events affect instances in multiple
Availability Zones it causes our customers serious operational issues that are
difficult to architect around. We have been using and refining our
Availability Zone architecture for over 10 years at Amazon to provide highly
reliable services. Availability Zones provide a high degree of isolation
including physical separation, independent power distribution, independent
cooling and mechanical systems, and multiple physical links to the Internet
through multiple transit providers and peering connections. All of our regions
have exceeded 99.99% availability over the last several years. We are also
continually investing in improving our architecture as we learn more. In
addition to the remediation discussed above which addresses the specific bug
we saw last night, we are currently in the later stages of refining the way
that we do route advertisement within a region. These changes will isolate any
bad route information to inside a single Availability Zone while maintaining
the performance characteristics of our current inter-Availability Zone network
design. We have been deploying these changes carefully to avoid impact to
customers, but we expect these changes to be complete within the next several
weeks. We are confident these changes will protect us from multi-Availability
Zone impact for the sort of bug we saw last night.

------
javery
Just updated - looks like a network connectivity issue. It looks like they
have had issues everyday for the last 3 days - first time I have seen it down
though.

------
eblackburn
For those of you relieved it happened in the middle of the night. Hi from mid
morning London where externally hosted services including their websites
mysteriously disappeared for half an hour.

Nothing critical, but certainly a warning shot across the bows.

------
mickeymoose
"The right choice is probably EC2 plus a non-EC2 provider (your own hosted
stuff, another cloud (?), etc.), with protection in case either goes down"

Yes, multi-Cloud strategy... i know a good company for this ;-)

~~~
rdl
And yet you have no contact info in your profile...

There's also the obvious risk with even using a single PaaS running on
multiple IaaS clouds. If your account with the PaaS gets hacked, or they get
acquihired, or whatever, you can be screwed too.

Figuring out exactly where to have redundancy in your business is hard.
Especially because building something to be redundant imposes costs (more
expensive, slower development) and sometimes itself is the cause of outages
(lots of hilarious failover-related failures have taken down sites).

------
instakill
It's incredible how much of the internet depends on Amazon.

------
retrovirus
Keeping an eye on HN all the time pays off :) I too was wondering what the
hell is going on with my EC2 instances when I saw this thread.

AWS status dashboard: -1, HN: +1

------
danieldrehmer
Happened with many of our sites a few moments ago. Went to their status
history and no sign of failure. Shame on you, AWS.

------
thibaut_barrere
I have a dotcloud (EC2 based afaik) down too.

------
instakill
So this is why wildfireapp.com was down.

------
lreeves
Seeing this too. The status dashboard shows connectivity problems for US-
EAST-1 now.

------
orph
At least I wasn't imagining it.

------
tikhon
back up

