
AWS issues, affecting Heroku and others - elliottcarlson
https://status.heroku.com/#2012-10-22
======
pjscott
Amazon, those masters of tactful understatement, say:

> 10:38 AM PDT -- We are currently investigating degraded performance for a
> small number of EBS volumes in a single Availability Zone in the US-EAST-1
> Region.

<http://status.aws.amazon.com/>

~~~
elliottcarlson
Not only that - the "information" icon that they use makes it very hard to
spot issues quickly. Even on a minor issue it would be useful to have some
kind of different icon.

~~~
yuvadam
AWS cleverly designed a very informative status dashboard: green means "things
_might_ work", green with a little speck on it means "nothing is working, the
entire datacenter is down".

Effectively, this is exactly how it is.

~~~
pjscott
About 33 minutes after they first acknowledged the problem, they've changed
one of those indicators to as goldenrod-yellow color. Presumably this
indicates that the world is on fire and the sky has gone black with carrion
crows:

> 11:11 AM PDT -- We can confirm degraded performance for a small number of
> EBS volumes in a single Availability Zone in the US-EAST-1 Region. Instances
> using affected EBS volumes will also experience degraded performance.

~~~
cpeterso
<http://status.aws.amazon.com/images/status0.gif>

<http://status.aws.amazon.com/images/status1.gif>

<http://status.aws.amazon.com/images/status2.gif>

<http://status.aws.amazon.com/images/status3.gif>

------
blantonl
I took our Web sites offline (RadioReference.com / Broadcastify.com) so our
back end processes (audio archiving, audio serving etc) don't crater.

Our Master MySQL server uses EBS (striped RAID across 4 EBS disks) and it is
getting killed due to severely degraded EBS performance. There is definitely a
major problem with EBS at this time.

~~~
beneth
We are running our sites across three AZs, all in US-EAST. Our three web
servers (one in each AZ) are not even able to load static test pages from
their EBS volumes at this time. We also have a Percona MySQL cluster across
three AZs and we are seeing very degraded perfomance in the EBS volumes on all
three servers as well. It seems like this is impacting more than a single AZ
from what I'm seeing.

~~~
blantonl
Well, we've gone from "degraded performance" to completely losing our EBS
instance store for our master MySQL server. If we don't see resolution shortly
we might have to look at promoting one of our slaves and moving forward from
there.

~~~
blantonl
We ended up promoting a slave to master, and moving forward from there. Our
master eventually returned to service, and we'll promote him tomorrow after
things settle down.

Interesting to note that we issued a reboot on our Master and it went away and
didn't return for over 45 minutes - we thought for sure it would have to be
terminated. API calls and console access was severely restricted, so even
launching in new AZ's was problematic.

------
rabidsnail
Note to all the postgres users cursing EBS right now: you can persist your db
instances to s3 with <https://github.com/heroku/WAL-E>, which, combined with
streaming replication and some redundancy, should let you run off ephemeral
disk.

~~~
joevandyk
I recommend this as well. Performance is way more consistent on the
instance/ephemeral storage as well.

------
kmfrk
I find it really hard to hate Heroku with a status page that gorgeous. There
is something to be learnt about UX there, I think.

~~~
UnoriginalGuy
I'm not a Heroku customer, but I want to be after just looking at that status
page.

~~~
pennywyze
I found it to be pretty non-sensical to be honest. Not sure what's wrong with
having a simple table with title/description/timestamp columns. I'm going to
that page to figure out why my site isn't working, not view some designer's
portfolio.

------
ftwinnovations
Yup, my sites are down. I'm sure it's something new and complex, like a ferret
or a leprechaun got into the 7th backup power switch which overloaded this and
that.. as usual something impressively difficult to plan for, but come on!
Virginia you're killing me...

~~~
peterwwillis
That Virginia datacenter has the worst reputation among my peers for any
Northeast US datacenter. I highly recommend moving.

------
TylerE
So the lesson seems to be, if you want to do AWS, don't use US-EAST? Seems
like that's always where these big meltdowns are.

~~~
ojiikun
The lesson is: don't build a system that requires or advertizes high
avilability but that runs in a single datacenter. Load balance. Elect leaders.
Batch, queue, and distribute. Fail-over. Fail fast. So long as we have
earthquakes, lightning, human malice, and human error, single AZs will fail.

~~~
MicahWedemeyer
We're paying 2x for the RDS multi-AZ setup that is supposed to failover in
these cases. So far no luck. Think I'll ask for a refund.

~~~
dangrossman
I gave up on Amazon (and ate the reserved instance fees) 2+ years ago over
this. I was paying through the nose for multi-AZ RDS and when EBS failed in
one zone, nothing failed over.

~~~
evanmoran
What did you switch to?

------
krenoten
How often do these happen? I've been using EC2 for about a week and it was a
surprise when my instance ground to a halt before my eyes. Here's the series
of events from my perspective: Nginx is no longer able to connect to uwsgi
Instance successfully rebooted via EC2 console Celery tasks are not able to
reach the RabbitMQ broker Halfway into a dump of Postgresql the session goes
dead Green check on EC2 console for the instance still rclick -> stop causes
instance to get stuck in a 'stopping' state, which persists to this moment.

~~~
UnoriginalGuy
EC2 seems to be about as reliable as a mid-range host provider. Which is to
say that they can do 99% but not really 99.99%.

They also seem to either have no issues is everything goes out all at once.

------
gsibble
Well, it did rain today. So obviously AWS is down.

~~~
cryptoz
Cloud computing indeed.

------
twp
Also affecting Coursera: <https://twitter.com/coursera>

It's rather annoying as the first assignments for the Alex Aiken's excellent
Compilers course are due today.

~~~
tocomment
Sorry to hear that. Maybe you can get an extension.

~~~
amalag
Yes they will usually extend it, like they did with the Godaddy outage.

~~~
twp
Thanks - I've already emailed the lecturer.

For info, Coursera seems to randomize the assignments. Specifically, in the
case of multiple choice tests it seems to pick a random subset of answers to
each question, and randomize the order. This is obviously a defense against
dumb copying.

What it does mean though is that you can't resubmit your own work. I've re-
done the assignment twice this evening (the questions change enough that you
have to think a lot, even when re-doing the same assignment), only to get 500
errors both times.

Anyhow, I'm sure it'll all work out and these temporary technical problems are
but minor hiccups in what is a fantastic course and a fantastic learning
system. It's very, very much appreciated.

~~~
rz2k
The choices can change between each time you _initiate_ a quiz. However, if
you complete part of a quiz, save, then return later, the choices will be the
same, but the order will be different.

Not all of the courses are set up with multiple choices though. In addition to
varying sets of correct responses, some answers are mathematical. For instance
it is defined such that a question might ask ("What is %d plus %d", a, b),
then have choices involving the quantities: ("%d", a+1), ("%d", a*b), ("%d",
a+b), etc. with a different random quantity for a and b each time you start a
quiz.

Interestingly, I saw one course where it had single-choice questions, with
more differences than simply right or wrong. For instance one answer gave
100%, another 80%, and another 0% of the points available for that question.

------
ergo14
You know guys, this always makes me wonder... So many of you experience issues
because of cloud, why not got dedicated at hetzner or whatever your baremetal
provider is?

~~~
whileonebegin
Ran a remote dedicated server to host apps for several years. While it was an
enlightening experience, it was a lot of sleepless nights and high anxiety.
Traffic spikes, OS updates, security, and plain old downtime. I've used
shared, VPS, dedicated, and cloud hosting for apps. They all have their fair
share of downtime, but I like shared and cloud hosting the best. It allows me
to focus on development, rather than running the server.

~~~
zorlem
So far I haven't solved the problem of keeping a non-trivial service without
any interventions. How does running your servers and services on a PaaS
provider solve the need for OS updates and security? What are your typical
architecture and operations like?

I personally think that OS updates and security fixes/hardening are crucial in
all cases if you're running any production site (and even for the dev
environment). Granted, some of the colocation providers don't offer services
like AWS VPC (and the associated SecurityGroups and ACLs) but that's pretty
rare nowadays. In all cases, and Ops or a Sysadmin person is quite important
for the stability of your business.

The only big advantage I see about using PaaS is the rapid and flexible
provisioning of new instances in case you need them. That's why the best
solution to me, so far, has been running a combination of bare-metal servers
with optional provisioning of additional cloud instances if you need them. The
price difference can go to orders of magnitude (even if you factor in the
costs of paying somebody to be a system administrator and take care of your
servers).

 _Edit: added the last paragraph_

------
mcos
"11:03 AM PDT We are currently experiencing connectivity issues and degraded
performance for a small number of RDS DB Instances in a single Availability
Zone in the US-EAST-1 Region."

Given that my sites are deployed using Multi-AZ RDS instances and yet they're
still down, this takes the cake a little.

~~~
matt2000
Are you also on Heroku? Our RDS is also multi-az and is actually up, but
heroku bit the bullet so it doesn't matter.

~~~
mcos
No, just straight up EC2 and RDS. It seems not all the RDS instances are
affected, but it's just frustrating that they all haven't failed over.

------
fallous
As an AWS user since 2009, I can only imagine the tremendous difficulties of
running such a service but given the stability of my long-running instances in
US West vs East, I really do sometimes wonder if Mr Bean is on staff in
Virginia.

~~~
jrochkind1
I wonder if US East just has a LOT more customers.

------
gfodor
Just a friendly reminder that AMZN now runs a large chunk of the Internet.

------
corford
And this is why my new venture will be going pinboard.in's route and hosting
everything on our own machines. More work for me but vastly cheaper and
hopefully a lot more reliable.

~~~
joevandyk
Just don't use EBS. Most everything else is pretty stable.

~~~
elliottcarlson
Even then, there seems to be issues with ELB currently.

~~~
dialtone
ELBs unfortunately use EBS so they are always affected in these cases.

------
bifrost
I checked out the AWS status page, it says EBS is busted in NoVA, yet the
status is still green. I think someone needs to fix their UI problem statte...

~~~
AtTheLast
All they had was a little information icon by the green circle with a hover
message letting you know what the problem was. They need to make that circle
red!

------
ericcholis
I know it's been said, but the Heroku status page is the easiest to use. Very
effective use of a timeline.

------
Ambadassor
Reddit is down as well

~~~
randartie
was down, somehow they recover pretty quickly

~~~
endersshadow
Still down for me.

------
wyck
This is bullshit, so many critical sites offline.

Talk about having all your eggs in one basket.

------
krenoten
The wikipedia page for EC2 lists a couple similar incidents at the Northern
Virginia datacenter. Victims of the previous NVa outages included Reddit,
Quora, SpringPad, FourSquare, Netflix, etc...

Many of those companies are down right now... why didn't they learn?
FourSquare seems to be up - although slow. I can load Quora but didn't try to
log in. Netflix, Reddit, SpringPad, seem to be down for me.
[http://en.wikipedia.org/wiki/Amazon_Elastic_Compute_Cloud#Is...](http://en.wikipedia.org/wiki/Amazon_Elastic_Compute_Cloud#Issues)

~~~
dangrossman
Learning doesn't magically triple your hosting budget or your ops team
headcount. Sometimes 99.9% availability is all you can afford... or all that
makes sense to do the work for. Netflix didn't go down, though; I've been
watching TV on it throughout this incident.

~~~
krenoten
True, and learning on a monday morning is certainly easier than learning
during peak hours for some of these companies. It does lead one to reexamine
the value of another 9, however. Out of curiosity, does it often cost triple
to shift half of the instances/LB's/etc... to another datacenter and implement
the proper consistency mechanisms?

------
jrochkind1
While it's more frustrating to have no control over the reliability, I suspect
that most companies could not achieve the level of reliability from AWS by
self-hosting, at anywhere close to the cost (inc. stafftime). Nothing is 100%,
and getting as close to 100% as you can gets exponentially more expensive the
closer you get.

Of course, this could change if AWS's reliability trends downward. But it's
never going to be 100%, and neither is anything you do yourself.

~~~
dangrossman
I don't think this is true. US-East's incident history over the past 3 years
is terrible. If you had a completely non-redundant set of rented servers
sitting in Softlayer or Rackspace's data center over the same period, even
factoring in expected hardware failures over the time period (replacing a hard
drive or two, probably), you'd have been available more than even a multi-AZ
deployment with failover in AWS US-East.

I've got heavily used servers at Softlayer that haven't had more than a few
minutes of downtime in 4+ years, and have never been rebooted. During that
same time, I ran a site out of US-East for two years (EC2 + EBS + Multi-AZ RDS
+ ELB), and had more downtime and spent more admin time working around
significant issues with Amazon. Amazon neither saved me time nor money, nor
did it provide better availability.

~~~
jrochkind1
wonder if anyone's been keeping and publishing statistics on exactly how often
AWS US East has been down. Perhaps it's more than I thought. Of course, most
failures haven't been effecting _all_ of US East, so maybe what would be
interesting to look at is just _some_ customer's data on exactly how much
downtime they had in the past 3 years or whatever. Without knowing what the
uptime/downtime was, hard to say how terrible it was, or how it compares to
other options.

------
durpleDrank
Been getting crap loads of alerts stemming from their North Virginia data
centre. Got 9 of these yesterday ...

"You are receiving this email because your Amazon CloudWatch Alarm "awsec2-WM-
Live1-i-8747ece1-High-CPU-Utilization" in the US - N. Virginia region has
entered the ALARM state, because "Threshold Crossed: 1 datapoint (93.13) was
greater than or equal to the threshold (85.0)." at ."

Maybe their data centre was over run by mutants ?

------
knodi
-_- I'm really beginning to hate you ec2/ebs.

------
SwaroopH
I was planning to launch our startup in a few hours. Oh well.

~~~
johncoogan
Pivot to a PaaS with a focus on uptime?

~~~
flyt
Or build your application with the characteristics of the underlying platform
in mind. Nobody would run a DB server on a single spindle drive since
occasionally they fail. Accordingly, nobody should deploy non-multi-az AWS
applications because occasionally they fail.

~~~
SwaroopH
as a matter of fact, our app is using multi-zones not just for instances but
also for RDS. As of now, I can't reach anything including the loadbalancer.

~~~
flyt
All I've ever heard and seen in practice is that RDS should generally be
avoided for production usage. Its opaque reliance on EBS is its downfall.

~~~
SwaroopH
Possibly but there are bigger companies using RDS in production – also, the
multi-az feature had worked out better than my own custom setup. Unfortunately
lack of resources prevent me from rolling out my own setup on bare metal.

------
jpdoctor
Anyone calculated how many 9's Heroku is down to this year?

~~~
bgentry
We do not post a yearly number right now, but the previous 1 & 3 months are
prominently displayed on our status site: <https://status.heroku.com/>

------
aaronbrethorst
Interesting. My Heroku-hosted site seems to be doing fine:
<http://www.cocoacontrols.com>

~~~
IheartApplesDix
Heroku has implemented AWS services correctly, most websites can't afford to
do that.

~~~
taligent
Well clearly they have a lot of work still to do:

<https://status.heroku.com/>

Complete outage of an AZ shouldn't ever take services offline.

------
BIackSwan
12:32 PM PDT We are working on recovering the impacted EBS volumes in a single
Availability Zone in the US-EAST-1 Region.

Hopefully recovering in the next hour.

------
tarkeshwar
Not even able to ssh to aws instance. AWS console shows normal. Are others
able to ssh?

~~~
ubercore
SSH issues are isolated to instances that are EBS backed. Instance storage
backed is working OK for SSH.

------
clientbiller
Amazon hosts so many big sites that when it goes down its a huge disaster.

------
matt2000
Yeah, we've had failovers on RDS and Heroku is having trouble as well.

------
alanbyrne
Affecting PHPFog/Appfog too - I think their MySQL services.

------
pearle
I'm having issues trying accessing the AWS web console...

------
pearle
I'm affected. Site is down, no servers are accessible.

------
erikcw
I'm also having trouble accessing RedisToGo.

------
bryanh
Us too. RDS bit the bullet for us.

------
jonny_eh
At least Netflix is still up.

------
colinhowe
I now regret using EBS boot.

------
johncoogan
Skynet is coming online.

