

AWS Elastic Block Store issues in one AZ in us-east-1 - cperciva

The Tarsnap website went offline at 1:05 PDT and attempts to launch a new instance in the same availability zone have failed.  I&#x27;m also seeing the EC2 Management Console getting very slow -- possibly because lots of other people are logging in to investigate their failing systems.<p>As of 1:40 PDT I&#x27;m seeing EBS requests served and the Tarsnap website is back online.<p>Amazon&#x27;s status page reports:<p>1:22 PM PDT We are investigating degraded performance for some volumes in a single AZ in the US-EAST-1 Region<p>1:29 PM PDT We are investigating degraded performance for some EBS volumes and elevated EBS-related API and EBS-backed instance launch errors in a single AZ in the US-EAST-1 Region.<p>RDS and ELB are also now (1:40 PM PDT) showing &quot;investigating connectivity issues&quot;.  Sounds like there was a core networking outage in that AZ.<p>2:21 PM PDT We have identified and fixed the root cause of the performance issue. EBS backed instance launches are now operating normally. Most previously impacted volumes are now operating normally and we will continue to work on instances and volumes that are still experiencing degraded performance.
======
alrs
It used to be when EBS wiped out it didn't respect the boundary between
availability zones. Every AZ in a region would be lost in tandem.

If they have managed to keep this incident confined to one AZ without trashing
the whole region they have made significant progress.

~~~
bgentry
or maybe it just wasn't as bad as previous incidents :)

------
jaytaylor
This is the kind of thing you're susceptible to when you use AWS products that
start with E: EBS, ELB, etc..

~~~
nieksand
At a minimum, you should use at least two availability zones if you care about
high-availability on AWS. The vast majority of issues (including the current
EBS hiccup) tend to be limited to one AZ.

For us, today's outage triggered a bunch of alerts, knocked 1/4 of our USE
attribution servers offline (we're in four zones there) but otherwise caused
no impact. Mostly just acked alerts and twiddled thumbs waiting for situation
to resolve.

------
helper
If you care about uptime you should really get off EBS (and services that
depend on it like ELB and RDS). Having a network between your disks and your
CPU make outages like this one inevitable.

~~~
akurilin
What do you recommend as an alternative for hosting large DBs on?

~~~
insaneirish
Joyent. Your compute and disks are colocated. Separation of compute and disks
is a sure way to cause pain.

On failure:

[http://www.joyent.com/blog/on-cascading-failures-and-
amazons...](http://www.joyent.com/blog/on-cascading-failures-and-amazons-
elastic-block-store/) [http://www.joyent.com/blog/network-storage-in-the-
cloud-deli...](http://www.joyent.com/blog/network-storage-in-the-cloud-
delicious-but-deadly/) [http://www.joyent.com/blog/magical-block-store-when-
abstract...](http://www.joyent.com/blog/magical-block-store-when-abstractions-
fail-us/)

~~~
akurilin
I can't easily get a feel for Joyent's pricing. Is it comparable/cheaper than
AWS or?

------
aren55555
This is currently causing Heroku to have issues too.
[https://status.heroku.com/incidents/548](https://status.heroku.com/incidents/548)

~~~
jdleesmiller
Now reporting as resolved... Except that my database is still down. Sigh. And
it's not just me -- a few others have tweeted at @herokustatus.

~~~
bgentry
Hi, we usually track individual database issues via outbound support tickets
opened on your account. The exception is for widespread outages, but this one
was actually fairly small in scope.

If you didn't receive a support notification, please contact
support@heroku.com so we can figure out why.

~~~
jdleesmiller
Thanks. I've now done this and referred some others to this comment.

------
josephpmay
Is Instagram still on Amazon servers?

~~~
flyt
Yes.

~~~
josephpmay
I wonder if this outage will encourage them to migrate over to Facebook's
servers.

~~~
aioprisan
doubt it, since they probably properly set up their infrastructure with multi-
az setup

~~~
nowarninglabel
I was getting ngnix 504's on their site consistently during this outage time,
perhaps coincidence, but seems likely to be related.

