Hacker News new | past | comments | ask | show | jobs | submit login
AWS Elastic Block Store issues in one AZ in us-east-1
51 points by cperciva on Aug 25, 2013 | hide | past | favorite | 28 comments
The Tarsnap website went offline at 1:05 PDT and attempts to launch a new instance in the same availability zone have failed. I'm also seeing the EC2 Management Console getting very slow -- possibly because lots of other people are logging in to investigate their failing systems.

As of 1:40 PDT I'm seeing EBS requests served and the Tarsnap website is back online.

Amazon's status page reports:

1:22 PM PDT We are investigating degraded performance for some volumes in a single AZ in the US-EAST-1 Region

1:29 PM PDT We are investigating degraded performance for some EBS volumes and elevated EBS-related API and EBS-backed instance launch errors in a single AZ in the US-EAST-1 Region.

RDS and ELB are also now (1:40 PM PDT) showing "investigating connectivity issues". Sounds like there was a core networking outage in that AZ.

2:21 PM PDT We have identified and fixed the root cause of the performance issue. EBS backed instance launches are now operating normally. Most previously impacted volumes are now operating normally and we will continue to work on instances and volumes that are still experiencing degraded performance.

It used to be when EBS wiped out it didn't respect the boundary between availability zones. Every AZ in a region would be lost in tandem.

If they have managed to keep this incident confined to one AZ without trashing the whole region they have made significant progress.

or maybe it just wasn't as bad as previous incidents :)

This is the kind of thing you're susceptible to when you use AWS products that start with E: EBS, ELB, etc..

At a minimum, you should use at least two availability zones if you care about high-availability on AWS. The vast majority of issues (including the current EBS hiccup) tend to be limited to one AZ.

For us, today's outage triggered a bunch of alerts, knocked 1/4 of our USE attribution servers offline (we're in four zones there) but otherwise caused no impact. Mostly just acked alerts and twiddled thumbs waiting for situation to resolve.

No, this is the kind of thing that happens with a poor application architecture (ie, using one AZ). Sad that companies still can't get this right.

where's EC2 fall in that?

if you want a safer EBS, create a raid ontop i and use snapshots.

If you care about uptime you should really get off EBS (and services that depend on it like ELB and RDS). Having a network between your disks and your CPU make outages like this one inevitable.

What do you recommend as an alternative for hosting large DBs on?

I can't easily get a feel for Joyent's pricing. Is it comparable/cheaper than AWS or?

Use ephemeral disks on EC2 and a good distributed database like Cassandra or Riak. These scale well as your data grows and allow you to lose individual nodes without losing data (in sane configurations).

+1 to the use of Riak. It is great.

But if you are stuck using Postgresql or similar you put your live servers on Ephemeral instances and put one (unused otherwise) slave on EBS so you can snapshot it for backups. And you need at least 2 ephemeral systems so you have a slave to fail over to.

No need to use ebs/snapshots for backups.

Use wal-e and/or pg_receivexlog. Ideally run pg_receivexlog on a server at a different provider so that if amazon cancels your account your data is up-to-the-second recoverable.

Define 'large'. Up to 80T can be done in one box these days with up to 32 cores right next to it. Not cheap but fairly reliable. You'll still need to replicate off-site.

So are you thinking along the lines of Hetzner?

Sure. I've never used the cloud, I simply can't make the business case for it. Traffic or storage will kill it every time.

This is currently causing Heroku to have issues too. https://status.heroku.com/incidents/548

Now reporting as resolved... Except that my database is still down. Sigh. And it's not just me -- a few others have tweeted at @herokustatus.

Hi, we usually track individual database issues via outbound support tickets opened on your account. The exception is for widespread outages, but this one was actually fairly small in scope.

If you didn't receive a support notification, please contact support@heroku.com so we can figure out why.

Thanks. I've now done this and referred some others to this comment.

this affected me as well, i had to provision a new database and promote it to master in order to bring my app back online

I had to do the same and Heroku support is non-responsive. This is beyond frustrating.

Is Instagram still on Amazon servers?


I wonder if this outage will encourage them to migrate over to Facebook's servers.

doubt it, since they probably properly set up their infrastructure with multi-az setup

I was getting ngnix 504's on their site consistently during this outage time, perhaps coincidence, but seems likely to be related.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact