Hacker Newsnew | comments | leaders | jobs | submitlogin
Amazon S3+SQS are down, bringing down Scribd, Docstoc, Twitter, SmugMug, JungleDisk (twitter.com)
77 points by alexwg 568 days ago | 31 comments


15 points by SwellJoe 568 days ago | link

This has been the problem with AWS all along: Aggregate downtime at good hosting providers is measured in minutes, or even seconds, per year. Downtime at AWS has historically been measurable in days per year. This level of reliability puts it well into the bottom ranks of hosting providers. We're talking about the dregs of the industry here...the hosts who have a single cheap Cogent pipe running into a single cage of machines with no power backup and no backup pipe or infrastructure redundancy. This is the sole reason we don't recommend AWS to our customers, and why we don't use it ourselves for any vital services. We want to like it, and recommend it, and we have quite a bit of software that works with it, that we enjoy selling to people. But, the reliability just isn't there, and it has been a recurring problem since the service launched.

-----

2 points by modoc 567 days ago | link

Absolutely. I'm pretty shocked by the amount of downtime they've been having.

I've been hosting with various providers for almost 10 years. I'm now with SoftLayer and VERY happy. In those 10 years, I think I've had less combined downtime than AWS has had in the last 6 months.

The big advantage to running your apps in a nebulous "cloud", aside from the scaling up-down flexibility, is that in theory the difficulty in running a stable data center (or ideally set of load balanced, geo-graphically diverse data centers) is taken care of for you. If the reality is it's a trade-off between getting easy scaling, and losing decent uptime numbers, I'll take the "hassle" of adding/dropping servers at SoftLayer which are actually UP, and which I have good visibility into, any day.

Hopefully they'll get to that point eventually, but for now, I'm staying far far away.

-----

13 points by staunch 568 days ago | link

Amazon should give their sysadmins bonuses tied directly to uptime.

-----

14 points by gaius 568 days ago | link

Yes and no. There are two sides to uptime - there's that the infrastructure is up, and there's that the application is running. It sucks to be a sysadmin bonused on service uptime when the network is up, the servers are up, the database is up, but the application won't stay up and as a sysadmin there's actually nothing you can do about it; all you can do is wait for the developers to patch it.

-----

8 points by tlrobinson 568 days ago | link

Well clearly the bonuses should be tied to the uptime of the particular system the employees/managers are responsible for.

-----

7 points by babul 568 days ago | link

I've been working on a private cloud using http://eucalyptus.cs.ucsb.edu/ that maps to Amazon incase of issues like this (though Amazon was supposed to be the backup).

Has anyone else been doing the same? What have you been using?

-----

5 points by vaksel 568 days ago | link

I like how techcrunch is yet to make a post about this

-----

9 points by fallentimes 568 days ago | link

They were too busy covering female bloggers featured in Playboy.

-----

5 points by nickb 568 days ago | link

9:05 AM PDT We are currently experiencing elevated error rates with S3. We are investigating.

9:26 AM PDT We're investigating an issue affecting requests. We'll continue to post updates here.

9:48 AM PDT Just wanted to provide an update that we are currently pursuing several paths of corrective action.

10:12 AM PDT We are continuing to pursue corrective action.

10:32 AM PDT A quick update that we believe this is an issue with the communication between several Amazon S3 internal components. We do not have an ETA at this time but will continue to keep you updated.

11:01 AM PDT We're currently in the process of testing a potential solution.

11:22 AM PDT Testing is still in progress. We're working very hard to restore service to our customers.

-----

4 points by danw 568 days ago | link

At least twitter is smart and only uses S3 for profile images. The service can survive without S3. Tumblr images and audio posts are also affected by the downtime.

-----

4 points by tx 568 days ago | link

Bezos likes the analogy about Amazon services being "electricity" for other businesses, i.e. you don't to have to own a generator if you operate a restaurant (as they used to back in the day) - just "hook up to the grid" and you're all set.

Funny analogy, since all data centers DO have their own generators: they're not restaurants.

-----

9 points by demandred 568 days ago | link

data center = electricity for startups. why is this analogy funny?

Instead of running their own generator (server and storage), for those startups who don't need to, they can use Amazon's power, AWS.

am I missing something here?

-----

2 points by gscott 568 days ago | link

Yes you are missing the part where the eletricity cuts off, you don't know why, you can't do anything about it, and because you relied so heavily on your electricity provider you didn't set up a plan to make your own.

If your site is your personal blog or something not important, then downtime might not be a big deal. If you don't have the money for a backup electricity provider then you have to take your chances also.

-----

7 points by emmett 568 days ago | link

Yes, that's exactly so. Remember when there was a whole day of downtime for everyone in the colo in downtown SF?

Running things yourself is no guarantee of uptime.

The only real fix is to maintain fully redundant systems, which is extremely expensive. Otherwise, put up with downtime sometimes, because no other system will fully protect you.

-----

6 points by demandred 568 days ago | link

so the analogy still works. AWS=electricity for startups. For restaurants the power does sometimes go out too. If it's life critical, like a hospital, I'm sure you'll be fully redundant.

-----

4 points by tlrobinson 568 days ago | link

SQS too. http://status.aws.amazon.com/

-----

2 points by alexwg 568 days ago | link

Thanks, added that!

-----

3 points by alexwg 568 days ago | link

SmugMug too, apparently.

-----

4 points by tlrobinson 568 days ago | link

Interesting, since they like to talk a lot about how they expect S3 to go down once in awhile and supposedly can handle it:

http://blogs.smugmug.com/don/category/amazon/

-----

3 points by tom_rath 568 days ago | link

It's back for us (~7:15 pm Eastern).

Any estimates on total time of the outage?

-----

3 points by tlrobinson 568 days ago | link

S3 and SQS seemed to first go down around 9:00 am PDT, and its now 4:45 pm PDT, so about 7-8 hours... not too good. At least it happened to be a Sunday.

-----

3 points by akd 568 days ago | link

I run my own server for a hobby project and it goes down as frequently as S3 for various reasons.

-----

3 points by mechanical_fish 568 days ago | link

Oh, this is why Jungledisk is down.

-----

2 points by andyking 568 days ago | link

I did wonder why the Panoramio site they use for Google Maps photos was playing up. That explains it.

-----

1 point by sh1mmer 567 days ago | link

It's an interesting question of web services. If you depend on Amazon for your file storage, big table for your database, yahoo geo, etc then your uptime is figure is a product of the uptimes of those services.

This means that using 4 services that have a 99.9% SLA actually gives you an approx uptime of 99.6%. It doesn't sound like much as soon as you include something like Twitter you can really see the the whole uptime graph skew.

-----

1 point by hello_moto 568 days ago | link

Dropbox went down as well. I like Dropbox a lot because it's so simple to use. Unfortunately, they're what some people called "Amazon S3 re-seller". Dropbox's heartbeat depends on Amazon.

-----

1 point by pchristensen 567 days ago | link

Fortunately, with Dropbox you still have the latest version (pre-AWS crash) on your machine. It's much more usable in a crash than web-only services.

-----

1 point by cpinto 568 days ago | link

and this is the reason why you let other people pay to test the infrastructure "cloud". it's hard to justify not being able to do _anything_ when any AWS goes down as smugmug must be figuring out by now.

here's a piece of advice: start by leasing a couple of $75 USD per month servers. if you can, buy instead of lease. if you go bust, you can sell the hardware on ebay whereas with AWS you can't do any of that it's just money you're throwing away for 0 assets. AWS still needs to be managed, you still need sysadmins available 24/7 so you won't save any money there. the only thing AWS has going for it is provisioning. be smart and take advantage of that (eg. have your own physical infrastructure and be able to send some of the load the way of AWS if and when you need to).

-----

5 points by bpedro 568 days ago | link

Or... use multiple cloud infrastructure solutions creating a fail-over in case some of them goes down. Think of this as a "Cloud Balancer".

-----

4 points by trezor 567 days ago | link

And here I thought the whole cloud infrastructure was supposed to provide its own redundancy.

If you need to first setup your site to work with a cloud, and then need to add a cloud balancer to guarantee uptime, maybe a regular network load balancer and old-fashioned solutions might be a better option.

At least then you have a tried and tested solution, not to mention you got it all under your control so things can actually be fixed.

-----

-2 points by zandorg 568 days ago | link

Boingboing.net went down for me, though I had no problems with Amazon!

-----




Lists | RSS | Bookmarklet | Guidelines | FAQ | News News | Feature Requests | Y Combinator | Apply | Library

Analytics by Mixpanel