Hacker News new | comments | show | ask | jobs | submit login
Amazon S3+SQS are down, bringing down Scribd, Docstoc, Twitter, SmugMug, JungleDisk (search.twitter.com)
78 points by alexwg on July 20, 2008 | hide | past | web | favorite | 31 comments



This has been the problem with AWS all along: Aggregate downtime at good hosting providers is measured in minutes, or even seconds, per year. Downtime at AWS has historically been measurable in days per year. This level of reliability puts it well into the bottom ranks of hosting providers. We're talking about the dregs of the industry here...the hosts who have a single cheap Cogent pipe running into a single cage of machines with no power backup and no backup pipe or infrastructure redundancy. This is the sole reason we don't recommend AWS to our customers, and why we don't use it ourselves for any vital services. We want to like it, and recommend it, and we have quite a bit of software that works with it, that we enjoy selling to people. But, the reliability just isn't there, and it has been a recurring problem since the service launched.


Absolutely. I'm pretty shocked by the amount of downtime they've been having.

I've been hosting with various providers for almost 10 years. I'm now with SoftLayer and VERY happy. In those 10 years, I think I've had less combined downtime than AWS has had in the last 6 months.

The big advantage to running your apps in a nebulous "cloud", aside from the scaling up-down flexibility, is that in theory the difficulty in running a stable data center (or ideally set of load balanced, geo-graphically diverse data centers) is taken care of for you. If the reality is it's a trade-off between getting easy scaling, and losing decent uptime numbers, I'll take the "hassle" of adding/dropping servers at SoftLayer which are actually UP, and which I have good visibility into, any day.

Hopefully they'll get to that point eventually, but for now, I'm staying far far away.


9:05 AM PDT We are currently experiencing elevated error rates with S3. We are investigating.

9:26 AM PDT We're investigating an issue affecting requests. We'll continue to post updates here.

9:48 AM PDT Just wanted to provide an update that we are currently pursuing several paths of corrective action.

10:12 AM PDT We are continuing to pursue corrective action.

10:32 AM PDT A quick update that we believe this is an issue with the communication between several Amazon S3 internal components. We do not have an ETA at this time but will continue to keep you updated.

11:01 AM PDT We're currently in the process of testing a potential solution.

11:22 AM PDT Testing is still in progress. We're working very hard to restore service to our customers.


I've been working on a private cloud using http://eucalyptus.cs.ucsb.edu/ that maps to Amazon incase of issues like this (though Amazon was supposed to be the backup).

Has anyone else been doing the same? What have you been using?


Amazon should give their sysadmins bonuses tied directly to uptime.


Yes and no. There are two sides to uptime - there's that the infrastructure is up, and there's that the application is running. It sucks to be a sysadmin bonused on service uptime when the network is up, the servers are up, the database is up, but the application won't stay up and as a sysadmin there's actually nothing you can do about it; all you can do is wait for the developers to patch it.


Well clearly the bonuses should be tied to the uptime of the particular system the employees/managers are responsible for.


Bezos likes the analogy about Amazon services being "electricity" for other businesses, i.e. you don't to have to own a generator if you operate a restaurant (as they used to back in the day) - just "hook up to the grid" and you're all set.

Funny analogy, since all data centers DO have their own generators: they're not restaurants.


data center = electricity for startups. why is this analogy funny?

Instead of running their own generator (server and storage), for those startups who don't need to, they can use Amazon's power, AWS.

am I missing something here?


Yes you are missing the part where the eletricity cuts off, you don't know why, you can't do anything about it, and because you relied so heavily on your electricity provider you didn't set up a plan to make your own.

If your site is your personal blog or something not important, then downtime might not be a big deal. If you don't have the money for a backup electricity provider then you have to take your chances also.


Yes, that's exactly so. Remember when there was a whole day of downtime for everyone in the colo in downtown SF?

Running things yourself is no guarantee of uptime.

The only real fix is to maintain fully redundant systems, which is extremely expensive. Otherwise, put up with downtime sometimes, because no other system will fully protect you.


so the analogy still works. AWS=electricity for startups. For restaurants the power does sometimes go out too. If it's life critical, like a hospital, I'm sure you'll be fully redundant.


At least twitter is smart and only uses S3 for profile images. The service can survive without S3. Tumblr images and audio posts are also affected by the downtime.



Thanks, added that!


SmugMug too, apparently.


Interesting, since they like to talk a lot about how they expect S3 to go down once in awhile and supposedly can handle it:

http://blogs.smugmug.com/don/category/amazon/


I like how techcrunch is yet to make a post about this


They were too busy covering female bloggers featured in Playboy.


I run my own server for a hobby project and it goes down as frequently as S3 for various reasons.


It's back for us (~7:15 pm Eastern).

Any estimates on total time of the outage?


S3 and SQS seemed to first go down around 9:00 am PDT, and its now 4:45 pm PDT, so about 7-8 hours... not too good. At least it happened to be a Sunday.


It's an interesting question of web services. If you depend on Amazon for your file storage, big table for your database, yahoo geo, etc then your uptime is figure is a product of the uptimes of those services.

This means that using 4 services that have a 99.9% SLA actually gives you an approx uptime of 99.6%. It doesn't sound like much as soon as you include something like Twitter you can really see the the whole uptime graph skew.


Oh, this is why Jungledisk is down.


and this is the reason why you let other people pay to test the infrastructure "cloud". it's hard to justify not being able to do _anything_ when any AWS goes down as smugmug must be figuring out by now.

here's a piece of advice: start by leasing a couple of $75 USD per month servers. if you can, buy instead of lease. if you go bust, you can sell the hardware on ebay whereas with AWS you can't do any of that it's just money you're throwing away for 0 assets. AWS still needs to be managed, you still need sysadmins available 24/7 so you won't save any money there. the only thing AWS has going for it is provisioning. be smart and take advantage of that (eg. have your own physical infrastructure and be able to send some of the load the way of AWS if and when you need to).


Or... use multiple cloud infrastructure solutions creating a fail-over in case some of them goes down. Think of this as a "Cloud Balancer".


And here I thought the whole cloud infrastructure was supposed to provide its own redundancy.

If you need to first setup your site to work with a cloud, and then need to add a cloud balancer to guarantee uptime, maybe a regular network load balancer and old-fashioned solutions might be a better option.

At least then you have a tried and tested solution, not to mention you got it all under your control so things can actually be fixed.


I did wonder why the Panoramio site they use for Google Maps photos was playing up. That explains it.


Dropbox went down as well. I like Dropbox a lot because it's so simple to use. Unfortunately, they're what some people called "Amazon S3 re-seller". Dropbox's heartbeat depends on Amazon.


Fortunately, with Dropbox you still have the latest version (pre-AWS crash) on your machine. It's much more usable in a crash than web-only services.


Boingboing.net went down for me, though I had no problems with Amazon!




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: