Is this not a 'service disruption' situation? At the bottom of the page, the yellow icon is associated with 'performance issues'.
If there's one thing that's shocked me about AWS, it's the total failure to acknowledge the severity of service disruptions. Like the above case, or the fact that a 3-hour loss of connectivity is displayed on the service history as a green tick with a small 'i' box: http://oi46.tinypic.com/x5qtch.jpg
AWS needs to blantantly copy Heroku's status system, which is worlds better for people needing fast updates on their infrastructure.
https://status.heroku.com/ vs http://status.aws.amazon.com/
You can at least mitigate it by keeping utilization of your instances below what would be needed to handle load of one zone failing. I suspect to save money, many people are pushing utilization as high as possible.
I can't speak to why Heroku was impacted as much as it was. It could be that they have single points of failure, or run at a utilization level that makes losing a single zone difficult.
Apart from Apple's legendary secrecy, Amazon's EC2 is a solid #2 in terms of impenetrability.
Linode, in comparison, has always been straight-forward, personal, and as honest as a company in that space can be.
What's perhaps odd is Amazon's customer service for their material goods is usually superb.
It's pointless to complain. We've all seen before that Amazon can't keep whole regions up. If you rely on a region being up, you will have downtime and it's your fault.
I'm not an AWS customer, just reading their docs; please correct me if I'm wrong about any of this.
That is absolutely absurd. At what point did the common-sense solution to "unacceptable downtime on AWS" become "buy two of everything"?
We operate systems that sit on the pages of the top e-commerce companies in the world. We have 10 separate segments of clusters. Operating in four AZ's in East, three AZ's in West-1 and two AZ's in West-2. When this outage happened, the servers that were impacted in east were removed from our DNS and within 9 minutes the impact of this event on their sites was eliminated.
At Quantcast we have physical servers in 14 cities. We use anycast to achieve site failovers in 6 seconds. Downtime for us would impact millions of websites, so we don't have downtime.
(The trend for regular ISPs is probably improving, except that mobile/carrier DNS is often particularly broken. It would be interesting to do monthly surveys of this.)
It's nontrivial to determine exactly when to drop the announcement. And be careful, because if you are too eager to drop the announcement, you may do it in more than one site at a time.
At first we used DNS with short timeouts, but those timeouts are only advisory and are ignored by some implementations. We would see most traffic tail off within 10 minutes for a one minute timeout, but it took many hours for all the traffic to migrate over to the new DNS. The folklore on using less than one minute for a DNS timeout is that a huge percentage of implementations ignore sub-minute timeout. Funny how much of the Internet's operation is passed along as folklore and not really known for sure.
Thanks for asking. Hacker News should be about sharing best practices and making the Internet a more reliable place.
The only useful/sane use case I can see in Amazon EC2 would be for services like Heroku where they need to automatically be able to manage a truckload of VM's as their rapidly growing infrastructure, unless you want to do it yourself which I imagine is quite a headache unless you work closely with someone like Amazon or Rackspace.
Yes white boxes are cheap. Site negotiations, design, procurement, networking, operations, and maintenance are expensive in dollars and time. Personally I run "a bunch" of physical sites across the globe. It would be waaaay easier to be able to turn up rackspace/aws/google instances as needed.
You'd be surprised how many people that actually use EC2 think it is.
> Yes white boxes are cheap. Site negotiations, design, procurement, networking, operations, and maintenance are expensive in dollars and time.
It's called planning ahead of time. If not, then here's a suggestion: Use EC2 until you set it up and migrate, if you cannot wait that is.
All in all I don't mind whether people use EC2 for whatever reason. Just stating my opinion. I agree of course that in terms of "convenience" is has the upper hand. Not having to wait for boxes to be added to data centers, being able to spin up boxes in multiple regions through a single company/console. Maybe your use case does justify using EC2. Many other people clearly do not (hence all the whining because of all the downtime, which they wouldn't have had if they deployed to multiple AZs/Regions).
How do cloud services compare to a gym membership? Are you implying you can't get out of your AWS contract?
Wait for the dust to settle. We're all just going to be a bunch of Fonzies here.
EDIT: Looks like API access has been restored, so I'm cautiously optimistic about things working. Note though that some instances may have rebooted or be otherwise impacted so check your error logs.
EDIT2: Nope, ELB is still hosed. Continue to be skeptical.
Good luck, friends.
Rackspace's prices are insane. $1,314/mo for a cloud server with 30gb of ram, compared to $657/mo for 34gb on AWS.
Plus with AWS you can use reserved instances to get that cost down to $286/mo. Rackspace has no way to get the cost down.
That makes Rackspace cost over 4.5x more when comparing based on ram.
Rackspace prices are insanely high and I can't wait to move off of them.
If you're scenario is more complicated than a single server, you might find our tool useful to forecast your costs: https://my.shopforcloud.com/?guest=true (this link will create a guest account for you so you can play with it quickly)
disclaimer: I'm co-founder of ShopForCloud.com
$657/mo will rent you eight 32G/i7 servers there.
If europe is not acceptable then you should still look into american colos (e.g. LeaseWeb recently opened a US DC) which are not as cheap as Hetzner but will still give you 3-4 servers for the money.
...what do you mean lowbrow humor isn't allowed on hacker news?
So let me get this straight: the critical issue with not having electricity after a huge storm is that the A/C isn't working? And 100F/38C isn't even that hot, right?
When I got the frantic texts when EC2 first dropped offline, sure enough, the AWS status page was all green, but twitter was alight with people talking about it.
I suspect a service disruption would have to be Godzilla.
AEP (local power company) says about 65% of customers in this area are w/o power. May be days before it's fully restored. Hope no one from the HN community got hurt.
Edit: I posted this from a computer in town. No power at my place so I can't respond to follow-up posts.
us-east-1 is a region, containing multiple AZ's.
At the time It was like voodoo, and you had to triple-check your datastore actions, because they could fail for no reason at the backend.
I've been thinking about building a site with a Parse backend, and they're up, which is good to discover.
Second or I believe third power outage/loss of service for AWS in the past 10-days if I'm not mistaken.
This is wild. I wonder what's going on at Amazon and if they're capable of handling this much usage in addition to having power issues, etc.
Instagram and Netflix servers are down from what I hear and have been down for a few hours. Now it makes sense that they're being hosted on AWS.
UTC TIME STEP
on the 1st of July 2012
A positive leap second will be introduced at the end of June 2012.
The sequence of dates of the UTC second markers will be:
2012 June 30, 23h 59m 59s
2012 June 30, 23h 59m 60s
2012 July 1, 0h 0m 0s
It's somewhat important to the original spirit of the comment since Acts of God might indeed happen anywhere (1) whereas acts of man might not. For example, disruption caused by on going war in Syria wouldn't be covered by an Act of God clause.
(1) I think that's BS though...there are definitely some places where nature is considerably more stable than others.
US East or US West plus Direct Connect to your own colo space, with AWS for the burst capacity, and your own redundancy for the database servers, might be the best plan if you can't do wide area "over the Internet" database replication. (I might get an extra 10G DC (since I need 1G DC myself) and then have some colo for sale with it in US East/US West later this year.)
edit: so basically, the businesses suffering outages (heroku, netflix, etc) don't value uptime to the same extent that amazon does. they got what they paid for.
This isn't the first time this has happened to AWS - we moved our app to linode last year after this happened to us and it seems to affect AWS more than any other hosting i've ever used, i'd be interested to know how their infrastructure is set up because it doesn't seem particularly robust.
It's a wonderful idea!
I even have a list of possible locations they should look into. Beyond the Virginia site, they should be looking at DCs in Oregon, California, Ireland, Singapore, Tokyo and even Sao Paulo. What do you think?
Seriously though, as horrible as downtime is, I think most internet users aren't terribly surprised when they can't go to a specific website for a short period of time.
If you want to be extra paranoid, you could always pick a secondary data center location where it would be safer in the summer, since Denver would be safe from tornadoes in the winter.  Or you could just put one on the West Coast and hope that there's not a 1-in-a-thousand tornado within a week of a big earthquake.
Yup, everywhere has got its natural disasters.
Anyone else care to speculate how this could be good press?
"This site is so popular it barely works"
Or they're just like Heroku and sit on top of AWS?
I know outages happen all the time at hosts, and maybe as a result of either a) news is more accessible now, or b) Amazon is bigger than most other hosts....I feel like Amazon & Heroku are going down WAYYYYY too much.
I am starting to wonder if this "tell all" policy is really best.