Per the linked dashboard, some instances in a single AZ in a single Region are h...

coderintherye · on Oct 22, 2012

I would agree with you, but Amazon is just downright dishonest in their reports, which makes me sad, because I love Amazon. Go look at the past reports, they've never shown a red market, only "degraded performance" even when services for multiple availability zones went down at the same time due to their power outage (so had you architected to multiple AZs you were still fucked). When they have a single AZ go down, they won't even give it a yellow marker on the status page, they'll just put a footnote on a green marker. It makes their status dashboard pretty much useless for at a glance checking (why even have colors if they don't mean anything?)

Read their report from the major outage earlier this year, they start out by saying "elevated error rates", when many services were in fact down, and it wasn't until hours later they finally admitted to having an issue that affected more than just one availability zone.

From Forbes: ”We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone.” By 11:49 EST, it reported that, ”Power has been restored to the impacted Availability Zone and we are working to bring impacted instances and volumes back online.” But by 12:20 EST the outage continued, “We are continuing to work to bring the instances and volumes back online. In addition, EC2 and EBS APIs are currently experiencing elevated error rates.” At 12:54 AM EST, AWS reported that “EC2 and EBS APIs are once again operating normally. We are continuing to recover impacted instances and volumes.”

kalid · on Oct 22, 2012

It's like grade inflation. You can never give out an F (Mr. Admissions officer, are you so bad at your job that you would admit such an unqualified student?), so a Gentleman's C is handed around. In Amazon's case, it's a gentleman's B+ (green, with an info icon).

A: fine A-: problems B+: servers are on fire

I really like Amazon as a company, use a lot of their services, but this is dishonest.

mibbitier · on Oct 22, 2012

They reserve the red mark for "The world just exploded". Orange is reserved for "The datacenter got bombed".

driverdan · on Oct 22, 2012

Amazon is downplaying the problem. It is affecting many large sites.

res0nat0r · on Oct 22, 2012

Many large sites which aren't properly architecting their infrastructure to deal with the commodity nature of the cloud.

mey · on Oct 22, 2012

http://downrightnow.com/netflix

If netflix is down, then it's something most companies who know how to design fail over can't cope with.

res0nat0r · on Oct 22, 2012

The last Netflix post mortem mentioned they had a bug in their configuration where they kept sending traffic to already down ELB instances, which was the cause of the last outage for them if I remember correctly.

Uchikoma · on Oct 22, 2012

AirBnB is also down.

creatrice · on Oct 22, 2012

Yeap, it's down the day I really need it!

incision · on Oct 22, 2012

Not necessarily, it could be some element of the Netflix architecture that due to their size and/or design trade-offs has taken longer / is harder to eliminate than it would be for others.

Other services, like Twilio, have come through several of these major problems with US-EAST generally unscathed while Netflix has had issues repeatedly.

iamdave · on Oct 22, 2012

Netflix seems to be operating just fine for me in the southwest USA.

acdha · on Oct 22, 2012

According to a site which doesn't document what its reports are based on. Given that Netflix worked for me during that period, I'm suspicious that downrightnow might be using EBS somewhere.

dsl · on Oct 22, 2012

I'm seeing issues across multiple AZ's. Everything is still up, but that doesn't stop me from getting paged.

darkarmani · on Oct 22, 2012

Including companies like Amazon that aren't properly architecting their control plane.

jeremyjh · on Oct 22, 2012

They should outsource it.

milkshakes · on Oct 22, 2012

too bad their own dashboard and management interface don't fail over... http://db.tt/BcuoSnPu

sehugg · on Oct 22, 2012

I'm also having API and Console timeouts, so I would disagree that it's limited to a single AZ.

ceejayoz · on Oct 22, 2012

The API and Console timeouts are likely due to high load from everyone logging in to see what's going on.