On Cascading Failures and Amazon's Elastic Block Store

krobertson · on April 22, 2011

Good post, but the tone of Joyent's posts so often irk me. Too much poking at Amazon while holding themselves on a pedestal.

They're a competitor to Amazon, of course they think they're superior... just so smug.

It seems like a bad practice, especially when you end up with pie in your face later. Not too long ago, they had an entire food fight thrown in their direction, so its not exactly like they're immune from issues.

kmfrk · on April 23, 2011

As far as I'm concerned, Joyent look like idiots after this post. Had the post been written in a tone that reflected on own prior experiences and used a tone that was solidary, it'd be a nice post. But no, this is 20/20 hindsight, where everyone can say "I told you so" with no evidence to prove that they knew this was coming.

Poor etiquette, terrible argumentation, and atrocious PR.

patrickgzill · on April 22, 2011

Yes, I immediately thought of what was it, the BingoDisk fiasco when I saw it was Joyent commenting on AWS' outage.

ekidd · on April 22, 2011

This is one of the key insights to take away from this whole AWS mess: When things start to go wrong, your automatic recovery code will increase the load on your system, and commonly lead you into a spiral of death.

I'm delighted to now have a name for this: "Congestive collapse."

The first time I saw congestive collapse in a real-world system, it was an ugly surprise. And this is presumably one reason why Netflix runs at 30-60% capacity across 3 AZs: They want to be able to lose a zone without overloading key systems.

ChuckMcM · on April 22, 2011

"When things start to go wrong, your automatic recovery code will increase the load on your system, and commonly lead you into a spiral of death."

Here is a Battlebots story; When I was competing we would see new teams in the pit area at the beginning of the competition (there were 3 days of preliminaries) that had really nice looking bots. I'd ask them, "So, have you run it full speed into a concrete wall?" And they would either say "Yeah, wow it was amazing ... " and tell some story of mayhem, or "No." (sometimes with a prognostication of confidence in their design skills or their simulations).

Teams that said "No" never made it out of the preliminaries. Not once in my experience.

That story underpins a fundamental truth in systems analysis, "Beat it until it fails before you depend on it."

This is something that Google does really really well by the way, I've watched them turn of 25 core routers simultaneously carrying hundreds of gigabits worth of data, just to verify that what they think will happen, does happen.

You learn not only what breaks, but if you go through the fire drill of bringing it back online, and you take copious notes when people say "Dammit! I need to do 'x' and I can't." your ability to respond will improve.

For very large systems, this can sometimes be the only way to develop this information.

Amazon has clearly had an event of extraordinary magnitude in their data centers. They've no doubt discovered all sorts of tools that they could use to recover more quickly. I would love it if someone from there would post a complete post mortem, but my expectations are low (there is a lot of proprietary benefit in knowing some of this stuff).

kinofcain · on April 22, 2011

Joyent knows all about shared network drive failures.

http://techcrunch.com/2008/01/15/joyent-suffers-major-downti...

Which of course they solved by getting out of the business completely.

caller9 · on April 23, 2011

"When the cloud goes down it becomes a fog." Funny quote from the linked Magnolia article.

tillk · on April 22, 2011

No one sane runs code to automatically take out and replace EBS volumes from raids.