

On Cascading Failures and Amazon's Elastic Block Store - timf
http://joyeur.com/2011/04/22/on-cascading-failures-and-amazons-elastic-block-store/

======
krobertson
Good post, but the tone of Joyent's posts so often irk me. Too much poking at
Amazon while holding themselves on a pedestal.

They're a competitor to Amazon, of course they think they're superior... just
so smug.

It seems like a bad practice, especially when you end up with pie in your face
later. Not too long ago, they had an entire food fight thrown in their
direction, so its not exactly like they're immune from issues.

~~~
kmfrk
As far as I'm concerned, Joyent look like idiots after this post. Had the post
been written in a tone that reflected on own prior experiences and used a tone
that was solidary, it'd be a nice post. But no, this is 20/20 hindsight, where
everyone can say "I told you so" with no evidence to prove that they knew this
was coming.

Poor etiquette, terrible argumentation, and atrocious PR.

------
ekidd
This is one of the key insights to take away from this whole AWS mess: When
things start to go wrong, your automatic recovery code will increase the load
on your system, and commonly lead you into a spiral of death.

I'm delighted to now have a name for this: "Congestive collapse."

The first time I saw congestive collapse in a real-world system, it was an
ugly surprise. And this is presumably one reason why Netflix runs at 30-60%
capacity across 3 AZs: They want to be able to lose a zone without overloading
key systems.

~~~
ChuckMcM
"When things start to go wrong, your automatic recovery code will increase the
load on your system, and commonly lead you into a spiral of death."

Here is a Battlebots story; When I was competing we would see new teams in the
pit area at the beginning of the competition (there were 3 days of
preliminaries) that had really nice looking bots. I'd ask them, "So, have you
run it full speed into a concrete wall?" And they would either say "Yeah, wow
it was amazing ... " and tell some story of mayhem, or "No." (sometimes with a
prognostication of confidence in their design skills or their simulations).

Teams that said "No" never made it out of the preliminaries. Not once in my
experience.

That story underpins a fundamental truth in systems analysis, "Beat it until
it fails before you depend on it."

This is something that Google does really really well by the way, I've watched
them turn of 25 core routers simultaneously carrying hundreds of gigabits
worth of data, just to verify that what they think will happen, does happen.

You learn not only what breaks, but if you go through the fire drill of
bringing it back online, and you take copious notes when people say "Dammit! I
need to do 'x' and I can't." your ability to respond will improve.

For very large systems, this can sometimes be the _only_ way to develop this
information.

Amazon has clearly had an event of extraordinary magnitude in their data
centers. They've no doubt discovered all sorts of tools that they could use to
recover more quickly. I would love it if someone from there would post a
complete post mortem, but my expectations are low (there is a lot of
proprietary benefit in knowing some of this stuff).

------
kinofcain
Joyent knows all about shared network drive failures.

[http://techcrunch.com/2008/01/15/joyent-suffers-major-
downti...](http://techcrunch.com/2008/01/15/joyent-suffers-major-downtime-due-
to-zfs-bug/)

Which of course they solved by getting out of the business completely.

------
caller9
"When the cloud goes down it becomes a fog." Funny quote from the linked
Magnolia article.

------
tillk
No one sane runs code to automatically take out and replace EBS volumes from
raids.

