

Coffee and Design for Failure - somic
http://www.somic.org/2011/04/22/coffee-and-design-for-failure/

======
sklivvz1971
I don't think the analogy is correct: suppose that John needed to buy life
saving medicines. Would his course of action be reasonable and rational? No,
because he would be dead. On mission critical systems you make damn well sure
in advance you have a backup plan (or more if needed). John should have bought
an extra set of medicines, and AWS should have had better backup plans - it
seems to me that if it's taking so long to bring the systems up, there must be
something not really well thought of there.

~~~
somic
The post was not directed at AWS. A "normal accident" (one that can't be
foreseen, usually due to confluence of factors and bad luck) happened and they
are dealing with it.

My post was directed at folks who said "web sites that went down as a result
of such unprecedented EC2 problem made an engineering mistake by not building
to be able to withstand."

Again - I am emphasizing "engineering mistake", not business mistake or
funding mistake or resource allocation mistake.

My point is there are things you rationally protect against. But at some
point, putting up defenses against more and more bad things stops being
rational.

For different systems this point (where it stops being rational) is different.

------
alanh
Interesting such a question-filled post doesn’t have comments enabled. (Not
that I’m criticizing it for this reason.)

I guess I’m not sure how relevant the parable is, because while it doesn’t
cost have coffee & a pot at your house — and you will use it anyway, and it
doesn’t cost anything day-to-day to just keep it there — this isn’t true for,
say, having a backup system just waiting to take over in the event of
catastrophic failure.

Then again, the Netflix strategy (3 clusters at 60% utilization → 2 clusters
at 90%) is superficially similar to usually having an option of homebrew,
walking, and driving for coffee, and sometimes just being forced to pick one
option. Superficially.

------
isak2
For all John Doe knew, those events (road closed, rain + strong wind) were
unlikely to happen, so planning for them would not have been rational. He did
bring an umbrella when he saw a cloudy sky, which demonstrates good planning
(even though it didn't work out in this particular case).

------
ctdonath
Context, please: what's "the Judgment Day Outage"?

(I presume knowing that would explain the purpose of the gedankenexperiment.)

~~~
archgoon
Amazon's EC2 Failures. Since April 21st 2011, is the date that the fictional
Skynet (from Terminator) seizes control of the worlds computing
infrastructure, some have been amused by the timing coincidence.

------
cmurdock
John doesn't make money based on whether or not he gets a cup of coffee, so
this whole question is worthless in my opinion.

~~~
archgoon
It is not overly difficult to transform this story about one form of utility
(Starbucks coffee) to being about needing to get to a location to perform work
(another form of utility).

