
Five whys - jmorin007
http://www.joelonsoftware.com/items/2008/01/22.html
======
mixmax
Some years ago a Major Danish bank suffered a black swan event where their
entire online financial system went down. It was traced to a bizarre problem
in an IBM DB2 database that had never occurred before. In the entire history
of DB2. All kinds of experts were flown in from all over the world to fix it,
and eventually they succeeded with some loss of data.

The result? Customers of the bank were unable to make any transactions for 5
days, and the bank's stock suffered so severely that the company lost almost a
billion dollars in market cap within a week.

Now they have redundant sytems...

~~~
daniel-cussen
Did anybody get fired for buying from IBM?

~~~
pius
Nice one. :)

------
gibsonf1
With EC2 available, why be in the hardware as well as the software business?
Life seems to short to have to worry about hardware and hardware scaling -
especially things like the story of rounding up 100 servers over the weekend
to handle demand when you could just deploy as many EC2 instances as you need
in a few keystrokes in a few minutes, or this problem of a misconfigured
switch.

~~~
rcoder
There are a number of reasons not to put critical business systems on top of
EC2.

First and foremost, if you don't have control of who can physically access
your hardware, you can't make any real assurances about the privacy and
integrity of your data. I don't care how much you trust Amazon with your
personal credit card and shipping information -- you are taking a big risk
putting your _customers'_ data there, too.

Secondly, EC2 has no SLA _at all_. Given Joel's (strong) argument about the
limited usefulness of SLAs in general, that may not seem like a big deal, but
the lack of any firm commitment on Amazon's part pretty much means that you're
SOL if there are any major problems.

Finally, EC2 absolutely _does not_ protect you from issues like mis-configured
network hardware. Amazon may have incredibly redundant connectivity, including
data centers in different geographic regions, but that redundancy doesn't
necessarily benefit your virtual machines in the way it does Amazon's core
business applications. One VM instance is still tied to one physical box
hosting it, and if that box goes down, you'll lose that host and any data it
hasn't flushed elsewhere on the network.

That's not to say that EC2 doesn't have its uses; I just couldn't imagine
putting your mission-critical apps on it. Stick with prototyping, testing,
batch-processing, and the occasional extra capacity for unexpected spikes in
load, and you should be fine.

~~~
gibsonf1
1\. Without password and encryption keys, access to the hardware is
meaningless from a data security perspective.

2\. I am pretty confident that Amazon's ability to keep the computers running
is better than most other options. And unless their entire service goes down
as well as the entire S3 grid, we can very quickly launch a new instance with
no data loss (we have a hyperactive back up strategy to S3 to prevent data
loss in case of an EC2 instance failure)

3\. We can launch a new instance in a very few minutes. So if anything, this
seems like an ideal environment where uptime is extremely good, and in the
case of a failure, recovery is extremely fast. I'm not sure where we can get
similar service? The S3 is especially good in terms of multiple backups in
different regions - hard to beat for assuring business clients who can't
afford to loose data that their data won't be lost.

~~~
run4yourlives
>1\. Without password and encryption keys, access to the hardware is
meaningless from a data security perspective.

Are you shitting me? What school did you go to?

~~~
gibsonf1
I guess we need to concretize this: Someone decides that the data on my system
is valuable and wants to steal it. They find out that we're running on EC2
with backup to S3, and they want to locate the equipment. How in the world
would they do it? I have no clue where the hardware is that hosts our service.

Much easier to break into someone's office and get the equipment directly, or
to break into a smaller host and grab the equipment. With the cloud, finding
the actual hardware in use seems extremely difficult - which seems to
dramatically increase your security against a hardware theft as opposed to
self hosting.

~~~
run4yourlives
1\. You're assuming the threat is targeted. You'd be just as liable (and it
would cause you just as much damage) if Amazon's janitor decided to walk off
with a couple of hard drives.

2\. You not knowing which server your data is on is just as bad. You can't
actually guarentee me that my data _hasn't been stolen_. You have no way of
knowing.

------
eVizitei
Joel makes a really good point in this article. We cannot prepare for every
eventuality, and in fact it simple isn't cost effective to do so. The best we
can do is plan for the things that are expected to go wrong, and make
permenant fixes (either with technology or with processes) when things we
don't expect come up.

------
daniel-cussen
Why not just make an NTSB for internet outages?

~~~
wmf
An outage at a Web 2.0 site is unlikely to kill anyone, so it doesn't justify
government intervention like the NTSB. Laws have been proposed requiring
companies to report on privacy breaches, but I don't think they require
companies to say _why_ the breach happened, and they don't require an outside
audit.

The ROC project suggested something similar, although their analogy was not to
the NTSB but an internal Ma Bell policy that required a report to be written
for all outages. However, most companies don't want to release details about
their outages, especially ones that are not customer-visible. When companies
do talk about outages (as Joel just did) it's usually as a form of PR damage
control.

<http://roc.cs.berkeley.edu/#pubs>

~~~
daniel-cussen
Now I know. Thanks for the link.

------
curmudgeon
An embarassing outage turned into yet another marketing opportunity... WTG,
Joel!

