

Details on the Oracle failure at JPMorgan Chase - babar
http://www.dbms2.com/2010/09/17/jp-morgan-chase-oracle-database-outage/

======
babar
Short on details, and I thought the speculation on transferring the user info
to a NoSQL solution sounds naive, but interesting to get a peek into the
internal operations at a large enterprise. I think I want to go validate some
of our database backup right now.

~~~
georgefclay
I agree. Also his comment about "over engineering" making the system "more
brittle" was odd.

For data that important, I would have mirrored the databases to "warm standby"
servers. They could have been back up in minutes with no data loss. Sure it
would have doubled the cost, but how much money did they lose during the
outage.

~~~
jasonwatkinspdx
You completely failed to read the article.

Otherwise you'd know that they had a fault that propagated to the hot spare.
It's also utterly daft to think that a financial enterprise as large as
JPM/Chase wouldn't already be running a HA setup. In this case it appears to
be Oracle RAC.

I'm astounded how often I have to remind people that replication and backups
are very different things, and that you need both.

I'm also depressed how many utterly thoughtless comments are made here on
hackernews lately.

~~~
gaius
No Oracle RAC shares the same storage between two or more nodes.

What they had here would appear to be database A running on storage A which is
replicated at the storage level to storage B where database B waits in an idle
state. Because the replication system is "blind" - it only sees its own
filesystem containing bytes, not Oracle data structures - it can't tell a good
Oracle block from a bad one and copies it.

I do this sort of setup for a living and you would be amazed at how many
"architects" there are around who have completely drunk the storage vendor
kool-aid and don't really understand how anything works (not even storage...).

~~~
GBond
This is likely the case since the post mentioned that storage controller was
initially blamed (but cleared).

------
known
_caused by corruption in an Oracle database._

Could happen to any database. Not just Oracle.

------
lvecsey
So do those 8 machines and the code on it represent the system before or after
the Wamu merger, or something in between? I've heard many of these banks such
as Citigroup have something like 13 different databases or systems, many of
which duplicate functionality.

------
unohoo
having worked at an enterprise software company and working with several big
clients (including banks), I find it surprising (and shocking to some extent)
that JPMC didnt have a more efficient disaster recovery process in place.

I am not saying they didnt have one, just that disaster recovery scenarios
should factor into such outages. Hypothetical fire drills etc. are needed at
such critical businesses like banks.

My guess is that a bunch of people @ jpmc will most likely be losing their
jobs over this.

~~~
VladRussian
And bunch of people getting the jobs/contracts. Like anybody who've been
working with big enterprises, i'm sure it would be a net positive effect :)

Btw, while everybody's salivating over "Oracle failure", was it really Oracle
failure, ie. like failure of Oracle?

~~~
unohoo
ohh yeah. i totally agree. I'm pretty sure if its an oracle issue, ibm is
going to put its sales pitch for db2 in overdrive to jpmc

~~~
isyourfriend
They should have used MS SQL Server instead :-)

