
What's your worst IT failure to date? - kaosmos
Last week I failed pretty hard with a website migration to a new platform. So hard that I was forced to put online again the previous website version. To date, in almost 20 years of career, this is definitely my worst failure. What&#x27;s your story, if any, of IT failure?
======
camclay
VMWare snapshot of our Master Domain Controller failed, Windows crashed and
the snapshot corrupted.

Slept at the office, while I made a backup, and then manually corrected the
snapshot chain in vi. Got it up and running right as people were getting into
the office.

Most terrifying moment of my IT career.

------
shakycode
My worst failure was a 3am production database drop half-asleep. We had
backups but we lost an hour of transactional data. My worst IT failure to
date.

------
epc
This story is ancient, and I may have written it up before on hn:

In the weeks leading up to Christmas 1998 my team and I were in the process of
migrating our web site to a Y2K "safe" complex. Our site ran in two data
centers, using Andrew File System (AFS) to mirror content. The new complex
used Distributed File System (DFS), which was the new new thing then. The
migration got put on hold because a "key customer" needed to use the new
complex for the weeks around Christmas, so we were stuck on the old complex
indefinitely.

The week before Christmas there was a security update to the routers in use
across the hosting infrastructure. Now, it's Christmas, and it's end of the
year, so there's not supposed to be any breaking changes, and the bureaucracy
of the hosting organization should insure against any stupidity.

However, one of the senior managers decided that it was a truly critical
security update, and pushed through a series of changes to the routers on
approximately December 22nd, 1998.

Roughly 24 hours later the AFS cell began to melt down. I don't recall the
precise changes that occurred, but something to do with the MAC addresses of
the routers becoming virtual messed up how AFS communicated across the cell.

Over the next 48 hours my team and I desperately tried to keep the site up
over dialup lines and cell phones from a ski chalet, and our various parents'
homes since we'd all gone into vacation mode and there was absolutely no
reason to expect the site to melt down. And almost no one had broadband at
home, let alone at parents' or relatives' homes.

In the end we cobbled together a half–assed simulacrum of the site on the "new
complex" based on a weeks' old snapshot and hand–copied over updates, which
worked to keep the site up through Christmas Day until the "old complex" could
recover from the unexpected router changes.

As far as I could tell it didn't hit the press, and barely made mention in
various newsgroups at the time.

The following week, the sysadmin (who worked sort-of dotted line to me but
reported into the hosting organization) was effectively fired for telling me
that the site had gone down. Seems none of the management chain in the hosting
group had any idea, and since I reported more or less directly to the CIO,
they were caught short in the weekly roundup meeting. Wasn't the sysadmin's
fault, her management declined to keep the management chain informed.

A separate result was that the Y2K migration got delayed until August 1999
because the whole event exposed many, many holes in the hosting group's
processes and procedures.

~~~
kaosmos
In your story I can find a similarity with a problem I had: you were put on
hold because of a "key customer" while I haven't enough "bargaining power" to
have my customer to stop the former website from receiving new orders and new
registrations (it is an ecommerce website) for a time long enough to have the
site correctly migrated without further pressure. In both cases there were an
added difficulty because of being too much "compassionate" with the customer
needs.

