

Ask HN: What's the SCARIEST thing that's ever happened in your career? - robmcvey

I want to hear about database horrors, ludicrously-long downtime, DDOS attacks, and that time you accidentally deleted all your user data.<p>You know, all that scary stuff.<p>What happened? 
How did you resolve the issue? 
What did you learn?
======
rajacombinator
It all depends what you consider scary.

There were the times when I happened to be the only person running the trading
desk during the nights of the Bear Stearns, Lehman, and FNM collapses, having
to turn over a ~$2B portfolio. It was all OPM, so not personally scary per se,
but staring into the abyss definitely altered my worldview.

Then there was the time we thought a stupid bug I made cost the company $500k.
Turns out it was only about $5k, but man I still cringe with embarrassment
when I think about it. Guess I didn't learn my lesson though because I still
make stupid bugs now and then.

The thing that really made me toss and turn at night though was being stuck in
dead end jobs with no potential for upward trajectory or recognition of my
contributions. That's the scariest thing that can happen to a career.

------
epc
Worst outage: Christmas 1998. Site is locked down, all changes prohibited due
to recently instituted end of quarter/end of year freezes as part of
“professionalizing” our services. Am just out of cell phone range visiting
family when a call comes in that the site is down.

Drive back to family home in Chicago suburbs, dial in on one phone line while
calling into open conference call on other line. Learn that site started
decaying a few hours earlier but service management decided not to inform me
until actual outage commenced. Over the next few hours the site goes
completely black, the AFS fileservers have lost their minds and won't accept
connections from the httpd clients serving content.

We had been in the process of migrating off this old complex onto a newer
complex which used DFS on newer hardware, so I hot flipped the site to the
stale content on the new complex, so at least the site is up, sort of. I make
minor tweaks to make it look more recent.

Returning to the "old" complex, we learn that although service changes had
been prohibited, a manager in the service organization had decided to bypass
pretty much all internal processes and pushed through a change to the routers.
I don't remember the precise change, think it had something to do with HSRP,
or virtual MAC addresses. Whatever the change was, totally hosed AFS which was
dependent on the very thing being changed. A normal review would have caught
this, but since it was Christmas and the guy was in a hurry, no one caught it.

Over the next 24 hours from Christmas Eve into Christmas Day myself, working
from my parents' spare bedroom, my lead sysadmin, working from a cabin in the
Rockies, and my lead webmaster, working from HIS parents' home in the UK
manage to resurrect the site from backups (the site itself was running out of
datacenters in Columbus, OH and Schaumburg, IL).

The punchline: at the time my second line manager is the CIO. Over the entire
outage I've kept him in the loop on what we were doing, who was helping, etc.

The following Monday he's on his regular call with Global Services, reviewing
incidents, issues, etc. No one mentions that the corporate site had been down
for, effectively, two days. Finally, he brings it up, causing a colossal
bureaucratic shitstorm.

The end result? I'm reprimanded for a couple of minor infractions (a slap on
the wrist considering my sole motivation was getting the site back online).
The sysadmin who worked through her vacation from a backwoods cabin? Fired.
Not for the work she did to get the site back online, but because management
felt she had too had bypassed process and should not have focused on getting
the site online, but on keeping management informed (which, it turned out,
they were but had ignored the trouble tickets coming in).

The manager who approved and pushed through the "minor change" that took the
site offline? Promoted.

Lesson learned: for all of the talk about relying on the I/T professionals,
they were just as apt to make colossal mistakes, but could fall back on
"process" and bureaucracy to avoid accountability. I was advised that the next
time the site was down, that I should rely on the professionals to return it
to service, and that if it took multiple hours or days, so be it.

~~~
mobiplayer
I hope everyone can learn a lesson here: You don't work during your holidays.

