Hacker News new | past | comments | ask | show | jobs | submit login

Organizational complexity grows super-linearly; in general, the number of people a company can hire per unit time is either constant or grows linearly.

Google once had a very quiet big emergency that was, ironically(1), initiated by one of their internal disaster-recovery tests. There's a giant high-security database containing the 'keys to the kingdom', as it were... Passwords, salts, etc. that cannot be represented as one-time pads and therefore are potentially dangerous magic numbers for folks to know. During disaster recovery once, they attempted to confirm that if the system had an outage, it would self-recover.

It did not.

This tripped a very quiet panic at Google because while the company would tick along fine for awhile without access to the master password database, systems would, one by one, fail out if people couldn't get to the passwords that had to be occasionally hand-entered to keep them running. So a cross-continent panic ensued because restarting the database required access to two keycards for NORAD-style simultaneous activation. One was in an executive's wallet who was on vacation, and they had to be flown back to the datacenter to plug it in. The other one was stored in a safe built into the floor of a datacenter, and the combination to that safe was... In the password database. They hired a local safecracker to drill it open, fetched the keycard, double-keyed the initialization machines to reboot the database, and the outside world was none the wiser.

(1) I say "ironically," but the actual point of their self-testing is to cause these kinds of disruptions before chance does. They aren't generally supposed to cause user-facing disruption; sometimes they do. Management frowns on disruption in general, but when it's due to disaster recovery testing, they attach to that frown the grain of salt that "Because this failure-mode existed, it would have occurred eventually if it didn't occur today."




That's not quite how it happened. ;)

<shameless plug> We used this story as the opening of "Building Secure and Reliable Systems" (chapter 1). You can check it out for free at https://sre.google/static/pdf/building_secure_and_reliable_s... (size warning: 9 MB). </shameless plug>


Thanks for telling this story as it was more amusing than my experiences of being locked in a security corridor with a demagnetised access card, looooong ago.


what if the executive had been pick-pocketed


EDIT: I had mis-remembered this part of the story. ;) What was stored in the executive's brain was the combination to a second floor safe in another datacenter that held one of the two necessary activation cards. Whether they were able to pass it to the datacenter over a secure / semi-secure line or flew back to hand-deliver the combination I do not remember.

If you mean "Would the pick-pocket have access to valuable Google data," I think the answer is "No, they still don't have the key in the safe on the other continent."

If you mean "Would the pick-pocket have created a critical outage at Google that would have required intense amounts of labor to recover from," I don't know because I don't know how many layers of redundancy their recovery protocols had for that outage. It's possible Google came within a hair's breadth of "Thaw out the password database from offline storage, rebuild what can be rebuilt by hand, and inform a smaller subset of the company that some passwords are now just gone and they'll have to recover on their own" territory.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: