Hacker News new | past | comments | ask | show | jobs | submit login
Joyent us-east-1 rebooted due to operator error (joyent.com)
104 points by hypervisor on May 27, 2014 | hide | past | favorite | 122 comments

It should go without saying that we're mortified by this. While the immediate cause was operator error, there are broader systemic issues that allowed a fat finger to take down a datacenter. As soon as we reasonably can, we will be providing a full postmortem of this: how this was architecturally possible, what exactly happened, how the system recovered, and what improvements we are/will be making to both the software and to operational procedures to assure that this doesn't happen in the future (and that the recovery is smoother for failure modes of similar scope).

I feel bad for the person who made the mistake. Even though its obviously a systemic problem, and highly unlikely to be an act of negligence, Im sure he/she doesnt feel too hot right now.

It's operations. You fuck up, you suck it up, you fix it, then (and this is the important part) you prevent it from ever happening again. Feeling like shit for bringing something down is a good way to give yourself depression, given how often you will screw the pooch with root. In the same vein, anybody who says they'd fire the operator without any qualification on that remark should be given a wide berth.

People tend to forget that "fixing it" isn't just technical, it involves process, too. Every new hire that whines about change control and downtime windows would be the first to suggest them, were they troubleshooting the outage that demonstrated the need.

Back in the day we used to say there are 2 types of network engineers, those that have dropped a backbone and those that will drop a backbone.

Your management is failing your newer engineers, if this is still the case.

Nonsense. Someone has to be operating at the sharp end of the enable prompt, and sooner or later it'll be 0330 and that person will type Ethernet0 when they meant Ethernet1, whatever management you have in place.

When that happens, you do just what Joyent did here: you send out an embarrassed email to customers, everyone else in the ops team gets a few cheap laughs at the miscreant's expense, you have a meeting about it, discuss lessons learned, and you move on.

Everyone screws up. Everything goes down once in a while. This is why you build in redundancy at every level.

I've seen generally brilliant people be bit by bad process. The worst example was an important hard drive being wiped thanks to a lack of labeling, obviously taking a production server down with it.

Other things that have caused outages: lack of power capacity planning, unplugging an unrelated test server from the network (go go gadget BGP), cascading backup power failure, building maintenance taking down AC units, expensive equipment caching ARP replies indefinitely… the list goes on.

I had my own fun fuckup too. I learned SQL on PostgreSQL, and had to fix a problem with logged data in a MySQL database. Not trusting myself, I typed "BEGIN;" to enter a transaction, ran my update, and queried the table to check my results. I noticed my update did more than I expected, so I entered "ROLLBACK;" only to learn that MyISAM tables don't actually implement transactions.

Thankfully, in this case it turned out to be possible to undo the damage, but talk about a heart-stopping moment!

Shit happens. You deal with it, then do what you can to keep it from happening again. I've learned to respect early morning change windows as a way to limit damage caused by mistakes.

My thoughts exactly. Poor fella.

I've seen worse though. A newish officer spilled his morning coffee into the circuitry of a device worth over 10 zeros. Immediately short circuited.

Wow. Did this gold-plated B-2 bomber still fly after the coffee incident?

You know you could build five space shuttles with ten zeros, right? Are we talking dollars or Yen?

Must be counting the two after the decimal point. =)

Why are you using a throwaway account? Ohh, I just saw the "dollars or Yen" remark. TIL we use throwaway accounts for the times we feel like being assholes, so the non-elites can't track it back to our physical neuroprocessors.

I'm not sure why I'm dignifying this Reddit drivel with a response, but my karma and account age should be your hint that you're barking up the wrong tree.

"check my previous responses and my credit score for how you should treat me" ohh what an old-man response. It's too bad the Imgur "downvote everything they ever posted" script doesn't work here on HN, now isn't it?

My account's karma and history exceed your account's on this site, and even worse, this individual comment bears more value than yours! Ooh burn!

Haha, this is a pretty spectacular amount of cognitive dissonance you're demonstrating here.

Let's post-mortem this lunacy:

1) You misinterpret jsmthrowaway's initial comment as vaguely racist (or something), notice the word "throwaway" in his user name and get really excited that you can stand on your high horse and call him out for hiding behind the shield of internet anonymity when he wants to be a (you think) racist idiot. Even though his comment is a legitimate currency conversion remark. See: https://www.google.com/search?q=1+usd+in+yen

2) He explains that clearly you're mistaken (which, really, seriously, what a kneejerk response from someone who just wanted to show how clever they were, calling out an "asshole") and further explains his account isn't even a "throwaway" in the traditional, trolly sense of the word, citing his account age and the fact that he regularly actively posts to the account.

3) You, caught perhaps in a moment of clarity, though I think I give you too much credit here, realize you were too eager to pounce on the "asshole" for his "Yen" remark, and perhaps you misread it. Your latent erection fading, you counter by explaining that your history and karma are even _more_ impressive, somehow completely avoiding taking responsibility for a completely nonsensical leap in logic and accusation of wrongdoing, while doubling down on your cognitive dissonance.

4) The Aristocrats!

It seems he has a bit of misplaced sensitivity when it comes to the Japanese: https://news.ycombinator.com/item?id=7555232

I was wildly drunk when posting this. I didn't remember doing so until I checked HN just now.

Uh, you just called him out for using a throwaway account, now age and experience means nothing? Go away, troll!

if something worth over 10 zeroes can be destroyed with a coffee spill, i would say it had it coming

As a one-time newish officer who used to be in charge of things with many zeroes, I'd be inclined to agree.

How many non-zero integers were in the price of the device, though? My computer is also worth more than 10 zeros.

Please keep in mind that "price" means "how many dollars other humans are willing to trade for it right now"; not necessarily any concrete evaluation of the device's functionality compared to a human competitor or human operator...


...then there was the new server room that was built with one of the 'big red buttons' conveniently placed behind the pull cord for the lights.

Why, yes...a couple of times...before a perspex arch was less-than-hastily fixed over the button..

Seems like not allowing food or drink near a device worth over 10 zeroes would be a no-brainer, but hindsight is tricky like that.

That sounds so awful. I can't imagine living the rest of my life knowing that I had been a net negative in the world. All of my life's earnings would just be a partial restitution of that one second of destruction.

If you have an EXTREMELY reductive point of view, that equates revenue with human worth.

supposedly 0.0000000001 billions of $

shit happens, design for the worst.

"If you reach for a star, and come up with a handful of mud..."

really? what can cost that much ?

The USS Gerald Ford cost 12.8 billion to construct + 4.7 billion in R&D ... I think we would have heard if it had been destroyed by a cup of coffee.

... And be utterly destroyed by a single cup of coffee?

A quantum computer.

As a request: It looks like each time the status page is updated, the old UPDATE: <words> is removed. For the future, it would be great if the older updates were preserved so that people looking back could understand the chain of events, rather than just seeing the first / last pieces.

Mandatory DevOps Borat

"To make error is human. To propagate error to all server in automatic way is #devops"

and my fav "Law of Murphy for devops: if thing can able go wrong, is mean is already wrong but you not have Nagios alert of it yet."

Joyent's messaging about "we're cloud, but with perfect uptime" was always broken.

It's mildly gross that the current messaging sounds like they're throwing a sysadmin under the bus. If fat fingers can down a data center, that's an engineering problem.

I care about an object store that never loses data and an API that always has an answer for me, even if it's saying things that I don't want to hear.

99.999 sounds stuck-in-the-90s.

> sounds like they're throwing a sysadmin under the bus

at least they didn't name the operator in question...

Our internal culture is such that everyone on the team would rather be blamed for something than accuse someone else of doing it. That's shitty, and not something you do to someone. You fix the problem and then you move on.

If it makes you happy, blame me - I don't mind.

At my $DAYJOB, we are always careful to figure out exactly what happened, including by whom. It's not to assign personal blame, but I believe it's critical that everyone agrees on the facts (who, what, when, where, and [if possible] why).

Response and conversation is always focused on "how do we prevent this in the future?", not on punishing whoever was involved in the past.

IOW, I agree with I believe is your intent, but differ on the implementation. Blameless transparency is the term we use (and we probably stole that from somewhere else).

It's a very powerful signal to the whole team when you first see individuals "admitting" to exactly what they did, how it caused or contributed to the outage, and to hear them thanked for their contribution of understanding in the post-mortem.

Senior leadership (including myself, who originally instituted the entire process a decade ago) is very clear that we want to know the facts and that in seeking and using those facts, we're only focused on the future, no matter how boneheaded the individual actions appear with the benefit of hindsight and knowledge that they'd lead (in)directly to an outage. I run operations and also participate in the promotion discussions for all technologists, and in 11 years, I've never heard a negative shadow cast onto a sysadmin/sysengineer from their actions during or leading to a production outage. And we've (collectively) made our fair share of mistakes over the years. That doesn't stop good employees from feeling bad about it, but that's a personal feeling they have, not from the fear of it being a professional black mark.

I think there's a difference in how you approach this with an internal-facing view and an external one.

Internally, You're right. But externally the company fucked up, not the individual.

100% agree, and it is my oversight to not draw that distinction more clearly. We have the luxury (so far) of only reporting internally.

BTW, this is the right way to do it. :)

"elijahwright" shall henceforth be used in place of "scapegoat"

Awesome! It's what I've always wanted!!!

it was that way at tech, no reason for it to change now

Now I have to figure out who you are. :-)

Sure blame on the engineers. You give power, people use it badly blame the engineer for giving too much power. You don't give enough power sysadmins/users bitch and yell why don't we have enough power, we're not children.

Its always the engineer fault. :(

Systems engineers, software engineers, architects, whatever. We're all in the same gang.

My point is that the problem in this case is likely the system's design, not one engineer's typing abilities.

This comes down to operational philosophy, in the end. The point you're dancing around is whether the system should permit grave actions that don't make any sense when you're designing the system.

By the rules, every single system on a commercial aircraft has a circuit breaker. Pilots make the "what if X catches on fire?" case, which is actually pretty compelling. However, that also means there are several switches overhead that will ostensibly crash the airplane if pulled. Pilots lobby very strongly for the aircraft not to fight them in any way because they are the only ones with the data, in the moment, now. They have final command over the aircraft in every way.

I use this to point out that as you're designing systems for operations people -- something we're increasingly doing ourselves as devops/SRE takes hold -- you might think you can anticipate every scenario and design suitable safeguards into the system. However, sometimes, when Halley's Comet refracts some moonlight into swamp gas and takes your fleet down, you as an operator have to do some really crazy shit. It's in that moment, when all hell has broken loose, I'm at the helm, and based on the data available to me I have made a decision to shoot the system in the head: if the system fights me and prolongs an outage because we argued about whether we'd ever need to reboot a fleet all at once, I'm replacing the system as the first item in my postmortem. If you make me walk row to row flipping PDUs, we're going to have words.

That's just my philosophy. Give the operators the knives and let them cut themselves, trusting that you've hired smart people and understanding mistakes will happen. Your philosophy may vary. By all means, ask me to confirm. Ask me for a physical key, even. But if you ever prevent me from doing what I know must be done, you are in my way. I have yet to meet a system that is smarter than an operator when the shit hits the fan (especially when the shit hits the fan).

There's probably a broader term for operational philosophy like this.

...and the operations version of that is that all normal operations are performed under restricted permissions that cannot "do anything", while the full "do anything" permissions are only broken out during a major crisis.

Such an approach would have prevented this incident where "normal" operations were being performed and accidentally ALL the servers were rebooted at once.

I tend to agree with you, with the caveat that you can't have this philosophy and sell your customers 99.999% uptime[0].

[0] http://www.joyent.com/products/compute-service/features/linu...

I disagree wholeheartedly. Your operational philosophy complements your SLA goals, it doesn't force them.

I can't figure out how your comment that "understanding mistakes will happen" is compatible with 99.999% uptime.

I'm of the opinion that 99.999% for an individual instance isn't particularly achievable in a commodity hosting environment. That kind of uptime doesn't leave much room for the mistakes that you and I both anticipate.

I do think that 99.999% is doable for a properly distributed whole-system across multiple geographically-dispersed datacenters.

I think Joyent has gone wrong in promoting individual instance reliability.

Always ask how numbers like that are computed.

They're not. That's a statement of what customers have enjoyed up until now. The actual SLA simply states what refund you get for each hour of downtime.

It's a combined fault. Clearly the operator made a mistake, but the system shouldn't have let such a calamitous operation take place without at least three levels of "Are you sure" (or something smarter like "Confirm how many servers would you like to reboot:") before it lets you take down thousands of servers.

Joyent's marketing is not the most transparent. They haven't updated AWS prices in their pricing page since AWS lowered their prices two months ago.


Joyent doesn't use AWS.

The 'devops' automation I made at my last company (and am building at my current company) had monitoring fully integrated into the system automation.

That is, 'write' style automation changes (as opposed to most 'remediation' style changes) would only proceed, on a box by box basis, if the affected cluster didn't have any critical alerts coming in.

So, if I issued a parallel, rolling 'shutdown the system' command to all boxes, it would only take down a portion of all of the boxes before automatically aborting because of critical monitoring alerts.

Parallel was calculated based on historical but manually approved load levels for each cluster, compared to current load levels. So parallel runs faster if there's very low load on a cluster, or very slowly if there's a high load on a cluster.

One way or another, most automation should automatically stop 'doing things' if there's critical alerts coming in. Or, put another way, most automation should not be able to move forward unless it can verify that it has current alert data, and that none of that data indicates critical problems.

DevOps means being able to take out an entire datacenter with a single keysstroke...

As a Devops, I can't justify building any automated way to down or restart all of my systems at once. We've only had to do that to resolve router reconvergence storms when changing out (relatively) major infrastructure pieces, such as our Juniper router.

You don't intentionally build an automated way to take down all your servers at once.

You build a way to automatically perform some mundane standard procedure, like propagating a new firewall rule to all your systems at once. Then you accidentally propagate a rule that blocks all inbound ports. Huh, when I tested locally I didn't notice that.

Or you build a way to automatically delete timestamped log files more than a month old. And when it runs in production, it also deletes critical libraries which have the build timestamp in their filename. Ah, the test server was running a nightly build instead of a release so the files were named differently.

Or you build a way to automatically deploy the post-heartbleed replacement certificates to all your TLS servers, and only after you do that you find you didn't deploy the replacement corporate CA certificate to all the clients. Hmm, the test environment has a different CA arrangement, so testers don't get the private keys of prod certificates.

Or you build a way to retain timestamped snapshots of all your files, every five minutes, so you can roll back anything - then find that huge log file that constantly changes gets snapshotted every time, and everything is hanging because of lack of disk space. Oh, production does get a lot more traffic to log, now I think about it.

Or you do any of a hundred other things that seem like simple, low risk operations until you realise they aren't.

Once I typed

  rm -rf logs_ *
instead of

  rm -rf logs_*

Our less-than-savvy Financial Director took it upon himself to restore from tape the bought ledger files to a live system after a slight mishap. Unfortunately, the bought ledger files all started with a 'b' and he managed to restore them to the root of the -nix system instead of the right place, so he mv'd b* to the right location.

All was well until a scheduled maintenance restart a few weeks later and we (eventually) discovered that /boot and /bin were AWOL.

Edit: He had access to the root account to maintain the accounts app (not my call)

I have nightmares about such things.

Unfortunately, the same tools that allow someone to automate management of systems can easily become catastrophic.

As one of the other commenters noted, a ~20 character salt command will do this. I doubt Joyent built a Big Red Button to take down a datacenter, I expect this will be the case of somebody missing an asterisk or omitting a crucial flag while trying to do their normal work.

Sorry - are you telling us you had to reboot all nodes because you swapped a router out? Sounds like you need a network engineer.

And I'm being downvoted for that? Seriously? In 13 years of networking I have never once had to reload machine to help with OSPF or BGP convergence. Good networking architecture and planning should mitigate anything other than a couple of minute outage. No routing change should ever require a reload of a server or end node.

I believe you were down voted not for what you said, but the way you have said it.

I've been down voted several times for (what I see) as relatively minor remarks. The HN readers are a sensitive bunch...

Those who are still posting on HN are orders of magnitude more sensitive than those who post on Imgur. The communities are similar-size, yet Imguraffes are much, much more accepting of my comments. What merits a handful of upvotes there brings a downvote or two on this site.

You're assuming my management has been paying for good networking architecture for the past dozen years.

I believe it. Networking is seen as a commodity now. It's transparent until it fails. There's a whole lot of technical debt lurking out there. I personally have seen the dark shadow of spanning tree suck the light out of DevOp engineers eyes.

DevOps Borat is going to have a field day today.

Let this be a lesson to linux admins. Re-alias shutdown -r now into something else on production servers. I once took down access to about 6000 servers because I ran the script to decommission servers on our jump box when I got the SSH windows confused.

At one point, I worked in a computer lab that was mostly Ultrix machines. The shutdown grace period was specified in minutes ( http://www.polarhome.com/service/man/generic.php?qf=shutdown... )

Then we got a hp-ux machine in the lab. For some reason, the grace period on that system was in seconds ( http://www.polarhome.com/service/man/generic.php?qf=shutdown... )

System dax shutting down in 5 seconds.

Cheers for this! Would have saved me so much grief before. Now going around and installing it on the servers I manage (fortunately nothing mission critical, but many remote).

"when I got the SSH windows confused"

I've come close to that as well.

This reminds me of the paradox of being competent vs. a beginner.

It also has parallels in a few thing outside computing.

Beginners make different mistakes because they don't know enough to go quickly.

Once you are experienced you fly, similar to the way you drive in a trance without thinking some times.

With power tools I've seen this as well. You tend to take more chances the more experience you have (or even in my case getting cut with an exacto knife). Someone using a saw for the first time is going to go slowly and follow the directions (of course there are other types of safety mistakes they could make for sure..)

While a newbie might do rm -fr directory * instead of rm -fr directory* an experienced user could do that as well [1] simply by going to fast and not thinking "hey I'm doing something dangerous let me slow down and check before I auto hit return".

[1] I typically do

for i in something* do echo $i done

Then if I like what I see I will up arrow and insert "rm -fr $i" after the echo. Or maybe a read x to pause in between.

(Note: I'm not a sysadmin but I've done over many years sysadmin tasks because it is kind of relaxing in a way..)

I once put `shutdown -h now` (halt) instead of `shutdown -r now` (reboot)

Once I realized what had happened on the production server I ended up calling OVH (and they were helpful but not immediately acting).

It's not a good feeling.

This happened to me once; I don't know if this works on all linux distros but if you quickly follow a halt/shutdown with a "sudo init 6"(reboot) before your ssh-session gets SIGTERMed/KILLed, the box comes back up. This at least worked on some Ubuntu version a few years back.

Give it a try on some system that's not critically important :)

Yeah, but the problem is when you honestly didn't realize calling a halting shutdown until the server doesn't come back 5 minutes later and then you review the terminal

I tend to use /sbin/reboot instead, it amounts to the same (calls shutdown), but it's harder to get it mixed up.

A similar case happened with the Eve Online cluster (~50,000 concurrent users) a couple of years ago. A programmer, who for some reason had access to the live cluster, confused his local development instance with that of the live cluster and issued a shutdown. Luckily they were able to avert the incoming disaster in time (it was a timed shutdown), but jokes are still made about the mistake.


salt '*' system.reboot

> hubot restart all on prod

oh shit i meant stag fuckfuckfuckfuck

So have hubot second guess any changes to production unless you specifically told it you were messing with prod beforehand. Have it wait a few seconds before doing something important and listen for sounds of regret.

>hubot restart all on prod

hubot: > say "Hubot isn't responsible for hosing production because I actually meant staging"

>Hubot isn't responsible for hosing production because I actually meant staging

hubot: okay, don't say I didn't warn you.

>oh shit i meant stag fuckfuckfuckfuck

hubot: I hadn't started yet, but I'm doing it anyway just to teach you a lesson.

Why in the name of all that is holy do you have Hubot getting access to your production boxen?

Why does that seem like a good idea, ever?

My thought, exactly. Time to setup some good ACL :-) http://docs.saltstack.com/en/latest/ref/clientacl.html

Looks like the janitor needed somewhere to plug in the vacuum cleaner again...

Not even just the plug. I've had outages from bits flipped simply by the static electricity generated when vacuuming near servers.

5W walkie talkies in a big sports complex with the RF getting into the keyboard controllers and acting like a maniac was punching the keyboard - would eventually hang the servers.

Fix: Replace cheapened-keyboards-with-mylar-film-(not)-screening with older models that had a full metal cage around the keyboard assembly.

You have carpets in your server room??

I assume bash.org?

It's a truly ancient anecdote; it probably predates the Internet.

The first example in RISKS is in 1994: http://catless.ncl.ac.uk/Risks/15.59.html#subj3.1 but the canonical version of the story is in a Cape Town hospital in 1996: http://web.archive.org/web/20040624065333/http://www.legends...

He might be referring to The daily WTF (worse than failure):


Unintentional Mishap while Contractor Unplugs X to fix/maintain Y is a relatively common theme on their list of horror stories.

edit: I think he might actually have meant this one: http://thedailywtf.com/Articles/I-Told-You-So.aspx

And I get 2 downvotes for this? really? downvoters care to explain why, just for asking if it was a reference from bash? Wow... Edit: Thanks to the other 2 posters who provided alternative sources. You learn by asking, no? or at least some of us do..

Joyent has been having some serious issues over the past month or two. I am not sure if it is growing pains, bad luck or what is happening, but we had already lost faith and trust in their Cloud prior to today. This is the nail in the coffin from our perspective. Moving on...


Thanks for asking rather than just down-voting. I wanted others to know that this isn't isolated. We have been having issues with their service for a few months now. They never know when there is a problem with hardware, for instance. Joyent support will gladly tell you everything is fine. After you insist, and insist they will actually have someone look at the underlying infrastructure. Eventually they will acknowledge the problem and fix it (maybe). I believe the monitoring and reporting for their team is flawed or incomplete which leads to more downtime of affected systems. Just one observation, but we have had three incidents over the past month and a half. Two within a week of each other.

I'm sorry to hear about your experience; we pride ourselves on being able to root-cause problems regardless of where they might be in the stack, but it sounds like your problem didn't get properly escalated. If you want to reach out to me privately (my HN login at acm.org), we can try to figure out what happened here -- with my apologies again for the subpar experience.

"What's this button do?"

As I've always said, "You can never protect a system from a stupid person with root".

You can limit carnage and mitigate this type of thing, but you can't fully protect against sysadmins doing dumb things (unless you just hire great sysadmins)

If you think that "hire great sysadmins" prevents somebody from fatfingering, you must be hiring from some more evolved species. Nobody is immune to mistakes; preventing this kind of issue is something the infrastructure and procedures should do.

I don't think "just hiring great sysadmins" is possible. People have off-days or are tired or sick, new people get on-boarded, even great people make mistakes, etc.

...or accidentally switch which of the 25 term sessions they had open

I tend to color my production terms in a red background / yellow font scheme. It tends to inspire the tired brain to understand you are in production.

Not only do you consider mistakes the province of stupid people doing dumb things, but you're crediting yourself with a proverb about it and suggesting that you posses the ability to sniff these people out from the 'great' ones.

Get a grip, you're recursively full of yourself.

Wow HN seriously!? I never once pretended that I'm able to hire people who don't make mistakes, only that you can't protect systems from administrators who mess up.

Get a grip people.

So don't give anyone root on an entire data center.

Is this like Captain Planet? It's a bit exceptional to divide access servers of similar type between administrators such that individuals have full access to a portion of the fleet. Do they meet up and put their rings together to roll out updates? What if one of them goes on vacation?

There are keysharing protocols; you can do something like 5 sysadmins have a split of the master key such that any 3 of them can access the master account.

For day-to-day maintenance of systems, that's crippling. If I need 2 cosigns to run "date" across the fleet while I'm troubleshooting an NTP issue, and then 2 cosigns again to run "service ntpd status", and so forth, my coworkers will have lit my desk on fire long before I fix the clocks.

There are definitely use cases for keysharing systems like you describe: if we're talking about getting access to a database with sensitive information, or signing a new cert that all our systems are about to put their full faith in. But for the day-to-day administrative efforts, it's overkill and ends up being counterproductive: after a certain point, Alice and Bob write scripts that let them hotkey signing off on my requests.

I'm not worried about how crippling that sort of scenario is on a day to day basis, because presumably the company doesn't mind paying a fortune for a bunch of people to sit around to hold one anothers' keys.

I worry about those policies when the shit hits the fan and you're trying to fix a production problem hobbled by an inability to do stuff without three fingers on every keystroke.

Agreed. Ideally, whatever system is in use for managing infrastructure provides sanity checks while I'm working, but either gets out of my way or can be sidestepped if need be. I don't want to be crippled by technical red tape when things are on fire.

"date" and service status don't typically require root.

I've not needed this, but it's a nice idea. Do you do this with a combination of sudo/PAM|pubkey auth? I can google, but can you push me off in the right direct? Thanks!

I've not been directly involved, so your googling may well be as good as mine; on a quick look you might have to do this manually using ssss (and then each person encrypts their piece with gpg --symmetric or the like).

Actually, capabilities makes it trivial to lock down things like shutdown for admin accounts. A script can do the shutdown instead in a more controlled and less error-prone fashion. Same for network device updates. Abstraction.

That has its own risks. There might be some catastrophe that need root access on everything to fix, and you can't reach enough people to get it....

Then put the keys to datacenter-wide root somewhere safe (with a manual-ish process to access and use them), but out of the way and with alarms on it (the same alarms that you'd use in the absolute worst situation possible). Make sure anyone using it will be shamed if they don't absolutely have to.

Shame is a terrible tool for ensuring compliance. The people you want to keep will resent the fact that you're using shame as a motivating factor.

If you think keys in a safe is a good idea, ask a Googler about the legend of the Valentine safe. Short version: nobody was able to get into the safe and a locksmith had to come drill it to restore a critical service.

It's also a cautionary tale about testing your DR occasionally.

You're going to need a bigger crew.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact