People tend to forget that "fixing it" isn't just technical, it involves process, too. Every new hire that whines about change control and downtime windows would be the first to suggest them, were they troubleshooting the outage that demonstrated the need.
When that happens, you do just what Joyent did here: you send out an embarrassed email to customers, everyone else in the ops team gets a few cheap laughs at the miscreant's expense, you have a meeting about it, discuss lessons learned, and you move on.
Everyone screws up. Everything goes down once in a while. This is why you build in redundancy at every level.
Other things that have caused outages: lack of power capacity planning, unplugging an unrelated test server from the network (go go gadget BGP), cascading backup power failure, building maintenance taking down AC units, expensive equipment caching ARP replies indefinitely… the list goes on.
I had my own fun fuckup too. I learned SQL on PostgreSQL, and had to fix a problem with logged data in a MySQL database. Not trusting myself, I typed "BEGIN;" to enter a transaction, ran my update, and queried the table to check my results. I noticed my update did more than I expected, so I entered "ROLLBACK;" only to learn that MyISAM tables don't actually implement transactions.
Thankfully, in this case it turned out to be possible to undo the damage, but talk about a heart-stopping moment!
Shit happens. You deal with it, then do what you can to keep it from happening again. I've learned to respect early morning change windows as a way to limit damage caused by mistakes.
I've seen worse though. A newish officer spilled his morning coffee into the circuitry of a device worth over 10 zeros. Immediately short circuited.
My account's karma and history exceed your account's on this site, and even worse, this individual comment bears more value than yours! Ooh burn!
Let's post-mortem this lunacy:
1) You misinterpret jsmthrowaway's initial comment as vaguely racist (or something), notice the word "throwaway" in his user name and get really excited that you can stand on your high horse and call him out for hiding behind the shield of internet anonymity when he wants to be a (you think) racist idiot. Even though his comment is a legitimate currency conversion remark. See: https://www.google.com/search?q=1+usd+in+yen
2) He explains that clearly you're mistaken (which, really, seriously, what a kneejerk response from someone who just wanted to show how clever they were, calling out an "asshole") and further explains his account isn't even a "throwaway" in the traditional, trolly sense of the word, citing his account age and the fact that he regularly actively posts to the account.
3) You, caught perhaps in a moment of clarity, though I think I give you too much credit here, realize you were too eager to pounce on the "asshole" for his "Yen" remark, and perhaps you misread it. Your latent erection fading, you counter by explaining that your history and karma are even _more_ impressive, somehow completely avoiding taking responsibility for a completely nonsensical leap in logic and accusation of wrongdoing, while doubling down on your cognitive dissonance.
4) The Aristocrats!
Why, yes...a couple of times...before a perspex arch was less-than-hastily fixed over the button..
shit happens, design for the worst.
"To make error is human. To propagate error to all server in automatic way is #devops"
and my fav
"Law of Murphy for devops: if thing can able go wrong, is mean is already wrong but you not have Nagios alert of it yet."
It's mildly gross that the current messaging sounds like they're throwing a sysadmin under the bus. If fat fingers can down a data center, that's an engineering problem.
I care about an object store that never loses data and an API that always has an answer for me, even if it's saying things that I don't want to hear.
99.999 sounds stuck-in-the-90s.
at least they didn't name the operator in question...
If it makes you happy, blame me - I don't mind.
Response and conversation is always focused on "how do we prevent this in the future?", not on punishing whoever was involved in the past.
IOW, I agree with I believe is your intent, but differ on the implementation. Blameless transparency is the term we use (and we probably stole that from somewhere else).
It's a very powerful signal to the whole team when you first see individuals "admitting" to exactly what they did, how it caused or contributed to the outage, and to hear them thanked for their contribution of understanding in the post-mortem.
Senior leadership (including myself, who originally instituted the entire process a decade ago) is very clear that we want to know the facts and that in seeking and using those facts, we're only focused on the future, no matter how boneheaded the individual actions appear with the benefit of hindsight and knowledge that they'd lead (in)directly to an outage. I run operations and also participate in the promotion discussions for all technologists, and in 11 years, I've never heard a negative shadow cast onto a sysadmin/sysengineer from their actions during or leading to a production outage. And we've (collectively) made our fair share of mistakes over the years. That doesn't stop good employees from feeling bad about it, but that's a personal feeling they have, not from the fear of it being a professional black mark.
Internally, You're right. But externally the company fucked up, not the individual.
Its always the engineer fault. :(
My point is that the problem in this case is likely the system's design, not one engineer's typing abilities.
By the rules, every single system on a commercial aircraft has a circuit breaker. Pilots make the "what if X catches on fire?" case, which is actually pretty compelling. However, that also means there are several switches overhead that will ostensibly crash the airplane if pulled. Pilots lobby very strongly for the aircraft not to fight them in any way because they are the only ones with the data, in the moment, now. They have final command over the aircraft in every way.
I use this to point out that as you're designing systems for operations people -- something we're increasingly doing ourselves as devops/SRE takes hold -- you might think you can anticipate every scenario and design suitable safeguards into the system. However, sometimes, when Halley's Comet refracts some moonlight into swamp gas and takes your fleet down, you as an operator have to do some really crazy shit. It's in that moment, when all hell has broken loose, I'm at the helm, and based on the data available to me I have made a decision to shoot the system in the head: if the system fights me and prolongs an outage because we argued about whether we'd ever need to reboot a fleet all at once, I'm replacing the system as the first item in my postmortem. If you make me walk row to row flipping PDUs, we're going to have words.
That's just my philosophy. Give the operators the knives and let them cut themselves, trusting that you've hired smart people and understanding mistakes will happen. Your philosophy may vary. By all means, ask me to confirm. Ask me for a physical key, even. But if you ever prevent me from doing what I know must be done, you are in my way. I have yet to meet a system that is smarter than an operator when the shit hits the fan (especially when the shit hits the fan).
There's probably a broader term for operational philosophy like this.
Such an approach would have prevented this incident where "normal" operations were being performed and accidentally ALL the servers were rebooted at once.
I'm of the opinion that 99.999% for an individual instance isn't particularly achievable in a commodity hosting environment. That kind of uptime doesn't leave much room for the mistakes that you and I both anticipate.
I do think that 99.999% is doable for a properly distributed whole-system across multiple geographically-dispersed datacenters.
I think Joyent has gone wrong in promoting individual instance reliability.
Joyent doesn't use AWS.
That is, 'write' style automation changes (as opposed to most 'remediation' style changes) would only proceed, on a box by box basis, if the affected cluster didn't have any critical alerts coming in.
So, if I issued a parallel, rolling 'shutdown the system' command to all boxes, it would only take down a portion of all of the boxes before automatically aborting because of critical monitoring alerts.
Parallel was calculated based on historical but manually approved load levels for each cluster, compared to current load levels. So parallel runs faster if there's very low load on a cluster, or very slowly if there's a high load on a cluster.
One way or another, most automation should automatically stop 'doing things' if there's critical alerts coming in. Or, put another way, most automation should not be able to move forward unless it can verify that it has current alert data, and that none of that data indicates critical problems.
You build a way to automatically perform some mundane standard procedure, like propagating a new firewall rule to all your systems at once. Then you accidentally propagate a rule that blocks all inbound ports. Huh, when I tested locally I didn't notice that.
Or you build a way to automatically delete timestamped log files more than a month old. And when it runs in production, it also deletes critical libraries which have the build timestamp in their filename. Ah, the test server was running a nightly build instead of a release so the files were named differently.
Or you build a way to automatically deploy the post-heartbleed replacement certificates to all your TLS servers, and only after you do that you find you didn't deploy the replacement corporate CA certificate to all the clients. Hmm, the test environment has a different CA arrangement, so testers don't get the private keys of prod certificates.
Or you build a way to retain timestamped snapshots of all your files, every five minutes, so you can roll back anything - then find that huge log file that constantly changes gets snapshotted every time, and everything is hanging because of lack of disk space. Oh, production does get a lot more traffic to log, now I think about it.
Or you do any of a hundred other things that seem like simple, low risk operations until you realise they aren't.
rm -rf logs_ *
rm -rf logs_*
All was well until a scheduled maintenance restart a few weeks later and we (eventually) discovered that /boot and /bin were AWOL.
Edit: He had access to the root account to maintain the accounts app (not my call)
As one of the other commenters noted, a ~20 character salt command will do this. I doubt Joyent built a Big Red Button to take down a datacenter, I expect this will be the case of somebody missing an asterisk or omitting a crucial flag while trying to do their normal work.
I've been down voted several times for (what I see) as relatively minor remarks. The HN readers are a sensitive bunch...
Then we got a hp-ux machine in the lab. For some reason, the grace period on that system was in seconds ( http://www.polarhome.com/service/man/generic.php?qf=shutdown... )
System dax shutting down in 5 seconds.
I've come close to that as well.
This reminds me of the paradox of being competent vs. a beginner.
It also has parallels in a few thing outside computing.
Beginners make different mistakes because they don't know enough to go quickly.
Once you are experienced you fly, similar to the way you drive in a trance without thinking some times.
With power tools I've seen this as well. You tend to take more chances the more experience you have (or even in my case getting cut with an exacto knife). Someone using a saw for the first time is going to go slowly and follow the directions (of course there are other types of safety mistakes they could make for sure..)
While a newbie might do rm -fr directory * instead of rm -fr directory* an experienced user could do that as well  simply by going to fast and not thinking "hey I'm doing something dangerous let me slow down and check before I auto hit return".
 I typically do
for i in something*
Then if I like what I see I will up arrow and insert "rm -fr $i" after the echo. Or maybe a read x to pause in between.
(Note: I'm not a sysadmin but I've done over many years sysadmin tasks because it is kind of relaxing in a way..)
Once I realized what had happened on the production server I ended up calling OVH (and they were helpful but not immediately acting).
It's not a good feeling.
Give it a try on some system that's not critically important :)
oh shit i meant stag
>hubot restart all on prod
hubot: > say "Hubot isn't responsible for hosing production because I actually meant staging"
>Hubot isn't responsible for hosing production because I actually meant staging
hubot: okay, don't say I didn't warn you.
>oh shit i meant stag fuckfuckfuckfuck
hubot: I hadn't started yet, but I'm doing it anyway just to teach you a lesson.
Why does that seem like a good idea, ever?
Fix: Replace cheapened-keyboards-with-mylar-film-(not)-screening with older models that had a full metal cage around the keyboard assembly.
The first example in RISKS is in 1994: http://catless.ncl.ac.uk/Risks/15.59.html#subj3.1 but the canonical version of the story is in a Cape Town hospital in 1996: http://web.archive.org/web/20040624065333/http://www.legends...
Unintentional Mishap while Contractor Unplugs X to fix/maintain Y is a relatively common theme on their list of horror stories.
edit: I think he might actually have meant this one:
You can limit carnage and mitigate this type of thing, but you can't fully protect against sysadmins doing dumb things (unless you just hire great sysadmins)
I tend to color my production terms in a red background / yellow font scheme. It tends to inspire the tired brain to understand you are in production.
Get a grip, you're recursively full of yourself.
Get a grip people.
There are definitely use cases for keysharing systems like you describe: if we're talking about getting access to a database with sensitive information, or signing a new cert that all our systems are about to put their full faith in. But for the day-to-day administrative efforts, it's overkill and ends up being counterproductive: after a certain point, Alice and Bob write scripts that let them hotkey signing off on my requests.
I worry about those policies when the shit hits the fan and you're trying to fix a production problem hobbled by an inability to do stuff without three fingers on every keystroke.
It's also a cautionary tale about testing your DR occasionally.