It should go without saying that we're mortified by this. While the immediate cause was operator error, there are broader systemic issues that allowed a fat finger to take down a datacenter. As soon as we reasonably can, we will be providing a full postmortem of this: how this was architecturally possible, what exactly happened, how the system recovered, and what improvements we are/will be making to both the software and to operational procedures to assure that this doesn't happen in the future (and that the recovery is smoother for failure modes of similar scope).
It's operations. You fuck up, you suck it up, you fix it, then (and this is the important part) you prevent it from ever happening again. Feeling like shit for bringing something down is a good way to give yourself depression, given how often you will screw the pooch with root. In the same vein, anybody who says they'd fire the operator without any qualification on that remark should be given a wide berth.
People tend to forget that "fixing it" isn't just technical, it involves process, too. Every new hire that whines about change control and downtime windows would be the first to suggest them, were they troubleshooting the outage that demonstrated the need.
Nonsense. Someone has to be operating at the sharp end of the enable prompt, and sooner or later it'll be 0330 and that person will type Ethernet0 when they meant Ethernet1, whatever management you have in place.
When that happens, you do just what Joyent did here: you send out an embarrassed email to customers, everyone else in the ops team gets a few cheap laughs at the miscreant's expense, you have a meeting about it, discuss lessons learned, and you move on.
Everyone screws up. Everything goes down once in a while. This is why you build in redundancy at every level.
I've seen generally brilliant people be bit by bad process. The worst example was an important hard drive being wiped thanks to a lack of labeling, obviously taking a production server down with it.
Other things that have caused outages: lack of power capacity planning, unplugging an unrelated test server from the network (go go gadget BGP), cascading backup power failure, building maintenance taking down AC units, expensive equipment caching ARP replies indefinitely… the list goes on.
I had my own fun fuckup too. I learned SQL on PostgreSQL, and had to fix a problem with logged data in a MySQL database. Not trusting myself, I typed "BEGIN;" to enter a transaction, ran my update, and queried the table to check my results. I noticed my update did more than I expected, so I entered "ROLLBACK;" only to learn that MyISAM tables don't actually implement transactions.
Thankfully, in this case it turned out to be possible to undo the damage, but talk about a heart-stopping moment!
Shit happens. You deal with it, then do what you can to keep it from happening again. I've learned to respect early morning change windows as a way to limit damage caused by mistakes.
Why are you using a throwaway account? Ohh, I just saw the "dollars or Yen" remark. TIL we use throwaway accounts for the times we feel like being assholes, so the non-elites can't track it back to our physical neuroprocessors.
"check my previous responses and my credit score for how you should treat me" ohh what an old-man response. It's too bad the Imgur "downvote everything they ever posted" script doesn't work here on HN, now isn't it?
My account's karma and history exceed your account's on this site, and even worse, this individual comment bears more value than yours! Ooh burn!
Haha, this is a pretty spectacular amount of cognitive dissonance you're demonstrating here.
Let's post-mortem this lunacy:
1) You misinterpret jsmthrowaway's initial comment as vaguely racist (or something), notice the word "throwaway" in his user name and get really excited that you can stand on your high horse and call him out for hiding behind the shield of internet anonymity when he wants to be a (you think) racist idiot. Even though his comment is a legitimate currency conversion remark. See: https://www.google.com/search?q=1+usd+in+yen
2) He explains that clearly you're mistaken (which, really, seriously, what a kneejerk response from someone who just wanted to show how clever they were, calling out an "asshole") and further explains his account isn't even a "throwaway" in the traditional, trolly sense of the word, citing his account age and the fact that he regularly actively posts to the account.
3) You, caught perhaps in a moment of clarity, though I think I give you too much credit here, realize you were too eager to pounce on the "asshole" for his "Yen" remark, and perhaps you misread it. Your latent erection fading, you counter by explaining that your history and karma are even _more_ impressive, somehow completely avoiding taking responsibility for a completely nonsensical leap in logic and accusation of wrongdoing, while doubling down on your cognitive dissonance.
Please keep in mind that "price" means "how many dollars other humans are willing to trade for it right now"; not necessarily any concrete evaluation of the device's functionality compared to a human competitor or human operator...
That sounds so awful. I can't imagine living the rest of my life knowing that I had been a net negative in the world. All of my life's earnings would just be a partial restitution of that one second of destruction.
As a request: It looks like each time the status page is updated, the old UPDATE: <words> is removed. For the future, it would be great if the older updates were preserved so that people looking back could understand the chain of events, rather than just seeing the first / last pieces.
Our internal culture is such that everyone on the team would rather be blamed for something than accuse someone else of doing it. That's shitty, and not something you do to someone. You fix the problem and then you move on.
At my $DAYJOB, we are always careful to figure out exactly what happened, including by whom. It's not to assign personal blame, but I believe it's critical that everyone agrees on the facts (who, what, when, where, and [if possible] why).
Response and conversation is always focused on "how do we prevent this in the future?", not on punishing whoever was involved in the past.
IOW, I agree with I believe is your intent, but differ on the implementation. Blameless transparency is the term we use (and we probably stole that from somewhere else).
It's a very powerful signal to the whole team when you first see individuals "admitting" to exactly what they did, how it caused or contributed to the outage, and to hear them thanked for their contribution of understanding in the post-mortem.
Senior leadership (including myself, who originally instituted the entire process a decade ago) is very clear that we want to know the facts and that in seeking and using those facts, we're only focused on the future, no matter how boneheaded the individual actions appear with the benefit of hindsight and knowledge that they'd lead (in)directly to an outage. I run operations and also participate in the promotion discussions for all technologists, and in 11 years, I've never heard a negative shadow cast onto a sysadmin/sysengineer from their actions during or leading to a production outage. And we've (collectively) made our fair share of mistakes over the years. That doesn't stop good employees from feeling bad about it, but that's a personal feeling they have, not from the fear of it being a professional black mark.
Sure blame on the engineers. You give power, people use it badly blame the engineer for giving too much power. You don't give enough power sysadmins/users bitch and yell why don't we have enough power, we're not children.
This comes down to operational philosophy, in the end. The point you're dancing around is whether the system should permit grave actions that don't make any sense when you're designing the system.
By the rules, every single system on a commercial aircraft has a circuit breaker. Pilots make the "what if X catches on fire?" case, which is actually pretty compelling. However, that also means there are several switches overhead that will ostensibly crash the airplane if pulled. Pilots lobby very strongly for the aircraft not to fight them in any way because they are the only ones with the data, in the moment, now. They have final command over the aircraft in every way.
I use this to point out that as you're designing systems for operations people -- something we're increasingly doing ourselves as devops/SRE takes hold -- you might think you can anticipate every scenario and design suitable safeguards into the system. However, sometimes, when Halley's Comet refracts some moonlight into swamp gas and takes your fleet down, you as an operator have to do some really crazy shit. It's in that moment, when all hell has broken loose, I'm at the helm, and based on the data available to me I have made a decision to shoot the system in the head: if the system fights me and prolongs an outage because we argued about whether we'd ever need to reboot a fleet all at once, I'm replacing the system as the first item in my postmortem. If you make me walk row to row flipping PDUs, we're going to have words.
That's just my philosophy. Give the operators the knives and let them cut themselves, trusting that you've hired smart people and understanding mistakes will happen. Your philosophy may vary. By all means, ask me to confirm. Ask me for a physical key, even. But if you ever prevent me from doing what I know must be done, you are in my way. I have yet to meet a system that is smarter than an operator when the shit hits the fan (especially when the shit hits the fan).
There's probably a broader term for operational philosophy like this.
...and the operations version of that is that all normal operations are performed under restricted permissions that cannot "do anything", while the full "do anything" permissions are only broken out during a major crisis.
Such an approach would have prevented this incident where "normal" operations were being performed and accidentally ALL the servers were rebooted at once.
I can't figure out how your comment that "understanding mistakes will happen" is compatible with 99.999% uptime.
I'm of the opinion that 99.999% for an individual instance isn't particularly achievable in a commodity hosting environment. That kind of uptime doesn't leave much room for the mistakes that you and I both anticipate.
I do think that 99.999% is doable for a properly distributed whole-system across multiple geographically-dispersed datacenters.
I think Joyent has gone wrong in promoting individual instance reliability.
It's a combined fault. Clearly the operator made a mistake, but the system shouldn't have let such a calamitous operation take place without at least three levels of "Are you sure" (or something smarter like "Confirm how many servers would you like to reboot:") before it lets you take down thousands of servers.
The 'devops' automation I made at my last company (and am building at my current company) had monitoring fully integrated into the system automation.
That is, 'write' style automation changes (as opposed to most 'remediation' style changes) would only proceed, on a box by box basis, if the affected cluster didn't have any critical alerts coming in.
So, if I issued a parallel, rolling 'shutdown the system' command to all boxes, it would only take down a portion of all of the boxes before automatically aborting because of critical monitoring alerts.
Parallel was calculated based on historical but manually approved load levels for each cluster, compared to current load levels. So parallel runs faster if there's very low load on a cluster, or very slowly if there's a high load on a cluster.
One way or another, most automation should automatically stop 'doing things' if there's critical alerts coming in. Or, put another way, most automation should not be able to move forward unless it can verify that it has current alert data, and that none of that data indicates critical problems.
As a Devops, I can't justify building any automated way to down or restart all of my systems at once. We've only had to do that to resolve router reconvergence storms when changing out (relatively) major infrastructure pieces, such as our Juniper router.
You don't intentionally build an automated way to take down all your servers at once.
You build a way to automatically perform some mundane standard procedure, like propagating a new firewall rule to all your systems at once. Then you accidentally propagate a rule that blocks all inbound ports. Huh, when I tested locally I didn't notice that.
Or you build a way to automatically delete timestamped log files more than a month old. And when it runs in production, it also deletes critical libraries which have the build timestamp in their filename. Ah, the test server was running a nightly build instead of a release so the files were named differently.
Or you build a way to automatically deploy the post-heartbleed replacement certificates to all your TLS servers, and only after you do that you find you didn't deploy the replacement corporate CA certificate to all the clients. Hmm, the test environment has a different CA arrangement, so testers don't get the private keys of prod certificates.
Or you build a way to retain timestamped snapshots of all your files, every five minutes, so you can roll back anything - then find that huge log file that constantly changes gets snapshotted every time, and everything is hanging because of lack of disk space. Oh, production does get a lot more traffic to log, now I think about it.
Or you do any of a hundred other things that seem like simple, low risk operations until you realise they aren't.
Our less-than-savvy Financial Director took it upon himself to restore from tape the bought ledger files to a live system after a slight mishap. Unfortunately, the bought ledger files all started with a 'b' and he managed to restore them to the root of the -nix system instead of the right place, so he mv'd b* to the right location.
All was well until a scheduled maintenance restart a few weeks later and we (eventually) discovered that /boot and /bin were AWOL.
Edit: He had access to the root account to maintain the accounts app (not my call)
Unfortunately, the same tools that allow someone to automate management of systems can easily become catastrophic.
As one of the other commenters noted, a ~20 character salt command will do this. I doubt Joyent built a Big Red Button to take down a datacenter, I expect this will be the case of somebody missing an asterisk or omitting a crucial flag while trying to do their normal work.
And I'm being downvoted for that? Seriously? In 13 years of networking I have never once had to reload machine to help with OSPF or BGP convergence. Good networking architecture and planning should mitigate anything other than a couple of minute outage. No routing change should ever require a reload of a server or end node.
Those who are still posting on HN are orders of magnitude more sensitive than those who post on Imgur. The communities are similar-size, yet Imguraffes are much, much more accepting of my comments. What merits a handful of upvotes there brings a downvote or two on this site.
I believe it. Networking is seen as a commodity now. It's transparent until it fails. There's a whole lot of technical debt lurking out there. I personally have seen the dark shadow of spanning tree suck the light out of DevOp engineers eyes.
Let this be a lesson to linux admins. Re-alias shutdown -r now into something else on production servers. I once took down access to about 6000 servers because I ran the script to decommission servers on our jump box when I got the SSH windows confused.
This reminds me of the paradox of being competent vs. a beginner.
It also has parallels in a few thing outside computing.
Beginners make different mistakes because they don't know enough to go quickly.
Once you are experienced you fly, similar to the way you drive in a trance without thinking some times.
With power tools I've seen this as well. You tend to take more chances the more experience you have (or even in my case getting cut with an exacto knife). Someone using a saw for the first time is going to go slowly and follow the directions (of course there are other types of safety mistakes they could make for sure..)
While a newbie might do rm -fr directory * instead of rm -fr directory* an experienced user could do that as well  simply by going to fast and not thinking "hey I'm doing something dangerous let me slow down and check before I auto hit return".
 I typically do
for i in something*
Then if I like what I see I will up arrow and insert "rm -fr $i" after the echo. Or maybe a read x to pause in between.
(Note: I'm not a sysadmin but I've done over many years sysadmin tasks because it is kind of relaxing in a way..)
This happened to me once; I don't know if this works on all linux distros but if you quickly follow a halt/shutdown with a "sudo init 6"(reboot) before your ssh-session gets SIGTERMed/KILLed, the box comes back up. This at least worked on some Ubuntu version a few years back.
Give it a try on some system that's not critically important :)
A similar case happened with the Eve Online cluster (~50,000 concurrent users) a couple of years ago. A programmer, who for some reason had access to the live cluster, confused his local development instance with that of the live cluster and issued a shutdown. Luckily they were able to avert the incoming disaster in time (it was a timed shutdown), but jokes are still made about the mistake.
So have hubot second guess any changes to production unless you specifically told it you were messing with prod beforehand. Have it wait a few seconds before doing something important and listen for sounds of regret.
>hubot restart all on prod
hubot: > say "Hubot isn't responsible for hosing production because I actually meant staging"
>Hubot isn't responsible for hosing production because I actually meant staging
hubot: okay, don't say I didn't warn you.
>oh shit i meant stag fuckfuckfuckfuck
hubot: I hadn't started yet, but I'm doing it anyway just to teach you a lesson.
And I get 2 downvotes for this? really? downvoters care to explain why, just for asking if it was a reference from bash? Wow...
Edit: Thanks to the other 2 posters who provided alternative sources. You learn by asking, no? or at least some of us do..
Joyent has been having some serious issues over the past month or two. I am not sure if it is growing pains, bad luck or what is happening, but we had already lost faith and trust in their Cloud prior to today. This is the nail in the coffin from our perspective. Moving on...
Thanks for asking rather than just down-voting. I wanted others to know that this isn't isolated. We have been having issues with their service for a few months now. They never know when there is a problem with hardware, for instance. Joyent support will gladly tell you everything is fine. After you insist, and insist they will actually have someone look at the underlying infrastructure. Eventually they will acknowledge the problem and fix it (maybe). I believe the monitoring and reporting for their team is flawed or incomplete which leads to more downtime of affected systems. Just one observation, but we have had three incidents over the past month and a half. Two within a week of each other.
I'm sorry to hear about your experience; we pride ourselves on being able to root-cause problems regardless of where they might be in the stack, but it sounds like your problem didn't get properly escalated. If you want to reach out to me privately (my HN login at acm.org), we can try to figure out what happened here -- with my apologies again for the subpar experience.
If you think that "hire great sysadmins" prevents somebody from fatfingering, you must be hiring from some more evolved species. Nobody is immune to mistakes; preventing this kind of issue is something the infrastructure and procedures should do.
Not only do you consider mistakes the province of stupid people doing dumb things, but you're crediting yourself with a proverb about it and suggesting that you posses the ability to sniff these people out from the 'great' ones.
Is this like Captain Planet? It's a bit exceptional to divide access servers of similar type between administrators such that individuals have full access to a portion of the fleet. Do they meet up and put their rings together to roll out updates? What if one of them goes on vacation?
For day-to-day maintenance of systems, that's crippling. If I need 2 cosigns to run "date" across the fleet while I'm troubleshooting an NTP issue, and then 2 cosigns again to run "service ntpd status", and so forth, my coworkers will have lit my desk on fire long before I fix the clocks.
There are definitely use cases for keysharing systems like you describe: if we're talking about getting access to a database with sensitive information, or signing a new cert that all our systems are about to put their full faith in. But for the day-to-day administrative efforts, it's overkill and ends up being counterproductive: after a certain point, Alice and Bob write scripts that let them hotkey signing off on my requests.
I'm not worried about how crippling that sort of scenario is on a day to day basis, because presumably the company doesn't mind paying a fortune for a bunch of people to sit around to hold one anothers' keys.
I worry about those policies when the shit hits the fan and you're trying to fix a production problem hobbled by an inability to do stuff without three fingers on every keystroke.
Agreed. Ideally, whatever system is in use for managing infrastructure provides sanity checks while I'm working, but either gets out of my way or can be sidestepped if need be. I don't want to be crippled by technical red tape when things are on fire.
I've not been directly involved, so your googling may well be as good as mine; on a quick look you might have to do this manually using ssss (and then each person encrypts their piece with gpg --symmetric or the like).
Actually, capabilities makes it trivial to lock down things like shutdown for admin accounts. A script can do the shutdown instead in a more controlled and less error-prone fashion. Same for network device updates. Abstraction.
Then put the keys to datacenter-wide root somewhere safe (with a manual-ish process to access and use them), but out of the way and with alarms on it (the same alarms that you'd use in the absolute worst situation possible). Make sure anyone using it will be shamed if they don't absolutely have to.
If you think keys in a safe is a good idea, ask a Googler about the legend of the Valentine safe. Short version: nobody was able to get into the safe and a locksmith had to come drill it to restore a critical service.
It's also a cautionary tale about testing your DR occasionally.