

Joyent us-east-1 rebooted due to operator error - hypervisor
https://help.joyent.com/entries/40957424-Transient-availability-issues-in-US-East-1-data-center
Due to an operator error, all compute nodes in us-east-1 were simultaneously rebooted.  Some compute nodes are already back up, but due to very high load on the control plane, this is taking some time.  We are dedicating all operational and engineering resources to getting this issue resolved, and will be providing a full postmortem on this failure once every compute node and customer VM is online and operational.  We will be providing frequent updates until the issue is resolved.
======
bcantrill
It should go without saying that we're mortified by this. While the immediate
cause was operator error, there are broader systemic issues that allowed a fat
finger to take down a datacenter. As soon as we reasonably can, we will be
providing a full postmortem of this: how this was architecturally possible,
what exactly happened, how the system recovered, and what improvements we
are/will be making to both the software and to operational procedures to
assure that this doesn't happen in the future (and that the recovery is
smoother for failure modes of similar scope).

~~~
mixologic
I feel bad for the person who made the mistake. Even though its obviously a
systemic problem, and highly unlikely to be an act of negligence, Im sure
he/she doesnt feel too hot right now.

~~~
socceroos
My thoughts exactly. Poor fella.

I've seen worse though. A newish officer spilled his morning coffee into the
circuitry of a device worth over 10 zeros. Immediately short circuited.

~~~
jsmthrowaway
You know you could build five space shuttles with ten zeros, right? Are we
talking dollars or Yen?

~~~
stephengillie
Why are you using a throwaway account? Ohh, I just saw the "dollars or Yen"
remark. TIL we use throwaway accounts for the times we feel like being
assholes, so the non-elites can't track it back to our physical
neuroprocessors.

~~~
jsmthrowaway
I'm not sure why I'm dignifying this Reddit drivel with a response, but my
karma and account age should be your hint that you're barking up the wrong
tree.

~~~
stephengillie
"check my previous responses and my credit score for how you should treat me"
ohh what an old-man response. It's too bad the Imgur "downvote everything they
ever posted" script doesn't work here on HN, now isn't it?

My account's karma and history exceed your account's on this site, and even
worse, this individual comment bears more value than yours! Ooh burn!

~~~
disillusioned
Haha, this is a pretty spectacular amount of cognitive dissonance you're
demonstrating here.

Let's post-mortem this lunacy:

1) You misinterpret jsmthrowaway's initial comment as vaguely racist (or
something), notice the word "throwaway" in his user name and get really
excited that you can stand on your high horse and call him out for hiding
behind the shield of internet anonymity when he wants to be a (you think)
racist idiot. Even though his comment is a legitimate currency conversion
remark. See:
[https://www.google.com/search?q=1+usd+in+yen](https://www.google.com/search?q=1+usd+in+yen)

2) He explains that clearly you're mistaken (which, really, seriously, what a
kneejerk response from someone who just wanted to show how clever they were,
calling out an "asshole") and further explains his account isn't even a
"throwaway" in the traditional, trolly sense of the word, citing his account
age and the fact that he regularly actively posts to the account.

3) You, caught perhaps in a moment of clarity, though I think I give you too
much credit here, realize you were too eager to pounce on the "asshole" for
his "Yen" remark, and perhaps you misread it. Your latent erection fading, you
counter by explaining that your history and karma are even _more_ impressive,
somehow completely avoiding taking responsibility for a completely nonsensical
leap in logic and accusation of wrongdoing, while doubling down on your
cognitive dissonance.

4) The Aristocrats!

~~~
Gigablah
It seems he has a bit of misplaced sensitivity when it comes to the Japanese:
[https://news.ycombinator.com/item?id=7555232](https://news.ycombinator.com/item?id=7555232)

------
lukasm
Mandatory DevOps Borat

"To make error is human. To propagate error to all server in automatic way is
#devops"

and my fav "Law of Murphy for devops: if thing can able go wrong, is mean is
already wrong but you not have Nagios alert of it yet."

------
alrs
Joyent's messaging about "we're cloud, but with perfect uptime" was always
broken.

It's mildly gross that the current messaging sounds like they're throwing a
sysadmin under the bus. If fat fingers can down a data center, that's an
engineering problem.

I care about an object store that never loses data and an API that always has
an answer for me, even if it's saying things that I don't want to hear.

99.999 sounds stuck-in-the-90s.

~~~
knodi
Sure blame on the engineers. You give power, people use it badly blame the
engineer for giving too much power. You don't give enough power
sysadmins/users bitch and yell why don't we have enough power, we're not
children.

Its always the engineer fault. :(

~~~
alrs
Systems engineers, software engineers, architects, whatever. We're all in the
same gang.

My point is that the problem in this case is likely the system's design, not
one engineer's typing abilities.

~~~
jsmthrowaway
This comes down to operational philosophy, in the end. The point you're
dancing around is whether the system should permit grave actions that don't
make any sense when you're designing the system.

By the rules, every single system on a commercial aircraft has a circuit
breaker. Pilots make the "what if X catches on fire?" case, which is actually
pretty compelling. However, that also means there are several switches
overhead that will ostensibly crash the airplane if pulled. Pilots lobby very
strongly for the aircraft not to fight them in any way because they are the
only ones with the data, in the moment, now. They have final command over the
aircraft in every way.

I use this to point out that as you're designing systems for operations people
-- something we're increasingly doing ourselves as devops/SRE takes hold --
you might think you can anticipate every scenario and design suitable
safeguards into the system. However, sometimes, when Halley's Comet refracts
some moonlight into swamp gas and takes your fleet down, you as an operator
have to do some really crazy shit. It's in that moment, when all hell has
broken loose, I'm at the helm, and based on the data available to me I have
made a decision to shoot the system in the head: if the system fights me and
prolongs an outage because we argued about whether we'd ever need to reboot a
fleet all at once, I'm replacing the system as the first item in my
postmortem. If you make me walk row to row flipping PDUs, we're going to have
words.

That's just my philosophy. Give the operators the knives and let them cut
themselves, trusting that you've hired smart people and understanding mistakes
will happen. Your philosophy may vary. By all means, ask me to confirm. Ask me
for a physical key, even. But if you ever prevent me from doing what I know
must be done, you are in my way. I have yet to meet a system that is smarter
than an operator when the shit hits the fan ( _especially_ when the shit hits
the fan).

There's probably a broader term for operational philosophy like this.

~~~
alrs
I tend to agree with you, with the caveat that you can't have this philosophy
and sell your customers 99.999% uptime[0].

[0] [http://www.joyent.com/products/compute-
service/features/linu...](http://www.joyent.com/products/compute-
service/features/linux)

~~~
jsmthrowaway
I disagree wholeheartedly. Your operational philosophy complements your SLA
goals, it doesn't force them.

~~~
alrs
I can't figure out how your comment that "understanding mistakes will happen"
is compatible with 99.999% uptime.

I'm of the opinion that 99.999% for an individual instance isn't particularly
achievable in a commodity hosting environment. That kind of uptime doesn't
leave much room for the mistakes that you and I both anticipate.

I do think that 99.999% is doable for a properly distributed whole-system
across multiple geographically-dispersed datacenters.

I think Joyent has gone wrong in promoting individual instance reliability.

~~~
elijahwright
Always ask how numbers like that are computed.

------
Diederich
The 'devops' automation I made at my last company (and am building at my
current company) had monitoring fully integrated into the system automation.

That is, 'write' style automation changes (as opposed to most 'remediation'
style changes) would only proceed, on a box by box basis, if the affected
cluster didn't have any critical alerts coming in.

So, if I issued a parallel, rolling 'shutdown the system' command to all
boxes, it would only take down a portion of all of the boxes before
automatically aborting because of critical monitoring alerts.

Parallel was calculated based on historical but manually approved load levels
for each cluster, compared to current load levels. So parallel runs faster if
there's very low load on a cluster, or very slowly if there's a high load on a
cluster.

One way or another, most automation should automatically stop 'doing things'
if there's critical alerts coming in. Or, put another way, most automation
should not be able to move forward unless it can verify that it has current
alert data, and that none of that data indicates critical problems.

------
jameshart
DevOps means being able to take out an entire datacenter with a single
keysstroke...

~~~
stephengillie
As a Devops, I can't justify building any automated way to down or restart all
of my systems at once. We've only had to do that to resolve router
reconvergence storms when changing out (relatively) major infrastructure
pieces, such as our Juniper router.

~~~
michaelt
You don't intentionally build an automated way to take down all your servers
at once.

You build a way to automatically perform some mundane standard procedure, like
propagating a new firewall rule to all your systems at once. Then you
accidentally propagate a rule that blocks all inbound ports. Huh, when I
tested locally I didn't notice that.

Or you build a way to automatically delete timestamped log files more than a
month old. And when it runs in production, it also deletes critical libraries
which have the build timestamp in their filename. Ah, the test server was
running a nightly build instead of a release so the files were named
differently.

Or you build a way to automatically deploy the post-heartbleed replacement
certificates to all your TLS servers, and only after you do that you find you
didn't deploy the replacement corporate CA certificate to all the clients.
Hmm, the test environment has a different CA arrangement, so testers don't get
the private keys of prod certificates.

Or you build a way to retain timestamped snapshots of all your files, every
five minutes, so you can roll back anything - then find that huge log file
that constantly changes gets snapshotted every time, and everything is hanging
because of lack of disk space. Oh, production does get a lot more traffic to
log, now I think about it.

Or you do any of a hundred other things that seem like simple, low risk
operations until you realise they aren't.

~~~
codexon
Once I typed

    
    
      rm -rf logs_ *
    

instead of

    
    
      rm -rf logs_*

~~~
linker3000
Our less-than-savvy Financial Director took it upon himself to restore from
tape the bought ledger files to a live system after a slight mishap.
Unfortunately, the bought ledger files all started with a 'b' and he managed
to restore them to the root of the -nix system instead of the right place, so
he mv'd b* to the right location.

All was well until a scheduled maintenance restart a few weeks later and we
(eventually) discovered that /boot and /bin were AWOL.

Edit: He had access to the root account to maintain the accounts app (not my
call)

------
shiftpgdn
Let this be a lesson to linux admins. Re-alias shutdown -r now into something
else on production servers. I once took down access to about 6000 servers
because I ran the script to decommission servers on our jump box when I got
the SSH windows confused.

~~~
cordite
I once put `shutdown -h now` (halt) instead of `shutdown -r now` (reboot)

Once I realized what had happened on the production server I ended up calling
OVH (and they were helpful but not immediately acting).

It's not a good feeling.

~~~
smtddr
This happened to me once; I don't know if this works on all linux distros but
if you quickly follow a halt/shutdown with a "sudo init 6"(reboot) before your
ssh-session gets SIGTERMed/KILLed, the box comes back up. This at least worked
on some Ubuntu version a few years back.

Give it a try on some system that's not critically important :)

~~~
cordite
Yeah, but the problem is when you honestly didn't realize calling a halting
shutdown until the server doesn't come back 5 minutes later and then you
review the terminal

------
dharbin
salt '*' system.reboot

~~~
quickdry21
> hubot restart all on prod

oh shit i meant stag fuckfuckfuckfuck

~~~
qbrass
So have hubot second guess any changes to production unless you specifically
told it you were messing with prod beforehand. Have it wait a few seconds
before doing something important and listen for sounds of regret.

>hubot restart all on prod

hubot: > say "Hubot isn't responsible for hosing production because I actually
meant staging"

>Hubot isn't responsible for hosing production because I actually meant
staging

hubot: okay, don't say I didn't warn you.

>oh shit i meant stag fuckfuckfuckfuck

hubot: I hadn't started yet, but I'm doing it anyway just to teach you a
lesson.

------
jordanthoms
Looks like the janitor needed somewhere to plug in the vacuum cleaner again...

~~~
saganus
I assume bash.org?

~~~
gknoy
He might be referring to The daily WTF (worse than failure):

[http://thedailywtf.com/Articles/I-Didnt-Do-
Anything.aspx](http://thedailywtf.com/Articles/I-Didnt-Do-Anything.aspx)

Unintentional Mishap while Contractor Unplugs X to fix/maintain Y is a
relatively common theme on their list of horror stories.

edit: I think he might actually have meant this one:
[http://thedailywtf.com/Articles/I-Told-You-
So.aspx](http://thedailywtf.com/Articles/I-Told-You-So.aspx)

------
devinegan
Joyent has been having some serious issues over the past month or two. I am
not sure if it is growing pains, bad luck or what is happening, but we had
already lost faith and trust in their Cloud prior to today. This is the nail
in the coffin from our perspective. Moving on...

~~~
rincebrain
Howso?

~~~
devinegan
Thanks for asking rather than just down-voting. I wanted others to know that
this isn't isolated. We have been having issues with their service for a few
months now. They never know when there is a problem with hardware, for
instance. Joyent support will gladly tell you everything is fine. After you
insist, and insist they will actually have someone look at the underlying
infrastructure. Eventually they will acknowledge the problem and fix it
(maybe). I believe the monitoring and reporting for their team is flawed or
incomplete which leads to more downtime of affected systems. Just one
observation, but we have had three incidents over the past month and a half.
Two within a week of each other.

~~~
bcantrill
I'm sorry to hear about your experience; we pride ourselves on being able to
root-cause problems regardless of where they might be in the stack, but it
sounds like your problem didn't get properly escalated. If you want to reach
out to me privately (my HN login at acm.org), we can try to figure out what
happened here -- with my apologies again for the subpar experience.

------
shanselman
"What's this button do?"

------
SEJeff
As I've always said, "You can never protect a system from a stupid person with
root".

You can limit carnage and mitigate this type of thing, but you can't fully
protect against sysadmins doing dumb things (unless you just hire great
sysadmins)

~~~
wmf
So don't give anyone root _on an entire data center_.

~~~
akerl_
Is this like Captain Planet? It's a bit exceptional to divide access servers
of similar type between administrators such that individuals have full access
to a portion of the fleet. Do they meet up and put their rings together to
roll out updates? What if one of them goes on vacation?

~~~
lmm
There are keysharing protocols; you can do something like 5 sysadmins have a
split of the master key such that any 3 of them can access the master account.

~~~
akerl_
For day-to-day maintenance of systems, that's crippling. If I need 2 cosigns
to run "date" across the fleet while I'm troubleshooting an NTP issue, and
then 2 cosigns again to run "service ntpd status", and so forth, my coworkers
will have lit my desk on fire long before I fix the clocks.

There are definitely use cases for keysharing systems like you describe: if
we're talking about getting access to a database with sensitive information,
or signing a new cert that all our systems are about to put their full faith
in. But for the day-to-day administrative efforts, it's overkill and ends up
being counterproductive: after a certain point, Alice and Bob write scripts
that let them hotkey signing off on my requests.

~~~
rodgerd
I'm not worried about how crippling that sort of scenario is on a day to day
basis, because presumably the company doesn't mind paying a fortune for a
bunch of people to sit around to hold one anothers' keys.

I worry about those policies when the shit hits the fan and you're trying to
fix a production problem hobbled by an inability to do stuff without three
fingers on every keystroke.

~~~
akerl_
Agreed. Ideally, whatever system is in use for managing infrastructure
provides sanity checks while I'm working, but either gets out of my way or can
be sidestepped if need be. I don't want to be crippled by technical red tape
when things are on fire.

