
Postmortem for outage of us-east-1 - akerl_
http://www.joyent.com/blog/postmortem-for-outage-of-us-east-1-may-27-2014
======
jffry
It's great to see a postmortem with this level of detail, fairly quickly. It's
also great to see Joyent hang the blame on the system that allowed rebooting
every server, and the poor recovery from that failure, rather than continuing
to throw the operator under the bus:

    
    
      "...we will be rethinking what tools are necessary over
      the coming days and weeks so that "full power" tools
      are not the only means by which to accomplish routine tasks."

~~~
blantonl
More importantly, I hope that the person that issued this command still keeps
his job. He learned an important lesson and unless there is sheer incompetence
here, this individual will have a medal on his chest indicating he has been in
the worst "combat" that a sysadmin can endure.

Everyone makes mistakes, and judging by the language of the postmortem it
appears it was just that.

Would love to have that sysadmin on my team, because he will _never_ do that
again....

~~~
bcantrill
We debated whether or not to make this explicit in the postmortem, but yes,
the operator in question still has their job, and for exactly the reasons that
you outlined: it was an honest mistake, they were deeply apologetic (as one
might imagine) -- and we know that they (certainly!) won't be making that
mistake again. Mistakes like this are their own punishment; additional
punitive action serves only to instill fear rather than effect the changes
necessary to not repeat the failure.

~~~
dmourati
"Five why's" might be appropriate to suss out how the ultimate mistake was
even possible.

------
incision
_> 'In our experience, platforms with this network device will encounter this
boot-time issue about 10% of the time. The mitigation for this is for an
operator to simply initiate another reboot, which we performed on those
afflicted nodes as soon as we identified them.'_

This bit bothers me more than anything else. It's not just a die roll
configuration, but a known one with an operator required to do the re-rolling.

Everything is a value judgement I guess, but knowingly leaving that one
'mitigated' would drive me insane.

~~~
bcantrill
Sorry if we didn't adequately convey our frustration with this particular
issue. It's one that's been with us for a long time and it absolutely sucks --
and after trying (and failing) to work with the vendor to get the issue
understood (it's essentially a firmware-level issue), we ultimately decided to
move away from that particular NIC vendor entirely. If we could wave a wand
and be rid of these particular parts, we gladly would -- but until then, this
transient boot-time issue needs to be manually mitigated with an additional
reboot.

~~~
AceJohnny2
Why the shyness about saying the brand's name? If their product and support
was subpar to the point of blacklisting the entire vendor, it could be useful
to spread the information to A) warn others of potential problems and B) put
pressure on the vendor to improve their products.

~~~
bcantrill
I'm with you, but somehow a postmortem for our own outage seemed like the
wrong place to name-and-shame a vendor...

~~~
insaneirish
If I had to guess, it's Broadcom. While their merchant switching ASICs
(Trident+, Trident2) have become good enough to displace most custom spun
ASICs for 10 Gbps and 40 Gbps switching, their NIC hardware has long been
somewhat of a disaster. Interesting to note is that Broadcom has basically
sold the NIC business to QLogic:
[http://www.broadcom.com/press/release.php?id=s832628](http://www.broadcom.com/press/release.php?id=s832628)

------
bignaj
> _The command to reboot the select set of new systems that needed to be
> updated was mis-typed, and instead specified all servers in the datacenter.
> Unfortunately the tool in question does not have enough input validation to
> prevent this from happening without extra steps /confirmation, and went
> ahead and issued a reboot command to every server in us-east-1 availability
> zone without delay._

"To make error is human. To propagate error to all server in automatic way is
#devops." -@DEVOPS_BORAT [1]

[1][https://twitter.com/DEVOPS_BORAT/status/41587168870797312](https://twitter.com/DEVOPS_BORAT/status/41587168870797312)

------
adamfeldman
I don't have a lot of experience with datacenters, and am trying to understand
this sentence:

"Because there was a simultaneous reboot of every system in the datacenter,
there was extremely high contention on the TFTP boot infrastructure, which
like all of our infrastructure, normally has throttles in place to ensure that
it cannot run away with a machine."

What does "cannot run away with a machine" mean? Why would you want to by
default restrict the speed at which that system runs?

~~~
bcantrill
The throttle that we're referring to there is a CPU throttle. When we
provision an OS-virtualized instance, there is a default throttle to prevent
it from consuming a disproportionate amount of CPU on the box. The instance
that runs TFTP on the headnode was provisioned as a relatively small instance
(it needs very little DRAM), which also gave it (by default) a CPU throttle
that restricted its CPU utilization. Normally, of course, this isn't an issue
-- but normally we don't try to TFTP boot the entire datacenter at once. This
issue was obvious immediately (thanks, DTrace!), and we resolved it on the
headnode by manually raising the throttle, and will be making the fix in
SmartDataCenter itself as well. Does that answer your question?

~~~
adamfeldman
Ah got it! The control plane has instance restrictions similar to customer
nodes, since it uses some of the same infrastructure.

~~~
bcantrill
Exactly; it uses all of the same infrastructure, actually -- it's just
provisioned on a different network (namely, the admin network). It's also
worth noting that OS-based virtualization helped us here not only because of
the global visibility we get with DTrace (which immediately indicated that
TFTP was waiting for CPU), but also because we could dynamically adjust the
throttle and simply give it more CPU without having to bounce the box and
interrupt all of the TFTP booting in progress. It was a small amount of solace
on what was easily the worst day we've had in a while, if not ever...

------
JBiserkov
>The command to reboot the select set of new systems that needed to be updated
was mis-typed, and instead specified all servers in the datacenter.

Substitute reboot with 'upgrade to win 7' and data center with university at
get a story from a month? ago.

    
    
         rm ./*.* - delete all files in current dir
    
         rm /*.* - never do this
    

Can you spot the difference? A colleague did the later on our testing server.
No clients were disturbed, but us devs were left working on things that can
run locally (much more pleasant stuff for sure, yet the schedule suffered a
lot).

~~~
jacquesm
Your 'never do this' line would erase initrd.img on most machines, which you'd
only find out about after the next reboot if nobody told you about it (and
good luck getting that fixed).

There are a large number of varieties of this particular error. Some with
terrible results.

rm -rf * .bak

for instance (especially when executed in the root directory).

That '#' prompt is there for a reason.

The way to solve these sort of issues is to first get the files using 'find'
until you're totally happy about the result and then to use 'rm' as the
command passed to find.

Of course, nobody does this ;)

~~~
jethro_tell
>Your 'never do this' line

Would also erase everything else on the box so you might not know for a few
seconds, but then the database errors start and the lib and pid files start
disappearing and now your the king of a mountain of shit. It won't take till
next reboot to notice.

If you ever get a chance, rm -rf / a box before you throw it out and just play
around with it for a couple minutes while it eats itself.

You can spare your self most errors like that with tab complete, the built in
sanity checker. I tab complete everything since I'm dyslexic, and it saves me
a shit ton of time cause I'm never far from the error when I notice it.

~~~
jsmthrowaway
> Would also erase everything else on the box so you might not know for a few
> seconds, but then the database errors start and the lib and pid files start
> disappearing and now your the king of a mountain of shit. It won't take till
> next reboot to notice.

Yes, it will, because the original post didn't include -r. So it's only
deleting things that match the glob in the root directory. On many systems,
that is nothing.

~~~
eurleif
Even with -r,

    
    
      /*.*
    

only matches files and directories in / that have a period in their name. So
it would skip over /usr, /home, etc.

~~~
jsmthrowaway
That's what "match the glob" means.

~~~
eurleif
My point was that the lack of -r doesn't really change anything; even with -r,
it would still only delete initrd.img and similar, which would take until the
next reboot to notice.

------
sathio
so, basically it was like a "DELETE FROM table;" without limit :)

~~~
phil21
One of the worst days of my life was when one of my techs called me as I
hopped on a plane...

    
    
      Tech: "hey phil, I just broke $bigcustomer's database"
      Phil: "..."
      Tech: "I typed drop database ImportantDB;2"
      Tech: "instead of drop database Importantdb2;"
    

Even worse because this was lets say.. a very undersized install due to
customer cheapness. And all we had for backups (since who wants to PAY for
backups?!?!) was a day-old mysqldump off a slave. It took multiple days to re-
import all that data. Customer was not pleased.

------
samstave
There should be Zero method to reboot the whole dc by one operator. There
should be means where only issuance of seperate commands from separate
operators can, when jointly submitted, will result in simultaneous dc reboot.
Tpi and all that.

~~~
paulwolf
As a non-security focused company, they probably don't think there is an issue
giving so much power to one command that can be issued by a single person.

------
geetarista

        On behalf of all of Joyent, we are extremely sorry for this outage, and the severe inconvenience it may have caused to you, and your customers.
    

I hate it when people apologize saying _maybe_ it was inconvenient. If someone
is using your service and you fucked up, it _is_ an inconvenience.

~~~
RussianCow
Doesn't mean everyone is affected by it. If I was hosting some unimportant
website on their platform, I probably wouldn't even notice, and it wouldn't be
an inconvenience to me. Silly thing to feel so strongly about.

------
hueving
This is the same company that chose to publicly shame one of the best
contributors to node.js into quitting because he closed a pull request
changing a gendered pronoun in a code comment. They place very little value on
engineering excellence and more on 'being cool'. This leads to half-baked
tools using <buzzword technology> like the one described in the article that
allowed every server to be simultaneously rebooted without confirmation.

~~~
mirashii
For those wondering why this is downvoted, read the "shaming" and the reasons
behind it for yourself.

[http://www.joyent.com/blog/the-power-of-a-
pronoun](http://www.joyent.com/blog/the-power-of-a-pronoun)

~~~
vacri
I've always hated that essay. "We value empathy here at Joyent. That's why we
would rather fire an [theoretical] employee than retrain them".

~~~
chris_wot
It's funny how Cantrill is so willing to throw someone else under a bus, yet
he's the guy who once asked if another developer had "ever kissed a girl"?

