
Knightmare: A DevOps Cautionary Tale (2014) - strzalek
http://dougseven.com/2014/04/17/knightmare-a-devops-cautionary-tale/
======
vijucat
I once shut down an algorithmic trading server by hastily typing (in bash):

\- Ctrl-r for reverse-search through history

\- typing 'ps' to find the process status utility (of course)

\- pressing Enter,....and realizing that Ctrl-r actually found 'stopserver.sh'
in history instead. (There's a ps inside stoPServer.sh)

I got a call from the head Sales Trader within 5 seconds asking why his GUI
showed that all Orders were paused. Luckily, our recovery code was robust and
I could restart the server and resume trading in half a minute or so.

That's $250 million to $400 million of orders on pause for half a minute. Not
to mention my heartbeat.

Renamed stopserver.sh to stop_server.sh after that incident :|

P.S. typing speed is not merely overrated, but dangerous in some contexts.
Haste makes waste.

~~~
phaemon
Might be better having something like the following at the top of your script:

    
    
        echo "This will STOP THE SERVER. Are you sure you want to  do this?"
        echo "Type 'yes' to continue:"
        read response
    
        if [[ $response != "yes" ]]
        then
          echo "You must type 'yes' to continue. Aborting."
          exit 1
        fi
    
        echo "Stopping server ..."
    

It barely takes any time to type 'yes' but it makes you stop and think.

~~~
marcosdumay
That'll make it impossible to use in scripts. This remember me the time I
tried to fix my "rm -r * .o" with a CLI trash system, instead of doing proper
backups.

Might be better to start reviewing EVERY command one sends to important
servers, and testing them if viable. What probably is the line that vijucat
took... that's the line that everybody ends up taking, the only thing that
changes is the number of accidents needed.

~~~
euid
Then you add an optional CLI argument that makes it skip the prompt, and use
that version in scripts.

------
ratsbane
I can't read something like this without feeling really bad for everyone
involved and taking a quick mental inventory of things I've screwed up in the
past or potentially might in the future. Pressing the enter key on anything
that affects a big-dollar production system is (and should be) slightly
terrifying.

~~~
click170
I have the same fear, but I wonder if that fear stems from lack of training
and or documentation and or time.

When I ask myself why I am afraid of deploying to production servers its
always because I don't fully understand what the deployment process does. If I
had to deploy manually I would be lost at sea.

Is it just nievete thinking that enough documentation and training makes that
fear go away?

~~~
vinceguidry
It is, ultimately, a management problem.

Computer systems are rarely useful on their own. They need to be attached to
some other process in order to derive any value from them. Sometimes it's a
network, somethings it's a factory, sometimes it's just process run by humans.

In most domains that you attach computers to, it's possible for one person to
understand both the computer side of it and the other side of it. Web
development is like this, you can easily understand both the web and computer
systems, as one is really just another version of the other.

If you're attaching the computer system to another very hard-to-understand
system, like, say, a surgical robot, then the person that understands both
domains well enough to avoid problems like this is a unicorn, there might well
be only one person in the world that's built up expertise in both fields.

To get around this you need careful, effective management of the two pools of
labor, and robust dialogue between the two teams. The second this starts
breaking down, is the second you start marching down the road to disaster.

In the aforementioned example, everyone would be well-aware of the failure
modes. So it's a bit easier to manage. In finance, failure modes can be so
subtle, particularly as the system grows more complex, that they can escape
detection by both teams unless they're both checking each other's work and
keeping each other honest.

Institutional knowledge transfer has to constantly be happening, areas of
ignorance on both sides have to constantly be appraised and plans undertaken
to reduce said ignorance. The more everyone knows about the entire system, the
more likely it will be that critical defects like this can be discovered
before they strike.

This kind of effective interaction of those at the bottom can only be
organized and directed at the top. It's very much a "captains winning the
war"-type situation, but captains can't lead without support from the
generals.

~~~
click170
Do you have any advice with regards to institutional knowledge transfer, or
have you seen any examples of when this was done exceptionally well?

Knowledge transfer problems have been a running theme at many of my previous
work places, I'm interested in what I can do to help, I'm a documentation
proponent but there must be more.

~~~
vinceguidry
It's a really hard problem. From a personal standpoint as a coder, the problem
with documentation is that it has to be maintained same as the rest of the
application it's documenting. That's why you see the push towards self-
documenting code in the Ruby world, where you can just look at a code file and
know exactly what it's doing because convention. Every tool you add on to your
workload doesn't just impose an initial dev cost, but also an ongoing
maintenance cost.

When you're also dealing with humans, you have to pay a management cost too,
so you have some documentation, where are you going to put it? Is there a
repository somewhere where you could put it where anyone who wants to work on
the program later will know to look? Often there isn't. So you have to make
one. It will need the use of company resources. You need to educate people
that there's this place for documentation and everyone needs to use it,
because it won't make any sense for a documentation repository to have just
your own documentation in it.

The size and scope of what it takes to be effective at this makes it a
management problem, resources need to be allocated, and directions have to be
given. Someone has to drive the project, to make it happen above and beyond
his job duties.

Most of the time this happens when something big fails. Today, an important
scheduled email was just found out to not have happened for the last eight
weeks. The stakeholders did not notice that the email was not hitting their
desks every MWF as usual. When the problem was fixed because another process
that was less important but actually had someone on the ball enough to notice
it was failing complained, the important email started coming in again,
prompting a giant WTF. My reaction is a giant shrug, if it's important to you,
you need to be monitoring it, I'm not omniscient, pushing the responsibility
back onto the management.

So the knowledge transfer in this situation goes as follows, I need to know
which business processes are important, "all of them" is not an acceptable
answer. Second, other teams need to be aware that when systems are automated,
that means that they're not really being monitored, that's what automation
means. Whoever is in charge needs to delegate a human to do it manually. That
person can't be me, my time is too valuable for that shit.

Eventually, I can build a system for getting the kind of feedback that wasn't
built into the system in the first place, maybe some kind of job verification
system, that, after enough tweaking makes the system as a whole more reliable.
But that still won't remove the necessity for some human to have the job at
the end point of the system to ensure that the information is flowing on time
and on target. No matter how robust the system is, there will still be silent
failure modes that can go for months or years unnoticed.

~~~
click170
Wow, thanks for your thoughtful response, you've helped me realize I already
have it rather good by comparison. Eg we already have a designated One Source
Of Truth for documentation.

Having been a coder I do understand that documentation is just something else
to maintain. And this is a sentiment I hear from many people in development.
Read the code is a common response.

As someone who shares part of the on call rotation though, I've grown to see
it differently. While that documentation is extra overhead to maintain, having
that documentation ready will save me from having to call you at 3AM when your
project breaks in production and your documentation was written without
consideration for on call. When this happens, I have no choice and you're
getting woken up at 3am. I've found that when cast in this light, I have yet
to meet a developer who wasn't eager to avoid that early morning phone call.

Thank you for your comment, its good to know the color of the grass on the
other side of the fence sometimes :)

------
NhanH
Everytime I'm reading the story, there is one question that I've never
understood: why can't the just shutdown the servers itself? There ought to be
some mechanism to do that. I mean, $400 millions is a lot of money to not just
bash the server with a hammer. It seems like they realized the issue early on
and was debugging for at least part of the 45 minutes. I know they might not
have physical access to the server, but wouldn't there be any way to do a hard
reboot?

~~~
jacques_chester
In the story he points out that there was no kill switch.

And, as has been found in other disasters in other industries, kill switches
are hard to test.

~~~
beat
Long ago in a previous life, I worked in a factory that made PVC products,
including plastic house siding. One of my co-workers got his arm caught in a
pinch roller while trying to start a siding line by himself. There was a kill
switch on the pinch roller - six feet away and to his left, when his left arm
was the one that was caught. Broke every bone in his arm, right up to his
collarbone.

He screamed for help, but no one could hear him over the other noisy
machinery. Welcome to the land of kill switches.

~~~
RankingMember
Yikes. That reminds me of The Machinist.

------
otakucode
While articles like this are very interesting for explaining the technical
side of things, I am always left wondering about the organizational/managerial
side of things. Had anyone at Knight Capital Group argued for the need of an
automated and verifiable deployment process? If so, why were their concerns
ignored? Was it seen as a worthless expenditure of resources? Given how common
automated deployment is, I think it would be unlikely that none of the
engineers involved ever recommended moving to a more automated system.

I encountered something like this about a year ago at work. We were deploying
an extremely large new system to replace a legacy one. The portion of the
system which I work on required a great deal of DBA involvement for
deployment. We, of course, practiced the deployment. We ran it more than 20
times against multiple different non-production environments. Not once in any
of those attempts was the DBA portion of the deployment completed without
error. There were around 130 steps involved and some of them would always get
skipped. We also had the issue that the production environment contained some
significant differences from the non-production environments (over the past
decade we had, for example, delivered software fixes/enhancements which
required database columns to be dropped... this was done on the non-production
systems, but was not done on the production environment because dropping the
columns would take a great deal of time). Myself and others tried to raise
concerns about this, but in the end we were left to simply expect to do
cleanup after problems were encountered. Luckily we were able to do the
cleanup and the errors (of which there were a few) were able to be fixed in a
timely manner. We also benefitted from other portions of the system having
more severe issues, giving us some cover while we fixed up the new system. The
result, however, could have been very bad. And since it wasn't, management is
growing increasingly enamored with the idea of by-the-seat-of-your-pants
development, hotfixes, etc. When it eventually bites us as I expect it will, I
fear that no one will realize it was these practices that put us in danger.

~~~
bd_at_rivenhill
You should read Charles Perrow's "Normal Accidents" and all will be revealed.
This is hardly a new problem.

[http://www.amazon.com/Normal-Accidents-Living-High-Risk-
Tech...](http://www.amazon.com/Normal-Accidents-Living-High-Risk-
Technologies/dp/0691004129/ref=sr_1_1?s=books&ie=UTF8&qid=1423026671&sr=1-1&keywords=normal+accidents)

------
ooOOoo
The post is quite poor and suffer a lot from hindsight bias. Following article
is so much better: [http://www.kitchensoap.com/2013/10/29/counterfactuals-
knight...](http://www.kitchensoap.com/2013/10/29/counterfactuals-knight-
capital/)

~~~
jackgavigan
Great link. Thanks for posting.

------
serve_yay
If you fill the basement with oily rags for ten years, when the building goes
up in flames, is it the fault of the guy who lit a cigarette?

~~~
dredmorbius
In the case of wildfires, why yes, yes indeed we do.

~~~
serve_yay
The conditions for wildfire do not arise out of neglect and laziness.

~~~
dredmorbius
The conditions for _risk_ from wildfire _do_ arise out of a multitude of
factors. Including discounting hazards, poor construction, insufficient
warning or evacuation capabilities, and more.

But more to the point: the conditions in which wildfires are common involve
such _insanely_ high risk of conflagration that the least spark can set them
off. And do.

When you've got to tell people _not to mow lawns or trim brush_ for fear of
stray sparks setting off tinder-dry brush, you're simply sitting on top of (or
in the midst of) a bomb. And that _is_ the reality in much of the world --
Australia, the western US, Greece, and elsewhere. Or a hot car exhaust from a
parked vehicle. Or broken glass. Or, yes, flame in the form of cigarettes, a
campfire, or barbecue.

But except for the case of deliberate arson (which does happen), I find the
practice of convicting people for what's essentially a mistake waiting to
happen to be quite distasteful.

------
rgj
Repurposing a flag should be spread over two deployments. First remove the
code using the old flag, then verify, then introduce code reusing the flag.

Even if the deployment was done correctly, _during_ the deployment there would
be old and new code in the system.

------
gunnark01
I used to work in the HFT, and I don´t understand is why there was no risk
controls. They way we did it was to have explicit shutdown/pause rules (pause
meaning that the strategy will only try to get flat).

The rules where things like: \- Too many trades in one direction (AKA. big
pos) \- P/L down by X over Y \- P/L up by X over Y \- Orders way off the
current price

When ever there was a shutdown/pause a human/trader would need to assess the
situation and decide to continue or not.

~~~
neomantra
To add insult to injury, Knight got fined for not having appropriate risk
controls after this incident.

------
Mandatum
I remember reading a summary of this when it occurred in 2012. It's obvious to
everyone here what SHOULD have been done, and I find this pretty surprising in
the finance sector..

Also your submission should probably have (2014) in the title.

~~~
johansch
The art of rehashing old stories for exposure. Without mentioning where you
got your material of course.

~~~
Mandatum
[https://news.ycombinator.com/item?id=4333089](https://news.ycombinator.com/item?id=4333089)

I believe it was this, as it was a Guardian article - which has since been
subject to the digital graveyard.

------
solarmist
Why would they repurpose an old flag at all? That seems crazy to me unless it
was something hardware bound.

~~~
vertex-four
To keep the messages as short as possible, to reduce the time-costs of
transmitting and processing them. It's HFT, they do things like that.

~~~
jemfinch
I assume "flag" in this context means something akin to a command-line flag.

~~~
socceroos
I assumed it was a single bit in a string of them that describes a message.

~~~
MattHeard
You're correct. Single bits representing boolean values are often called
"flags".

------
beat
It's nice to see a more detailed technical explanation of this. I've used the
story of Knight Capital is part of my pitching for my own startup, which
addresses (among other things) consistency between server configurations.

This isn't just a deployment problem. It's a _monitoring_ problem. What
mechanism did they have to tell if the servers were out of sync? Manual review
is the recommended approach. Seriously? You're going to trust human eyeballs
for the thousands of different configuration parameters?

Have computers do what computers do well - like compare complex system
configurations to find things that are out of sync. Have humans do what humans
do well - deciding what to do when things don't look right.

------
narrator
Somebody was on the other side of all those trades and they made a lot of
money that day. That's finance. Nobody loses money, no physical damage gets
done and somebody on the other side of the poker table gets all the money
somebody else lost.

~~~
noonespecial
Its probably weirder than that. As soon as their system "lost containment" all
of that money probably just diffused into the market like smoke amid hundreds
of thousands of trades. There's no "big winner" on the other side of the table
grumble at.

HFT systems aren't like a bunch of poker players slyly eying one another
across a table, they're more like electro-chemical gold miners concentrating
gold from the rivers of cyanide sludge pumping up from the financial system
mines. Heaven help them if they depolarize.

------
__abc
This must be an old wives tale. I live in Chicago and a trading form on the
floor beneath us went bankrupt, in roughly the same time, with a similar
"repurposed bit" story.

Maybe it's the same one .....

~~~
fnordfnordfnord
Knight Capital. It was a big deal. Not an old wive's tale.
[http://www.nanex.net/aqck2/3522.html](http://www.nanex.net/aqck2/3522.html)

~~~
easytiger
HE was joking.

~~~
fnordfnordfnord
In that case, "whoosh".

------
bevacqua
Ah yes, this story is legendary. I discuss it in my JavaScript Application
Design book[1]. Chaos-monkey server-ball-wrecking sounds like a reasonable way
to mitigate this kind of issues (and sane development/deployment processes,
obviously)

[1]: [http://bevacqua.io/bf](http://bevacqua.io/bf)

------
aosmith
Wasn't Knight in trouble for some other things as well?

------
recursive
"Power Peg"? More like powder keg.

------
danbruc
What really looks broken to me in this story is the financial system. It has
become an completely artificial and lunatic system that has almost nothing to
do with the real - goods and services producing - economy.

~~~
raincom
Trickle down economics = financialization of capital. Those who influence the
policy are beneficiaries of the financialization of capital: think of grads
from elite law schools and b shools!! If there is no financialization of
capital, these grads from elite law/b schools wont get fat bonus.

------
hcarvalhoalves
As usual in catastrophic failures, a series of bad decisions had to occur:

\- They had dead code in the system

\- They repurposed a flag for a previous functionality

\- They (apparently) didn't had code reviews

\- They didn't had a staging environment

\- They didn't had a tested deployment process

\- They didn't had a contingency plan to revert the deploy

It could be minimized or avoided altogether by fixing just one of the points.
Incredible.

~~~
UnoriginalGuy
> They (apparently) didn't had code reviews

I don't get that. There was no code issue. The old and new code both worked as
intended, it was a deployment and deployment-verification problem.

> They didn't had a staging environment

Yes they did. They staged the new code and tested it. They did a slow
deployment also.

> They didn't had a contingency plan to revert the deploy

They did revert the deploy within the 45 minutes. It made it worse.

I think you need to re-read the article. Your assessment is strange given the
event.

~~~
paulrademacher
Code review could have been another set of eyes to predict the problem of
reusing a flag.

~~~
solarmist
If the message was as compact and low level as possible it was probably a bit
flag, so in that context it makes sense to repurpose it.

Being so removed from binary and bit level interactions it can be easy to
forget things like this.

