
Knightmare: A DevOps Cautionary Tale - nattaylor
http://dougseven.com/2014/04/17/knightmare-a-devops-cautionary-tale/
======
AndrewBissell
This story puts the lie to a couple of canards about HFT:

\- "It's risk free." Any time you put headless trading code into the market
you are risking a catastrophic loss. That risk can be managed to a degree with
many layers of programmatic safeties, and other practices like having your
operations people look for warning emails the day after you've deployed new
code. But the risk is always present.

\- "It makes the market more unstable." The most important market-maker in
U.S. equities blew itself up in spectacular fashion and had to remove itself
from the trading entirely. Sending unaccounted orders into the market in an
endless loop is about the worst mistake an algorithmic trading firm can make.
Can anyone pick the day this happened out of a long-term chart of the S&P 500?

------
ams6110
Automated deployment would not necessarily have prevented this. Errors happen
when humans deploy software manually, and errors happen when humans configure
automated deployment tools. The real problem was lack of a "kill switch" to
shut down the system when it became obvious something was wrong.

~~~
30thElement
A kill switch wouldn't have saved them. What killed Knight wasn't the $400
million loss, it was the lack of confidence all other firms had in them
afterwards. Brokers can't just shut down in the middle of the trading day.

They managed to raise the money to cover the loss, but afterwards they were
getting round 10% of their normal order volume [1].

Somewhat ironically, the closest thing they had to a kill switch, backing out
the code, actually made the situation worse as it made all 8 servers misbehave
instead of just the first one[2].

The full SEC report in [2] is an interesting read, just skip the parts about
"regulation 5c-15...".

[1] [http://www.businessinsider.com/look-how-knight-capitals-
trad...](http://www.businessinsider.com/look-how-knight-capitals-trading-
volume-is-just-withering-away-2012-8)

[2]
[http://www.sec.gov/litigation/admin/2013/34-70694.pdf](http://www.sec.gov/litigation/admin/2013/34-70694.pdf)

~~~
personZ
Note that the loss of confidence was because a) it went on for 45 minutes, b)
the financial loss was large enough that it seriously threatened the ongoing
business of the firm. A kill switch would absolutely have helped them as it
would have solved both a and b. No, they shouldn't stop fulfilling orders
(which, incidentally, happens with order fulfillment. Outages and glitches
happen), but the alternative was clearly much, much worse.

And to the backout, they reverted the code on the 7 servers while erroneously
leaving the revised configuration, so it wasn't really a kill switch at all.
It was frantic fumbling that made things worse.

They would have been better off just shutting the servers down at the first
hint of trouble.

------
siliconc0w
I think the lesson is actually more about how do proper versioning and message
serialization in higher risk distributed systems. Higher message versions
should fail to deserialize and cause the message to re-queue (or go to a dead
letter queue). Then you monitor queue length like a hawk and have a plan in
place for rolling back not just the consumers but the producers as well

------
fizx
Every company I've worked at had similar (though far less costly) issues.

Put an API method in every service that exposes the SHA of the running code,
the build time of the binary (if compiled), and the timestamp when the deploy
was initiated. (btw, having this information in the filesystem is
insufficient, because what if a post-deploy restart failed?) Verify after
every deploy.

------
Havoc
>BTW – if there is an SEC filing about your deployment something may have gone
terribly wrong

I can't help but smile at this comment. Production servers crashing is bad
news, but the above is a whole new level of bad.

------
bowlofpetunias
The quite common lack of a kill switch is something that never fails to amaze
me. Especially in large scale deployments with all kinds of distributed
processes where you cannot simply turn off "the" server.

Everybody is worried about downtime, but downtime is rarely the worst that can
happen.

Things turning to shit fast and _not being able to stop it_ is both much more
common and much harder to recover from.

So many organizations have dutifully implemented a single command deployment
but don't even have a playbook for simply pulling the plug.

~~~
hft_throwaway
That's very true in this case. The issue here was that Knight isn't just
trading for its own account. They're a broker where they likely have some SLA-
ish agreement with clients, or face repetitional risk at the very least. As a
registered market-maker they're obligated to quote two-way prices. Shutting
down costs them money and exposes them to regulatory risk.

~~~
hga
And if you read between the lines, loss of reputation is what killed them:

" _In 2012 Knight was the largest trader in US equities with market share of
around 17% on each the NYSE and NASDAQ. Knight’s Electronic Trading Group
(ETG) managed an average daily trading volume of more than 3.3 billion trades
daily, trading over 21 billion dollars…daily._ "

This was for others, per the preceding sentence.

" _Knight only has $365 million in cash and equivalents. In 45-minutes Knight
went from being the largest trader in US equities and a major market maker in
the NYSE and NASDAQ to bankrupt. They had 48-hours to raise the capital
necessary to cover their losses (which they managed to do with a $400 million
investment from around a half-dozen investors). Knight Capital Group was
eventually acquired by Getco LLC (December 2012) and the merged company is now
called KCG Holdings._ "

Per
[https://news.ycombinator.com/item?id=7652573](https://news.ycombinator.com/item?id=7652573)
afterwords " _they were getting round 10% of their normal order volume_ "

Losing 90% of their business "overnight" tends to be fatal.

Per a link from the article the above links to, " _An equities trader
explained that Knight was the "last place" he would go to execute a trade.
Others expressed befuddlement and the firm's inability to rectify the trading
error for a full 45 minutes._"

Now all that was immediately after the screwup, but it's hard to imagine it
getting better without a perception that the people managing their operations
were replaced with people worthy of trust.

~~~
kasey_junk
Knight (and most big HFT firms) trade for others or sell their execution
platforms. They also have proprietary trading arms, of which their market
making strategies were assumedly a part.

The way they were able to raise the funds they needed was to go to existing
stakeholders for the money. This lead directly to their merger with Getco a
bother huge HFT firm.

------
coldcode
In addition to having a repeatable and dependable deployment process, it's
also a good idea to remove unused functionality completely from deployed
products. I also don't understand why people insist on using integer flags for
things that matter. I've seen this type of error before, people reuse a bit
pattern to mean something different and two pieces of code start doing
different things.

------
iandanforth
I think this says more about their business than their deployment process. It
might be a good rule of thumb to say, "If your business can lose $400M in 45
minutes, you're not in business, you're playing poker."

~~~
rcxdude
Many businesses can lose that much or more in a very short time if something
goes wrong, by destruction of product, damage to equipment, or damage to
environment. Software can be responsible for all of these.

~~~
hga
Famous example, although the cost was much less:

[http://catless.ncl.ac.uk/Risks/18.74.html#subj5](http://catless.ncl.ac.uk/Risks/18.74.html#subj5)

As I recall, it was a "vanity" function that blew up, i.e. something not at
all necessary for running those smelters.

Ah, look here for some much more expensive ones in space exploration:
[https://en.wikipedia.org/wiki/List_of_software_bugs](https://en.wikipedia.org/wiki/List_of_software_bugs)

The second YF-22 was killed by a software error. That was plenty expensive,
I'm sure, then again, that's why in peacetime we build and thoroughly exercise
test aircraft before starting production.

------
teyc
This case is extremely interesting, because it presents a very difficult
problem. What is it that Knight could have done to prevent such a serious
error from occurring?

At the core it seems is that each application server is effectively running as
root. Having enormous capacity to cause immediate damage. The lesson from
[http://thecodelesscode.com/case/140](http://thecodelesscode.com/case/140) is
to "trust no-one". This implies having automated supervisors that has the
capacity or authority to shut down machines. This is difficult, and difficult
to reason.

Secondly, it warns us of the dangers of sharing global variables/flags. Humans
lack the capacity to reason effectively what happens when a repurposed flag
gets used by an old piece of code. That should be sufficient heuristic to
avoid doing so. This is utterly preventable.

Thirdly, incomplete/partial deployment is extremely dangerous. While assembly
signing and other approaches work for binaries, there's nothing said about
configuration files. Perhaps, best practice in highly dangerous situations
require configuration to be versioned and checked by the binaries as they
load. After all, a configuration represents an executable specification.
Similarly, relying on environment variables is extremely risky as well.

------
BryantD
Allspaw's post on this incident
([http://www.kitchensoap.com/2013/10/29/counterfactuals-
knight...](http://www.kitchensoap.com/2013/10/29/counterfactuals-knight-
capital/)) is much better. In particular, he explains why the SEC document is
not a post-mortem and should not be used to reach this kind of conclusion.

------
rbc
I think it shows how release engineering hasn't been seriously funded at many
companies. There are a bunch of tools, serving different communities that are
used, but most of them operate in the context of the single server. Production
reality is clusters of machines. We need better tools for managing cluster
deployments and hot-swapping code. the Erlang platform takes on some of this,
but doesn't seem to have picked up the following it probably deserves. I bet
there are some lessons to be learned there.

------
chris_wot
Why didn't they remove the old code first?

In LibreOffice, we are spending a LOT of time trying to remove ununsed and
outdated code.

~~~
EliRivers
Because time (and money) spent removing old code that's not used right now is
time (and money) spent for zero short-term profit increase.

~~~
chris_wot
We should let such firms die when they fall over their own crudulence.

~~~
kasey_junk
Knight never got any government assistance and their trades stood. Unlike when
the big I banks mess up, HFT firms tend to pay for their mistakes.

~~~
hga
While some people are drawing lessons from this incident about HFT, I've seen
no indication Knight was a major player in that, the summary says they were
"*engaging in market making, electronic execution, and institutional sales and
trading." I.e. those actions are on the behalf of others, plus stock exchange
market makers have obligations to keep what they're responsible for liquid.

~~~
kasey_junk
It depends on what your definition of HFT is. But they were frequently
responsible for ~20% of equity volume in a day. They were market making purely
electronically & at low latency. That hits nearly every definition of HFT that
I know.

------
personZ
There are many interesting lessons from this tale, but the conclusion that
automated deployment would have saved the day seems a bit of a jump.

Automation does not protect you from either automated devastation, gaps, or
human errors. Your automation tools, just as with written instructions,
require configuration -- a list of servers, for instance.

Automation can be bullet-proof when it's a continuous deployment situation,
but is less reliable when you do infrequent deployments, as such a financial
firm does. I say this having been in a firm where we moved from "a list of
deployment steps" to "fully automated" for our quarterly builds, and the
result was much, much, much worse than it was before. We could certainly have
resolved this (for instance having a perfect replica of production), but the
amount of delta and work and testing we did on our deployment process vastly,
and by several magnitudes, exceeded our manual process.

An observer did not validate the deployment (which should be the case whether
automated or not for such a deploy). They _ignored_ critical warning messages
sent by the system pre-trading (the system was warning them that it was a
SNAFU situation). Systems in a cluster didn't verify versions with each other.
Configuration did not demand a version. Most importantly for a system of this
sort, they didn't have a trade gateway that they could easily see what the
system was doing, and gate abnormal behaviors quickly and easily (such a
system should be as simple as possible, the premise being that it's an
intermediate step between the decision/action systems and the market. The
principal is exactly the same as sending a mass customer mailing to a holding
pen for validation to ensure that your macros are correct, people aren't
multi-sent, to do throttling, etc).

~~~
vacri
The bit that surprised me the most was the lack of killswitch (or a halt or a
pause); that a human supervisor couldn't invoke a "holy shit!" button.

~~~
yummyfajitas
A "killswitch" is not a trivial thing to build.

If you simply run `kill -9 pid`, you might be holding a large position, or
worse, you might be holding some naked shorts. (In fact, you almost certainly
are.) This is risky. It can result in failures to deliver, make you vulnerable
to large market movements, etc.

Another form of "killswitch" is to not open any new positions, but still
attempt to close out your old positions with varying degrees of
aggressiveness. But if your system is _wildly_ broken, this might not be doing
what you think it's doing. As I understand it, this happened to Knight.

~~~
ramchip
I think that's a strange way to approach a killswitch. The point of a
killswitch is to stop everything NOW, so the system can be checked. After the
problem is understood, then it's safe to restart strategies gradually and
unwind positions. A system shouldn't attempt to send more orders when it's in
an unpredictable state. I believe many regulators require such a switch.

------
codr
No rollback plan? Wtf.

------
micro-ram

      repurposed an old flag
    

The flag was not OLD since there was still code in the CURRENT code base which
COULD use it.

~~~
VintageCool
There was live code in the current code base which could use the flag, but it
hadn't been active in 8 years.

------
chollida1
The blog spam isn't necessary. To get the actual findings check out the sec
post-mortem located here.

[http://www.sec.gov/litigation/admin/2013/34-70694.pdf](http://www.sec.gov/litigation/admin/2013/34-70694.pdf)
(PDF warning).

~~~
greenyoda
The posted article is much more readable than this SEC document. I wouldn't
call it "blog spam" at all. (I'd define "blog spam" as a blog post that links
to another article without adding any value.)

