
Knightmare: A DevOps Cautionary Tale (2014) - redredhathat
https://dougseven.com/2014/04/17/knightmare-a-devops-cautionary-tale/
======
JackFr
Amusing personal anecdote -- the Knight debacle caused the market in general
to tumble. The week before a coworker of mine -- sure of a market drop but for
other reasons -- had bought a raft or puts on the S&P 500. When I saw looking
glum at work after the Knight news broke, I asked him what was wrong didn't
you make a ton? Yeah, he said, but I can't get out cause my account's with
Knight.

~~~
manwithplan
Cool story, didn't happen. There were no retail trading accounts at Knight. In
fact, there was no outside money of any kind. The S&P500 fell about 0.75% on
the day in question: a non-trivial decline, but not really remarkable. It was
up about 0.4% on the week. Also, this is incredibly not how the OCC deals with
members in default.

~~~
alasdair_
>There were no retail trading accounts at Knight.

The article states “The NYSE was planning to launch a new Retail Liquidity
Program (a program meant to provide improved pricing to retail investors
through retail brokers, like Knight)”

This pretty strongly implies Knight was a retail broker.

I assume I’m missing something- can you clarify?

~~~
manwithplan
I don't know precisely what the article means. The NYSE Retail Liquidity
Program, which still exists, describes two categories of participants: member
organizations (MOs) and liquidity providers (LPs). MOs have retail orders,
which are defined as originating with an actual person, ie not a computer. LPs
provide liquidity to those orders, ie they take the other side of the trade.

I believe Knight would have been interested as an LP. It is not inconceivable
that in some circumstances Knight would have been able to submit retail flow
as an MO, but 100% of that flow would have been routed to it from brokerages
holding actual retail accounts.

Knight was a trading firm, not a hedge fund, and certainly not an institution
which held outside money in retail accounts. But consider an entity which does
have retail accounts and also has propriety trading for its own account.
Suggesting that the former would become inaccessible if the latter lost lots
of capital in bad trading is absurd, would mean that retail-customer and
proprietary monies were mingled, and would require the violation of untold
numbers of regulations. This did not happen with Knight and indeed has never,
ever happened.

~~~
JackFr
As an example in the case of Lehman Brothers their bankruptcy didn’t affect
retail customers who were protected by the SIPC and whose investment accounts
were quickly move to other brokerages.

But it’s not always so sanguine. To find a bankruptcy with commingled prop and
customer funds you need only look to MF Global. To find a retail brokerage
bankruptcy with commingled funds you can go “Wolf of Wall Street” and look at
Stratton Oakmont.

And despite how smooth everything turned out for Lehman there was a period of
a day or two for some retail guys where it wasn’t exactly clear where your
money was.

------
whalesalad
Back when I used to smoke I would ocassionally hang out with this guy from an
investment bank that traded on the Japanese exchange. They had really cool
working hours (started a lot later in the day) because we were based in Hawaii
which is a few hours behind Japan.

Anyway, the guy told me that they had multiple big red physical kill switches
so that they could immediately turn things off if shit ever hit the fan with
their systems.

If you have ever spent time in Michigan you'll notice that the manufacturer
test vehicles have a big ass red button on the dashboard to kill the vehicle
in case something goes wrong.

I cannot imagine doing anything remotely close to this sort of thing without a
big ass red kill switch on my desk.

~~~
brazzy
They did have a kill switch. What they did not have was someone with the
authority and guts to throw it in time.

This may have something to do with the fact that killing a HFT bot without
some kind of orderly wind-down might leave you with some _very_ expensive open
positions.

------
t0mas88
I'm not sure the conclusion of the post is the "One and Only Answer" because a
fully automated deploy process has another risk that has bitten both AWS and
Google at some point: fully automatically taking down huge amounts of
instances.

~~~
SteveNuts
A lot of times those issues have been "fully automated (but with human
inputs)" or "fully automated with no guardrails"

~~~
bobbiechen
This seems to cover all the cases. Either there are guardrails (as human
inputs), or there aren't. Unless I'm missing a middle ground here?

~~~
HelloNurse
Automated checks. For example, in this case, confirming that the other
containers are quiescent (as they are supposed to be) and locking them down
before the potentially conflicting operation.

~~~
bobbiechen
I'd be willing to bet that it's extremely rare for a fully automated process
to have absolutely no guardrails/checks/tests, and it's also extremely rare
for a fully automated process to have 100% test coverage.

If this check existed and the system failed in some other way, it would be
characterized as "fully automated with no guardrails" (for the scenario which
caused the failure). "We had tests but missed an edge case" usually doesn't
get you any sympathy.

So what's left? Formally proving correctness is overkill for most things. The
"end-to-end" argument [1] might be able to detect when something goes wrong at
the end to rollback or alert, but what if the intermediate steps have already
caused damage or prevent the "end" from being reached entirely? If a run is
taking longer than usual, how do you differentiate between harmless delays in
the intermediate steps, and the run being entirely broken somehow?

[1] [http://pages.cs.wisc.edu/~bart/739/papers/end-to-
end.pdf](http://pages.cs.wisc.edu/~bart/739/papers/end-to-end.pdf)

------
floatingatoll
Previous discussions on HN:

2014:
[https://news.ycombinator.com/item?id=7652036](https://news.ycombinator.com/item?id=7652036)

2015:
[https://news.ycombinator.com/item?id=8994701](https://news.ycombinator.com/item?id=8994701)

~~~
lostlogin
Thanks - the top comment from vijucat in the 2015 discussion is anxiety
inducing.

“ - Ctrl-r for reverse-search through history \- typing 'ps' to find the
process status utility (of course) \- pressing Enter,....and realizing that
Ctrl-r actually found 'stopserver.sh' in history instead. (There's a ps inside
stoPServer.sh)”

~~~
erinaceousjones
I had a habit of doing `sudo shutdown now` on my desktop as I'm leaving my
office. I don't know why, it takes longer than simply hitting the power
button.

Didn't notice I was still SSH'ed into "the" server which was at the time a
single point of failure for my entire project, and as a lowly not-an-IT-
person-just-a-developer in our corporate environment, I didn't have access to
the machine to go power it back on. And the IT people I knew who could help
had gone home for the day.

Felt super dumb writing that up in the downtime log the next day.

Having read this article, it makes me super glad I'm working on very niche
slow-paced stuff which, when goes down for ~12 hours, is a minor annoyance to
our users rather than "you're costing us millions of $currency per minute" :-)

------
toomuchtodo
This is less DevOps and more poor software engineering practices (code
reviews, unit testing, paying off your technical debt through
refactoring/removing old code, etc), although properly managing and
instrumenting deploys might have stemmed the bleeding and kept losses
manageable.

It's good though; poor decisions must have a cost. The only way to enforce
good engineering practices that are human time intensive is for there to be a
cost not to.

~~~
mongol
I think at it's core it is right in the guts of DevOps. The "flag" that
protects dead code is dev, and the unforeseen deployment scenario is ops. With
a DevOps mindset you need to think of both. I think it is a stellar example of
what can go wrong if you don't consider both the dev and the ops aspects.

------
toolslive
off topic, but a "knightmare" is also a chess term. It's a good-knight-vs-bad-
bishop position that went horribly wrong for the owner of the bishop.

~~~
evilotto
Also a Batman alternate timeline

~~~
swish_bob
And a late 80s early 90s children's TV programme.

------
forgottenpass
What's to take away from this?

Automate deployment? Fine but boring. That's the prevailing dogma today. I
don't remember where the devops hype train was in 2012. Package management had
already been a solved problem for years even though it was (and continues to)
be regarded as involving too much "icky reading" and a repository system using
plain directories on vanilla webservers; all way too unoptimized for resume
padding.

Learn how to identify and manage risk like an engineer? Understand how
business process and software can implement risk controls and mitigations?

I kid, so I don't cry.

~~~
Traster
There's a whole slew of lessons to learn from this. Leaving dead code in your
system and then deciding to _repurpose_ it. Manually deploying with no
verification. No checks in place to disable a system during crazy behaviour.
No real alerting system. No procedures in place for when a system goes wrong.
No audit log to refer to when rolling back.

The lesson from this article is kind of funny

>It is not enough to build great software and test it; you also have to ensure
it is delivered to market correctly so that your customers get the value you
are delivering

While true, I don't see any indication this was great software or that it was
properly tested.

>Had Knight implemented an automated deployment system – complete with
configuration, deployment and test automation – the error that cause the
Knightmare would have been avoided.

Or to put it another way - had Knight implemented a higher quality deployment
system than the quality of any of their other systems, they _might_ have
avoided this issue.

These stories are never about a single thing gone wrong. The whole point about
critical systems is that you _should_ need dozens of things to go wrong for
them to fail, and then you should fail safe.

~~~
wikiman
The fundamental truth of software. Your system is only as good as its worst
component. every. single. time.

Deployment is a component. Monitoring is a component. They are also OpEx and
therefore "inferior"

------
jmalicki
(2014)

~~~
dang
Added. Thanks!

