Hacker News new | comments | show | ask | jobs | submit login
Knightmare: A DevOps Cautionary Tale (dougseven.com)
101 points by nattaylor 1269 days ago | hide | past | web | 60 comments | favorite

This story puts the lie to a couple of canards about HFT:

- "It's risk free." Any time you put headless trading code into the market you are risking a catastrophic loss. That risk can be managed to a degree with many layers of programmatic safeties, and other practices like having your operations people look for warning emails the day after you've deployed new code. But the risk is always present.

- "It makes the market more unstable." The most important market-maker in U.S. equities blew itself up in spectacular fashion and had to remove itself from the trading entirely. Sending unaccounted orders into the market in an endless loop is about the worst mistake an algorithmic trading firm can make. Can anyone pick the day this happened out of a long-term chart of the S&P 500?

Automated deployment would not necessarily have prevented this. Errors happen when humans deploy software manually, and errors happen when humans configure automated deployment tools. The real problem was lack of a "kill switch" to shut down the system when it became obvious something was wrong.

An operations group should:

1. know what a normal morning looks like

2. and recognize the abnormality

3. and have the authority to shut down all trading immediately

DevOps is not an excuse to fire your operations staff, it's a requirement that your developers work with and understand your operations staff and vice-versa.

As someone else pointed out, "shut down all trading immediately" isn't so clear cut as "let's stop doing this".

The other aspect is that while I think people should be empowered, shutting down business operations should be decided by more than just an operations team.

Here's an example - one morning on my way to work many years ago, I got a frantic call from our head of operations - "the site is slow", "we're getting DOSed", "I'm going to start blocking netblocks from the major offenders".

I talked him into waiting until I got there, and took a look - sure enough, we're doing 6x our normal traffic, web servers are slowing down, most of it coming from the US. But was it a DoS? Credit card transactions were also up, but not 6x, about 1.5x-2x, so more people were buying, but not proportional to the normal traffic volumes.

A short while later we figured it out - unknownst to us, we had been featured on a major US morning news show, complete with a walkthrough on how to use our site. Millions of people jumped on the site to give it a shot, but many of them abandoned, or the site got too slow for them to purchase on. We fixed it, got it up and running again, and made a ton of money.

But if our operations group had obeyed the "clear" signs of what was going on and just started blocking whole netblocks, we would have lost money and hurt our business.

A kill switch wouldn't have saved them. What killed Knight wasn't the $400 million loss, it was the lack of confidence all other firms had in them afterwards. Brokers can't just shut down in the middle of the trading day.

They managed to raise the money to cover the loss, but afterwards they were getting round 10% of their normal order volume [1].

Somewhat ironically, the closest thing they had to a kill switch, backing out the code, actually made the situation worse as it made all 8 servers misbehave instead of just the first one[2].

The full SEC report in [2] is an interesting read, just skip the parts about "regulation 5c-15...".

[1] http://www.businessinsider.com/look-how-knight-capitals-trad...

[2] http://www.sec.gov/litigation/admin/2013/34-70694.pdf

Note that the loss of confidence was because a) it went on for 45 minutes, b) the financial loss was large enough that it seriously threatened the ongoing business of the firm. A kill switch would absolutely have helped them as it would have solved both a and b. No, they shouldn't stop fulfilling orders (which, incidentally, happens with order fulfillment. Outages and glitches happen), but the alternative was clearly much, much worse.

And to the backout, they reverted the code on the 7 servers while erroneously leaving the revised configuration, so it wasn't really a kill switch at all. It was frantic fumbling that made things worse.

They would have been better off just shutting the servers down at the first hint of trouble.

I think the lesson is actually more about how do proper versioning and message serialization in higher risk distributed systems. Higher message versions should fail to deserialize and cause the message to re-queue (or go to a dead letter queue). Then you monitor queue length like a hawk and have a plan in place for rolling back not just the consumers but the producers as well

Every company I've worked at had similar (though far less costly) issues.

Put an API method in every service that exposes the SHA of the running code, the build time of the binary (if compiled), and the timestamp when the deploy was initiated. (btw, having this information in the filesystem is insufficient, because what if a post-deploy restart failed?) Verify after every deploy.

>BTW – if there is an SEC filing about your deployment something may have gone terribly wrong

I can't help but smile at this comment. Production servers crashing is bad news, but the above is a whole new level of bad.

The quite common lack of a kill switch is something that never fails to amaze me. Especially in large scale deployments with all kinds of distributed processes where you cannot simply turn off "the" server.

Everybody is worried about downtime, but downtime is rarely the worst that can happen.

Things turning to shit fast and not being able to stop it is both much more common and much harder to recover from.

So many organizations have dutifully implemented a single command deployment but don't even have a playbook for simply pulling the plug.

That's very true in this case. The issue here was that Knight isn't just trading for its own account. They're a broker where they likely have some SLA-ish agreement with clients, or face repetitional risk at the very least. As a registered market-maker they're obligated to quote two-way prices. Shutting down costs them money and exposes them to regulatory risk.

And if you read between the lines, loss of reputation is what killed them:

"In 2012 Knight was the largest trader in US equities with market share of around 17% on each the NYSE and NASDAQ. Knight’s Electronic Trading Group (ETG) managed an average daily trading volume of more than 3.3 billion trades daily, trading over 21 billion dollars…daily."

This was for others, per the preceding sentence.

"Knight only has $365 million in cash and equivalents. In 45-minutes Knight went from being the largest trader in US equities and a major market maker in the NYSE and NASDAQ to bankrupt. They had 48-hours to raise the capital necessary to cover their losses (which they managed to do with a $400 million investment from around a half-dozen investors). Knight Capital Group was eventually acquired by Getco LLC (December 2012) and the merged company is now called KCG Holdings."

Per https://news.ycombinator.com/item?id=7652573 afterwords "they were getting round 10% of their normal order volume"

Losing 90% of their business "overnight" tends to be fatal.

Per a link from the article the above links to, "An equities trader explained that Knight was the "last place" he would go to execute a trade. Others expressed befuddlement and the firm's inability to rectify the trading error for a full 45 minutes."

Now all that was immediately after the screwup, but it's hard to imagine it getting better without a perception that the people managing their operations were replaced with people worthy of trust.

Knight (and most big HFT firms) trade for others or sell their execution platforms. They also have proprietary trading arms, of which their market making strategies were assumedly a part.

The way they were able to raise the funds they needed was to go to existing stakeholders for the money. This lead directly to their merger with Getco a bother huge HFT firm.

In addition to having a repeatable and dependable deployment process, it's also a good idea to remove unused functionality completely from deployed products. I also don't understand why people insist on using integer flags for things that matter. I've seen this type of error before, people reuse a bit pattern to mean something different and two pieces of code start doing different things.

I think this says more about their business than their deployment process. It might be a good rule of thumb to say, "If your business can lose $400M in 45 minutes, you're not in business, you're playing poker."

Many businesses can lose that much or more in a very short time if something goes wrong, by destruction of product, damage to equipment, or damage to environment. Software can be responsible for all of these.

Famous example, although the cost was much less:


As I recall, it was a "vanity" function that blew up, i.e. something not at all necessary for running those smelters.

Ah, look here for some much more expensive ones in space exploration: https://en.wikipedia.org/wiki/List_of_software_bugs

The second YF-22 was killed by a software error. That was plenty expensive, I'm sure, then again, that's why in peacetime we build and thoroughly exercise test aircraft before starting production.

From the article, which even has a Wikipedia link to market maker:

"Knight Capital Group is an American global financial services firm engaging in market making, electronic execution, and institutional sales and trading."

Institutions and other big entities buy and sell stocks, in huge quantities. Someone has to execute these trades, and doing it electronically is infinitely faster and more efficient, and usually less error prone. And the platforms for doing this are therefore very "powerful".

But "With great power comes great responsibility", and this company was manifestly grossly irresponsible on many levels, it was likely only a matter of time before something like this would kill them.

The market is tough and I know a real business (commodity) that can lose about a million a minute in market shifts if done really wrong (like this).

If you can lose $400M in 45 minutes, you need an actual deployment team with actual procedures and triple check code verifications.

you are rigth in this case.

those companies exist for one reason... in the past there were rules so people dont send money to the wrong place in the stock exchange. those brokers and speed traders got ahead of everyone by bypassing those safeties with little refard for safety. the only sad part in this history is that it still havent haooened to all of them.

This case is extremely interesting, because it presents a very difficult problem. What is it that Knight could have done to prevent such a serious error from occurring?

At the core it seems is that each application server is effectively running as root. Having enormous capacity to cause immediate damage. The lesson from http://thecodelesscode.com/case/140 is to "trust no-one". This implies having automated supervisors that has the capacity or authority to shut down machines. This is difficult, and difficult to reason.

Secondly, it warns us of the dangers of sharing global variables/flags. Humans lack the capacity to reason effectively what happens when a repurposed flag gets used by an old piece of code. That should be sufficient heuristic to avoid doing so. This is utterly preventable.

Thirdly, incomplete/partial deployment is extremely dangerous. While assembly signing and other approaches work for binaries, there's nothing said about configuration files. Perhaps, best practice in highly dangerous situations require configuration to be versioned and checked by the binaries as they load. After all, a configuration represents an executable specification. Similarly, relying on environment variables is extremely risky as well.

Allspaw's post on this incident (http://www.kitchensoap.com/2013/10/29/counterfactuals-knight...) is much better. In particular, he explains why the SEC document is not a post-mortem and should not be used to reach this kind of conclusion.

I think it shows how release engineering hasn't been seriously funded at many companies. There are a bunch of tools, serving different communities that are used, but most of them operate in the context of the single server. Production reality is clusters of machines. We need better tools for managing cluster deployments and hot-swapping code. the Erlang platform takes on some of this, but doesn't seem to have picked up the following it probably deserves. I bet there are some lessons to be learned there.

Why didn't they remove the old code first?

In LibreOffice, we are spending a LOT of time trying to remove ununsed and outdated code.

That's what I'm thinking as well. You can't leave land mines lying around and then blame the poor guy who steps on one.

If you find yourself afraid to pull old code out, you've got probably got a combination of technological and cultural problems.

Because time (and money) spent removing old code that's not used right now is time (and money) spent for zero short-term profit increase.

We should let such firms die when they fall over their own crudulence.

Knight never got any government assistance and their trades stood. Unlike when the big I banks mess up, HFT firms tend to pay for their mistakes.

While some people are drawing lessons from this incident about HFT, I've seen no indication Knight was a major player in that, the summary says they were "*engaging in market making, electronic execution, and institutional sales and trading." I.e. those actions are on the behalf of others, plus stock exchange market makers have obligations to keep what they're responsible for liquid.

It depends on what your definition of HFT is. But they were frequently responsible for ~20% of equity volume in a day. They were market making purely electronically & at low latency. That hits nearly every definition of HFT that I know.

There are many interesting lessons from this tale, but the conclusion that automated deployment would have saved the day seems a bit of a jump.

Automation does not protect you from either automated devastation, gaps, or human errors. Your automation tools, just as with written instructions, require configuration -- a list of servers, for instance.

Automation can be bullet-proof when it's a continuous deployment situation, but is less reliable when you do infrequent deployments, as such a financial firm does. I say this having been in a firm where we moved from "a list of deployment steps" to "fully automated" for our quarterly builds, and the result was much, much, much worse than it was before. We could certainly have resolved this (for instance having a perfect replica of production), but the amount of delta and work and testing we did on our deployment process vastly, and by several magnitudes, exceeded our manual process.

An observer did not validate the deployment (which should be the case whether automated or not for such a deploy). They ignored critical warning messages sent by the system pre-trading (the system was warning them that it was a SNAFU situation). Systems in a cluster didn't verify versions with each other. Configuration did not demand a version. Most importantly for a system of this sort, they didn't have a trade gateway that they could easily see what the system was doing, and gate abnormal behaviors quickly and easily (such a system should be as simple as possible, the premise being that it's an intermediate step between the decision/action systems and the market. The principal is exactly the same as sending a mass customer mailing to a holding pen for validation to ensure that your macros are correct, people aren't multi-sent, to do throttling, etc).

The bit that surprised me the most was the lack of killswitch (or a halt or a pause); that a human supervisor couldn't invoke a "holy shit!" button.

A "killswitch" is not a trivial thing to build.

If you simply run `kill -9 pid`, you might be holding a large position, or worse, you might be holding some naked shorts. (In fact, you almost certainly are.) This is risky. It can result in failures to deliver, make you vulnerable to large market movements, etc.

Another form of "killswitch" is to not open any new positions, but still attempt to close out your old positions with varying degrees of aggressiveness. But if your system is wildly broken, this might not be doing what you think it's doing. As I understand it, this happened to Knight.

I think that's a strange way to approach a killswitch. The point of a killswitch is to stop everything NOW, so the system can be checked. After the problem is understood, then it's safe to restart strategies gradually and unwind positions. A system shouldn't attempt to send more orders when it's in an unpredictable state. I believe many regulators require such a switch.

You know that saying about finding yourself in a hole? A "close out my risk" button is fine for situations like losing money or whatever, but if you have no clue what your orders, trades and risk even are, the only sensible thing to do is stop making it worse.

This probably surprised regulators and the exchanges too, as it is a requirement for automated trading.

A kill switch for a system that holds complex and rapidly changing state (open positions, market data) is quite different from, and much more complicated than, a kill switch for a system that can be simply halted at a given instant.

This has nothing to do with automation of deployments. Any part of an automated deployment can fail. At scale a single failure cannot be allowed to halt the deployment, either.

This is an architectural mistake. Distributed systems must always be able to operate in an environment with a variety of versions, without ill effects.

They repurposed a flag and then failed to test the mixed environment.

Hindsight is 20/20, of course.

Posted a note about the human aspect of uptime recently:


Automation can be bullet-proof when it's a continuous deployment situation, but is less reliable when you do infrequent deployments

If your deployment pipeline is fully automated, why aren't you making lots of little deployments? The safest change to make is the smallest change possible, after all.

Our deployment can be done in one command in about 15 minutes, start to finish. Yet, we only deploy once every few weeks. Why? There's more to releasing code than simply pushing it live: QA has to review it, documentation has to be written/updated, marketing may need to write a press release, sales and customers may need to be notified, etc.

"Has to" is pretty strong. You folks have chosen to work that way, but other people do it differently.

I hope you're trolling, but rolling out high-impact code as is described in this article and comment thread without serious review and QA is akin to business suicide, as demonstrated by the article.

The code was QAed, but they didn't test old and new versions against each other. Version A could accept a flag and run obsolete logic that would lose control of its orders but never sent it, so this problem never happened. Version B sent this flag and the receiver would send RPI orders with it. Put a Version B sender and a Version A receiver together and you end up with a disaster.

From a systems perspective, my takeaways on this are:

-Don't re-use a message for a semantically different purpose in a distributed system where you're running different software versions (even in cases where you don't plan to, really, since you may roll back or end up running the wrong code by mistake)

-Version your messages so anything that changes their meaning can only be accepted by a receiver that follows that protocol

-QA old and new builds against one another

If you really want to look at the root cause of this, it's cultural. Trading desks don't want to spend development time on things that don't generate PnL. Traders want to try lots of ideas so many features are built that don't get used. Code cleanup gets put on the back burner. Developers do sketchy stuff like re-purposing a message field because it's annoying or time-consuming to deploy a new format. If traders aren't developers themselves, they may underestimate the risk of pressuring operations & devs to work more quickly.

Things like this are probably the biggest risk faced by automated traders, and the good shops take it very seriously. I've never been scared of any loss due to poor trading, but losses due to software errors can be astonishing and happen faster than you can stop them.

I'm not trolling at all.

Let me take his points:

> QA has to review it

QA review is one approach to quality, but it's far from the only one. In Lean Manufacturing, heavy QA is seen as wasteful, covering up for upstream problems. Their approach is to eliminate the root problems. That let Toyota kick the asses of the US car manufacturers in the 80s.

>documentation has to be written/updated

This to me smells of a phasist approach, with disconnected groups of specialists. Some people work with cross-functional teams, so that everything important (e.g., both code and user documentation) is updated at the same time.

>marketing may need to write a press release, sales and customers may need to be notified

This is confusing releasing code with making features active for most users. You can do them together, but it's not the only way. Feature flags and gradual rollouts are two other options.

More broadly, in this case rolling out the code with serious review and QA was also business suicide. The "do more QA" approach is trying to decrease MTBF, with the goal of nothing bad happening ever. But there's another approach: to minimize MTTR (or, more accurately, to minimize impact of issues). Shops like that are much better at recovering from issues. Rather than trying to pretend they will never make mistakes, they assume they will and work to be ready for it.

The biggest problem that I see with having heavy QA as a gateway to release (from a lean production perspective) is that it tends to encourage deploying large batches of changes at once. When something goes wrong (which it will) which one of the n changes (or combination of changes) caused the problem. How can you roll back/roll forward a fix to just the one problem?

No, the safest change to make is one that has been reviewed and tested thoroughly and found to be correct.

Plenty of harm can be found in even small changes; the reality is that continuous deployment only works for companies that can afford to regularly push minutely broken software to their customers.

Wealthfront has been doing continuous deployment for years, and they're financially liable for any bugs. You might want to reconsider your opinion.

No, they're financially liable for financially impacting bugs. Broken web UI annoys your customers, but is unlikely to bankrupt you.

I'm not getting your point. Is there some software that doesn't have the occasional minor UI bug? My point is that Wealthfront is doing Continuous Deployment in an environment where even small changes can cause substantial harm, and they seem to be doing fine at it.

Yes, there is software that is designed with the goal of never wasting the user's time or goodwill with buggy or ill conceived UX or implementation.

As for "doing fine at it", that they exist doesn't prove that, or even define what "fine" is.

I'm still not getting your point. So what if software is designed with a goal of being magical? The question is how good the real-world results are.

From hearing them talk at events and reading their blog, they seem to be doing fine. Surviving for 6 years and their recent C round from a number of smart people also suggests they are doing fine. If you have some substantial indication that they aren't, I look forward to reading it.

The point is simple: The fact that they raised an investment round and haven't gone out of business is not an argument for their methodology.

"Fine" doesn't mean anything in that context. I could ship lower quality software to our customers and do "fine", but I prefer to ship high quality well-polished software that doesn't foist the cost of my development laziness onto our customers.

They've been in business 6 years. I'm arguing that these methods are sustainable, so yes, them sustaining them is a good hint that it's sustainable.

I also prefer to ship high-quality software to customers. And from everything they've said, so do they. Your (unsubstantiated) claim appears to be that longer feedback loops (with particular supporting practices) result in net higher quality than shorter ones (with particular supporting practices). I don't think that's true, and places like Wealthfront and Etsy are good counterexamples.

No rollback plan? Wtf.

  repurposed an old flag
The flag was not OLD since there was still code in the CURRENT code base which COULD use it.

There was live code in the current code base which could use the flag, but it hadn't been active in 8 years.

Old doesn't mean unused...

The blog spam isn't necessary. To get the actual findings check out the sec post-mortem located here.

http://www.sec.gov/litigation/admin/2013/34-70694.pdf (PDF warning).

The posted article is much more readable than this SEC document. I wouldn't call it "blog spam" at all. (I'd define "blog spam" as a blog post that links to another article without adding any value.)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact