- "It's risk free." Any time you put headless trading code into the market you are risking a catastrophic loss. That risk can be managed to a degree with many layers of programmatic safeties, and other practices like having your operations people look for warning emails the day after you've deployed new code. But the risk is always present.
- "It makes the market more unstable." The most important market-maker in U.S. equities blew itself up in spectacular fashion and had to remove itself from the trading entirely. Sending unaccounted orders into the market in an endless loop is about the worst mistake an algorithmic trading firm can make. Can anyone pick the day this happened out of a long-term chart of the S&P 500?
1. know what a normal morning looks like
2. and recognize the abnormality
3. and have the authority to shut down all trading immediately
DevOps is not an excuse to fire your operations staff, it's a requirement that your developers work with and understand your operations staff and vice-versa.
The other aspect is that while I think people should be empowered, shutting down business operations should be decided by more than just an operations team.
Here's an example - one morning on my way to work many years ago, I got a frantic call from our head of operations - "the site is slow", "we're getting DOSed", "I'm going to start blocking netblocks from the major offenders".
I talked him into waiting until I got there, and took a look - sure enough, we're doing 6x our normal traffic, web servers are slowing down, most of it coming from the US. But was it a DoS? Credit card transactions were also up, but not 6x, about 1.5x-2x, so more people were buying, but not proportional to the normal traffic volumes.
A short while later we figured it out - unknownst to us, we had been featured on a major US morning news show, complete with a walkthrough on how to use our site. Millions of people jumped on the site to give it a shot, but many of them abandoned, or the site got too slow for them to purchase on. We fixed it, got it up and running again, and made a ton of money.
But if our operations group had obeyed the "clear" signs of what was going on and just started blocking whole netblocks, we would have lost money and hurt our business.
They managed to raise the money to cover the loss, but afterwards they were getting round 10% of their normal order volume .
Somewhat ironically, the closest thing they had to a kill switch, backing out the code, actually made the situation worse as it made all 8 servers misbehave instead of just the first one.
The full SEC report in  is an interesting read, just skip the parts about "regulation 5c-15...".
And to the backout, they reverted the code on the 7 servers while erroneously leaving the revised configuration, so it wasn't really a kill switch at all. It was frantic fumbling that made things worse.
They would have been better off just shutting the servers down at the first hint of trouble.
Put an API method in every service that exposes the SHA of the running code, the build time of the binary (if compiled), and the timestamp when the deploy was initiated. (btw, having this information in the filesystem is insufficient, because what if a post-deploy restart failed?) Verify after every deploy.
I can't help but smile at this comment. Production servers crashing is bad news, but the above is a whole new level of bad.
Everybody is worried about downtime, but downtime is rarely the worst that can happen.
Things turning to shit fast and not being able to stop it is both much more common and much harder to recover from.
So many organizations have dutifully implemented a single command deployment but don't even have a playbook for simply pulling the plug.
"In 2012 Knight was the largest trader in US equities with market share of around 17% on each the NYSE and NASDAQ. Knight’s Electronic Trading Group (ETG) managed an average daily trading volume of more than 3.3 billion trades daily, trading over 21 billion dollars…daily."
This was for others, per the preceding sentence.
"Knight only has $365 million in cash and equivalents. In 45-minutes Knight went from being the largest trader in US equities and a major market maker in the NYSE and NASDAQ to bankrupt. They had 48-hours to raise the capital necessary to cover their losses (which they managed to do with a $400 million investment from around a half-dozen investors). Knight Capital Group was eventually acquired by Getco LLC (December 2012) and the merged company is now called KCG Holdings."
Per https://news.ycombinator.com/item?id=7652573 afterwords "they were getting round 10% of their normal order volume"
Losing 90% of their business "overnight" tends to be fatal.
Per a link from the article the above links to, "An equities trader explained that Knight was the "last place" he would go to execute a trade. Others expressed befuddlement and the firm's inability to rectify the trading error for a full 45 minutes."
Now all that was immediately after the screwup, but it's hard to imagine it getting better without a perception that the people managing their operations were replaced with people worthy of trust.
The way they were able to raise the funds they needed was to go to existing stakeholders for the money. This lead directly to their merger with Getco a bother huge HFT firm.
As I recall, it was a "vanity" function that blew up, i.e. something not at all necessary for running those smelters.
Ah, look here for some much more expensive ones in space exploration: https://en.wikipedia.org/wiki/List_of_software_bugs
The second YF-22 was killed by a software error. That was plenty expensive, I'm sure, then again, that's why in peacetime we build and thoroughly exercise test aircraft before starting production.
"Knight Capital Group is an American global financial services firm engaging in market making, electronic execution, and institutional sales and trading."
Institutions and other big entities buy and sell stocks, in huge quantities. Someone has to execute these trades, and doing it electronically is infinitely faster and more efficient, and usually less error prone. And the platforms for doing this are therefore very "powerful".
But "With great power comes great responsibility", and this company was manifestly grossly irresponsible on many levels, it was likely only a matter of time before something like this would kill them.
If you can lose $400M in 45 minutes, you need an actual deployment team with actual procedures and triple check code verifications.
those companies exist for one reason... in the past there were rules so people dont send money to the wrong place in the stock exchange. those brokers and speed traders got ahead of everyone by bypassing those safeties with little refard for safety. the only sad part in this history is that it still havent haooened to all of them.
At the core it seems is that each application server is effectively running as root. Having enormous capacity to cause immediate damage. The lesson from http://thecodelesscode.com/case/140 is to "trust no-one". This implies having automated supervisors that has the capacity or authority to shut down machines. This is difficult, and difficult to reason.
Secondly, it warns us of the dangers of sharing global variables/flags. Humans lack the capacity to reason effectively what happens when a repurposed flag gets used by an old piece of code. That should be sufficient heuristic to avoid doing so. This is utterly preventable.
Thirdly, incomplete/partial deployment is extremely dangerous. While assembly signing and other approaches work for binaries, there's nothing said about configuration files. Perhaps, best practice in highly dangerous situations require configuration to be versioned and checked by the binaries as they load. After all, a configuration represents an executable specification. Similarly, relying on environment variables is extremely risky as well.
In LibreOffice, we are spending a LOT of time trying to remove ununsed and outdated code.
If you find yourself afraid to pull old code out, you've got probably got a combination of technological and cultural problems.
Automation does not protect you from either automated devastation, gaps, or human errors. Your automation tools, just as with written instructions, require configuration -- a list of servers, for instance.
Automation can be bullet-proof when it's a continuous deployment situation, but is less reliable when you do infrequent deployments, as such a financial firm does. I say this having been in a firm where we moved from "a list of deployment steps" to "fully automated" for our quarterly builds, and the result was much, much, much worse than it was before. We could certainly have resolved this (for instance having a perfect replica of production), but the amount of delta and work and testing we did on our deployment process vastly, and by several magnitudes, exceeded our manual process.
An observer did not validate the deployment (which should be the case whether automated or not for such a deploy). They ignored critical warning messages sent by the system pre-trading (the system was warning them that it was a SNAFU situation). Systems in a cluster didn't verify versions with each other. Configuration did not demand a version. Most importantly for a system of this sort, they didn't have a trade gateway that they could easily see what the system was doing, and gate abnormal behaviors quickly and easily (such a system should be as simple as possible, the premise being that it's an intermediate step between the decision/action systems and the market. The principal is exactly the same as sending a mass customer mailing to a holding pen for validation to ensure that your macros are correct, people aren't multi-sent, to do throttling, etc).
If you simply run `kill -9 pid`, you might be holding a large position, or worse, you might be holding some naked shorts. (In fact, you almost certainly are.) This is risky. It can result in failures to deliver, make you vulnerable to large market movements, etc.
Another form of "killswitch" is to not open any new positions, but still attempt to close out your old positions with varying degrees of aggressiveness. But if your system is wildly broken, this might not be doing what you think it's doing. As I understand it, this happened to Knight.
This is an architectural mistake. Distributed systems must always be able to operate in an environment with a variety of versions, without ill effects.
They repurposed a flag and then failed to test the mixed environment.
Hindsight is 20/20, of course.
If your deployment pipeline is fully automated, why aren't you making lots of little deployments? The safest change to make is the smallest change possible, after all.
From a systems perspective, my takeaways on this are:
-Don't re-use a message for a semantically different purpose in a distributed system where you're running different software versions (even in cases where you don't plan to, really, since you may roll back or end up running the wrong code by mistake)
-Version your messages so anything that changes their meaning can only be accepted by a receiver that follows that protocol
-QA old and new builds against one another
If you really want to look at the root cause of this, it's cultural. Trading desks don't want to spend development time on things that don't generate PnL. Traders want to try lots of ideas so many features are built that don't get used. Code cleanup gets put on the back burner. Developers do sketchy stuff like re-purposing a message field because it's annoying or time-consuming to deploy a new format. If traders aren't developers themselves, they may underestimate the risk of pressuring operations & devs to work more quickly.
Things like this are probably the biggest risk faced by automated traders, and the good shops take it very seriously. I've never been scared of any loss due to poor trading, but losses due to software errors can be astonishing and happen faster than you can stop them.
Let me take his points:
> QA has to review it
QA review is one approach to quality, but it's far from the only one. In Lean Manufacturing, heavy QA is seen as wasteful, covering up for upstream problems. Their approach is to eliminate the root problems. That let Toyota kick the asses of the US car manufacturers in the 80s.
>documentation has to be written/updated
This to me smells of a phasist approach, with disconnected groups of specialists. Some people work with cross-functional teams, so that everything important (e.g., both code and user documentation) is updated at the same time.
>marketing may need to write a press release, sales and customers may need to be notified
This is confusing releasing code with making features active for most users. You can do them together, but it's not the only way. Feature flags and gradual rollouts are two other options.
More broadly, in this case rolling out the code with serious review and QA was also business suicide. The "do more QA" approach is trying to decrease MTBF, with the goal of nothing bad happening ever. But there's another approach: to minimize MTTR (or, more accurately, to minimize impact of issues). Shops like that are much better at recovering from issues. Rather than trying to pretend they will never make mistakes, they assume they will and work to be ready for it.
Plenty of harm can be found in even small changes; the reality is that continuous deployment only works for companies that can afford to regularly push minutely broken software to their customers.
As for "doing fine at it", that they exist doesn't prove that, or even define what "fine" is.
From hearing them talk at events and reading their blog, they seem to be doing fine. Surviving for 6 years and their recent C round from a number of smart people also suggests they are doing fine. If you have some substantial indication that they aren't, I look forward to reading it.
"Fine" doesn't mean anything in that context. I could ship lower quality software to our customers and do "fine", but I prefer to ship high quality well-polished software that doesn't foist the cost of my development laziness onto our customers.
I also prefer to ship high-quality software to customers. And from everything they've said, so do they. Your (unsubstantiated) claim appears to be that longer feedback loops (with particular supporting practices) result in net higher quality than shorter ones (with particular supporting practices). I don't think that's true, and places like Wealthfront and Etsy are good counterexamples.
repurposed an old flag
http://www.sec.gov/litigation/admin/2013/34-70694.pdf (PDF warning).