He said he saw the whole dev team just power off and go home at 11am, followed quickly by the rest of the employees. At that point, there was nothing they could do.
The craziest thing is that it went on for so long. No one caught it until their own traders so it come across Bloomberg and CNBC. They actually thought it was a rival HFT and tried to play against it.
The only people that came out of this ahead were aggressive algos on the other side and a few smart individual traders. A lot of retail guys had stop losses blown through that normally would never have been hit. After trading was halted they set the cap at 20% loss for rolling back trades. So if you lost 19% of your position in that short period of craziness, tough luck.
> The week after this we had a trader in our office who had a meeting at Knight on the morning it happened.
> He said he saw the whole dev team just power off and go home at 11am, followed quickly by the rest of the employees.
1) Dev and Trading/Sales happen at different physical locations.
2) I actually know someone who spent their day cleaning this up and according to someone who was on the tech team and working that day no one went home early.
Think about it, the firm just lost a shit load of money due to an IT issue. Dev's were frantically searching the code for the bug, Sys admins were rifling through server logs. No one had nothing to do:)
> After trading was halted they set the cap at 20% loss for rolling back trades. So if you lost 19% of your position in that short period of craziness, tough luck.
This is just plain false. The normal procedures for busting trades were followed. There was no 20% "cap" for losses, how would you even determine what a 20% loss is?
Per the 20%, I forget what it was but I know trades were rolled back and there was some kind of threshold for them. When I first wrote this, 20% losses was what I remembered. There's an article on it somewhere with the actual amounts. I think it had to do with how far stop limits were off from the open price of the stock.
FYI, NYSE rolled back transactions based on predetermined price rules. There were no discretionary rollbacks associated with Knight's "big day".
> The only people that came out of this ahead were aggressive algos on the other side
Don't forget the other market makers; they took the other side of those Knight trades and made the spread every time Knight lost the spread.
In my previous life working with telcos, I once tried to teach a particularly huge customer how to use CVS how to manage configurations across a 10+ machine cluster of machines. They didn't see any value in it, so they stuck to their good old process of SSHing into each machine individually, "cp config.xml config.xml.20131022", and then editing the configs by hand. Didn't take too long until a typo in a chmod command took down the whole thing (= the node couldn't take down a network interface anymore, so failover stopped working), and they spent several weeks flying in people from all over the planet to debug it... and they still didn't learn their lesson!
The other day he was describing a strange bug which was triggering in the field of a large telco. It was strange because it used to get triggered after exactly 85 days of deployment.
It turned out to be a debug script that was pinging a development server and timing out (because the development server was not accessible from the field). The series of retrials and timeouts totaled to exactly 85 days, after which the rest of the script would activate and uninstall crucial dependencies!
By having negotiated profit margins. Bell Labs was meant to soak up excess money. What we are now nostalgic for would never have happened without an interval during which a telco was considered a "natural monopoly."
(Also: lots of folks really don't enjoy learning new stuff. Or new ways of working. No, they really don't. Even if the new techniques are vastly better and more efficient. Put this down to a human cognitive bias favouring the tried-and-trusted over new-and-untested. There's a lot to be said for that when you're a neolithic hunter-gatherer or an iron-age peasant -- if you try something new and it fails, you maybe get to watch your family starve next spring -- but it's a bit less useful as a rule of thumb in the data centre.)
Of course intelligent, incremental deployment works too, but let's not confuse those poor suits...
SSH is great at least in the sense that it's configuration-format-agnostic, mostly-OS-agnostic, client-side-automation-agnostic, and many other things like that.
Regulators found HFT's exacerbated price declines. As noted above, regulators found that high frequency traders exacerbated price declines. Regulators determined that high frequency traders sold aggressively to eliminate their positions and withdrew from the markets in the face of uncertainty.
Berkshire Hathaway has a difference of 1000 dollars for its bid ask spread yet you dont see a lot of people complaining, do you?
Thank you. That is all that needs to be said.
Norway's sovereign wealth fund (one of the largest in the world, they own 1% of all US stocks) just came out on this exact topic:
> Next, the paper takes on HFT's usual defence – that they provide the market with much-increased liquidity. Norges Bank worries that this liquidity is "transient" - i.e. HFTs often place large orders only to then cancel them, creating "phantom" liquidity and leaving "buyside traders fac[ing] new challenges in assessing posted liquidity."
But great for you! Because maybe you're the guy selling MSFT shares to them. You get the benefit of the price rising faster.
The market is more efficient. Norway can no longer take advantage of the fact that it knows that there's all this additional demand (originating from itself) and it takes a while for everyone else to figure that out.
This is because there's not just one HFT. There are tons. So if one of them tries to do this you won't sell to them, you'll sell to one of the other ones for a better price. Yay competition!
"Volatility, a measure of the extent to which a share’s price jumps around, is about half what it was a few years ago."
Quoted from here:
Although I wish I could find a better article / source with more info and some hard data to back that up. That sentence is not nearly as compelling as it could be if it had more details.
1. kagillion (shares, %ADV, notional dollars)
2. good old days (single year or a range)
3. too much (in $ or %)
2. It's not a binary thing but a gradual transition from mostly human market makers in the 80s to mostly algorithmic market makers today.
3. I'm on shakier ground on this one, but a fraction of a % of their cost.
Liquidity is always there. Until it ain't.
And I've yet to see a convincing argument that HFT is of any benefit to anyone else other than themselves.
At the most basic level, HFT firms make money, right? Otherwise they wouldn't be doing this. Where does this money come from? Entities that are holding these same stocks for longer periods of time. It increases the costs to the longer-term buyers, and it decreases profits to the long-term sellers.
HFT is sand in the gears of the economy, not lubricating oil like they would have you believe.
It's like saying that if you add a 1% tax on food that all grocery stores will go out of business because their margins tend to be really small (around 1%). Clearly that's not what actually happens.
Several exchanges (the Hong Kong Stock Exchange, for one) rate-limit each connection to the exchange, charging fees based on the number of transactions per second allowed on the connection.
Avoiding systems that don't create value would help us live in a world with more value in it. That's why considering this is very productive, and helps moral people avoid wasting precious resources on zero sum games.
He was giving his own value judgement about it.
Also, I do think that knowing whether something actually creates value or not (is a positive-sum, negative-sum or zero-sum game) is definitely helpful in understanding large systems. i.e: It helps predicting whether a society will succeed or fail, by seeing how much of it is wasted on negative/zero sum games.
Societies will succeed more or less based on various parameters, one of which is whether they generate value to sustain themselves or not.
A perfect description of many forms of organized crime.
The classic example is -- if I recall correctly -- a mutation sometimes found in mice. The mutation causes a male to only produce male children. This mutation will spread through the population until there are only male mice, and one generation later the mice are all dead.
The "forced coercion" typically only comes into play if someone fails to hold up their end of a contract---criminal organizations cannot sue.
But, you know what they say: In theory there's no difference between theory and practice. In practice, there is. ;)
I think Insider Trading is more of a problem than the
There's a limited total profit potential everyone competes for, which limites the number of players and the expenses that can be justified.
And that profit potential is directly related to market volatility. Which peaked in 2008 and has since gone down drastically.
A lot of the loudest criticism of HFT is that it's too competitive - a lot of people who used to make a comfortable living from the bid/ask spread are no longer able to due to computers driving down profit margin.
Unless you're prepared to devote alot of time to learning, it's best to stick to index funds, or lower risk investments (GICs).
Having said that, we deployed a system that was mostly automated, with the human operator to oversee investments and if any out-of-the-ordinary transactions (based on experience) were taking place, to shut it down. She happily sat there approving the recommendations even though the recommendations were absolutely outside of anything we'd ever generated in the past, and bled accounts dry in one evening, so sometimes even with a human observing you're still boned.
Maybe you have to figure out a way to test people for unusually high aptitude at looking at mind-numbingly dull repetitive things over and over again, but then still being able to notice the aberrant ones. And then only put people in that job with unusually high aptitude there.
Or have people only do pretty short shifts at that task.
I'm pretty confident that this person wasn't unusually negligent, if you have most anyone doing that job hour after hour day after day they will lose the ability to flag the aberant stuff.
If an alert system is not perceived as highly reliable in directing positive action, then the humans involved will inevitably disable the alert system, either by pulling out a screwdriver or rewriting their mental rubrics to ignore the messages as noise.
Knight Capital is just the finance version of Three Mile Island and Deepwater Horizon -- the means to mitigate or prevent disaster were on hand, but the people in charge just dithered by the kill switch because they were confused. Well, if the people in charge are confused, that is a reason to start the emergency procedures.
People ignore repetitive things, but they usually notice when it changes. They can tell you that it's shaped different or explain how it sounds different from normal.
If this system made dissimilar transactions look very similar to the monitor, then it is to blame, not the idea of having a monitor at all.
What you want is two or three people really in charge, where the individuals are empowered to say: "I am totally confused. If someone cannot explain what is going on to me so that I understand, I am starting shutdown procedures, immediately. Do YOU know exactly what is going on?"
That said, if you're going to fly the jet liner in full manual mode, you better make sure your co-pilot is watching the instruments.
Automated deployments would have helped them. They made lots of mistakes, but IMHO not automating the deploy was likely the #1 mistake here.
The Knight computer error was spectacular and catastrophic but us humans have a longer track record of making catastrophic financial decisions in the market.
Puts and Calls are confusing as they are both something you buy. It's not like a sports bet where you're betting on the team to win so the other side will lose if they win. You can buy a put and a call in the same stock and profit on both if your strike prices are aligned properly.
The opposite side of buying a call is selling a call. Just like shorting a stock, when selling an option the most you can make is 100% and your loss potential is infinite. As an option seller, you're hoping the option expires worthless and out of the money so that you can keep the initial outlay. Most brokerages require a high level of clearance to allow you to sell naked puts and calls as it's generally a bad idea if you can't cover the potential upside.
Actually, we would both lose money if the stock price doesn't move. But yes, generally in order for a call or put to go up, the other must go down.
It's not zero sum, though. If you had purchased a 950 call and an 850 put just before Google announced earnings (with the stock around 900), the call would now be worth more than you paid for both options combined. The counterparty who sold the options is the one who loses (and it is zero sum with them, per design).
For instance, let's say a large institutional investor determines that a 5-yr 10-yr flattener on South Whereisitstan bonds is very attractively priced, but can't stomach the risk of the 10-yr yield going through the roof. They pay an investment bank to create some OTC options on the Whereisitstan 10-year notes and several medium-sized hedge funds take the other side. If the hedge funds are right, the big institution over-pays for the options in strict dollar terms, but the options allow the institution to enter into a very attractive bond trade they otherwise would have been unwilling to enter. In this case, everyone could win, even though the big institution takes a (both realized and statistical) loss on the options.
Instead, we designed our system so that there's a very low threshold for it stop trading if it appears to have lost money, but to only do so temporarily. If our marked-to-market position recovers shortly thereafter while the trading system is idle, then the apparent loss was probably due to a market data glitch. On the other hand, if our position does not recover, then the temporary stoppage becomes permanent, and a human intervenes. (Obviously, there are more details here, but this is the general idea, and it's worked very well for us.)
This is a point that bears amplifying. People who do not work in the financial industry may not appreciate just how bad market data feeds are. Radical jumps with no basis in reality, prices dropping to zero, regular ticks going missing, services going offline altogether with no warning, etc.
This particular subsystem was a replacement for a previous version, which was a kill switch and had led to opportunity costs. We spent a lot of time designing, implementing, and testing it, but haven't felt the need to touch it since then. It's sufficiently general that it doesn't need to be adapted as our strategies change, and it doesn't need to adapt to changing market conditions (as, e.g., a trading strategy does).
Those types of risk controls (i.e. capital thresholds) are bog standard in our industry and it still kind of blows my mind that Knight managed to get around them so easily.
Since there were no procedures in place, would you like to be the guy who pulled the plug on the (let's guess) $100 million/minute processing system? Do you think you could get another job after that? What would the costs be for violating contracts? You could single handedly sink the company (which, in the end, this issue basically did).
I don't blame the guy for not killing all operations. He never should have been put in that situation. Proper QA, regression testing, monitoring, running a shadow copy and verifying it's output, there are tons of things that could have prevented/mitigated this.
Guy1 "Sir, we're losing money much faster than predicted, turn your key"
Guy2 "It could just be market variance"
Guy1 pulls out gun "Turn your key, sir"
Honest question, it's that really doable at this scale?
So you'd need two codebases and two developer teams, coordinated enough that their code produced exactly the same output yet independent enough that they didn't make the same mistakes. With the challenges of coordination this would more than double your costs.
Of course, with the benefit of hindsight, the costs might have been worthwhile...
Also, the market may not go against you immediately. What if the glitch in the system means you're opening positions in stocks and you drive up the price by doing so? The losses are not immediately apparent. There's no screen where you could watch your losses run up in real time. The losses only become apparent once you try to unwind those positions and that's the case in many kinds of scenarios.
I believe it took J.P Morgan months to unwind the London Whale positions and really know what losses were incurred.
I think there's a better chance of catching a glitch at the point where the positions are opened.
"Moreover, because the 33 Account held positions from multiple sources, Knight personnel could not quickly determine the nature or source of the positions accumulating in the 33 Account on the morning of August 1. Knight’s primary risk monitoring tool, known as “PMON,” is a post-execution position monitoring system. At the opening of the market, senior Knight personnel observed a large volume of positions accruing in the 33 Account. However, Knight did not link this tool to its entry of orders so that the entry of orders in the market would automatically stop when Knight exceeded pre-set capital thresholds or its gross position limits. PMON relied entirely on human monitoring and did not generate automated alerts regarding the firm’s financial exposure. PMON also did not display the limits for the accounts or trading groups; the person viewing PMON had to know the applicable limits to recognize that a limit had been exceeded. PMON experienced delays during high volume events, such as the one experienced on August 1, resulting in reports that were inaccurate."
"Knight did not retest the Power Peg code after moving the cumulative quantity function to determine whether Power Peg would still function correctly if called."
From thereon they purely and simply deserved everything that happened to them.
Reusing a flag that did something different in a currently-deployed version, without having a "transition" version that ignores that flag? Dodgy, but makes sense if you're in a rush.
Needing to manually deploy code to 8 different servers? Just stupid.
Computers are very powerful when placed in certain configurations. The more powerful the system you're dealing with the more cautious you should be. If they were dealing with an app then, sure, I'd have a lot more pity for them not taking precautions - such precautions would not be reasonable to expect of them. But if you're not being excessively paranoid about such a powerful system as was deployed here, then you're doing it wrong.
I do feel some pity for them based on the fact that there's not a tradition of caution in programming. And I do agree that there were multiple points of failure in there. But testing all the code that's going to be on a system like this is a base level of caution that should be used - whether or not you intend to use that code. If you think it's too much bother to test, then it shouldn't be there - but if it's gonna be there then for god's sake test.
The real problem here is that they were using incremental deployment and did not have a good process for ensuring the same changes were successfully made to all server.
Code is either working correctly, and verified to be such by automatic testing, or it's not there. You don't leave unused code lying around to be removed later!
This is the biggest argument in my opinion against incremental deployment: it is hard to know exactly what is on any given box. Each time you push an incremental piece to a server you have effectively created a completely unique and custom version of the software on that server. Much better to package the entire solution and be able to say with certainty, "Server A has build number 123."
That is just painful to read. How many times do we hear a company couldn't figure out how to migrate code properly? Do any software engineering programs teach proper code migration?
Next time a manager questions money spent on integration or system testing, hand them a printout of this SEC document and explain how much the problem can cost.
Cool - all you have to do to get away with financial crimes is create a system with no protections against breaking the law.
 http://en.wikipedia.org/wiki/Hyatt_Regency_walkway_collapse#... - no criminal penalties, but civil penalties and lost their license to practice.
If every software project was run like an avionics project, software would be more reliable and of higher quality. But the world would be worse off; most of the software people use would never come into existence.
Sometimes the cost of failure appears low, but is actually massive because the failure mode is not understood. For example a spreadsheet that miscalculates and causes a bad investment decision and a corporate failure.
Software is often chosen because it trades off against weight (in physical systems) or people (more commonly).
Software is fundamentally different because it is not commonly toleranced or the tolerancing of software is not understood. Reliability in physical engineering is understood in terms of the limits to which a component can be pushed. This concept seems not to be applicable to software.
It would be nice to see strict regulation on systems where lives would be endangered should the software fail, but this also raises the issue of how you regulate.
In Structural engineering you can say don't use material X if the forces acting on it exceed Y newtons. The same regulation in software doesn't make sense, you can't say "only use Haskell" or "don't use library Z" because the interactions between the tools we use are much more complicated than many "real world" engineering tasks.
We then run into the fact that a lot of software engineers have no real power in their companies, they do what management says or they get fired, I'd guess that when any other kind of engineer says "this won't work" managers listen. In my opinion a better solution to holding software engineers responsible would be holding the company and managers to account, at least at this point in time.
And this brings us back to licensure, if we had a PE category for this sort of software engineering, where people really staked their livelihood on what they signed off on, these sorts of processes might be taken seriously. So when you're told, after giving a 2 year estimate, that you have 6 months, you can honestly reply: I cannot do that. And have a body to point to to back you up in your decision when you get fired and they hire on a less reputable "engineer".
*Backing away is when a market maker makes a firm offer to buy or sell shares, receives an order to execute that transaction (which they are ethically and legally obligated to do) and instead cancels the trade so they can trade those shares at a more favorable price (capturing enormous unethical profits in fast-moving markets while regulators did virtually nothing to enforce the rules in a meaningful way)
Learn more: http://bit.ly/1ddUzWP
I really feel bad for people who think like that. A process where tests and deployments are automated and repeatable are vital to quick, robust deployment. Quick deployment without tests just isn't going to work well.
However, it happened fairly regularly that smaller trading errors would cause a couple (or tens of) thousand dollar win or loss for the client. If it was a loss, the client universally complains and gets comped by the company. If it was a win and it is found, the client keeps mum and the company does not raise the issue.
I think you'd be surprised at what happens in large companies. I went through four, count em' four major releases with a company and each time the failure was on load balancing and not testing the capacity of the servers we had prior to release.
Even after the second release was an unmitigated disaster, the CTO said we needed more time to do load testing and making sure the servers were configured to handle traffic spikes to the sites we were working on. It happened again, TWICE after he said we needed to do this.
You would think something as basic as load testing would be at the top of the list of "to do's" for a major release, but it wasn't. It wasn't even close.
> Sadly, the primary cause was found to be a piece of software which had been retained from the previous launchers systems and which was not required during the flight of Ariane 5.
DevOps isn't a role (to begin with) and a lot of the practices documented in the text is the opposite of good DevOps practices.
However, like agile before it, despite the fact that it really means something purposeful and rigorous, the word "devops" has become widely abused to camouflage undisciplined, thoughtless, cowboy behaviour.
A handy way of telling the difference is to ask yourself "what would Devops Borat do?"; if it's something Devops Borat would do, it's the false devops.
Then you discover it does a mediocre job of each of those tasks as compared to a dedicated printer, scanner and fax machine. Sure, they'll take up more desk space, but you'll get higher quality results.
Also, regardless of if the deploy went good or bad, discuss the aftermath with co-workers. I'd almost guarantee that some of these problems had crept up before, and being able to ask the question of "how can this never happen again" is important, because otherwise, those problems will be forgotten and stumbled over again, in a future, perhaps more critical incident.
Seems like as a rule, they're likely to cause instability, and I have a hard time seeing any benefits in them.
Benefits? Knight gave a bunch of other market participants a better price than they could get anywhere else, and no-one traded at a price they didn't agree to.
Software is the same.
Knight had code that hadn't been run in 8 years. Sure, the code worked 8 years ago, but things have changed around it since then. As the problem code never ran, no-one noticed it getting broken, or had any reason to fix it if it broke in testing.
Most likely the code worked fine 8 years ago, broke in the intervening 8 years, and hence was broken when activated.
I would have expected it to just segfault or error out in some way.
It's like if you cause an accident while you're driving by breaking the law; you get a traffic citation (and the accompanying fine), even if your car is totaled as a result of the accident, because you did something illegal.
-- Relied on financial risk controls that were not capable of preventing the entry of orders that exceeded pre-set capital thresholds for the firm in the aggregate.
(The charge also states that Knight violated rules on covering shorts, but I guess this is not so important).
The SEC is quite right to fine firms that have lost money with poor risk controls: the point is that bad risk management can hurt the whole sector. It is like fining a factory owner who lets their plant break pollution regulations.
Deploying in such a way that all your servers are not running the same codebase is obviously bad.
Deploying to production with no plan for how to roll it back if something goes wrong is obviously bad.
Not having anyone monitor things closely enough, including the hundreds of warning emails they got before the market opened, is obviously bad.
There is no hindsight necessary here. You could look at what they were doing and predict a catastrophe.
I think the main problem here is nobody at this company pushed back on this stupid development plan of reusing a flag for a different purpose. There's no excuse for that (or maybe there is, they had run out of fields in some fixed-width message format or something dumb like that). Also apparently the use of the flag was not tied hermetically to the binary in production; when they rolled back the binary the flag was still there but it meant something different to the old software.
The correct way to roll this type of change out is for the new input (the "flag" in this case) to be totally inert for the old version of the software, and for the new version to have a config file or command line argument that disables it. So _first_ you start sending this new feature in the input, which is meaningless and ignored by the existing software, and then you roll out the new software to maybe 1% of your fleet, and see if it works. Then roll it out to maybe 10% and leave it that way for a week. Insist that your developers have created a a way to cross-check the correctness of the feature in the 10% test fleet (structured logging etc). If it looks good roll it to 100%. You now have three ways to disable it: turn it off in the input stream, turn it off in the new software with the config or argument, or roll back the software.
Doesn't look like these guys really knew what they were doing.
Trading floor is only open a few hours every day, the functionality being rolled out required the markets to be open. Furthermore, since the changes were all for new functionality they rolled it out in stages days ahead of time (good move).
Roll out the code in advance, and have the production machines switch to it at a defined, synchronized time?
I mean, imagine you only have one production machine. If you're willing to admit that you can have it switch from version X to version Y with no downtime, then synchronization is the only barrier to doing the same on n machines. Why would you need scheduled downtime?
But there's at least 30 minutes of downtime per week per market (usually per day), and the vast majority of downtimes coincide during the weekend - so this is all moot discussion and needlessly complex solution. If you can afford the downtime, switch midnight GMT between Saturday and Sunday, when all markets are closed.
If you plan is for the software to be on all servers, it needs to be on all servers.
What sometimes happens when you want to decommission features is you just turn the flags off rather than remove the code. There's an obvious allure to this as you have already tested the on/off functionality of the switch when you did the original roll-out so you can avoid having to test whether you have removed the code correctly. It sounds like in this case they removed the code and repurposed the switch that disabled said code (it may really be a shared memory system and they are running out of flags), but they fucked it up. The old code was still there on some servers and the switch was turned back in intent to enable the new feature it was re-purposed for, re-enabling this old code.