The week after this we had a trader in our office who had a meeting at Knight on the morning it happened.
He said he saw the whole dev team just power off and go home at 11am, followed quickly by the rest of the employees. At that point, there was nothing they could do.
The craziest thing is that it went on for so long. No one caught it until their own traders so it come across Bloomberg and CNBC. They actually thought it was a rival HFT and tried to play against it.
The only people that came out of this ahead were aggressive algos on the other side and a few smart individual traders. A lot of retail guys had stop losses blown through that normally would never have been hit. After trading was halted they set the cap at 20% loss for rolling back trades. So if you lost 19% of your position in that short period of craziness, tough luck.
I dont' want to call your friend a liar, but this is most likely false.
> The week after this we had a trader in our office who had a meeting at Knight on the morning it happened.
> He said he saw the whole dev team just power off and go home at 11am, followed quickly by the rest of the employees.
1) Dev and Trading/Sales happen at different physical locations.
2) I actually know someone who spent their day cleaning this up and according to someone who was on the tech team and working that day no one went home early.
Think about it, the firm just lost a shit load of money due to an IT issue. Dev's were frantically searching the code for the bug, Sys admins were rifling through server logs. No one had nothing to do:)
> After trading was halted they set the cap at 20% loss for rolling back trades. So if you lost 19% of your position in that short period of craziness, tough luck.
This is just plain false. The normal procedures for busting trades were followed. There was no 20% "cap" for losses, how would you even determine what a 20% loss is?
I really have no reason to doubt this guy as he's a pretty prolific trader. That said...traders are rather known for hyperbole.
Per the 20%, I forget what it was but I know trades were rolled back and there was some kind of threshold for them. When I first wrote this, 20% losses was what I remembered. There's an article on it somewhere with the actual amounts. I think it had to do with how far stop limits were off from the open price of the stock.
I don't know about Knight but in the big London trading desks developers and traders are in the same building for just this reason. As you can imagine traders can be quite demanding when things go wrong. Also plugs will get pulled if things are going horribly wrong.
Just another reminder of how systems that you'd think are rock solid often aren't.
In my previous life working with telcos, I once tried to teach a particularly huge customer how to use CVS how to manage configurations across a 10+ machine cluster of machines. They didn't see any value in it, so they stuck to their good old process of SSHing into each machine individually, "cp config.xml config.xml.20131022", and then editing the configs by hand. Didn't take too long until a typo in a chmod command took down the whole thing (= the node couldn't take down a network interface anymore, so failover stopped working), and they spent several weeks flying in people from all over the planet to debug it... and they still didn't learn their lesson!
I heard similar stories from a friend working for a big telco.
The other day he was describing a strange bug which was triggering in the field of a large telco. It was strange because it used to get triggered after exactly 85 days of deployment.
It turned out to be a debug script that was pinging a development server and timing out (because the development server was not accessible from the field). The series of retrials and timeouts totaled to exactly 85 days, after which the rest of the script would activate and uninstall crucial dependencies!
Legacy telcos are the epitome of large big institutions where many of the best talents leave. After just a few years of working with them, I couldn't believe they could get a dial tone. How did they produce so much great R&D?
By having negotiated profit margins. Bell Labs was meant to soak up excess money. What we are now nostalgic for would never have happened without an interval during which a telco was considered a "natural monopoly."
Reminds me of a colleague who RDPed into each of our 140 subsidiaries to change a config file.
He had a list of servers on his desk and ticked off every server.
Took him the whole day to apply the changes.
(Also: lots of folks really don't enjoy learning new stuff. Or new ways of working. No, they really don't. Even if the new techniques are vastly better and more efficient. Put this down to a human cognitive bias favouring the tried-and-trusted over new-and-untested. There's a lot to be said for that when you're a neolithic hunter-gatherer or an iron-age peasant -- if you try something new and it fails, you maybe get to watch your family starve next spring -- but it's a bit less useful as a rule of thumb in the data centre.)
When the service guy in a former job quit, the databases at the customers' production servers started to be corrupted. It appeared that he had manually logged into the database at every single customer regularly and fixed the problem on site. I don't know if had told the responsible developers (I worked on a different product), but after it was apparent that the system was not stable.
Considering that I've watched a new automated deployment process nuke an entire network due to a mistake in the config... Sometimes slow and plodding and potentially losing a node (as opposed to 100 nodes) is seen as preferable to the PTBs.
Of course intelligent, incremental deployment works too, but let's not confuse those poor suits...
We often get frustrated with strict change management processes, excruciating verification etc .. but there is a reason why the most mature (not always the biggest..) operators have these processes in place.
High Frequency Trading seems so abstract. There's no value created, it seems. It's like something in between imperfect systems, scraping off the margin created by that imperfection. It's fascinating, and interesting from an algorithmic point of view (like a computer game), but at the same time I don't feel sympathy for this company going out of business.
I really hate to go down this road because it's been rehashed thousands of times on Hacker News, but high frequency traders add value to the market by adding liquidity (and therefore reducing spreads --> cost to you for executing) and price discovery.
This liquidity argument is rehashed thousand times but did you know that most of the orders made by HFT's end up getting cancelled.
Regulators found HFT's exacerbated price declines. As noted above, regulators found that high frequency traders exacerbated price declines. Regulators determined that high frequency traders sold aggressively to eliminate their positions and withdrew from the markets in the face of uncertainty.
Sort of. If you are buying or selling a lot of shares it's quite a bit more expensive. That probably doesn't matter to you or I if we are buying AAPL because it's a small number of shares, but it does hurt any index/mutual funds you're invested in.
Norway's sovereign wealth fund (one of the largest in the world, they own 1% of all US stocks) just came out on this exact topic:
> Next, the paper takes on HFT's usual defence – that they provide the market with much-increased liquidity. Norges Bank worries that this liquidity is "transient" - i.e. HFTs often place large orders only to then cancel them, creating "phantom" liquidity and leaving "buyside traders fac[ing] new challenges in assessing posted liquidity."
So what happens is Norway's sovereign wealth fund wants to buy a kagillion shares of MSFT (or whatever). In the good old days they could probably complete this transaction before the price went up too much. Now, thanks to HFTs, this additional demand is noticed faster, the prices rises faster, and it costs Norway more money. So, bummer for them.
But great for you! Because maybe you're the guy selling MSFT shares to them. You get the benefit of the price rising faster.
The market is more efficient. Norway can no longer take advantage of the fact that it knows that there's all this additional demand (originating from itself) and it takes a while for everyone else to figure that out.
No, it's more like they want to buy a kagillion shares and before they can complete the trade a HFT buys them and sells to Norway for slightly more. It's not good for me, I sold to the HFT. It's just intraday noise that only does the HFT any good.
If that was actually happening it would mean that spreads (the difference between where market makers buy and sell stocks) were large. In fact, the exact opposite is true. Spreads are tiny. Much smaller than they used to be!
This is because there's not just one HFT. There are tons. So if one of them tries to do this you won't sell to them, you'll sell to one of the other ones for a better price. Yay competition!
1. Hrm, ok. So apparently it's a 770B fund. Let's say they want to move .01% of their assets into MSFT. That would be $77M or about 2.2M shares. That would be about 5% of the average daily volume of MSFT.
2. It's not a binary thing but a gradual transition from mostly human market makers in the 80s to mostly algorithmic market makers today.
3. I'm on shakier ground on this one, but a fraction of a % of their cost.
I really hate to go down this road because it's been rehashed thousands of times on Hacker News...
And I've yet to see a convincing argument that HFT is of any benefit to anyone else other than themselves.
At the most basic level, HFT firms make money, right? Otherwise they wouldn't be doing this. Where does this money come from? Entities that are holding these same stocks for longer periods of time. It increases the costs to the longer-term buyers, and it decreases profits to the long-term sellers.
HFT is sand in the gears of the economy, not lubricating oil like they would have you believe.
I hear this argument made all the time and maybe it is true at the microlevel. On a macrolevel it appears price discovery is one algorithm's "opinion" is matched against hundreds of other algorithms' opinions and they all tend to go in the same direction.
HFT does not add liquidity, but they do squeeze spreads in liquid contracts. It's some what good for retail, but terrible for institutional investors, which hurt retail on the backend. HFT is just a front running operation in those cases.
I would support idea to just price every transaction with a 0.0001% transaction fee. It should be enough to bring the whole HFT industry out of business while not having any impact on markets themselves.
There already are trading fees. Each exchange charges a small fee for every trade. It's how they make money and is why they have an incentive to increase volume. For example, on the CME futures exchange the fee per energy contract is usually 50 cents. Of course, those are only for trades that are actually filled. For quoting there are no transaction costs, but there are often rules to limit quoting (transactions/fills under a certain threshold, a total number of transactions per day, etc.). If these rules are broken, they often come with a fine.
This isn't true. "Each exchange charges a small fee for every trade" Not all exchanges charge fees for posting liquitity. Smaller ECNs will actually pay traders to provide liquitity. If you hit or lift someones bid or offer you will get charged a fee but if you post an order that does not cross against the current market you can get paided for. these rebates are very importatn for HFT desks.
A lot of the HFT high-frequencyness comes from lots of orders beng sent to market. Only a very small number of these orders result in transactions. A transaction fee would not affect most of the order flow which never gets executed.
there should be a small tax per order put out, even if cancelled and a small percentage-based tax on each trade. If that behavior has been determined to exacerbate price declines that externality should be captured in a tax and de-incentivized properly.
Ripple works this way, a fraction of an XRP is charged as a fee for each transaction (posting and cancelling offers). As a decentralized bid/ask ledger protocol, the fee is needed to prevent tx spam and DoS attacks.
Looking at systems by considering whether they 'create value' in some generalized utilitarian sense is unproductive. Such systems survive by being able to extract energy somehow, in this case by exploiting properties of the stock trading system. I guess you could say that they create a lot of value—for the people doing it. Very few modern economic activities make sense in a broader perspective, they exist purely because they allow some energy to flow towards the people perpetuating them, on a more local level.
I disagree completely. You seem to be confusing money/worth with value. If someone manages to redirect money towards themselves without creating a new economic resource, then no new value is made.
Avoiding systems that don't create value would help us live in a world with more value in it. That's why considering this is very productive, and helps moral people avoid wasting precious resources on zero sum games.
What I'm trying to say is that systems don't exist because they are 'good', they exist because they survive. When trying to interpret the world, 'value' is a much less powerful explainer than survival.
Also, I do think that knowing whether something actually creates value or not (is a positive-sum, negative-sum or zero-sum game) is definitely helpful in understanding large systems. i.e: It helps predicting whether a society will succeed or fail, by seeing how much of it is wasted on negative/zero sum games.
Societies will always succeed for a while and then fail. It is the cycle of life and energy in the economy. There is "value" created by HFTs that is used by HFTs and their families. It helps them survive and reproduce. This does not mean value is created for society overall, but it certainly makes society more complex. The more complex an ecosystem/economy is, the more likely it will be able to adapt to future environmental changes.
You don't even have to create value for your species, or lasting value for your genes.
The classic example is -- if I recall correctly -- a mutation sometimes found in mice. The mutation causes a male to only produce male children. This mutation will spread through the population until there are only male mice, and one generation later the mice are all dead.
How is that related to this story, however? This story is not about HFT -- unless I'm misreading or misunderstanding -- but instead about bog-standard retail order fulfillment. The fulfillment process had a bug that repeatedly kept filling the same orders repeatedly, leading to this issue.
I think HIgh Frequency Trading will eventually be neutralized through competition. I noticed the their
profits aren't as staggering as they were a few years
ago. I would like to see a law that limits how close an
company(individual) can set up shop next to an exchange
I think Insider Trading is more of a problem than the
I think it's simpler than this: just delay every trade by a random amount between zero and ten seconds. For people who actually want to buy or sell instruments because of their value to them, that won't be a problem. For people who just want to play Street Fighter II with the market, it's game over.
Why have a limit on how close people can be to the exchange? For one thing, everyone would just colo at exactly the minimum distance, achieving nothing. For another, ability to colo is not really an issue. If you wanted to create an HFT startup, colo is not going to be a major cost compared to hiring developers.
A limit on distance is in effect a speed limit. It means anyone that is able to reach the limit has a shot at competing. Whether that is good or not, or makes a difference or not is another discussion, but you can certainly affect the competitive situation massively that way.
Competition is already high, and colocating is not that expensive for anyone with the ability to compete.
A lot of the loudest criticism of HFT is that it's too competitive - a lot of people who used to make a comfortable living from the bid/ask spread are no longer able to due to computers driving down profit margin.
But they're not really competing on the absolute scale of how fast they can execute a trade, but rather on how much faster than everyone else they can execute. If everyone else is slowed by an equal amount, the game doesn't change at all.
This is ridiculous. It's possible to make money in the stock market without inside information, but it requires alot of knowledge. Markets are very fluid, complicated, and behaviour patterns are constantly changing.
Unless you're prepared to devote alot of time to learning, it's best to stick to index funds, or lower risk investments (GICs).
Just one of the risks of automation, and a good reminder why human monitoring is necessary.
Having said that, we deployed a system that was mostly automated, with the human operator to oversee investments and if any out-of-the-ordinary transactions (based on experience) were taking place, to shut it down. She happily sat there approving the recommendations even though the recommendations were absolutely outside of anything we'd ever generated in the past, and bled accounts dry in one evening, so sometimes even with a human observing you're still boned.
You should read the linked PDF - they had systems that were 100% dependent on human monitoring, that no one was checking, or where no one recognized anything unusual. If anything, their failures were due to massive lack of automation in deployment, testing, and monitoring.
Oh the other amazing thing we used to have - a system that generated so many alerts that everyone just created outlook rules to automatically delete all the alerts. I mean really? I wondered if we should modify the system to only generate messages when there are actual exceptional cases? People looked at me like I was an idiot
Yeah ours was monitoring just rubber stamped it. Afterwards everyone remarked, everyone could tell they were bad just looking at what was in front of them, our theory was she was too busy watching the breaking bad final episode or something.
You know how if the first 100 times a dialog box comes up, the correct response is to click 'ok', then people start just clicking 'ok' on every dialog box, and then the 201st one comes up "Destroy everything? [cancel] [ok], and they click 'ok' too, and don't think anything of it?
It's like how they keep the TSA x-ray readers at the airport from becoming too complacent, every once in a while you have to slip a gun through the scanner and see if they catch it. Otherwise you're just looking for (presumably) rare events and you become numb to the never-ending stream.
Sure, you click reflexively, but you should notice the text was bad either after clicking or within a couple more clicks. Letting the first few errors through is reasonable, but letting a wall of them through without noticing anything wrong, when reading them is your job, is inexcusable.
I don't know if it's excusable or not, but it may be incompatible with typical human cognition to expect someone to be able to do that.
Maybe you have to figure out a way to test people for unusually high aptitude at looking at mind-numbingly dull repetitive things over and over again, but then still being able to notice the aberrant ones. And then only put people in that job with unusually high aptitude there.
Or have people only do pretty short shifts at that task.
I'm pretty confident that this person wasn't unusually negligent, if you have most anyone doing that job hour after hour day after day they will lose the ability to flag the aberant stuff.
Yours is the correct viewpoint: it is incompatible with human cognition.
If an alert system is not perceived as highly reliable in directing positive action, then the humans involved will inevitably disable the alert system, either by pulling out a screwdriver or rewriting their mental rubrics to ignore the messages as noise.
Knight Capital is just the finance version of Three Mile Island and Deepwater Horizon -- the means to mitigate or prevent disaster were on hand, but the people in charge just dithered by the kill switch because they were confused. Well, if the people in charge are confused, that is a reason to start the emergency procedures.
I would argue that many more people monitoring can encourage the fear of making the wrong call. "Hey, someone smarter and more experienced than me should makes sense of this." "Hey, the smart new guy is supposed to be watching this. I will look more closely later."
What you want is two or three people really in charge, where the individuals are empowered to say: "I am totally confused. If someone cannot explain what is going on to me so that I understand, I am starting shutdown procedures, immediately. Do YOU know exactly what is going on?"
I'm not seeing anything here that makes me think they had any kind of automation at all! In my experience, if they had automation things would have consistently failed on all servers the deployment was executed on.
That said, if you're going to fly the jet liner in full manual mode, you better make sure your co-pilot is watching the instruments.
The parent post may be pointing out that the point of this software is to automate trading on the stock market. It's risky if your software testing, rollout, monitoring and rollback process is not sufficiently automated. This second kind of automation is the kind that you or I are most likely more familiar with. And was lacking.
Don't humans also make similar large scale mistakes? Merill Lynch's infamous London whale comes to mind. Also. I could be wrong but aren't most of derivatives a zero sum game: don't I have to lose money on my puts for you to make money on your calls ? Didn't so many people lose money on securities because they misunderstood their exposure ?
The Knight computer error was spectacular and catastrophic but us humans have a longer track record of making catastrophic financial decisions in the market.
Options are complicated. At their most basic level they are no different than a bet, so yes zero-sum. However when used in a spread or as a hedge or any other way to avoid risk or when sold against stock you own as an income generator, it's tough to call them zero-sum.
Puts and Calls are confusing as they are both something you buy. It's not like a sports bet where you're betting on the team to win so the other side will lose if they win. You can buy a put and a call in the same stock and profit on both if your strike prices are aligned properly.
The opposite side of buying a call is selling a call. Just like shorting a stock, when selling an option the most you can make is 100% and your loss potential is infinite. As an option seller, you're hoping the option expires worthless and out of the money so that you can keep the initial outlay. Most brokerages require a high level of clearance to allow you to sell naked puts and calls as it's generally a bad idea if you can't cover the potential upside.
For most products there is a legal framework that forbids buying an insurance contract unless you have an insurable interest. However, under the Commodity Futures Modernization Act of 2000, designed by Summers, Greenspan, Levitt, and Rainer, state insurance regulators are forbidden from regulating OTC derivatives as insurance products. They were already forbidden from regulating exchange traded derivatives (e.g. options and futures).
don't I have to lose money on my puts for you to make money on your calls ?
Actually, we would both lose money if the stock price doesn't move. But yes, generally in order for a call or put to go up, the other must go down.
It's not zero sum, though. If you had purchased a 950 call and an 850 put just before Google announced earnings (with the stock around 900), the call would now be worth more than you paid for both options combined. The counterparty who sold the options is the one who loses (and it is zero sum with them, per design).
If you consider gains and losses just in terms of dollar value, then options are zero-sum. However, if you take non-linear utility functions into account, then a fairly priced transfer of risk from a more risk-adverse party to a less risk-adverse party is a win-win situation.
For instance, let's say a large institutional investor determines that a 5-yr 10-yr flattener on South Whereisitstan bonds is very attractively priced, but can't stomach the risk of the 10-yr yield going through the roof. They pay an investment bank to create some OTC options on the Whereisitstan 10-year notes and several medium-sized hedge funds take the other side. If the hedge funds are right, the big institution over-pays for the options in strict dollar terms, but the options allow the institution to enter into a very attractive bond trade they otherwise would have been unwilling to enter. In this case, everyone could win, even though the big institution takes a (both realized and statistical) loss on the options.
I'm shocked they didn't have a killswitch or automated stop-loss of some kind. A script that says "We just lost $5M in a few minutes; maybe there's a problem." Or, a guy paid minimum wage to watch the balance, with a button on his desk. $172,222 is a lot of minimum-wage years.
I work for a small automated trading firm (in foreign exchange), and marking positions to market is one of the difficulties in designing an effective kill switch, because these marks can easily make the difference between a large gain and a large loss. In fast-moving markets (which is when a kill switch is most useful), it's very hard to determine the true mid-market rate. Our system of course always has such a notion, but if we used it to shut down our system every time it looked like we had lost money, then a market data glitch (which is not at all uncommon) would impose a large opportunity cost as a human intervened during an active market.
Instead, we designed our system so that there's a very low threshold for it stop trading if it appears to have lost money, but to only do so temporarily. If our marked-to-market position recovers shortly thereafter while the trading system is idle, then the apparent loss was probably due to a market data glitch. On the other hand, if our position does not recover, then the temporary stoppage becomes permanent, and a human intervenes. (Obviously, there are more details here, but this is the general idea, and it's worked very well for us.)
> a market data glitch (which is not at all uncommon)
This is a point that bears amplifying. People who do not work in the financial industry may not appreciate just how bad market data feeds are. Radical jumps with no basis in reality, prices dropping to zero, regular ticks going missing, services going offline altogether with no warning, etc.
This particular subsystem was a replacement for a previous version, which was a kill switch and had led to opportunity costs. We spent a lot of time designing, implementing, and testing it, but haven't felt the need to touch it since then. It's sufficiently general that it doesn't need to be adapted as our strategies change, and it doesn't need to adapt to changing market conditions (as, e.g., a trading strategy does).
They unintentionally built up a huge position well beyond their internal capital thresholds. You don't have worry about mark to market to detect that. If you think your FX strategy could end up with max notional of let's say £50MM, you can put something in place to stop trading if you exceed that. If you are a smaller shop, your prime broker is probably already doing something like that for you.
Those types of risk controls (i.e. capital thresholds) are bog standard in our industry and it still kind of blows my mind that Knight managed to get around them so easily.
How many millions in orders do they normally process per minute?
Since there were no procedures in place, would you like to be the guy who pulled the plug on the (let's guess) $100 million/minute processing system? Do you think you could get another job after that? What would the costs be for violating contracts? You could single handedly sink the company (which, in the end, this issue basically did).
I don't blame the guy for not killing all operations. He never should have been put in that situation. Proper QA, regression testing, monitoring, running a shadow copy and verifying it's output, there are tons of things that could have prevented/mitigated this.
In this case the system was deploying new functionality, so your shadow couldn't just run last week's version or you'd get false alarms all the time. And obviously deploying the same broken version to the primary and the shadow wouldn't detect anything.
So you'd need two codebases and two developer teams, coordinated enough that their code produced exactly the same output yet independent enough that they didn't make the same mistakes. With the challenges of coordination this would more than double your costs.
Of course, with the benefit of hindsight, the costs might have been worthwhile...
Stop loss orders are not a panacea here. You could lock in a loss for no other reason than the market becoming very volatile due to some news item or some other reason like the flash crash. And you can't even guarantee to limit your losses at the stop loss price. It's just the price that triggers an attempt to unwind the position.
Also, the market may not go against you immediately. What if the glitch in the system means you're opening positions in stocks and you drive up the price by doing so? The losses are not immediately apparent. There's no screen where you could watch your losses run up in real time. The losses only become apparent once you try to unwind those positions and that's the case in many kinds of scenarios.
I believe it took J.P Morgan months to unwind the London Whale positions and really know what losses were incurred.
I think there's a better chance of catching a glitch at the point where the positions are opened.
It looks like they saw positions accumulating in one of theirs accounts, but couldn't identify the source. And maybe they saw positions too late, because of lag. Here's a description from that SEC report, onto what went wrong with their monitoring system:
"Moreover, because the 33 Account held positions from multiple sources, Knight personnel could not quickly determine the nature or source of the positions accumulating in the 33 Account on the morning of August 1. Knight’s primary risk monitoring tool, known as “PMON,” is a post-execution position monitoring system. At the opening of the market, senior Knight personnel observed a large volume of positions accruing in the 33 Account. However, Knight did not link this tool to its entry of orders so that the entry of orders in the market would automatically stop when Knight exceeded pre-set capital thresholds or its gross position limits. PMON relied entirely on human monitoring and did not generate automated alerts regarding the firm’s financial exposure. PMON also did not display the limits for the accounts or trading groups; the person viewing PMON had to know the applicable limits to recognize that a limit had been exceeded. PMON experienced delays during high volume events, such as the one experienced on August 1, resulting in reports that were inaccurate."
Because not testing it can have unintended side effects at some low level of probability, and we all know this - mostly we just don't think it's going to happen to us. However, as the significance of the risk associated with that low level of probability goes up the demands for securing against that risk go up.
Computers are very powerful when placed in certain configurations. The more powerful the system you're dealing with the more cautious you should be. If they were dealing with an app then, sure, I'd have a lot more pity for them not taking precautions - such precautions would not be reasonable to expect of them. But if you're not being excessively paranoid about such a powerful system as was deployed here, then you're doing it wrong.
I do feel some pity for them based on the fact that there's not a tradition of caution in programming. And I do agree that there were multiple points of failure in there. But testing all the code that's going to be on a system like this is a base level of caution that should be used - whether or not you intend to use that code. If you think it's too much bother to test, then it shouldn't be there - but if it's gonna be there then for god's sake test.
Right, and they were deleted. The code was not there after the refactoring. Except that because they were using an incremental deployment it was not actually removed from one of the servers. It was not really a case of untested software being run, but rather one of the servers running an older version of the software due to a deployment failure. The newer version repurposed the Power Peg flag, but since that one server was still running an older version of the code it behaved differently. It carried out the old behavior, which was not suitable for the current environment.
This is the biggest argument in my opinion against incremental deployment: it is hard to know exactly what is on any given box. Each time you push an incremental piece to a server you have effectively created a completely unique and custom version of the software on that server. Much better to package the entire solution and be able to say with certainty, "Server A has build number 123."
"During the deployment of the new code, however, one of Knight’s technicians did not copy the new code to one of the eight SMARS computer servers. Knight did not have a second technician review this deployment and no one at Knight realized that the Power Peg code had not been removed from the eighth server, nor the new RLP code added. Knight had no written procedures that required such a review."
That is just painful to read. How many times do we hear a company couldn't figure out how to migrate code properly? Do any software engineering programs teach proper code migration?
Next time a manager questions money spent on integration or system testing, hand them a printout of this SEC document and explain how much the problem can cost.
Look up the 1999 WorldCom outage due to a screwed-up load of Lucent's Jade platform upgrade. Fun times. Best I can say about that is that within a year WCOM execs had bigger problems than just pissing off CBOT....
They also amassed over $3 billion net short positions spread across 75 stocks during those 45 minutes causing significant losses to investors with stop loss positions triggered that would not have happened without Knight's erroneous trades. They didn't just harm themselves...
Anyone using stop loss orders harmed themselves. It's like saying that you wouldn't have gotten hurt rolling down the steepest hill in town on a skateboard with no helmet if someone hadn't parked in a red zone.
Be very careful with what you are suggesting. If we are going to call them criminals because of bug in their code and system. Then don't be upset the day you are called a criminal for a bug in your code, for not having a test unit or deploying the wrong thing and not having the right process in place.
Having worked in the aerospace side of software dev, I'd actually be cool with this. Outside of software, real engineers have real liability when their systems fail . PEs  are held accountable in most countries that license engineers when they sign off on things that should not have been signed off on. Too many people on the software side, even in safety critical systems, play fast and loose with the "engineering" aspect. We know how to design sound software, we know how to analyze it, we know how to test it. Deliberate, systematic approaches to these tasks can bring orders of magnitude greater confidence in the systems we produce, we just choose not to do it. Technically, in aerospace, we have DO-178C, which provides guidelines on artefacts that need to be produced, but too often these artefacts are created after the fact, rather than prior/concurrent to the development (where they belong). Criminal liability, career ruined by loss of licensure, these are risks that might actually temper some of the recklessness in the industry.
You are missing an important point: different systems fail in very different ways. The reason why people will generally prefer software over other ways of building things is because the cost of failure is often very low. No one is going to die if a website serves an incorrect image, nor will such a bug require billions of dollars worth of semiconductor inventory to be recalled to repair.
If every software project was run like an avionics project, software would be more reliable and of higher quality. But the world would be worse off; most of the software people use would never come into existence.
I don't think I'm missing the point, I'm not calling for photoshop plugin devs to have the same requirements as others, I'm saying I wouldn't object to licensure for software devs on many categories of systems: medical and avionics are two obvious categories. Perhaps I should have stated that more clearly, but I figured most people would get that a browser-based game is in a totally different league than the software controlling acceleration in your vehicle.
Sometimes this is true, but it misses some things.
Sometimes the cost of failure appears low, but is actually massive because the failure mode is not understood. For example a spreadsheet that miscalculates and causes a bad investment decision and a corporate failure.
Software is often chosen because it trades off against weight (in physical systems) or people (more commonly).
Software is fundamentally different because it is not commonly toleranced or the tolerancing of software is not understood. Reliability in physical engineering is understood in terms of the limits to which a component can be pushed. This concept seems not to be applicable to software.
But the way in which 'real world' engineering is treated by companies is extremely different to that of software engineering. A structural engineer tells you that your mega-bridge is going to take another 2 years to build safely and you eat up the cost, a software engineer tells you they need another 2 years to make sure your program will work reliably and you tell them they have 6 months.
It would be nice to see strict regulation on systems where lives would be endangered should the software fail, but this also raises the issue of how you regulate.
In Structural engineering you can say don't use material X if the forces acting on it exceed Y newtons. The same regulation in software doesn't make sense, you can't say "only use Haskell" or "don't use library Z" because the interactions between the tools we use are much more complicated than many "real world" engineering tasks.
We then run into the fact that a lot of software engineers have no real power in their companies, they do what management says or they get fired, I'd guess that when any other kind of engineer says "this won't work" managers listen. In my opinion a better solution to holding software engineers responsible would be holding the company and managers to account, at least at this point in time.
Correct, you can't declare language use as a requirement in the same way as material. Check out DO-178C, it doesn't do that. It requires artefacts and processes to be in place so that when the system is done, if done right, you have a high degree of confidence that written in C or Haskell or OCaml or Fortran that it was designed and tested well, and consequently that errors are minimized or by design their impact is mitigated.
And this brings us back to licensure, if we had a PE category for this sort of software engineering, where people really staked their livelihood on what they signed off on, these sorts of processes might be taken seriously. So when you're told, after giving a 2 year estimate, that you have 6 months, you can honestly reply: I cannot do that. And have a body to point to to back you up in your decision when you get fired and they hire on a less reputable "engineer".
Very interesting, though I was happy to see Knight Capital take the huge loss, since they were such complete scumbags who stole hundreds of millions of dollars by backing away from trades* during the dotcom boom and bust.
*Backing away is when a market maker makes a firm offer to buy or sell shares, receives an order to execute that transaction (which they are ethically and legally obligated to do) and instead cancels the trade so they can trade those shares at a more favorable price (capturing enormous unethical profits in fast-moving markets while regulators did virtually nothing to enforce the rules in a meaningful way)
I would love to hear from an ex-Knight tech. Wouldn't be surprised if they wrote something along the lines of: "Management just wanted this thing in ASAP!", or perhaps "Tests weren't part of the kpi's". I may sound biased against non-techs, but I have seen this time and time again. Testing is a barrier to quick deployment, and "How much money are we losing while doing all that stoopid testing?".
I really feel bad for people who think like that. A process where tests and deployments are automated and repeatable are vital to quick, robust deployment. Quick deployment without tests just isn't going to work well.
I remember when Knight was in the news regarding this but never the technical details about what took place. It's scary stuff especially given the money on the line, and it makes a good case study for devops. I understand the temptation to re-use a field but normally I'm for using new values in those fields.
It's not really possible to fail upwards this way. It would be like forgetting how to play chess in the middle of a game and then winning. Anomalies are universally negative in high-stakes environments, or if they're positive, only engender modest improvements.
As others have noted, not really possible. You can lose a lot of money very fast by repeatedly buying high and selling low. Everybody will trade with you, happily. The opposite, placing low buy orders and high sell orders, will simply result in nothing happening.
Knight's systems were sending small, aggressive orders on both sides of the market, so their net positions were relatively small compared to the total volume traded. They were hemmhoraging money very quickly because they were crossing the spread repeatedly.
In a big long-running error like this it is extremely unlikely that a company would have come out on top.
However, it happened fairly regularly that smaller trading errors would cause a couple (or tens of) thousand dollar win or loss for the client. If it was a loss, the client universally complains and gets comped by the company. If it was a win and it is found, the client keeps mum and the company does not raise the issue.
Dead code takes down another system. A perfect storm of failures that they made themselves. My gut feeling is that most trading firms could suffer a similar loss. Having worked for a 3rd party accounting management firm that kept logs for smaller traders I really realized how borked the whole system is. 60s era pen and paper stuff moving at the speed of light.
> Sadly, the primary cause was found to be a piece of software which had been retained from the previous launchers systems and which was not required during the flight of Ariane 5.
AOL. True devops is about building systems that make this kind of disaster impossible. Or at least very hard.
However, like agile before it, despite the fact that it really means something purposeful and rigorous, the word "devops" has become widely abused to camouflage undisciplined, thoughtless, cowboy behaviour.
A handy way of telling the difference is to ask yourself "what would Devops Borat do?"; if it's something Devops Borat would do, it's the false devops.
I like to think of DevOps as one of those cheap, 3-in-1 printers that you buy thinking that you'll be saving money and desk space.
Then you discover it does a mediocre job of each of those tasks as compared to a dedicated printer, scanner and fax machine. Sure, they'll take up more desk space, but you'll get higher quality results.
Look at this as a handy guide to how not to deploy in one single page. Pretty much any of the things in bold, had they not been done, would have avoided this from happening. Reused flags, letting dead code sit around in production for years, lack of change management, etc, all are signs of something gone horribly wrong.
Also, regardless of if the deploy went good or bad, discuss the aftermath with co-workers. I'd almost guarantee that some of these problems had crept up before, and being able to ask the question of "how can this never happen again" is important, because otherwise, those problems will be forgotten and stumbled over again, in a future, perhaps more critical incident.
In a sense the markets should be unstable. A perfect market would be so sensitive that any new information, every order from a fundamentals trader, should shift the price. That's what these high speed transactions get you - more accurate prices, and faster. And as a result of that they can offer much narrower spreads than you'd get elsewhere.
Benefits? Knight gave a bunch of other market participants a better price than they could get anywhere else, and no-one traded at a price they didn't agree to.
A non-economist wants to know: if liquidity is beneficial to our economy and liquidity is a function of time, how how does the time-benefit curve look as t approaches 0? I don't know if you can quantify the benefit and map this curve, but if you could I don't imagine it would scale to infinity as time approached zero.
Liquidity isn't a function of time, it is a function of relative time between the predators (arbitrageurs) and the prey (market makers). If the predators are much faster than the prey, liquidity will disappear since the market makers can't survive (their prices are too stale and they are getting taken advantage of). It has always been this way, even since Nathan Rothschild used carrier pigeons to get news of the Battle of Waterloo.
Perhaps you could do consensus checking retrospectively? I.e., out of N supposedly identical servers a random one gets to make any given decision in real time but then a separate system goes back and compares all servers' results and stops their operation if there's divergence?
You may have heard people with things like backups and emergency generators saying "you have to test this stuff weekly, in case someone has broken it so it'll fail the moment you call on it."
Software is the same.
Knight had code that hadn't been run in 8 years. Sure, the code worked 8 years ago, but things have changed around it since then. As the problem code never ran, no-one noticed it getting broken, or had any reason to fix it if it broke in testing.
Most likely the code worked fine 8 years ago, broke in the intervening 8 years, and hence was broken when activated.
If I understand this correctly, this isn't like having untested code around. It's more like leaving highly toxic medicine in the bathroom cabinet when you no longer need it, or leaving an electrically powered band saw plugged in when it's not in use.
>>>What kind of cowboy shop doesn’t even have monitoring to ensure a cluster is running a consistent software release!?
I think you'd be surprised at what happens in large companies. I went through four, count em' four major releases with a company and each time the failure was on load balancing and not testing the capacity of the servers we had prior to release.
Even after the second release was an unmitigated disaster, the CTO said we needed more time to do load testing and making sure the servers were configured to handle traffic spikes to the sites we were working on. It happened again, TWICE after he said we needed to do this.
You would think something as basic as load testing would be at the top of the list of "to do's" for a major release, but it wasn't. It wasn't even close.
No, the SEC fined them for losing money stupidly. In order to have access to the market like they did, they had to follow certain laws that are enforced by the SEC. When they were losing all that money they weren't following those laws.
It's like if you cause an accident while you're driving by breaking the law; you get a traffic citation (and the accompanying fine), even if your car is totaled as a result of the accident, because you did something illegal.
These are pros who are paid very well to have a clue. It is not silly if "acting stupidly" is described in writing, such that all parties adequately understand when the hammer is likely to come down. I am sure a lot of traders push the envelope and "drive 71 in a 65 mph zone". But it still not silly to give the guy driving 81 a ticket.
Trading companies are supposed to have effective risk policy (i.e., they are supposed to think about what happens if things do wrong and formulate rules to ameliorate risks) and compliance controls (i.e., bureaucratic controls that ensure that policy actually is followed). The SEC charge does not paint a pretty picture on these counts, but the key point seems to be this one:
-- Relied on financial risk controls that were not capable of preventing the entry of orders that exceeded pre-set capital thresholds for the firm in the aggregate.
(The charge also states that Knight violated rules on covering shorts, but I guess this is not so important).
The SEC is quite right to fine firms that have lost money with poor risk controls: the point is that bad risk management can hurt the whole sector. It is like fining a factory owner who lets their plant break pollution regulations.
I'm not sure I agree with "Deploying in such a way that all your servers are not running the same codebase is obviously bad." I have a lot of experience in large scale systems (although this incident with 8 machines does not qualify) and I would say there is _always_ a period of transition where versions X and Y are online in production simultaneously. How can it be otherwise? You'd need scheduled downtime to do it any other way.
I think the main problem here is nobody at this company pushed back on this stupid development plan of reusing a flag for a different purpose. There's no excuse for that (or maybe there is, they had run out of fields in some fixed-width message format or something dumb like that). Also apparently the use of the flag was not tied hermetically to the binary in production; when they rolled back the binary the flag was still there but it meant something different to the old software.
The correct way to roll this type of change out is for the new input (the "flag" in this case) to be totally inert for the old version of the software, and for the new version to have a config file or command line argument that disables it. So _first_ you start sending this new feature in the input, which is meaningless and ignored by the existing software, and then you roll out the new software to maybe 1% of your fleet, and see if it works. Then roll it out to maybe 10% and leave it that way for a week. Insist that your developers have created a a way to cross-check the correctness of the feature in the 10% test fleet (structured logging etc). If it looks good roll it to 100%. You now have three ways to disable it: turn it off in the input stream, turn it off in the new software with the config or argument, or roll back the software.
Doesn't look like these guys really knew what they were doing.
"How can it be otherwise? You'd need scheduled downtime to do it any other way."
Trading floor is only open a few hours every day, the functionality being rolled out required the markets to be open. Furthermore, since the changes were all for new functionality they rolled it out in stages days ahead of time (good move).
> How can it be otherwise? You'd need scheduled downtime to do it any other way.
Roll out the code in advance, and have the production machines switch to it at a defined, synchronized time?
I mean, imagine you only have one production machine. If you're willing to admit that you can have it switch from version X to version Y with no downtime, then synchronization is the only barrier to doing the same on n machines. Why would you need scheduled downtime?
Synchronization is non-trivial, but the question is mostly how fine you need the synchronization to be. E.g., if you are doing a live upgrade using e.g Erlang or Nginx, you can sort-of decide when new processes will be served by the new server, but existing processes and requests may linger with old code until much later.
But there's at least 30 minutes of downtime per week per market (usually per day), and the vast majority of downtimes coincide during the weekend - so this is all moot discussion and needlessly complex solution. If you can afford the downtime, switch midnight GMT between Saturday and Sunday, when all markets are closed.
It's most likely a giant C++ server or set of servers. It's not really that surprising that old code is in there. One way releases are handled in these kinds of places is to roll out new features on some kind of external flag system, perhaps shared memory flags or some other mechanism. This way if something goes wrong with that one feature, you can disable it by flipping the switch, rather than having to back out the software, which would likely cause downtime during market hours, a major problem.
What sometimes happens when you want to decommission features is you just turn the flags off rather than remove the code. There's an obvious allure to this as you have already tested the on/off functionality of the switch when you did the original roll-out so you can avoid having to test whether you have removed the code correctly. It sounds like in this case they removed the code and repurposed the switch that disabled said code (it may really be a shared memory system and they are running out of flags), but they fucked it up. The old code was still there on some servers and the switch was turned back in intent to enable the new feature it was re-purposed for, re-enabling this old code.
It was a flag that was re-purposed. If the code is never deleted and the flag's use is discontinued then the code never runs. However, it seems that the flag was reused years later causing the old code to be inadvertently reactivated. It has nothing to do with the language it was written in.
Poster was probably thinking of directories of interpreted code (i.e. Python, PHP, Ruby), compared to compiled binaries (C++, C, C#, Java), since many compiled languages have dynamic deployment mechanisms like OSGi's hotswapping of individual modules.
It is peg, not keg. Peg refers to an order where the limit price is automatically adjusted to some benchmark. For instance, you always want to be 1 penny away from the best bid. I don't know specifically what "power peg" is, though.