We work in an incredibly immature industry. And trying to enforce better practices rarely works out as intended. To give one example: we rolled out mandatory code reviews on all changes. Now we have thousands of rubber-stamped "looks good to me" code reviews without any remarks.
Managers care about speed of implementation, not quality. At retrospectives, I hear unironic boasts about how many bugs were solved last sprint, instead of reflection on how those bugs were introduced in the first place.
Agree with this, a lot of developers are in a filter bubble where they stick to communities that advocate modern practices like automated testing, continuous integration, containers, gitflow, staging environments etc.
As a contractor, I get to see the internals of lots of different companies - forget practices even as basic as doing code reviews, I've seen companies not using source control with no staging environments and no local development environments where all changes are being made directly on the production server via SFTP on a basic VPS. A lot of the time there's no internal experts there that are even aware there's better ways to do things instead of it being the case they're lacking resources to make improvements.
Svn is easier to understand and use, but then you’d have to break some existing habits to get to git. But going straight to git might be a big step and cause reversion back to whatever system was already there.
Then disallow direct commits to master to make people work in feature branches and make merge requests through the platform (github /lab/ bitbucket). I find merging and branching locally is where people normally trip up.
Git GUI tools always make git seem way more complicated than it is so depending on the teams platform I would recommend cli from the start.
Say you want to edit a few chars in a 250MB file. Why don't you, just to be sure, past that whole file in the comment field? Do that for a few 100 commits. Tortoise really hated that one, and crashed the windows explorer every time you dared look at the logs (out of memory).
Or the time some joker (His CV proudly declares 10 years of developer experience) deletes the root of the tree, doesn't know about history, and goes with his manager straight to the storage admin, who wipes everybody's commit of that day ( a few 100 people). There clearly is no need to contact someone who knows anything about subversion if the data is gone, and maybe this way nobody will notice anything and jell at them.
Or say you want to do an upgrade. Theory is every user leaves, service and network port get shut down, VM instance is backed up just to be sure, you do the svn upgrade. Of course enterprise IT means I have to write detailed instructions to every party involved and under no circumstance allowed tot look at that server myself.
So it turns out: A) some users just keep on committing straight trough the maintenance window. B) The clown who shuts down the service doesn't check if the service is actually shut down, and there is a bug in the shutdown script. So svn just keeps on running. C) The RPM containing the patch is transported by an unreliable network that actually manages to drop a few bytes while downloading from http. D) The guy who should shut down the svn network port is away eating, so they decide to skip that step. E) SVN gets installed anyway (what do you mean, checksum mismatch) and starts commiting all kinds of weird crimes to its storage. F) The VM guy panics, rolls back to the previous version, except for the mount which contains the data files. G) Then they do it all again, and mail me how the release was successful without any detail of what happened.
Let me tell you, svn really loved having its binary change right under it, in the middle of a commit, while meeting its own previous version in memory. Oh and clients having revision N+10, while the server is at version N. A problem that solves itself in a few minutes as N goes up really fast ;-)
Now thats what happens with subversion, which is rock solid and never drops a byte once committed. This company is now discovering the joys of git, where you can rewrite history whenever you feel like it.
I've introduced a lot of people to SVN over the past decades. Be it programmers, sysadmins, artists, translators, it's fairly quick to learn.
I couldn't begin to imagine introducing anybody to git. It's a horrible nightmare to use, even for developers, there is nothing that come close in how many times it screws up and you have to search for help on the internet.
Subversion is newer than RCS. But that doesn't mean every use of the latter can or even should be replaced.
IBM ClearCase is the way to go
I'm not into Java development, but this sounds fine on the face of it, without you giving the context of how this pipeline is triggered.
You pulled the code from google drive, modified it, pushed it to PROD, checked it and moved it back to google drive... and asked the other developers to update.
Well I've seen companies that have their own idea of source control. Which is lots of copies on the network drive, and an Excel registry with what is in which file.
It is source control. Just bad source control.
However, there's also the skill set of taking such a ... let's call it well-aged development team and approach and modernizing and/or professionalizing it. And yes, sometimes this means to build some entirely lovecraftian deployment mechanism on mutable VMs because of how the application behaves. But hey, automated processes are better than manual processes, which beat undocumented processes. Baby steps.
Omg. And here I am feeling ashamed to tell others about my small small personal website with separate dev, qa, production environments on same server (via VirtualHost), code checked into github, deployed via Jenkins self hosted on another VPS, which was initially spun up with Ansible and shell scripts. All done by me for self training purpose. All because I thought businesses would have something more sophisticated with bells and whistles.
And then I hear there are businesses that make changes directly on live production servers...
But I'm not surprised by such stories as I have seen some bad workflow in real businesses that deal with tens of millions of dollars a year.
Years ago, I worked in the NOC of a company that's top in the small niche. They have dozens of employees, and been around for years.
Part of the job responsibility was rolling out patches to production servers. The kicker was the production servers were all Windows servers, running various versions, covering practically all Windows versions ever released by Microsoft. You can see where this is headed.
Rolling out a patch was all done manually, 1 Windows server at a time. Everything was manually done.
The instruction for deploying a change was easily multiple lines long, each with different style of writing/instruction. Often in plain text file format. We would print them out so we could check them off as we went down the list.
The CTO is still there, but everyone IT person under him has left or been let go. Working in the IT there is a struggle because lack of automation and old old stuff, but the CTO just blames bad employees and keeps churning them in/out. When the real issue is the decade or 2 worth of old legacy stuff that need to be cleaned up and/or thrown out, which can only be done by the direction of the CTO. But he knows he won't get that kind of budget from higher-ups so he will just keep hiring/firing employees and/or bring in some H1B workers who's basically trapped once they join. And of the few H1B workers I met there, they were truly completely non-technical. One did not want to learn how to use keyboard shortcuts to do common tasks... Good guy though.
> ...all changes are being made directly on the production servers via SFTP
I know this used to be common, but recently? Curious how often this is still the case.
Several times within the last year for me. Not all companies have big tech departments with knowledgable developers advocating modern best practices. Some big internal systems can start out from someone internal applying some basic self-taught skills for example.
To be fair, the jump to using Git (and especially dealing with Git merge conflicts) is scary. It can be a hard sell to get people to move from a system they already completely understand and have used successfully for years, even if their system looks like ancient history to us.
Literally heard "...but my IDE already automatically uploads to FTP on save, I'm usually the only usually one editing it and I already know what I changed" last week.
Meanwhile the CEO who was rejecting the €¥$£ in yh budget since 2000 is angry at everyone!
Oh the times I have seen this!!!
This was for a 100+ year old company with millions of dollars in annual revenue that was owned by the government. So, yeah. 100% the IT director's fault, who'd been there since the early 90s.
Having no testing/staging environments remains pretty common, along with its cousin "production work happens on staging". Partnering with not-primarily-software companies and asking about staging infrastructure, you har that regularly. And yeah, SFTP/SCP/SSH is a standard push-and-deploy approach in places where that happens.
On the other hand, outside of a tech company and with less than a dozen developers, don't expect to find any source control. Consultants see a lot of this shit, they work in all industries including where developers don't exist, with a lot of thrown away projects.
Funny thing. Git probably made it worse in recent years by being impossibly hard to use.
/seriously, though - I...hope this isn't being done any longer - but I bet it is. Sigh...
The writer breaks down why this document is not a post-mortem despite superficial similarities.
>Again, the purpose of the doc is to point out where Knight violated rules. It is not: 1) a description of the multiple trade-offs that engineering at Knight made or considered when designing fault-tolerance in their systems, or 2) how Knight as an organization evolved over time to focus on evolving some procedures and not others, or 3) how engineers anticipated in preparation for deploying support for the new RLP effort on Aug 1, 2012.
>To equate any of those things with violation of a rule is a cognitive leap that we should stay very far away from.
>It’s worth mentioning here that the document only focuses on failures, and makes no mention of successes. How Knight succeeded during diagnosis and response is unknown to us, so a rich source of data isn’t available. Because of this, we cannot pretend the document to give explanation.
He also makes an interesting point related to what you're saying: the SEC says the risk management controls were inappropriate, but clearly Knight thought they were appropriate or they would have fixed it.
>What is deemed “appropriate”, it would seem, is dependent on the outcome. Had an accident? It was not appropriate control. Didn’t have an accident? It must be appropriate control. This would mean that Knight Capital did have appropriate controls the day before the accident. Outcome bias reigns supreme here.
I'd go into it more but I would instead recommend you take a look at his breakdown as I'd be trying to do a shoddy summary of a really interesting write-up.
> This would mean that Knight Capital did have appropriate controls the day before the accident.
I think these claims are seriously confused.
SEC fines don't require mens rea, so Knight is simply being punished for having inappropriate controls, their view on the matter be damned. Kitchensoap rightly observes that "this event was very harmful" does not imply "this event was caused by extreme negligence". But the SEC filing focuses on alerting and controls; position limits don't prevent a specific misstep, but they limit the maximum size of any error that does occur. (Knight had position limits on accounts, but didn't use them as fundamental boundaries restricting actual trade volume.) The thesis is that Knight should have prepared to mitigate "unknown unknowns", in which case the size of the error is relevant because the size was exactly what should have been controlled for.
On appropriate controls, SEC fines are certainly outcome-biased, but the claim is obviously that these controls were always inappropriate, and the disaster simply revealed them. Post-disaster punishment creates an ugly system where people who don't take excess risk can be outcompeted before their competitors crumble, but the rule isn't actually conditional on failures.
Kitchensoap asks whether Knight would be judged so harshly if they'd only lost $1,000. Socially, perhaps not, but legally they actually would have! The SEC isn't just punishing Knight for losing money but for disrupting the market with improper trades; it specifically notes that for some "...of those stocks, the price moved by greater than ten percent, and Knight’s executions constituted more than 50 percent of the trading volume. These share price movements affected other market participants...". A smaller loss wouldn't have defended against that charge, while a smaller trade wouldn't have violated SEC rules.
I think the author is basically aware of this, since his fundamental point is that the SEC is describing the legal wrongs rather than the technical mistakes. That's a good point and I'm glad you linked this. But I think his focus on the specific deployment error neglects the fact missing position controls were the larger legal and technical failure.
What makes negligence extreme? Doing things you have specifically been warned against would be one thing, but there are others, including being oblivious to the magnitude of the risk when, with a little thought, it should have been clear.
The inverse of the above quote is equally valid: not-very-harmful outcomes do not imply that the negligence is not extreme, and it was all the days of operating without big problems that allowed the organization to be blind to the risk it was running, day in, day out.
>> Clearly Knight thought [its controls] appropriate or they would have fixed it.
But the problem is that it did not think about it, in a meaningful way: it did not have an informed opinion. Every day in which nothing very bad happened contributed to the normalization of deviance. I am sure there were other days when things went wrong, but without the worst possible outcome, and they became just part of the way things are, instead of a wakeup call.
By this logic, we can claim that if a hospital is storing all its employee passwords in plaintext, that's "appropriate" because if it was inappropriate they wouldn't do so.
Or that if a company is neglectful about offsite backups, that's an "appropriate" data retention strategy because if it wasn't, then the company would be taking backups.
In this case, if Knight thought its controls were "appropriate", then that's the problem that needs fixed.
But I appreciate the writing style, and I've got his site bookmarked for more systems safety reading in the future. It's a slog, but often this sort of ruthlessly-comprehensive breakdown is the best way to understand exactly how a complex thing went awry. Reading them every so often - even for non-software topics like drug treatments - seems to be a good refresher for my own error-analysis skills.
The link here excerpts the full filing, but the real meat wasn't included:
> 16... Because the cumulative quantity function had been moved, this server continuously sent child orders, in rapid sequence, for each incoming parent order without regard to the number of share executions Knight had already received from trading centers...
17. The consequences of the failures were substantial. For the 212 incoming parent orders that were processed by the defective Power Peg code, SMARS sent millions of child orders, resulting in 4 million executions in 154 stocks for more than 397 million shares in approximately 45 minutes. Knight inadvertently assumed an approximately $3.5 billion net long position in 80 stocks and an approximately $3.15 billion net short position in 74 stocks. Ultimately, Knight realized a $460 million loss on these positions.
Knight did $1.4 billion in revenue per year, but there was no safeguard against adopting $7 billion worth of positions in 45 minutes. The end result was that they lost 4 years of profits in under an hour. That's crazy, and not at all standard.
And everyone benefits. You, the employer, your colleagues, the fresh grads with starry eyes
First, management has to be willing to assign you to cleanup tasks, rather than have you work on new features.
Second, cleaning up spaghetti code safely and turning it into something maintainable is often not easy -- it wasn't written to be maintained! No tests, missing documentation, ill-considered interfaces...
It isn't enough to keep your head off the chopping block: management has to be fully on board.
I found an impending any-day-now production down scenario last week and got it fixed while learning tremendously and fixing some incidentally related things. It feels amazing.
No tests, no docs, no comments, total spaghetti, global variables everywhere, no logs, and I am loving it. What a job.
Upper upper management is starting to keep tabs on the "productive" output of my team though. We'll see.
Whether or not that's true, I don't know if it's the right framing for the problem. Think about how many new cars or new airplanes or new bridges are designed every year. (One or two per large firm? Less? None by any small firms?) Then think about how many new web services are designed every year. (Several for a small firm, dozens for a large firm?)
If "maturing" the software design process means 10x or 100x the cost and time that a project takes today, that isn't going to fly in the market. It might make more sense to point out the discrepancies between the typical software project (low cost for mistakes, high tolerance for downtime) and the atypical finance/medicine/aerospace software project (don't screw up or people die). Maybe the specific part of the industry that needs to mature is the _awareness_ of when you cross that boundary, especially within a single organization. The folks working on the internal HR system and the folks working on the high-frequency trading system need to use different procedures (and different levels of funding and different management expectations), and if they're the same folks on different days that might be very far from obvious.
About a decade ago, I was a contractor in a security services team at a large bank. There were systems in place to minimize the impact of our software, which handled requests for access, and the actual access granted. Our software was developed like most, or worse in the beginning. In the end, by separating workflows and agreeing on interfaces, a lot of issues were prevented altogether.
Other systems should be designed with similar safeguards and separate certain control flows behind well thought out APIs.
Others still should be created much closer to classic waterfall, with adjustments meaning a stop-work, re-evaluate, update design, and proceed process in place.
It should really depend on the situation.
If the analysis shows the defect could have reasonably been caught during the code review phase then management sometimes has to recalibrate the reviewers. It's also important to select the right reviewers based on the scope and risk of the change. For the largest, most complex changes we sometimes pull in up to 6 reviewers including members of other teams and high level architects.
In finance, I've seen people compute deals worth billions of dollars using excel spreadsheets and a team of MBAs.
The team of MBAs might screw up an Excel formula and lose money on a deal, but presumably if the math comes out to "let's pay $500 billion for Yahoo", somebody's going sanity-check that before they transfer the money.
* Lack of unit (or any) testing
* Lack of good versioning practices and code review (e.g emailing around a sheet that's been doing the rounds for 15 years in various different guises and formats and has who knows what horrors lurking)
* Lack of typing (e.g doing a SUM across a dataset consisting of "3", $3, 3, "III", an emoji of the number 3, a jpeg screenshot of the number 3 - might not add up to 18)
* Lack of precision (rounding errors)
Not sure if you are talking about the author of the SEC document (the “bug report”) or the blogger, but in either case, what is okay in 99% of software development may be quite inadequate for critical software used in tightly regulated industries. Context matters.
> Managers care about speed of implementation, not quality.
Managers care about speed of implementation not quality insofar as quality is often hard to unambiguously measure and hard to assess impact on the bottom line (not just quantity, but whether there is really any impact.)
This is a fairly dramatic example of the impact being made concrete.
with this practice I have gotten a better habit of looking at the code changes and having a mental image of what things do
this coupled with annotation in the IDE helps me really keep up with the code
only occasionally do I get an actual WTF that prompts me to leave any remark or not approve the change
I'm still better informed about the code change and implementation than not
I'd add that it's the the incredibly immature economic system, paired with an even more immature(new) industry that leads to this kind of failure.
I feel like this is a symptom of tech in capitalism. Where the goal is to maximize profits and minimize effort, rather than doing the job correctly. Fitting that this would befall a high freq trading firm.
It reads to me like standard RCA (Root Cause Analysis) language/tone.
How are you defining "top" and "bottom" here? I can agree with the 80/20 split, but I don't think it necessarily tracks with, say, name recognition or market cap. Amazon, for example, utterly failed on their last Prime day, and from the discussions of those Amazonians I know, it was pretty much inevitable.
The REASON these fast-and-loose habits are habits are because, when they work, they make money. You can't look at a lottery and say the winners "did it right" and the losers didn't - the winners just haven't failed yet. Those companies that AREN'T trying to grow at ridiculous speeds but ARE trying to maintain quality (and are fortunate enough to be in a market niche that supports that) are the ones most likely to have bulletproof reliability. Those won't be the "top" companies by most peoples definitions.
I am not familiar what happened. I am also ex-Amzn. I think there is also a dimension to this, scale. With Amazon's scale there are quite different challeges then in a smaller company. There are solutions that could not possible scale to their needs.
Fast-and-lose was largely introduced when Google and Facebook got entered the scene. I remember that Google did a survey and most of the users were ok with some breakage and get the newest features ASAP. Many people concluded based on this that all software development is like that. Ironically Facebook lately adopted certain technologies to reduce the amount of bugs in their frontend code (ReasonML). I think there is a large distance between bulletproof and feck-all reliability. Top companies are closer to the bulletproof range of the spectrum while bottom companies closer to the feck-all end.
Each of these is strongly correlated to quality.
Most companies are not tech companies, fyi. So you're talking about the top 20% of a very small subset of business.
I remember the week after this. Everyone I knew who worked at a fund was going over their code and also updating their Compliance documents covering testing and deployment of automated code.
As a side note one of hte biggest ways funds tend to get in trouble from their regulators is to not follow the steps outlined in their compliance manual. Its been my experience that regulators care more that you follow the steps in your manual than those steps necessary being the best way to do something.
I came away from this thinking the worst part of this was that their system did send them errors, its just that when you deal with billions of events emailing errors just tend to get ignored as at that scale you generate so many false positives with logging.
I still don't know the best way to monitor and alert users for large distributed systems.
The other take away was that this wasn't just a software issue but a deployment issue as well. It wasn't just one root cause but a number of issues that built up to cause the issue.
1) New exchange feature going live so this is the first day you are actually running live with this feature
2) old code left in the system long after it was done being used
3) re-purposed command flag that used to call the old code, but now is used in the new code
4) only a partial deployment leaving both old and new code working together.
5) inability to quickly diagnose where the problem was
6) you are also managing client orders and have the equivalent of an SLA with them so you don't want to go nuclear and shut down everything
I write apps that generate lots of logs too...I think an improvement lies in some form of automated algorithmic/machine learning (to incorporate a buzzword in your pitch) log analysis.
When I page through the log in a text editor, or watch `tail` if it's live, there's a lot of stuff that looks like
TRACE: 2019-04-01 09:45:03 ID A1D65F19: Request 1234 initiated
ERROR: 2019-04-01 09:45:04 ID A1D65F19: NumberFormatException: '' is not a valid number in ProfileParser, line 127
WARN : 2019-04-01 09:45:04 ID A1D65F19: Profile incomplete, default values used
WARN : 2019-04-01 09:45:14 ID A1D65F19: Timeout: Service did not respond within 10 seconds
TRACE: 2019-04-01 09:45:14 ID A1D65F19: Request 1234 completed. A = 187263, B = 1.8423, C = $-85.12, T = 11.15s
Don't email me when there's a profile incomplete warning. Don't email me any time there's an "ERROR" entry, because that just makes people reluctant to use error level logging. Definitely don't email me when there's a unique request complete string, that's trivially different every time. But do let me know when something weird is going on!
About 2/3 just came down to filtering certain classes of errors. 4xx errors, I stopped email notification altogether, since they were already being trapped/handled by the system. Others were a little more specific. Ironically, .Net tends to handle some things that should be 4xx errors as 500, so reclassifying those took out a lot as well.
In the end, within about a month, the emails were down to a manageable 20 or so a day and got more visibility as a result.
Often times the answer is writing better alert triggers that take historical activity into account to cut down on false positives. Other times it’s simply to reduce the number of alerts. In every case you need an alerting strategy that takes balances stakeholder needs, and you need to realign on that strategy quarterly. It’s ultimately an operational problem, not a technical one.
Alas, back in the real world, logging is always the last thing teams have time to think about...
I’ve also had clients in the past use Splunk with ML forecasting models that inject fields as part of the ingest pipeline. I don’t know the details of that implementation; I just know how the dev teams were using it.
In your example, the `NumberFormatException` is a bad ERROR entry, because it's covered by a WARN entry right below it. Meaning that the request was not failed -- it entered a default value instead of the bad parsing. So that exception should also be at WARN level.
(Arguably, overwriting input values with defaults because of a parsing error is probably a bad idea and should be an ERROR due to rejecting the request. But I'm rolling with what we got here.)
I'd imagine that this is somewhere statistical process control would apply: if you're dealing with a system in which errors are expected, then monitor frequency & magnitude, and alert when they fall outside of one or two standard deviations from the mean.
For a financial company, you'd probably want a graph of net worth or somesuch, and alert when it falls outside of one standard deviation. If you don't have the IT do calculate net worth to at least hourly granularity, then get there, and aim for to-the-minute granularity. This shouldn't be hard, but it might be, and if it is then it's worth fixing.
In 2003, Knight ceased using the Power Peg functionality. In 2005, Knight moved the tracking of cumulative shares function in the Power Peg code to an earlier point in the SMARS code sequence. Knight did not retest the Power Peg code after moving the cumulative quantity function to determine whether Power Peg would still function correctly if called.
If you want to leave the old code in, fine, but then it still needs to be tested.
Of course, violating those SLA's could cause bankruptcy through client dissatisfaction but that seems less certain than bleeding out money.
Not a joke. Bitfinex lost its Bitcoins by being compliant (regulators required that users wallets are displayed on the blockchain). It'd have been safer to just have a single cold wallet.
I've been working on a warehouse management software (that was running on mobile barcode scanners each warehouse worker had, as he moved stuff around the warehouse and confirmed each step with the system by reading barcodes on shelves and products).
We had a test mode, running on a test database, and production mode, running on the production database, and you could switch between them in a menu during the startup.
During testing/training users were running on the test database, then we intended to switch the devices to production mode permanently, so that the startup menu wouldn't show.
A few devices weren't switched for some reason (I suspect they were lost when we did the switch and found later), and on these devices the startup menu remained active.
Users were randomly taking devices each day in the morning, and most of them knew to choose "production" when the menu was showing. Some didn't, and were choosing the first option instead.
We started getting some small inaccuracies on the production database. People were directed by the system to take 100 units of X from the shelf Y, but there was only 90 units there. We looked at the logs on the (production) database, and on the application server, but everything looked fine.
We were suspecting someone might just be stealing, but later we found examples where there was more stuff in reality on some shelves than in the system.
At that time we introduced a big change to pathfinding, and we thought the system was directing users to put products in the wrong places. Mostly we were trying to confirm that this was the cause of the bugs.
Finally we found the reason by deploying a change to the thin client software running on the mobile devices to gather log files from all the mobile devices and send to server.
I've heard about this case many times before but somehow in the other renditions they downplayed or neglected to mention that the deployments were manual. As this story was first explained to me, one of the servers was not getting updated code, but I was convinced by the wording that it was a configuration problem with the deployment logic.
Performing the same action X times doesn't scale. Somewhere north of 5 you start losing count. Especially if something happens and you restart the process (did I do #4 again? or am I recalling when I did it an hour ago?)
The problem was - the startup menu with testing/production choice was enabled independently of the autoupdate mechanism (separate configuration file ignored by autoupdates) for some technical reason (I think to allow a few people to test new processes while most of the warehouse works on the old version on production database).
I rarely work on this system, but had to make an emergency change last summer. We deployed the change at around 10 pm. A number of our tests failed in a really strange way. It took several hours to determine that one of the 48 servers had the old version still. It's disk was full, so the new version rollout failed. The deployment pipeline happily reported all was well.
We got lucky in that our tests happened to land on the affected server. The results of this making past the validation process would be catastrophic. Not as catastrophic as this case I hope, but it'd be bad.
We made a couple human process changes, like telling the sysadmins to not ignore full disk warnings anymore (sigh). We also fixed the rollout script toactually report failures, but I don't actually trust it still.
Handling an out of space condition should be part of your test suite - it certainly was back when I looked after a Map reduce based Billing system at BT and that was back in the day when a cluster of 17 systems was a really big thing.
We actually did have monitoring on the capacity of the disk in question. We discovered during the analysis that it had been alerting the responsible team all day. They had just been ignoring it.
I was concerned that it's possible to read your comment as if it was critical of the parent - was that your intention?
Though for the system I mentioned one of the reasons they employed me as a developer was that I was also a sysad on PR1MES - an early example of devops maybe.
So they received these 90mins before they were executed, and as it so happens in many organizations, automated emails fly back and forth without anyone paying attention.
Also.. running a new trading code, and NOT have someone looking at it LIVE on the kick-off, that is simply irresponsible and reckless.
(Except I had remembered them losing $250M, not $465M, yeow)
The sad thing about this is if the engineering team had insisted on removing the old feature toggle first, deploying that code and letting it settle, and only then started work on the new toggle, they may well have noticed the problem prior to turning on the flag, and it certainly would have been the case that rolling back would not have caused the catastrophic failure they saw.
Basically they were running with scissors. When I say 'no' in this sort of situation I almost always get pushback, but I also can find at least a couple people who are as insistent as I am. It's okay for your boss to be disappointed sometimes. That's always going to happen (they're always going to test boundaries to see if the team is really producing as much as they can). It's better to have disappointed bosses than ones that don't trust you.
Anyway, this is what the deployment looked like two years after:
* All configuration files for all production deployments were located in a single directory on an NFS mount. Literally, countless of .ini files for hundreds of production systems in a single directory without any subdirectories (or any other structure) allowed. The .ini files themselves were huge as it typically happens in a complex system.
* The deployment config directory was called 'today'. Yesterday's deployment snapshot was called 'yesterday'. This is as much of a revision control as they had.
* In order to change your system configuration, you'd be given write access to the 'today' directory. So naturally, you could end up wiping out all other configuration files with a single erroneous command. Stressful enough? This is not all.
* Reviewing config changes were hardly possible. You had to write a description of what you changed, but I've never seen anybody attach an actual diff of changes. Say you changed 10 files, in the absence of a VCS, manually diff'ing 10 files wasn't anybody wanted to do.
* The deployment of binaries was also manual. Binaries were on the NFS mount as well. So theoretically, you could replace your single binary and all production servers would pick it up the next day. In practice though, you'd have multiple versions of your binary, and production servers would use different versions for one reason or another. In order to update all production servers, you'd need to check which version each of the server uses and update that version of the binary.
* There wasn't anything to ensure that changes to configs and binaries are done at the same time in an atomic manner. Nothing to check if the binary uses the correct config. No config or binary version checks, no hash checks, nothing.
Now, count how many ways you can screw up. This is clearly an engineering failure. You cannot put more people or more process over this broken system to make it more reliable. On the upside, I learned more about reliable deployment and configuration by analyzing shortcomings of this system than I ever wanted to know.
However what's neglected to mention is the risk associated with a catastrophic software error. If you are say instagram and you lose your uploaded image of what you ate for lunch, that is undesirable and inconvenient. The consequences of that risk should it come to fruition is relatively low.
On the other hand if you employee software developers that are literally the lifeblood of your business for automatic trading, you'd think that a company like that would understand the consequences of treating this "cost-center" as a critical asset rather than just a commodity.
Unfortunately you would be wrong. Nearly every developer I have ever met that has worked for a trading firm has told me that the general attitude towards nearly all it's employees that are not generating revenue as a disposable commodity. It's not just developers but also research, governance, secretarial, customer service, etc. This is a bit of a broad brush but generally the principles and traders of those aforementioned firms are arrogant and greedy and cut corners whenever possible.
In this case you'd think these people would be rational enough to know that cutting corners on your IT staff could be catastrophic. This is where you would be wrong. Most of the small/mid sized financial firms that I have had friends who worked there have told me they generally treat their staff like garbage and routinely push people out who want decent raises/bonuses, etc. These people are generally greedy and also egocentric and egomaniacal, and they believe all their employees are leaching from their yearly bonus directly.
This story is not a surprise to me in the least. What's shocking is no one in the finance industry has learned anything. Instead of looking at this story as a warning, most of the finance people hear this story and laugh at how stupid everyone else is and that this would never happen to them personally because they're so much smarter than everyone else.
What if we're smarter than everyone else? When I was in big bank, we had mandatory source control, lint, unit tests, code coverage, code review, automated deployment, etc... pretty good tools actually. Not everybody is stuck in the stone age.
Even in a small trading company before that, we had most of the tooling although not as polished. Very small company with a billion dollars a month in executed trade. One could say amateur scale.
I'm not an expert here. Part of what I said is based on the 6 different people I've met who have worked in the industry. I'm just saying if you have $400+ million to lose and you rely on the IT infrastructure allows you to make that money then you can spend a few million on top notch people and processes to prevent this kind of thing. I worked at a relatively large media company, and every deployment has a go/no-go meeting where knowledgable professionals asked probing questions, you defended your decisions. I've love to know what they did in Knight Capital. The idea of re-using an existing variable for code that was out of use strikes me as a terrible idea.
But spend 200,000 on managing 460,000,000? No way!
"The new RLP code also repurposed a flag" - this is the moment when terrible software development idea was executed that resulted in all of the mess.
Of course I don't know the full context and maybe, just maybe there was a really solid reason to reuse a flag on anything.
What I observe more often is something like this though:
1. We need a flag to change behaviour of X for use case A, let's introduce enable_a flag.
2. We want similar behaviour change of X also for use case B, let's use the enable_a flag despite the fact the name is not a good fit now.
3. Turns out use case B needs to be a bit different so let's introduce enable_b flag but not change the previous code so basically we need them both true to handle use case B.
4. Turns out for use case A we need to do something more but things should stay the same for B.
5. At this point no one really knows what enable_a and enable_b really do. Hopefully at least someone noticed that enable_a affects use case B.
What would probably be even better is separate, properly named config flags for each little behaviour change and just use all 5 of those to handle different use cases.
This is a little inevitable when working with (internal) binary protocols. You have some bits that used to be used as one thing, haven't been used for that thing in awhile, and it can be very tempting to just repurpose those old bits for new tricks.
In that case I'd call the sin deprecating and reusing in a single step. If you have to change the meaning of bits in a binary protocol, you should deprecate the old meaning, wait several release cycles, and then repurpose it after you're very, very confident that there are no remaining clients or servers in lurking around production with the old meaning, and that you won't need to roll back production to the old state ever again.
Roll-out failures like this happen all the time; you have to roll-out new changes almost assuming they will happen.
Either way, it shows some attempt at longer-term thinking undone by short term implementation done improperly. That’s a microcosm of this story as a whole which gives this incident a fractal quality.
A hard lesson to learn, and a hard rule to push for with others who have not yet learned.
Imagine what our species could do if experience were directly and easily transferable...
Same goes for functions, classes, React components, DB tables and everything else.
Just model it as close as possible to the real world. The world doesn't really change that often. What does is how we interpret and behave within it (logic/behaviour/appearance on top).
If you have a Label and Subheader in your app, create separate components for them. It doesn't matter that they look exactly the same now. Those are two separate things and I guarantee you more likely than not at some point they will differ.
My rule of thumb is: If it's something I can somehow name as an entity (especially in product and not tech talk) it deserves to be its own entity.
In the last 16 years I've worked in software the last 10 or so of that has not included manually copying files to production
"CI" would generally help keeping things tidy and transparent. It would be easier to find out what branches or flags or parameters already exist, which ones are not relevant any more; someone could rename them with a prefix "OLD_" or otherwise clean up, so that you typically pick from a list of options instead of setting a flag by accidentially copying it from one script to the other.
It is also easier to have error visualization plugins, or to look around previous deployments or test runs and get a pretty good idea if any new errors have shown up, even if you haven't analyzed the old ones.
While something may be a rounding error to the company, you can't get someone at the correct political level where it's a rounding error to pay attention.
Did you forget a where clause while deleting data on a table, or were you actually on the production server hosting the database?
Any code you write that interacts with a database (or really any production code at all...) should be reviewed before being merged. And developers shouldn't be writing raw SQL commands on a production server. It's hard for me to see this as anything other than an organizational failure rather than your own.
EDIT: Based on the number of downvotes this has received, I can only imagine we have a lot of devs on HN who cowboy SQL in production...holy hell how can any of what I said be controversial.
That said, I'd expect at least a backup of production, then again he said he lost 1 hr of data so it was likely between backups.
I mean I get it, I've made mistakes like this as well knowing I shouldn't have (we had test and prod running on the same server, about 40K people received a test push notification). But the bigger your product gets, the less you can afford to risk losing data.
I was just trying to explain that many business like the one I'm at don't do business in tech (mine sells wholesale clothing), with 6 people in the tech dep, so understandably there's certain limitations on how far best practices can go. While I would usually consider it a mistake, if you thought you were just making a quick, what should be read only query, and it happens to hit some random edgecase-bug and crash a db... Edit, continuing - Sure you should have tested that on the test DB first, but I'd be kinda understanding of how that happened.
Depends on the business too, if you're a startup-tech company then yea, get your -stuff- together! It's just a lot of business only need their website and some order management, their focus isn't on the tech side of things.
But the backup, hell yeah, you NEED mitigating controls (preventive/corrective) for when you allow people to make changes in Prod that haven't been gone through all the testing phases.
You can't just say "Hire more people" because the current setup is "working" and isn't considered critical to the rest of the business when it isn't tech related.
The DBAs in the company did think SQL should be reviewed in advance, but that's not how our department did it. I think it's arguable that, given that reviewers before or after doing something dangerous may miss things, it's better to establish safe practices and if you do that, then you don't really need a review in advance.
Same with "rm -r" --- run an "ls" first.
Sure it's not ideal but life isn't ideal. When you have a production problem, you need to fix production.
With the right tools and processes, it's possible to build very successful companies with tiny engineering teams. Developers run queries against prod because there's nobody else to do so. The risk of mistakes is mitigated by the increased situational awareness and the developer quality and communication in 3-teams vs 30-teams.
Neither approach is inherently 'wrong', although running a 30-person team the same way as a 3-person team (or vice versa) can have nasty consequences.
It has budget alerting, so the capabilities are obviously there, but it's never been added. Instead, there's just a vaguely insulting guide on writing a script to catch the alert and trigger a shutdown...
Pretty sure aws still doesn't
I suggest further reading, starting with Therac-25.
Funny tangent- the breakroom at that job was somewhat near the base stations. Some days around lunchtime we'd have transmission interruptions. The root cause ended up being an old noisy microwave.
My claim is mostly that "human negligence" is so universal of a root cause that it's meaningless. What form of human negligence happened, and how could it be averted in the future?
I think deploys get better with time, but that initial blast of software development at a startup is insane. You literally need to do whatever it takes to get your shit running. Some of these details dont matter because initially you have no users. But if your company survives for a couple years and builds a user base, you still have the same shitty code and practices from early times.
So many more interesting and meaningful uses of computing than trying to build a system to-out cheat other systems in manipulating the global market for the express interest of amplifying wealth.
What's crazy is that there were already rules in place to prevent stuff like this from happening - namely the Market Access Rule https://www.sec.gov/news/press/2010/2010-210.htm which was in place in 2010.
When the dust settled, Knight sold the entire portfolio over via a blind portfolio bid to GS. Was a couple $bn gross portfolio. I think they made a pretty penny on this as well.
Ah the good old "fk it we'll do it live" approach to managing billions.
A: I'd like to hire some people to improve our processes. It will take time and money and prevent future problems, but you will never notice.
B: Time and money and no new features? No way, I won't approve that.
A: tries to sell it some more even though they are technical and not a salesperson