Another great point referenced in the Power of 10 rules: a requirement is only valuable as long as you can enforce it. With some exception, if you can't enforce it with a static analysis tool or something of the sort, it might not belong in your coding standards.
After skimming the document for 15 minutes, I didn't find a single thing that would be insightful to experienced software developers or their managers. Things like, "Increased complexity means increased errors at all levels of development." It should be possible to write a useful book about software safety, but this is not it.
Safety critical software is all about the process and organization. The process must be so good that you can hire total newbie to write code and none of his sloppy work would get into the final product.
What you need is software engineer who writes 20 lines of code and can then justify each of the 100 lines in a resulting spreadsheet that is full MC/DC code coverage in processor instruction level. Then you need two testing engineers who can do the same for code that others have written. Then you need everyone to go trough every change, justify them, rewrite tests and provide reports.
Working like this would be my dream job. Typical software "engineering" has close to no process, no rigor, it's just smart kids pumping code like mad. And then people are upset when major issues arise.
And this is a demonstration of why the process must be so good. (Sorry; it was too perfect to resist)
That's because the purpose of this document is not to advance the process of producing safe software, it is to facilitate the process of assigning blame when something goes wrong, and in particular to help deflect blame from upper management and onto a scapegoat lower down on the food chain. This document is not operative in the forward direction, i.e. it is not the case that people read it and acquire knowledge which they then use to write better software. It is operative in the reverse direction, i.e. when something goes wrong, upper management scours the document to find a place where whatever went wrong deviated from the rules. They then say, "Look, it's not out fault, we had the right rules in place. It is the fault of whoever deviated from the rules."
This is not a phenomenon unique to NASA, BTW. Most large bureaucratic organizations eventually develop this dynamic.
An 'experienced professional' doesn't treat every document like a Bible from which new enlightenment can be gained. Often you need such things as this manual in order to quiet the rabble and give the mob something else to argue about, instead of eating themselves - as often happens when standards are not set and expectations not set and met.
At the time of the Apollo program MIT was contracted to write the navigation software..if it was done their research way we would have never landed on the moon as they wanted an all approach.
Instead what was negotiated between the engineers and astronauts was an approach where NASA radio signals carried the navigation stuff and the nav computers in the command module and lunar modules were backups and were not fully a full navigation of the full mission.
For an early thorough review of that team, there was a great CACM article I read in 1982-3 as I recall, it was written not long after the first launch in 1981, mentions how they changed/touched 70%? of the software after that launch, and in simulation found a possible error that would do something asymmetric with the SRBs, which would have been fatal to the shuttle (although maybe not the pilots, who in the first 4 flights had SR-71 derived ejection systems).
A previous employer opened an offshoring centre, stuffed it with fresh grads, and 6 months later it was CMM level 5 having never shipped a single line of production code...
Other than the fact you can quantify how developed your employers' processes are.
And maybe CMM-2 is where an org pretends to follow a process, but when crunch time hits throws it out the window, so you get the worst of both worlds (or maybe that's typical of CMM-3, can't remember).
Section 6.4 (p109) is a nice overview of requirements analysis benefits and types. 22.214.171.124 is high-assurance requirements which most projects don't have. Appendix H checklist is decent. 6.6.3 (p126) covers many useful analysis often done at compiler or type system level in CompSci. Follows with basic, but inadequate, explanation of formal specs.
Section 7.4.2 (p141) goes into all sorts of techniques for fault-tolerance. 7.4.4 (p145) talks language considerations for reducing defects. Following sections are limiting complexity & designing for easy maintenance. If only enterprises and their management read that stuff... 7.5.2 onward talks Design Analysis with examples nicely rating benefit vs effort required. First and interface failure lists plus especially design constraints are good as people overlook some subset of them usually.
Section 8.4 (p169) is where coding & testing practices begin. Quite thorough with cost benefit analysis I mostly agree with on coding side. Remember that requirements & design already knocked out most issues with code basically just implementing a precise spec. That's why some get "Low" rating when, in tossed-together coding, they might otherwise have high impact.
Ch 9 (p179) is main section on testing. It could be subsetted but pp 182-183 is nice, exhaustive list of what to look at. 9.4 (184) nice list of testing types. Nevermind, the latter sections are even better. Section 11.1.4 (p210) on languages, compilers, etc is pretty thorough with a sound, uncommon recommendation on language used. :) CASE tools (pp236) has nice list of capabilities for general, SW tooling worth imitating.
pp264 has list of common, human errors. p273 has list of questions to ask about dependencies, esp 3rd party software.
So, contrary to mysterypie et al, I find the document to have about everything you need to know to write software that either doesn't fail or handles failure well. It's meant to get you started on every aspect so you can follow up on it with specialist texts. It also drops literally hundreds of useful heuristics and list items that help you achieve your goal. Many of them are non-obvious. Quite a few would've prevented failures I see regularly on HN from otherwise competent developers. I'm for trimming the fat out of this thing to make it the reference text on high-assurance system development that it deserves to be. Plus, collecting together with it key information it references (esp specialist guides) so people can selectively look up and master pieces at a time.
Great idea, forbid normal developers from officially changing the source code and let an expert handle this important matter. Don't even think about using a modern VC system, IBM has fully integrated Enterprise SCM(tm) ready for you.
Change control is an important part of developing safe software. Arbitrary changes should be
avoided. Once a piece of software has reached a
level of maturity, it should be subject to a
formal change control process. What that leve
l of maturity is will vary by group. It could be
when the component compiles, when the CSCI
(which may contain several components) is
completed, or when the whole program is at its first baseline.
Oh, you noticed that this 500 line copy paste function could be replaced by three 80 line functions? Sorry buddy, before you can change this you have to fill out this form in triplicate and what for a decision from three middle managers. Maybe we can fit this change in the next release, but don't get your hopes up.
Because as it turns out, writing mission critical, one-off software is difficult. Touching a piece of code that's firmly embedded in the system could have serious consequences. (Of course, it shouldn't; any mistakes should be caught by testing but we all know how that goes.)
These are rules for a place that makes massive, expensive and dangerous things, not a webdev shop.
And don't dismiss the idea: they do use Microsoft Windows in surgical robots (including eye laser robots, which is why I'll never have it).
The way this is (normally/sanely) done is that the control code for the robot runs on its own CPU and does all the failsafe, real-time-contrained stuff. The robot's control code doesn't receive control-level instructions, but messages of the form "move to position X,Y,Z" or "move N units on axis Y" (which it validates and, ideally, against which it also has mechanical limits, i.e. the mechanism itself cannot move to an invalid or dangerous position). The application that gets user input and sends these messages doesn't need to be NASA-level stuff as long as it runs on another CPU.
In more recent years, this CPU and the one that runs Windows and the nice interface have been starting to come in the same box (and, more recently, even on the same die), but until not so long ago these were usually separate boxes, connected through serial, USB or Ethernet. This is still common, I've finished the firmware for one such device just a few months ago.
I don't know if you can actually pass through FDA's process with a robot that's running Windows, but frankly, I wouldn't try it. But there's nothing inherently unsafe in running the UI on general-purpose software and hardware, as long as the control code is robust enough in its validation and the interface between the two is correctly designed.
Machines can be rebuilt no matter how expensive they are, but human lives are gone for good. Having extremely rigid policies about accepting changes doesn't sound that crazy to me. What kind of serious professional would want to change the code of a life-critical system the same way they change that of a CRUD webapp?
That's a kind of software development where saying "oh, human lives at risk here, I'll be super-careful" doesn't quite cut it.
Well, there is some merit to it, isn't there? You could make a mistake when splitting up that function, like with any other change you make.
Yes, it sucks. Then again, I'd rather not cost our company more than my yearly salary for a simple mistske.
You wouldn't believe how poor IT processes are implemented over such critical infrastructure (oil rigs, infrastructure handling wall-street financial transaction etc) in outsourcing companies. I am not talking how big mess is in configuration and related documentation.
But whatever you do change, you really want to make sure it is worth the risk, screwing that up has greater consequences than asking someone to restart the web server.
With Source Code Management, what you're doing is making sure someone owns the process of ensuring well-documented , well-verified code is what becomes part of a system that will be used for the next 40+ years. Those big enterprise systems you deride have been proven they can handle decades of use, can you say the same of your modern VC system?
NASA didn't have a modern VC system by then, it's perfectly reasonable that they'd implement the procedure by hand.