Hacker News new | past | comments | ask | show | jobs | submit login
NASA Software Safety Guidebook (2004) [pdf] (nasa.gov)
104 points by Tomte on July 1, 2016 | hide | past | favorite | 38 comments

As a NASA flight software engineer who has created and worked with coding standards for critical software, I would highly recommend Gerard Holzmann's (JPL) Power of 10 Rules [http://spinroot.com/gerard/pdf/P10.pdf]. This is a set of the 10 most important (and thus highly restrictive) embedded flight software rules. It gets straight to the point and, at only 6 pages, is much easier to apply to actual software development than this ~400 page tome.

EDIT: Another great point referenced in the Power of 10 rules: a requirement is only valuable as long as you can enforce it. With some exception, if you can't enforce it with a static analysis tool or something of the sort, it might not belong in your coding standards.

I don't doubt for a minute that NASA has highly competent programmers who write bulletproof ultra-reliable software, but they didn't get their knowledge or skill from this tedious bureaucratic tome.

After skimming the document for 15 minutes, I didn't find a single thing that would be insightful to experienced software developers or their managers. Things like, "Increased complexity means increased errors at all levels of development." It should be possible to write a useful book about software safety, but this is not it.

I think many people fail to understand that safety critical software can't rely on "highly competent programmers". No matter how competent, programmers can't be trusted to write safety critical code.

Safety critical software is all about the process and organization. The process must be so good that you can hire total newbie to write code and none of his sloppy work would get into the final product.

What you need is software engineer who writes 20 lines of code and can then justify each of the 100 lines in a resulting spreadsheet that is full MC/DC code coverage in processor instruction level. Then you need two testing engineers who can do the same for code that others have written. Then you need everyone to go trough every change, justify them, rewrite tests and provide reports.

To extrapolate on this the skillset to identify "highly competent programmers" AND populate your organization with them exclusively either does not exist or at least is unreliable enough that it can never be assumed as a permanent state of your organization: as proven by all long-term software projects and the organizations that created them.

Including those at NASA.

> What you need is software engineer who writes 20 lines of code and can then justify each of the 100 lines in a resulting spreadsheet that is full MC/DC code coverage in processor instruction level. Then you need two testing engineers who can do the same for code that others have written. Then you need everyone to go trough every change, justify them, rewrite tests and provide reports.

Working like this would be my dream job. Typical software "engineering" has close to no process, no rigor, it's just smart kids pumping code like mad. And then people are upset when major issues arise.

Yeah I think it mainly comes down to who the code is serving. When people's lives are on the line, it is extremely valuable to make bullet-proof code. However, when you have to get a product out to users to start fueling growth for VCs.. well then let's get that code out the door ASAP. Extending the process in the latter case in exchange for more robust code doesn't always make sense.

Article speaking at a 30,000 ft level about what you describe.


> who writes 20 lines of code and can then justify each of the 100 lines

And this is a demonstration of why the process must be so good. (Sorry; it was too perfect to resist)

> After skimming the document for 15 minutes, I didn't find a single thing that would be insightful to experienced software developers or their managers

That's because the purpose of this document is not to advance the process of producing safe software, it is to facilitate the process of assigning blame when something goes wrong, and in particular to help deflect blame from upper management and onto a scapegoat lower down on the food chain. This document is not operative in the forward direction, i.e. it is not the case that people read it and acquire knowledge which they then use to write better software. It is operative in the reverse direction, i.e. when something goes wrong, upper management scours the document to find a place where whatever went wrong deviated from the rules. They then say, "Look, it's not out fault, we had the right rules in place. It is the fault of whoever deviated from the rules."

This is not a phenomenon unique to NASA, BTW. Most large bureaucratic organizations eventually develop this dynamic.

I read a somewhat tongue-in-cheek blog post about "blame-oriented software design". I can't find it now, but it had actually useful advice on design techniques like logging and API precondition checks to avoid blame for other people's bugs. :)

A remarkable number of people with impressive resumes don't just lack these insights, they will actively argue with them.

Yeah, it seems sort of discomfiting to hear a claim that 'experienced professionals will sniff at this and not learn a thing from it'. You're not supposed to learn anything from this - its there so that the standard is set, and professionals - experienced, or otherwise - can refer to the standard, and measure performance against it.

An 'experienced professional' doesn't treat every document like a Bible from which new enlightenment can be gained. Often you need such things as this manual in order to quiet the rabble and give the mob something else to argue about, instead of eating themselves - as often happens when standards are not set and expectations not set and met.

You have to remember how this process was developed..

At the time of the Apollo program MIT was contracted to write the navigation software..if it was done their research way we would have never landed on the moon as they wanted an all approach.

Instead what was negotiated between the engineers and astronauts was an approach where NASA radio signals carried the navigation stuff and the nav computers in the command module and lunar modules were backups and were not fully a full navigation of the full mission.

See this current HN item on writing Space Shuttle software: https://news.ycombinator.com/item?id=12014248

For an early thorough review of that team, there was a great CACM article I read in 1982-3 as I recall, it was written not long after the first launch in 1981, mentions how they changed/touched 70%? of the software after that launch, and in simulation found a possible error that would do something asymmetric with the SRBs, which would have been fatal to the shuttle (although maybe not the pilots, who in the first 4 flights had SR-71 derived ejection systems).

The thing to bear in mind is that not every organisation needs 'NASA-level' quality. IMHO The Capability Maturity Model is a must-read for everyone interested in professional software development. https://en.wikipedia.org/wiki/Capability_Maturity_Model_Inte...

CMM is pure snake oil. You can't swing a cat in some cities without hitting a level 5 certified organization - but is your outsourced SAP implementation or helpdesk really space shuttle quality? Does it need to be?

A previous employer opened an offshoring centre, stuffed it with fresh grads, and 6 months later it was CMM level 5 having never shipped a single line of production code...

What do I look for in this model if I'm merely employed as a software developer?

Other than the fact you can quantify how developed your employers' processes are.

The model is really just an abstract compilation of best practices and probably isn't helpful for just line-level developer. Personal Software Process (PSP), with it's pluses and minuses, is a more concrete implementation aimed at a single developer. You could in theory do it within the framework of your company's development process. Overall it's not that dissimilar from Joel's checklist[1] with an emphasis on metrics.

[1] http://www.joelonsoftware.com/articles/fog0000000043.html

Well, it wouldn't be very fun to be in an organization at CMM-3 as I recall, that's a stage where the organization is serious about it but doesn't really understand why they're following these processes, more of a cargo cult than the real thing. Level 4 is what you really want to strive for.

And maybe CMM-2 is where an org pretends to follow a process, but when crunch time hits throws it out the window, so you get the worst of both worlds (or maybe that's typical of CMM-3, can't remember).

If you are going to stay as a junior or even mid level developer - not much. But if there's even a slight chance you might move on to software architect, management or sales then in my experience it is essential. That doesn't mean you have to like it or promote it. Just having an understanding of what it is and how it might help makes a difference.

I find this more useful and informative, to be honest (While C specific, it can be adapted for other langauges as well): http://lars-lab.jpl.nasa.gov/JPL_Coding_Standard_C.pdf

This actually has a lot of good information. It's just that you have to get to it to know that. The sections talking about methods like OOP are pretty good summary of pro's and con's. The PDF Reader p93 "Good programming practices..." has good ones common in safety-critical embedded & some OS development but that I see almost no other C programmers doing. There's some common advice mixed in. That whole section is worth finding the document as low-level programmers will find at least 5 things they didn't think about. At least.

Section 6.4 (p109) is a nice overview of requirements analysis benefits and types. is high-assurance requirements which most projects don't have. Appendix H checklist is decent. 6.6.3 (p126) covers many useful analysis often done at compiler or type system level in CompSci. Follows with basic, but inadequate, explanation of formal specs.

Section 7.4.2 (p141) goes into all sorts of techniques for fault-tolerance. 7.4.4 (p145) talks language considerations for reducing defects. Following sections are limiting complexity & designing for easy maintenance. If only enterprises and their management read that stuff... 7.5.2 onward talks Design Analysis with examples nicely rating benefit vs effort required. First and interface failure lists plus especially design constraints are good as people overlook some subset of them usually.

Section 8.4 (p169) is where coding & testing practices begin. Quite thorough with cost benefit analysis I mostly agree with on coding side. Remember that requirements & design already knocked out most issues with code basically just implementing a precise spec. That's why some get "Low" rating when, in tossed-together coding, they might otherwise have high impact.

Ch 9 (p179) is main section on testing. It could be subsetted but pp 182-183 is nice, exhaustive list of what to look at. 9.4 (184) nice list of testing types. Nevermind, the latter sections are even better. Section 11.1.4 (p210) on languages, compilers, etc is pretty thorough with a sound, uncommon recommendation on language used. :) CASE tools (pp236) has nice list of capabilities for general, SW tooling worth imitating.

pp264 has list of common, human errors. p273 has list of questions to ask about dependencies, esp 3rd party software.

So, contrary to mysterypie et al, I find the document to have about everything you need to know to write software that either doesn't fail or handles failure well. It's meant to get you started on every aspect so you can follow up on it with specialist texts. It also drops literally hundreds of useful heuristics and list items that help you achieve your goal. Many of them are non-obvious. Quite a few would've prevented failures I see regularly on HN from otherwise competent developers. I'm for trimming the fat out of this thing to make it the reference text on high-assurance system development that it deserves to be. Plus, collecting together with it key information it references (esp specialist guides) so people can selectively look up and master pieces at a time.

I also liked that bit (p48): "“Cutting edge” techniques or processes may not be the best choice, unless time and budget exist to allow extensive training of the development and assurance teams. Staying with a well understood process often produces better (i.e. safer and more reliable) software than following the process-of-theyear."

Oh yeah! On Schneier's blog, I often post Nick P's Law of Trustworthy Technology: "Don't put your trust in a new method until there's been at least 10 years proving or improving it." Also, "tried and true beats novel or new" if high-integrity development.

Be aware of potential problems if you split control of software configuration management (e.g. having software documents is maintained by a project or company configuration management group, and the source code version control handled by the programmers). It may be difficult to keep the documents (e.g. design) synchronized with the code. Someone with configuration management experience should be “in charge” of the source code, to enforce change control and comprehensive documentation of changes.

Great idea, forbid normal developers from officially changing the source code and let an expert handle this important matter. Don't even think about using a modern VC system, IBM has fully integrated Enterprise SCM(tm) ready for you.

Change control is an important part of developing safe software. Arbitrary changes should be avoided. Once a piece of software has reached a level of maturity, it should be subject to a formal change control process. What that leve l of maturity is will vary by group. It could be when the component compiles, when the CSCI (which may contain several components) is completed, or when the whole program is at its first baseline.

Oh, you noticed that this 500 line copy paste function could be replaced by three 80 line functions? Sorry buddy, before you can change this you have to fill out this form in triplicate and what for a decision from three middle managers. Maybe we can fit this change in the next release, but don't get your hopes up.

> Oh, you noticed that this 500 line copy paste function could be replaced by three 80 line functions? Sorry buddy, before you can change this you have to fill out this form in triplicate and what for a decision from three middle managers. Maybe we can fit this change in the next release, but don't get your hopes up.

Because as it turns out, writing mission critical, one-off software is difficult. Touching a piece of code that's firmly embedded in the system could have serious consequences. (Of course, it shouldn't; any mistakes should be caught by testing but we all know how that goes.)

These are rules for a place that makes massive, expensive and dangerous things, not a webdev shop.

Until somebody tries to drive a surgical robot using your webdev javascript components...

And don't dismiss the idea: they do use Microsoft Windows in surgical robots (including eye laser robots, which is why I'll never have it).

I've worked on such devices before. I don't know how it goes for the surgical robots you mentioned, but in general, this can be made in a sane manner. Um, using a Windows application for the UI. I wouldn't be partial to operating a surgical level through a web interface, but only because of the unpleasant UI latency and the inconvenience of doing anything that's not HTTP from it.

The way this is (normally/sanely) done is that the control code for the robot runs on its own CPU and does all the failsafe, real-time-contrained stuff. The robot's control code doesn't receive control-level instructions, but messages of the form "move to position X,Y,Z" or "move N units on axis Y" (which it validates and, ideally, against which it also has mechanical limits, i.e. the mechanism itself cannot move to an invalid or dangerous position). The application that gets user input and sends these messages doesn't need to be NASA-level stuff as long as it runs on another CPU.

In more recent years, this CPU and the one that runs Windows and the nice interface have been starting to come in the same box (and, more recently, even on the same die), but until not so long ago these were usually separate boxes, connected through serial, USB or Ethernet. This is still common, I've finished the firmware for one such device just a few months ago.

I don't know if you can actually pass through FDA's process with a robot that's running Windows, but frankly, I wouldn't try it. But there's nothing inherently unsafe in running the UI on general-purpose software and hardware, as long as the control code is robust enough in its validation and the interface between the two is correctly designed.

Also, the fact that human lives might be at stake makes the whole system life-critical rather than just mission-critical.

Machines can be rebuilt no matter how expensive they are, but human lives are gone for good. Having extremely rigid policies about accepting changes doesn't sound that crazy to me. What kind of serious professional would want to change the code of a life-critical system the same way they change that of a CRUD webapp?

That's a kind of software development where saying "oh, human lives at risk here, I'll be super-careful" doesn't quite cut it.

> Oh, you noticed that this 500 line copy paste function could be replaced by three 80 line functions? Sorry buddy, before you can change this you have to fill out this form in triplicate and what for a decision from three middle managers. Maybe we can fit this change in the next release, but don't get your hopes up.

Well, there is some merit to it, isn't there? You could make a mistake when splitting up that function, like with any other change you make.

The software I write runs on the telephone network, where downtime not only costs us thousands of dollars of unbillable time, but penalties our company has to pay to the Monopolistic Phone Company for violating our SLAs. Even a simple change (like zeroing out a few variables in a certain condition---nothing critical to the call path but would make billing much easier for us) can take months to work its way through the process of two companies (six months so far and still waiting for this patch to be deployed).

Yes, it sucks. Then again, I'd rather not cost our company more than my yearly salary for a simple mistske.

A small error, which, if I remember correctly, although it's not mentioned in this account, was in a tiny change that was not tested as fully as it should have been, took down AT&T's long distance network in 1990, back when it probably still carried most of the nation's long distance calls: http://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collaps...

I have same experience. I was assigned to account - big oil company, where one mistake like typo in port or source/dest. IP had a huge impact. Even mistake has been fixed within minutes.

You wouldn't believe how poor IT processes are implemented over such critical infrastructure (oil rigs, infrastructure handling wall-street financial transaction etc) in outsourcing companies. I am not talking how big mess is in configuration and related documentation.

You don't really need to change that 500 lines of code with a shorter version on a rover trekking around at Mars.

But whatever you do change, you really want to make sure it is worth the risk, screwing that up has greater consequences than asking someone to restart the web server.

Okay, it can be replaced by three 80 line functions. Where are the unit tests? What's the difference in object size after the code's been compiled? As it's been broken up into three separate function calls, how much more of the stack is it taking up? Does your code rely on excessively complex tricks which would make it more difficult for someone 20 years down the road to understand? What benefit does your change make to the overall process?

With Source Code Management, what you're doing is making sure someone owns the process of ensuring well-documented , well-verified code is what becomes part of a system that will be used for the next 40+ years. Those big enterprise systems you deride have been proven they can handle decades of use, can you say the same of your modern VC system?

The most expensive outage I ever witnessed was due to such an innocuous cosmetic change. In general, I have never been able to notice a clear correlation between what seems likely to fail and what actually fails, therefore I'm fine with restricting update on working code however ugly it is.

That's just the usual procedure of any large git based project.

NASA didn't have a modern VC system by then, it's perfectly reasonable that they'd implement the procedure by hand.

Applications are open for YC Winter 2021

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact