Hacker News new | past | comments | ask | show | jobs | submit login
The Power of Ten – Rules for Developing Safety Critical Code (spinroot.com)
287 points by kiyanwang on July 14, 2016 | hide | past | favorite | 149 comments

This is an all-time classic.

Too bad that the article doesn't mention the original paper:


Some interesting HN discussions around applying these NASA coding standards to JavaScript:


The original paper describing and justifying these rules in more detail is at http://spinroot.com/gerard/pdf/P10.pdf, and the official document is at http://lars-lab.jpl.nasa.gov/JPL_Coding_Standard_C.pdf

Any Debian users interested in Spin ought to be able to `apt-get install spin` sometime soon: https://tracker.debian.org/pkg/spin

It's already landed in unstable, testing shouldn't be too far away.

Thanks! We updated the link from http://www.rankred.com/nasa-coding-rules.

I understand the need for linking to the original source, but I just looked at both and the latter is arguably harder to read.

Yes, but the rankred article hasn't reproduced the full paper, it's just cherry picked and summarised in a way that you would have no idea if you missed something worthwhile.

The original paper isn't that terrible to read.

Also the amount of third party cookie crap that site tries to deposit in my browser is crazy.

That's why HN prefers source material, not blogspam stuff like this which adds nothing new and often removes possibly useful information.

It looks to me as though the recommendation that RankRed has titled "Rule No. 5 — Low Assertion Density" would be better described as "High Assertion Density" — the recommendation is for a minimum of two assertions per function (and functions are supposed to be short per rule 4).

The recommendations look good to me and (with one caveat) correspond to rules that I apply when writing C code with a high reliability requirement.

My one caveat is in "Rule No. 8 – Limited Use of Preprocessor" which bans all complex uses of the preprocessor. The problem is that it is common in C to encounter situations where the only way to avoid lots of code duplication is to store a table of facts in a macro definition and use the preprocessor to expand those facts into the relevant bits of code in each place where they are used. So in these situations you face a trade-off between the risks due to complex preprocessor use, and the risks due to code duplication (namely, a maintainer might change a fact in one place where it is used but fail to change it in another). My experience is that the risks due to code duplication are very high, and so it's worth the risk of using complex preprocessor macros to avoid them. The risks can be mitigated by implementing the necessary macros in a structured way to keep the complexity under control: http://garethrees.org/2007/04/24/relational-macros/

Assertions in code cut both ways. Sometimes they can be great; telling you exactly which assumed invariant is violated. However, that doesn't tell you where or how the invariant was violated.

Sometimes assertions are just crutches for lazy programming. Instead of handling of a very valid (corner) case, some people just assert that it doesn't happen. Lo and behold, years later, it does happen. And those years later, the context is completely lost. How easy is it to handle the corner case now? Hard. In this situation, instead of simply asserting that it doesn't happen, it might be warranted to just assume that it _can_ happen and handle the case. Then, followed up with coverage testing, one needs to (actually try hard) come up with an input that triggers it.

Assertions can also have bugs. The worst is trying to figure out which assertions are really valuable and which aren't.

Shotgun-spraying assertions in the code is not a good strategy in general.

Obviously a recommendation to have a high assertion density does not mean "shotgun-spraying assertions in the code".

If your point is just that the rule could be applied mechanically and without thinking and that would be bad, then that's true, but it applies to everything, not just to the rule about assertions. Someone could apply the "keep functions definitions below 60 lines" rule in a perverse way, by splitting every long function definition at an arbitrary point in the first 60 lines and tail-calling a continuation. It doesn't mean the rule isn't a good one.

In my experience that what's will happen in practice, especially when there is a specific metric attached. Doubly so once there is a mechanically enforced required amount of assertions.

Wouldn't it be better if we could create programs that were correct by construction (and thus needed no assertions)?

I see — your experience suggests that rules inevitably get turned into thoughtlessly evaluated metrics that then get gamed.

It is a shame when that happens, but when it does, it's not the fault of the rules, it's the fault of the organizational culture. The rules would still be valuable if you used them thoughtfully and with the aim of improving the reliability of the product, not gaming some management system.

I don't think they're intentionally gamed, it's just part of human nature. If you don't have to defend your assertions (because more assertions is presumed better, as per the rules), then people are liable to err on the side of putting more in that they need to.

With the exception of assert(ptr != NULL) assertions, I think most of the ones I've hit have actually been completely duff, and with some thought could just be removed. I dread to think what would happen if I grepped the commit histories of all the projects I've worked on for "removed duff assert".

>(and thus needed no assertions)

Assertions are not meant to be needed. Any bug-free programm should behave exactly the same with and without assertions.

On the other hand, in any sufficiently complex algorithm, asserting the pre- and postconditions generally helps readability, maintainablilty and correctness.

Speaking from experience, the biggest problem with NASA's software engineering requirements (http://nodis3.gsfc.nasa.gov/displayDir.cfm?t=NPR&c=7150&s=2B) is the way that they tend to feed down into non-safety critical projects. It's getting better though.

Sadly quality is highly disregarded in our field.

I dream of the day when not making use of contracts, static analysis and type based programming is seen as quality smell and not something that only a few are allowed to make use of.

It's not "disregarded", it's eclipsed by a need to ship often and ship a lot. There is a tendency to overstate amount of problems that come from bugs because they are painful to debug. Therefore, somewhat experienced programmers are often too defensive and suspect to paralysis by analysis (of which uber-complicated and rigid typing is a sort). Even if some advanced typing system will save a few days of debugging after refactoring down the line, it doesn't matter if a competitor ships buggy RoR-based systems a few months earlier.

Advanced typing/static analysis is not a pinnacle of software engineering, it's a set of training wheels. Useful? Yeah. But sometimes you may need to skip them.

> ... it doesn't matter if a competitor ships buggy RoR-based systems a few months earlier.

You hit the nail on the head! However, I would like to mention that competition is only one very important factor.

I'd add that IT is a cost as it is a baker for a bakery, a cook for a restaurant, a mechanic for a repair shop. Do you want to be the n.1 restaurant? Then you must pay, so that your business is good food, not marketing - notice: the best restaurants don't even have good websites, as the food speaks for them. Yes, you must pay a lot to get the best cook, who will take all the time and require all the staff he wants/needs to prepare the best recipes, and so forth. We think we are in a better business just because IT is something relatively new, which not everyone fully understands, so we think that we are the best in everything we do, because we bring innovation. Sadly, we are not.

The current reality is that most companies are average or below-average with no great plans to further develop or to become the next IBM, NASA, Intel, or whatever. Most of them want to survive, some want to get a lot of money fast, and who knows, maybe in 5-10 years sell the company for 1,2,10 mln USD. As long as the mentality is driven by the idea of the acquisition, not by ambition, things won't change - the most efforts will be spent in hiding the pile of s* that employees do, trying to draw the right numbers on the right graphs. And even IBM, Nasa, Intel fail. They do. The difference lies in the ability to stand up and get on their feet again - learn from failures. They have also wasted huge amounts of money - it's documented everywhere, every time one of their projects gets shut down. However, they have a goal in their mind: to be the n.1 whatever they do.

> Do you want to be the n.1 restaurant?

Depends on what you mean by "n.1". The "n.1" restaurant in terms of monetary value is McDonalds. They do market (heavily). They don't pay for the best cook (not anywhere near). And they don't have the best recipes. But they are wildly successful. Conversely, michelin star restaurants go bankrupt all the time.

There's nothing wrong with aspiring to produce michelin star code. But as McDonalds shows, you can be incredibly ambitious and long-lasting with quality being at best a secondary priority.

> The "n.1" restaurant in terms of monetary value is McDonalds.

No, the No. 1 franchised chain of restaurants is McDonald's; there's a pretty big difference between the No. 1 restaurant and the No. 1 chain of restaurants.

> But they are wildly successful.

The chain has, over its lifetime, been wildly successful, a success initially fueled by a reputation for extremely high quality, consistency, and efficient service compared to the greasy-spoon diners that were the mainstay of low-cost restaurant food at the time of their initial growth; the recent history is something a bit different (they have, IIRC, recently ended a period of year-over-year drops in same-restaurant sales by, among other things, significantly cutting the number of restaurants, particularly the corporate-owned rather than franchised ones.)

> Conversely, michelin star restaurants go bankrupt all the time.

So do individual (independently owned and operated) McDonald's restaurants.

> But the goal of most companies isn't critical accolades, it's to make money.

Yeah, but McDonald's didn't make more money than other restaurants by skimping on quality, they made more money by being successful based on quality, and moving quickly with that success to branch out to other locations rather than merely saturated a single market, and by offloading much of the risk (and some of the rewards) of expansion to fee-paying franchisees, which is what differentiates them from single-location restaurants that are trying to extract maximum value from a particular local market.

Trying to generalize from this to say anything about succeeding in the software market is probably pointless, mostly resulting in very bad analogies.

McDonals is the Ford of restaurants. Automation (using procedures and equipment) to reduce cost and keep a consistent relatively high level of quality.

McDonalds has an executive chef with a high-end restaurant background.[1] He heads a team of other chefs. Their job is to design and test recipes that can reliably be implemented by following the directions.

[1] https://en.wikipedia.org/wiki/Dan_Coudreaut

Aside from marketing, which you mentioned, the quality McDonald's pays for is standardization.

Standardization is why chain restaurants exist at all. They give people a way to get a meal they know is going to be of a certain quality at a certain price with a sufficiently low probability of being surprised. Surprises are, on the whole, generally negative: Good restaurants are few and far between, especially at the fast food price point, and a sufficiently bad surprise can have health implications. There's a reason one of the first chains was White Castle: White implies purity and a standard of cleanliness, as opposed to the local greasy spoon cafe where the only assurance of quality is that it hasn't been shut down yet.

So McDonald's pays for processes and materials that it can blast out into a million little restaurants, all the same, secure in the knowledge that minimum-wage workers can be sufficiently skilled and motivated to carry out those processes and use those materials the right way. Doing anything better might lead to a much improved experience in some restaurants but it will reliably lead to total disaster in others, which is utterly contrary to the business model.

By that standard, McDonald's is fairly high quality.

> paralysis by analysis (of which uber-complicated and rigid typing is a sort).

How "paralysis by analysis" has anything to do with typing? And how is typing uber-complicated to begin with? This is ultra basic logic... And why exactly you you think dynamic langages are exempt of typing?

How does all of that has anything to do with a competitor shipping "buggy RoR-based systems a few months earlier"? You can do all kind of shit even in more statically checked langages if you really don't care about quality, and ship buggy products sooner too.

How is a system that prove that some aspects of your program is free of some class of defect should be considered "training wheels"? Do you also consider that compiler warnings are a bad thing for time to market, and so should be disabled? How can you gain any time from not using an automatic checker? Your brain is faster than a microprocessor to do repetitive tasks? This makes no sense. You do not abstain to put on your seat-belt just because you thing you know how to drive, nor do you disable all safety feature of whatever equipment just because you kind of think, without even the beginning of a reasoning to backup that, that you are going to do things "faster". And while the logic telling that what you are going to do is more dangerous is immediate and irrefutable. And while the studies about costs of discovery of bugs depending on the phase of dev clear.

Arguably there are some meta-programming aspects of Ruby which can let you program faster than with a language without equivalent features. But in no way we can deduce from this that "Advanced typing/static analysis" are "training wheels". This is one the the thing that is just so wrong in our industry that I sometimes just don't really know what to answer. I mean I could as well see a claim that the earth is flat and try to convince the crazy claimer that this is not the case, but I have the feeling that people making such claims are not sensible to logic to begin with.

> You do not abstain to put on your seat-belt just because you thing you know how to drive, nor do you disable all safety feature of whatever equipment just because you kind of think, without even the beginning of a reasoning to backup that, that you are going to do things "faster".

Lots of people do.

Yes, they often turn out dead, or missing limbs. But it's a fact that they do, to the point that industries have to police and punish people that don't use safety gear.

Developers have a similar mindset, just way less dangerous.

> Developers have a similar mindset, just way less dangerous.

Less dangerous? Not really.

- Therac-25

- Ariane 5

- Airbus A400M crash

- JAS 39 Gripen crash

- Chinook helicopter crash

- Patriot missile failure

Just a few examples as I am not bothered to provide an exhaustive list.

In my experience weakly typed languages are a trade-off, where you get something to run at all sooner but to run correctly only later. It all depends on the sort of product you're making whether that's worth it. When you're building something long-lived, there is no benefit to using a weakly typed language. Yes, initial coding takes less time, but since 70% of your cost is maintenance, it is dwarfed by the time spent debugging.

Good points, but mainly I was referring to the need to maintain extensive documentation purely for the satisfaction of bureaucratic review (though that point was in my head).

Example: All projects that are to be released must demonstrate that they are maintaining one of these:


Type based programming does not quite work in weakly typed languages.

I think you should take what he said as saying we should get rid of those

Could be implied, of course. Still, there is a noticeable segment who think otherwise (eg a large portion of the Clojure community).

> get rid of those

There are many legitimate uses for scripting languages. It would be impossible and undesirable to get rid of them.

Scripting languages are just that, for plain scripts, not full blown applications.

that's my point. you don't get rid of the tools you use incorrectly, you learn to use them correctly.

I was talking in the context of making full stack applications, not scripts for copying files around, automate software installation and such.

And even there I am most likely to use a ML derivative language than Perl, Python or something else.

What is the correct use of node.js? I claim there is none.

weakly typed != scripting.

Even in C you can e.g. use singleton structs to distinguish different semantic types.

Didn't you used to be able to use typedef scalars for that? Somewhere that got lost (some name-mangling issue trumped it).

> Didn't you used to be able to use typedef scalars for that?

Oh, I was so disappointed when I realised that those aren't considered distinct types. That ruins half the use of typedef.

Which I am not a big fan of.

I rather use the Algol and ML family of languages, when given the option.

For me, I'd frame NASA's approach to software quality as a core value of the organization. The problem is that determining where a mission critical level of attention to detail is not necessary requires a mission critical approach because failure there means failure to develop mission critical code to support a mission critical function.

So the process of determining what does and does not require mission critical rigor has to be handled with mission critical rigor. Adding such a process is just one more point of failure. It requires the same attention as just writing the code "right" but is more subject to political maneuvering and other unproductive mental overhead and subjectivity.

Having a look at the engineering requirements there are guidelines that I've seen that are used by medium to large scale organisation just simply to manage the project and code base.

Some of them seem to be universal such as statistic code analysis, heave use of sticking to a sub-set of the language that is well defined (Deterministic behavior). Sticking to a over-all coding convention and naming convention. Modulation of the code-base. Avoidance to thread's mutating the state of shared objects between threads (Share state between threads only in a single location eg.. IPC). Similar thing happened with java API with the designers attempting to make the language library thread safe. They soon discovered that was a real bad idea and backed off from that very fast.

At work I joke with my collegue's about the java framework they use they've now moved back to static enum's and static control id's for web pages instead of dynamically generating them on the fly. It's to a point where I feel like I'm programming win32 gui's all over again.

Regarding Rule 4 - No Large Function - I think John Carmack had a pretty good explanation on why that's not really a good thing:


>The fly-by-wire flight software for the Saab Gripen (a lightweight

>fighter) went a step further. It disallowed both subroutine calls and

>backward branches, except for the one at the bottom of the main loop.

>Control flow went forward only. Sometimes one piece of code had to leave

>a note for a later piece telling it what to do, but this worked out well

>for testing: all data was allocated statically, and monitoring those

>variables gave a clear picture of most everything the software was doing.

>The software did only the bare essentials, and of course, they were

>serious about thorough ground testing.

>No bug has ever been found in the "released for flight" versions of that code.

I was pretty excited to learn more about that style, and then I came across...


...but I'd still like to know more about this style of programming, and languages that facilitate programming in that manner.

If I read that article right, it sounds like the software fault attributed to the crash is that the "control laws" that the software implemented were improperly chosen, not that the software did not function as intended.

Which is to say, a serious problem, but if we take "bug" to mean a difference between designed function and realized function, not a "bug", and therefore not something which should tarnish the record of the coding style to which the bug-free nature of the Grippen software is attributed. The problem seems to have been at a different level than coding.

You're correct to point out that the coding style isn't to blame for the software fault. And IMO The last paragraph of the article hints at the most probable fundamental cause.

But I just don't buy that this was not a software fault. It clearly was a case of a faulty software specification.

> The problem seems to have been at a different level than coding.

This makes me queasy. Software engineers working on these sorts of systems -- -- or at least a few senior ones -- should understand enough of the domain to say "this spec is not adequate" or "bad things might happen under conditions xyz; what's the correct behavior in these cases?". And of course all software should notice and then degrade gracefully when assumptions are observably violated.

To absolve the software engineer of any responsibility for understanding the context surrounding his software is to wrongly assume there's not much to software engineering beyond programming to someone else's spec.

I absolutely see it as a software problem and a software engineering problem, just one orthogonal to considerations of the value of the coding style adopted in preventing code that deviated from specifications.

Does anyone know where this quotation comes from?

It's very hard to formulate a clear and easily enforced rule that restricts complexity. No large function is one way to do it.

Really long functions that have very low complexity (very few nested control structures, ideally none) and do not contain any loops that aren't trivial are easy to reason with. They are essentially the one case I'm familiar with where longer is not worse after a point.

I posted a write-up of Dr. Holzmann's talk "Mars Code" which he gave at USENIX Hot Topics in System Dependability '12 (http://www.verticalsysadmin.com/making_robust_software/), about how NASA/JPL writes and tests software that survives the rigors of interplanetary travel; planetary entry, descent, and landing; and operation in another world. :)

This is a very interesting talk about writing the software for the Curiosity Mars mission by the head of Software Reliability at JPL, Gerard Holzmann, who also wrote these 10 rules:


Nice tool setup. I would like to work under those circumstances.

If we're talking about high reliability code, one thing claimed about Haskell, is "if it compiles, it has no bugs".

How close is it to the truth ? And how close are we to having that kind of capability for real-time programming(even assuming we're willing to forsake protability, community, and maximum efficiency to some extent ) ?

Haskell is better than most languages at achieving this but still far from the truth. Type systems help a lot but they can't help you in every way. I write Haskell commercially and I can tell you there are many times our test suite catches an issue not caught by the compiler. (In fact just today we have this bug in which a particular maximum function returns the maximum of all historical values of a particular time-varying quantity but forgot to include the current value. The test suite not the compiler found the bug.) But it's also true that with Haskell you can have a smaller and less extensive test suite to maintain the same level of confidence about your code.

Another thing is, if you want to write high reliability code in a general-purpose language, you have to have the discipline not to use all language features because a general-purpose language is by default not safe enough. This article basically outlines a subset of the C language for writing high reliability software, and I imagine if Haskell is used for similar software, a large number of language features and functions in the base library would need to be banned.

I doubt that statement can be true in any programming language. Bugs come in many shapes and a functional language can prevent a fraction of it. You can have logic errors, misunderstood requirements, wrong database queries etc, the list is pretty much infinite.

The https://en.wikipedia.org/wiki/Curry–Howard_correspondence says there is a correspondence between any logical statement and a type in a sufficiently advanced type system. So yes, there are languages aimed at eliminating logic errors, and Haskell goes pretty far(though its type system isn't quite advanced enough).

If I grasp that right, it says that if something is provable mathematically, then you can write a program for it with an equal meaning.

I still don't see how that prevents user error. My question is then how do you mathematically prove intent?

Also, how far should "sufficiently advanced" be?

We already have a tool for that in the forms of unit tests and types does help if the subject is abstract enough.

You can't prove intent.

However, you can consider the formal specification of your intent (the type), to be an example of fully declarative programming.

Since you write your type without any care about how it might be executed -- the holy grail of abstraction -- you are less likely to make errors.

When using a type system that isn't capable of fully specifying what you're doing (i.e. Haskell), you are of course subject to making implementation errors within the range of possible programs that type check. But in practice it's usually enough to catch the sorts of mistakes that you're likely to make.

> However, you can consider the formal specification of your intent (the type), to be an example of fully declarative programming.

The vast majority of formal specification and verification tools (I believe Coq and Agda are the only exceptions, and they are rarely used in the industry) express the intent directly in logic rather than in the types (HOL in Isabelle and HOL Light, ZFC+LTL in TLA+ and maybe Scade, ZFC in Alloy, and a typed set theory in SPARK, I believe).

> But in practice it's usually enough to catch the sorts of mistakes that you're likely to make.

I think this claim is supported by little evidence. Most non-dependent type systems are extremely weak (or require cumbersome encoding) to express even all but the most trivial of properties (e.g. they can't even express that the value returned from a max function is indeed maximal, let alone more elaborate properties). Their expressive strength is that of a finite-state machine. How much does that prevent real bugs requires empirical study.

You need more than an advanced type system. You need a declarative constraint solving and proof system. Instead of telling the compiler how to perform a task, you would declare the assumptions and the desired relationships and then, with the help of the proof system, determine what implementation fullfills exactly those constraints.

Or you take the more practical Monte Carlo like approach - Ie Fuzzing

An expressive and ergonomic type system goes a very long way into codifying that intent.

For example, it's quite obvious what the intent of this Idris function is: `f : Vect n (Vect m a) -> Vect m (Vect n a)` (where `Vect` is the type of lists of a given length).

Curry Howard just says that computation and proof simplification work similarly. Correctness has more to do with how rich your type system is and with how you take advantage of it during modeling (to rule out undesired program states as ill-typed)

I came here to read stories about floating point rounding errors and was surprised there is no mention of them. Maybe all failsafe code is done with integer math...

Haskell's type system protects from a certain class of bug, but it can't prevent logic errors - something like using < instead of > won't trigger a build error, but it certainly is a bug.

Sometimes you can express the relevant relationship in the type system. Just adding types to an existing program won't tell you much, but if you work with the type system it's possible to move a lot of your business logic there.

The stricter and more advanced a type system is, the more this comes close to the truth. Of course, no compiler can check that the function you implemented is the function you wanted to implement, but it can quite check that it can reliably compute the function you typed in.

The question hinges, on how much errors in your program express themselves as type errors too.

>How close is it to the truth ?

Often a lot closer than in other languages, but that's still not close at all. And it depends a lot on how careful your code is. If you want to rely on the type system more heavily, you have to do more work upfront. E.g., using a simple SQL library that just takes queries as strings vs building your queries in a type-safe EDSL. You can write sketchy Haskell just like any other language, but you're likely to be much more aware of it.

>And how close are we to having that kind of capability for real-time programming?

Last time I looked it seemed like real-time Haskell still had a fair way to go. Presumably because it has garbage collection, lazy evaluation by default and other things that make it hard to prove time bounds. It can also be really hard to optimize sometimes IMO. But I doubt the features that make Haskell reliable are dependent on the features that don't suit real-time programs. I don't know of any pure functional languages with similar type systems, though. https://en.wikipedia.org/wiki/List_of_programming_languages_... has a pretty small list.

These rules are not just about correctness in the sense that "This code's computations are accurate" but also in the sense that "This code's performance is constant and predictable". Haskell helps with the first sense, but not really the second, from what little experience I have.

I imagine you will find this paper interesting

"Haskell vs. Ada vs. C++ vs. Awk vs. ... An Experiment in Software Prototyping Productivity"


I would imagine, though, that "prototyping" and "safety critical" are on opposite ends of the spectrum...

That paper was written in a time when languages like Haskell only got funding money from DARPA if it was for researching "prototyping".

Hence the way it was written.

It's not particularly true.

What they're saying is that Haskel (and Scala, and others) have such good type systems your data should never be wrong.

That has nothing to do with logic, and so if your code should be `if (foo)` and you write `if (bar)` or `if (!foo)` there is no way for a type system to catch that.

Additionally, there are a lot of constraints you can't capture with type systems. For example: given type Foo, type Bar extending Foo, and types * extending Foo, my type should extend Foo but not be or extend Bar.

Even just with raw data, imagine if your function required and was only valid if you passed in a prime number, or an even integer between 2 and 22, or a "sentence contained of 3 or more words none of which are Chinese" etc. It gets hard very quickly.

If you can do the wrong thing perfectly, then you have a bug.

It's more "If it compiles it probably does something that someone would find useful but possibly not what you actually want."

How does that fixed upper bound on loops work? If there's an array of dynamic size that needs to be looped over, how do you do that?

I cant speak for NASA, but in general if you are writing extremely safety critical code you simply don't use dynamically sized things. In addition to safety, the dynamically sized objects cause issues for the real-time constraints on these system as well.

There are no dynamically sized arrays. All memory is allocated at startup and that's it.

This is the reason why you see power of two limits in safety critical software. For example "radar is able to track up to 256 simultaneous objects".

    for (i = 0; i < 256: i++) {
       if (NULL != object[i]) {
           // do stuff

#define MAX_NUM_OBJS 100

for(int i = 0; i < num_objs && i < MAX_NUM_OBJS; i++)

Combined with the no dynamic allocation rule, you can guarantee that the number of elements in the array must be less than some maximum since you have limited dedicated space for storing them.

There might not be a valid object at every location in the preallocated memory.

Say you have a list of visible satellites, and the number of them can change as they go in/out of view. There's never more than 100, but sometimes there's 50 and sometimes 60.

I supposed you could have an object with an INVALID flag, which the loop can use in its logic?

That's a good point. If you do have a situation where the number of objects is somewhat dynamic, you could use layers of protection such as (re)initializing the unused objects to a known safe state, adding a flag as you mentioned, adding sentinel values to the end, and at the very least you do have the guarantee that you aren't running off into memory that was actually dedicated to another purpose and contains fundamentally different data. If you are doing something really dynamic, you could embed two linked lists (or an embedded list and a free index stack or even two packed stacks). The reason for stacks over lists is that lists can accidentally become cycles.

My take on it is. If have a task that can only allocate `once` a area of memory for initialization, then you can use the allocation size to put a constraint on the iterators based on the datatype size.


  const size_t taskSize = 256;
  const Task *task = (Task*) malloc(taskSize);

  // Allocate array of int's
  int *intPtr = (int*)(task);
  int intSize = taskSize+4;

  for (int i=0; i<intSize && i<(taskSize/sizeof(int)); i++)
  // do stuff.

Agreed, this puzzled me as well. I also have thought it was curious that we didn't get dynamically sized arrays in C until C99. Before that you had to malloc to the heap. It makes a little more sense now though having read this.

Found the rule `There shall be no use of dynamic memory allocation after task initialization.`

They must use pre-allocated string buffers for some of the task's. Though I figured if you're doing a flight controller for a jet fighter or sate-light you don't need a lot of string parsing.

dynamic memory allocation is very much frowned upon in embedded systems, not just for strings.

Everything should be static and deterministic at all times. This is the easiest/only way to ensure you have no resource issues.

You should always (statically) allocate for maximum/worst case.. because you have analysed your worst-case, haven't you?

That rhetorical question touches on a very important point, and I agree 100%. Dynamic memory allocation allows programmers to absolve themselves of considering worst case (memory usage) scenarios. As long as they handle allocation failures, their program should function predictably under all memory circumstances (in reality, we know this isn’t always the case, especially when the failure occurs deep in the call stack). The problem is predicting the memory circumstance itself. The way your program behaves is essentially tied to something external, which makes it that much harder to predict the overall behavior. This scheme makes sense in environments where you really have no idea how much memory will be available to you, such as a conventional PC program, but not on a system that has a single dedicated purpose. One could make the argument that if the program gracefully handles an allocation failure, then there should be no problem. My counter argument would be that in many embedded systems, it’s better to have a predictable, but low failure threshold as opposed to a potentially higher, but unpredictable failure threshold.

Another analogy: I would rather have a program with a 100% chance of hitting a bug under a well understood circumstance rather than a program with a 0.00001% chance of hitting a bug under an unpredictable/unknown circumstance.

In an embedded environment, you have complete control over the system and all aspects of it (or at least you should).

You often need to understand the behavior of your system under all circumstances, and that becomes _much_ easier to do when you operate with fixed size data structures because you now know exactly when you will run out of memory.

Consider a hypothetical embedded system where you’re creating a sub-module which must handle external events by means of a message queue. If you know that you can only have three possible events, you can statically allocate your message queue to be 3 deep to ensure that you will never drop an event. If you were to use dynamic memory allocation, you can’t make that guarantee because you don’t know what the other components of the system are doing (and how much memory they’re allocating). Even if there are no other allocations taking place, you still can’t guarantee that yours will succeed due to the possibility of fragmentation.

Statically allocating your buffers ensures that if the program can be loaded in the first place, you can predict with 100% certainty your program will be able to handle those 3 events.

true, embedded systems (and in theory server s/w) are different in a number of respects to your typical desktop-level s/w.

Embedded systems typically are not 'manned', they have to handle all issues themselves and continue providing service. Note that this does not mean "no resets".

Resets should be designed for because they will happen.

All effort should be made to make errors deterministic as (like you say) they will become a real time-suck.

IMO, servers should also be considered embedded systems and designed like this, but unfortunately the culture around server software is for very dynamic, very resource heavy and very inefficient, non-deterministic s/w.

How do you analyze worst case ? don't you need to know what calls what, up to what depth, and that's dynamic by nature ?

It all comes from the defined requirements and specifications.

i.e. "You shall handle x messages in y milliseconds."

from that, you derive your worst case buffer size given that you can service that buffer every 'z' milliseconds at most (Note, that involves a hard-real-time requirement as it is a bounded maximum time).

As said by TickleSteve, you define the worst case rather than analyze it.

As for the stack usage requirements, it seems like this could be determined statically by some parametric process, but I’m no expert on this.

Does anyone see a reason why there couldn’t be some algorithm to statically analyze some code to derive the worst case stack usage?

For example, take every function and assume that every variable declaration will be required. Add them up. Then, follow every path down the call tree while adding up the required stack for each call.

...in a typical GCC based system, you can for example use "-fstack-usage" and "-fcallgraph-info" to determine a worst case.

(tho that takes a bit of analysis, there are tools around that can automate this).

Almost all of the programs that I have ever read parse strings far too often. Almost all scientific and engineering software can eliminate it except at the UI and database.

I have written software for mobile network base stations where there were hardly any strings. I think it was only process names and log messages, both of which were string constants.

I imagine unmanned space software is similar; there is little need for a text UI on the spacecraft itself.

What counts as a task?

task == thread of control.

(Seriously? downvoted because I gave a direct, true answer to a question?)

I'm afraid I don't know what you mean by that either.

It must unknowingly be embedded-specific terminology, because I knew exactly what he was talking about.

So, what is it, then?

what is a task/thread/process?

where do I start.... a processor naturally has one thread of control, i.e. it executes instructions sequentially.

To simulate multiple things happening at a time a context-switcher periodically saves and restores the current processor state and switches to another sequence.

Each sequence is a thread-of-control... commonly known as a thread (or task, or process).

...and this isn't embedded-specific, just computer-science specific.

I know what a thread is. I thought you meant something special with "thread of control".

Recently there was a thread here where many argued against some company owner that has set the objective that the code created by his employees should be as simple as possible.

Often the arguments boiled down to: It looks only clever if the skill level of your employees is too low, you should work on that.

When I look now at the rules that NASA enforces, it seems to me that their objective is to prevent bugs by not allowing code that is overly complex and I doubt it's because their engineers are not clever enough.

In fact I believe that it takes a lot of cleverness to create code that seems so simple that you don't even need to read the documentation.

It seems that people often confuse complex with clever or impressive. Impressive is unifying the falling of an apple with the orbiting of the planets into three simple universal laws.

Complexity is an emergent phenomenon. It occurs when there are too many interactions to track. To make something simple is to take all of that complexity and to understand and model it as the interaction of a few simple rules repeated over and over and over.

Great analogy, I agree!

I think the compiler is as much responsible for secure code as the code itself, tools should be updated regularly with these core concepts in mind...These 10 points, I figure are just the starting points, there are a lot of other places where the flow of program changed by abusing simple keywords such as volatile. Which is where the correct use of compiler, serve as a nightwatch to stop any undefined behavior from crossing the wall....

My first two years at undergrad would have been so much easier if I had read this piece of document.

I remember using a quadruple pointer (char ) in one of my systems classes.

Also, after having dabbled in Haskell, I know the value of static analyzers. That is like good practices on steroids. And you start to avoid these errors, and start developing better practices.

Change the environment to change your habits.

All of these make perfect sense, but I would have allowed tail recursion (and only that). Its properties are well-defined and boundedness can be easily proven (also it is easy to distinguish tail and non-tail recursive calls).

They're using C. Many C compilers don't optimize tail recursion away, and even for those that do, it is not generally obvious from the C syntax whether something is a tail call or not.

> a function shouldn’t have more than 60 lines of code.

I started having headache spikes after reading this sentence that refers to keeping functions "short", in the article about safety critical programs.

Keep in mind that we're talking about C code here. Error handling etc. blows that up quite significantly.

I agree that making functions shorter does not automatically make them safer, but I do see how having a well designed "main" function in the ballpark of ~80LOC which calls another 10 auxiliary (and inlined) minor functions of ~15-25LOC would be easier to reason about than an equivalent function of ~220LOC.

The fact that you could safisfy the business requirement with a single function of ~140LOC does not mean the other approach is more verbose, but that you are only thinking on the happy path, and a bunch of edge cases are left undefined.

While the embedded world does have some hard depth limits, my experience has been that the readability of breaking up a function are worth it. This seems especially true when the nested functions are carefully named.

In rule 5 (about assertions), why the comparison to true? Why is "if (<boolean expression> == true)" preferable to "if (<boolean expresson>)"?

I assume it's a habit to make sure you communicate to the type checker: "I expect this expression to be boolean".

Rule 0: don't write safety-critical code in C.

Nice snark, I guess, but... did you actually read the article? One of the reasons for C is that the tooling around it gives you much more support. For safety critical code, you need that, because programmers make mistakes, no matter what the language.

So if you write safety critical code in Haskell, say, you won't have a large number of the holes that C gives you. But for safety critical systems, their rules say that you can't write code that has even the possibility of falling into those holes. That makes C much safer. That leaves you with a safer C, plus tooling, vs. Haskell (or whatever your choice is) without the tooling.

So, snark aside, what is your actual alternative that you recommend?

> did you actually read the article?

Yes. I used to work with the person who wrote it.

> One of the reasons for C is that the tooling around it gives you much more support.

That's because C is an unsafe language, so to write safe code you need that tooling. If you start with a safe language the need for that tooling simply evaporates because many of the mistakes that the tooling is designed to catch are simply not possible.

> For safety critical code, you need that, because programmers make mistakes, no matter what the language.

That's certainly true, but many of those mistakes are mistakes in the rendering of the desired semantics into code. Neither rules nor tooling can catch those kinds of mistakes.

> But for safety critical systems, their rules say that you can't write code that has even the possibility of falling into those holes. That makes C much safer.

No, it doesn't. All of the rules that are not specific to C can be equally well applied to any other language.

> So, snark aside, what is your actual alternative that you recommend?

Good heavens, just about anything is better than C if what you care about is safety. Ada. Java. Scheme. Haskell. OCaml. Python. Go. Swift. Common Lisp.

Well, of those, Java, Haskell, and Lisp (at a minimum) are completely unsuitable due to allocation and, worse, garbage collection. In a hard real-time system those are a deal breaker. (Usually. You can have deterministic allocators and guaranteed-worst-response-time garbage collectors, and then you can still prove you can meet your timing requirements, but it's not easy. Or, you can try to write those languages in a way that they never allocate after initialization, which might be possible, at least for Java.)

Python? Can't you get "method not supported" (or whatever the technical term is within Python)? That's not safety, by any stretch of the imagination. Or is there some way to prove that it can't happen, by some kind of (ahem) external tooling?

Ada I could accept. Go and Swift might be suitable. I don't know enough about Scheme or OCaml to comment.

> completely unsuitable

You are wrong. And you even admit that you are wrong in your own comment:

"You can... but it's not easy"

There's a big gap between "completely unsuitable" and "not easy."

But you are wrong for an even more fundamental reason: the matter at hand is not hard-real-time cade, it is safety-critical code (re-read the title). Those are not the same thing. Safety-critical code can be and has been written in Common Lisp. Moreover, an effort to port the Common Lisp code to C++ failed. So it is manifestly untrue that Common Lisp is "completely unsuitable" for writing safety critical code.

The title says "safety critical code". But the article says no allocations, because of the unpredictable (time) behavior. So the environment that the article describes is safety-critical, and part of that is predictable (guaranteed) timing.

> You are wrong. And you even admit that you are wrong in your own comment

Do you realize that you're agreeing with me? I put that part in parentheses in for a reason...

> Do you realize that you're agreeing with me?

No. I am not agreeing with you. You are wrong about everything, even the real-time part. You are even wrong when you say it's not easy to write hard-real-time code in GC'd languages. It is no harder to do that in GC'd languages than in non-GCd languages. In fact, it's easier. You are correct that it is hard to write a hard-real-time GC. But that job only has to be done once, and it has already been done so at this point it's a sunk cost. Writing non-consing code is also not hard, certainly no harder than writing non-consing code in C.

Not only is Common Lisp suitable for writing safety-critical code, it completely dominates C in every possible way. And this is not just speculation, it has been actually demonstrated in the field.


> But the article says no allocations, because of the unpredictable (time) behavior

You're wrong about this too. It does say it's because of the unpredictable behavior, but the "time" part is your own extrapolation. The article itself says nothing about hard-real-time. The word "time" does not even appear anywhere. (Not that this matters because it is no harder, and often easier, to write hard-real-time code in Lisp as it is in C.)

> Writing non-consing code is also not hard, certainly no harder than writing non-consing code in C.

I call BS on this claim. To write non-consing code in C, what functions do I need to avoid? malloc, calloc, and realloc. To avoid writing consing code in Lisp, what functions do I have to avoid? Can you list all of them for me?

> Not only is Common Lisp suitable for writing safety-critical code, it completely dominates C in every possible way.

BULLSHIT. Your extreme departure from reality makes it hard for me to respond any more politely than that.

Earlier, you cited a safety-critical system that had been written in Lisp. (I believe that you've cited it before.) They tried to re-write it in C++, and failed. You know what else I read on HN today? Banks have COBOL code that can't be re-written in something better and/or more modern. But it's not the wonderfulness of the language that's preventing it; it's that nobody understands what the code does well enough to be able to write a true functional equivalent. So, you've got a nice example, and a nice story, but it may not prove the point that you're trying to prove.

Now, what else have you got? How many safety critical systems are written in Lisp? How many of those are hard real-time systems? What are their defect rates compared to people writing MISRA C? And, if Lisp is so wonderful compared to C, how come so many more safety-critical systems are written in C? Is it because the entire profession is blind, and only you and a few enlightened ones realize how wonderful Lisp is? Or is it because people that have done this for decades see more problems with Lisp than you are aware of?

> It does say it's because of the unpredictable behavior, but the "time" part is your own extrapolation.

True. Now, in what ways do allocations behave unpredictably? Well, any particular allocation can fail due to being out of memory, and the location allocated is unpredictable. But the primary way that allocations are unpredictable is in the amount of time they take.

If you want to argue with that statement, along with your argument, tell me how many years of your career you have spent in embedded systems. And don't try to argue that embedded systems isn't what we're talking about here, just because the title says "safety critical" rather than "embedded".

> how many years of your career you have spent in embedded systems

Depends on how you count, but probably about 20. 15 of those were at NASA. Gerard Holzman (the author of the paper we're discussing) was in the office next to mine for about 3 years. Oh, and Common Lisp code that I wrote actually controlled a spacecraft back in 1999.


Does that qualify me to talk about these things?

> Does that qualify me to talk about these things?

Yes, certainly. (Though I have been in embedded longer - 25 years. But you've been in safety-critical longer than me.) I must admit that I under-rated your experience.

What part of the stuff from that link was written in Lisp? Was it just the planner, or also the exec? That is, did it have to be real time?

And, in my previous reply, I asked a bunch of questions (besides the ones about whether you knew what you were talking about). Would you answer them?

> What part of the stuff from that link was written in Lisp?

All of it. (Except for an IPC system which was written in C. See https://www.youtube.com/watch?v=_gZK0tW8EhQ if you want to know the whole story.)

> Would you answer them?

I'll answer a few of them.

> if Lisp is so wonderful compared to C, how come so many more safety-critical systems are written in C?

Politics mainly. And inertia. And the sunk cost fallacy.

> Is it because the entire profession is blind, and only you and a few enlightened ones realize how wonderful Lisp is?

Yes. Pretty much.

> Or is it because people that have done this for decades see more problems with Lisp than you are aware of?

They, like you, think they see problems, but they, like you, are wrong.

Tellingly, none of the people who think there are problems with Lisp actually know Lisp.

> Politics mainly. And inertia. And the sunk cost fallacy.

I would say no, sort of, and maybe. I don't think I've ever seen a language chosen because of politics. For "inertia", I would say "conservatism" - people know that they can build systems (even safety critical ones) using C, and they know where the problems are. They don't know where the problems are using Lisp. And they don't want to take the time to learn Lisp - that's sunk cost, but whether it's a fallacy or not depends on whether Lisp really is more suited for such work than C. You assert that it is; many of us would like to see considerably more evidence before we agree.

What evidence? Maybe a few hundred hard real time systems successfully written in Lisp. (But how are we going to get them, if everybody defaults to C? I will grant you that there's a chicken-and-egg problem here...)

> > Is it because the entire profession is blind, and only you and a few enlightened ones realize how wonderful Lisp is?

> Yes. Pretty much.

Yeah... um... I think I'll just let your words speak for themselves.

> chicken-and-egg


I want to follow up that whole discussion by stating I really appreciate someone with actual experience in this area chiming in. NASA code attracts a huge amount of bikeshedding here.


These are basically a subset of rules that MISRA defines for C programming.


Because code quality is usually not on the top list of most companies in IT field.

And all of these are so obvious that it should be called 'undergraduate coding rules for any system'.


I don't know for sure, but if I had to guess, it's that they are on relatively new, uncommon, or limited architectures for which a variety of compilers just don't exist. Many other compilers depend on being bootstrapped through C and many other runtimes depend on some subset of standard C library functions. Also if you stick to a certain part of C and write very simple code, it can be safe and more importantly very predictable. In embedded usage, you need to know that the execution time can fall within a bound. It can also be important to shy away from getting too high level in architecture or concept because that can hide complex but catastrophic logical bugs.

This formatting doesn't match the rest of HN which makes it difficult to read. Is this a bug somewhere?

Some kind of unicode copy/paste hijinx, probably coming from a PDF somewhere.

They seem to be different Unicode codepoints: FULLWIDTH LATIN SMALL LETTER

I wonder, is that really something that should go into a character set?

It was needed for JIS compatibility, since they encode halfwidth and fullwidth versions of kana and English characters, because having different fonts for them was too expensive.

Also it's needed for this tweet.


oh, sorry, has to due with the way windows switches between chinese and english. Sometimes it goes into a weird mode monospaced mode for some reason. Thought that would get stripped

JPL's safety-critical programming seems like a really good place for Rust (https://www.rust-lang.org), given that Rust has an upfront focus on safety.

> Do not use [..] direct or indirect recursion.

Ok.. I get it that they don't want their C programmers to do that, but do they also mean that this is to "complex" for normal developers to implement in a fault-free manner?

They mean that it's hard to write a static analysis tool that can determine if functions using recursion will ever terminate or not.

It also puts a lot of stress on the stack of machines that may not have much stack to go around.

Just out of curiosity, why is that harder than writing a static analysis that determines if a loop with condition will finish?

The sort of software these rules are intended for run in constrained environments. The other question besides "Will this loop/recursion terminate?" is "Will we blow up the stack?".

Ignoring the time required to establish a new stackframe (versus a mere jump for a loop), the creation of a stack frame uses more memory. The less memory you can use, while still producing maintainable code, the better. And an easy way to assist with this is to remove recursion.

You could make the case for tail recursion, which a sufficiently advanced compiler will/can/should turn into a mere jump, like a loop, because it is. But this isn't guaranteed in the C standard and so should not be relied upon.

And then there's Rule #2. Loops are bounded, the index variable is supposed to only be altered by the loop structure (so: `for(i = 0; i < MAX; i++)`, that increment is the only place `i` should be altered). By providing a compile-time max, and incrementing (deterministically) the index variable, you ensure that static analysis of the loop terminating (or not) is possible.

And tail recursion is equivalent to a loop, but a form of loop which breaks Rule 2.

Tali recursion should be equivalent to a loop, if the language compiler/implementation offers TCE/TCO. C does not guarantee this.

But I'll also point out, tail recursion could be statically analyzed in the same way that you would with a for loop (for the purpose of Rule 2). As long as it is structured in a way that demonstrates that:

1) Forward progress is always being made (that is, there's an iteration/index parameter that always increases/decreases monotonically) towards a bound: void f(state,n) { if(n > MAX) return; f(g(state),n+1); }

2) There's a single entry point which initiates the recursion (loop) with 0 (or MAX if you're decrementing, or whatever): void start(state) { f(state,0); }

Both can be demonstrated (with code structured like the above) statically.

All true.

A modified C compiler could be built that would do TCO. But, for this kind of application, it would mean that you'd have to re-validate your compiler, so that's pretty much not going to happen.

Looking it up, it seems that GCC (at least) presently does do (at some optimization levels) tail call elimination. Others probably do as well. I guess I've never looked it up because it was never relevant (if I'm using C, I might as well use for loops, they're at least as clear as the equivalent tail recursive function).

Rule #2 is "Fixed Upper Bound for Loops".

Although they define this flexibily as "It should be possible for a verification tool to prove statically that a preset upper-bound on the number of iteration of a loop can’t be exceeded."

So any loops where they can in fact write a verifier is simple enough.

Applications are open for YC Winter 2021

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact