I absolutely love Erlang and think that, along with Clojure, it provides a complete ideology for developing modern software.
But the article implies (and more than once) that the rover's architecture borrows from Erlang, while the opposite is true. Erlang adopted common best practices from fault-tolerant, mission-critical software, and packaged them in a language and runtime that make deviating from those principles difficult.
The rover's software shows Erlang's roots, not its legacy.
It wasn't the intention to imply that one borrowed from the other. The concept of isolation and virtual memory protection goes back to at least the 60'es for instance. It would be unwise not to take the knowledge into account - even though we are extremely good at forgetting our past in this industry :/
The key take away is that these methods seems to work. They worked in a setting of C code on the existing rovers as well as the newest one. They worked on large telephony switches written in Erlang. My guess would be that many other mission critical systems will posses the same traits.
Erlang is best suited for online applications (as opposed to batch processing) that require handling a high-rate of concurrent events and need to be fault-tolerant (highly available). Erlang was designed with those applications in mind, and provides just the right abstractions (and no more) for implementing them correctly and efficiently (like light-weight processes with mailboxes, complete process isolation, and supervision hierarchy).
But what I like most about Erlang (and I like Clojure for the same reason) is that it is not a kitchen-sink aggregation of programming language features thought to be useful or cool (like some other languages that I won't mention), but a perfect (or near-perfect) mix of just the right features, all made to work with one another in synergy.
In other words, Erlang, like Clojure, does not say, "here are the features you can program with", but rather, "this is how you should program". And the features these two languages provide are, I think, just the right ones for developing software with modern requirements (scaling and high-availability) on modern hardware (concurrency).
Erlang's features, however, do tend to focus more on programming in the large, or high-level software organization (they helps making your software scalable and fault-tolerant) and less on low-level constructs (it's hard or impossible to write a super high-performance data structure in Erlang). It delegates those low-level problems to units written in other languages. Clojure is the opposite, I think. It's designed to make building correct and efficient data structures and data-processing functions easy, but provides little guidance to high-level organization of complex, large software.
Both are also relatively easy to learn, which only shows their elegance (though Erlang does suffer from a somewhat antiquated syntax owing its origin to Prolog).
I'm curious too. The author implies that Erlang lets me manage memory explicitly like C does so that I can, by design, guarantee I never run out of memory and never get a GC at a critical time. How does Erlang handle that?
Erlang still uses GC, however because there is no shared memory (except ETS[1]), the GC can run independently for each process. This is eons better than JVM[2], with all the benefits of not having to worry about manual memory management. It’s not C, though, and I don’t believe author implies that in any way. In fact he acknowledges the exact opposite.
I wouldn't say it is "eons better than the JVM". The JVM GC, especially the new G1 collector, is probably the most advanced and performant garbage collector available on any platform. Erlang, however, does provide isolation.
There are versions of the JVM for hard real-time applications (Erlang is for soft real-time only), that provide explicit memory management, and a very fine-tuned GC, like this one: http://java.sun.com/javase/technologies/realtime/index.jsp
Mind telling me more about the fault-tolerant abilities/properties of Erlang? I'm not really able to conceptualize it right now. Maybe an example? :) Thanks.
Back in the 90s there was a software engineering fad (unfair term but it was faddish at the time) called the process maturity index, and JPL was one of two software development sites that qualified for the highest rank (5) which involves continuous improvement, measuring everything, and going from rigorous spec to code via mathematical proof.
This process (which Ed Jourdan neatly eviscerated when applied to business software) produces software that is as reliable as the specification and underlying hardware.
It may be a fad for the industry at large, but it's a requirement in US government contracting (as CMMI). It goes beyond software, too. My former employer just got their systems engineering up to CMMI Level 5[1] and was working hard on getting electrical engineering there (they are only at 3).
But in this case that's almost completely wrong. For example, "bug-ridden"? Outfits like NASA use classical techniques (code inspection etc.) to ensure that their software has exceedingly low error rates. This has been well studied. Such an approach works, it's just too expensive for most commercial projects. As for "slow", how likely is that?
On another note, it's pretty cool that the first three names credited in the JPL coding standard document (which is linked to at the bottom of the OP and is surprisingly well written) are Brian Kernighan, Dennis Ritchie, and Doug McIlroy.
Jokes aside, there isn't much of a speed requirement here, really. It all runs fine on a radiation-hardened 200mhz processor. Nothing on mars is running away.
> On another note, it's pretty cool that the first three names credited in the JPL coding standard document are Brian Kernighan, Dennis Ritchie, and Doug McIlroy.
It's not surprising at all. It's the work of Gerard Holzmann[1], who worked at Bell Labs at the Computing Sciences Research Center with all the other Unix folks where he developed the Spin[2] model checker. (As a sidenote, Spin was used to check the sanity of the Plan 9 kernel).
"Recursion is shunned upon for instance,...message passing is the preferred way of communicating between subsystems....isolation is part of the coding guidelines... The Erlang programmer nods at the practices."
Great article. The only thing he left out is the parallel to Erlang Supervisor Trees, which give the ability to restart parts of the system that have failed in some way without affecting the overall system.
The biggest difference to Erlang is VxWork's inability to isolate task faults or runaway high priority process. (Tasks are analogous to Processes in Erlang). VxWorks 6.0 supports isolation to some degree, but it was released in '04, after the design work on the rover started. Without total isolation, a lot of the supervisor benefits of VxWorks goes away.
Erlang could use a bit more isolation as well. For example limiting the mailbox size and limiting the memory available to a single process.
Memory limits would be great to give you some ease of mind when reusing libraries for processing input. For example processing XML, even using SAX parser, still leaves you open for attacks like huge tag values or just something more real world, huge href="data:" attributes. Obviously the right way is to have a limit in the parser (which xmerl doesn’t have, unfortunately), in this case, but it’d be nice to have some safety net from the VM as well.
Hm.. What do you mean by isolating task faults? I think a lot of that depends on the underlying hardware, right (e.g., if the board has an MMU)? I know you can insert a taskSwitchHook (I think it's called) that could be able to detect and kill runaway high-priority processes.
Edit: in response to the reply, I suppose I should have mean tasks instead of "processes" (which in VxWorks would be the RTP)
VxWorks has no processes[1]. It has tasks. Basically, you write kernel code, there's no user mode.
[1] As someone mentioned, VxVorks 6 did introduce processes and "usermode", called RTP. As with most features of VxWorks, you compile that into your image if you want the feature. But there's a lot of inertia, and much of the VxWorks stuff I see doesn't use RTP yet.
The motivation for writing the software in C is this: Code Reuse. NASA and it's associated labs have produced some rock solid software in C. In space missions commonly the RAD750 is used (with it's non-hardened version, the MCP), along with the Leon family of processors. Test beds and other ground hardware are often little-endian Intel processors. VxWorks is commonly used on many missions and ground systems, but so is QNX, Linux, RTEMS, etc... The only common thing the diverse set of hardware, operating systems, and compiler tool chains all support is ANSI C. This means that nifty languages like Erlang or whatever - though there may be a solid case for using them - is not practical in this circumstance.
I know some clever folks in the business have done interesting work on ML-to-C compilers, but it's still in the early R&D phase at this point - the compiler itself would have to be thoroughly vetted.
I didn't read it as arguing against C, just noting that there seems to be a lot of commonality between the ways the code in the Mars rovers are designed, and the way that robust Erlang applications are typically designed.
Precisely. One thing very must against using Erlang for this problem is that you need hard real-time behaviour. Erlang does not provide that. The other point, you need static allocation almost everywhere, is also detrimental to using Erlang for the Rovers.
That leaves you with very few languages you can use, and C is a good predictable one for the problem. Its tool support is also quite good with static verification etc. And it is a good target for compilation. As someone else notes, most of those 2.5 Megalines are auto-generated.
If you don't read the article, at least read the tl;dr:
"TL;DR - Some of the traits of the Curiosity Rovers software closely resembles the architecture of Erlang. Are these traits basic for writing robust software?"
Since much of those 2.5-MLOC are auto-generated, wouldn't it be accurate to say that they did use a something-to-C compiler, only instead of, say, ML-to-C, it was their own in-house something-to-C compiler? And that 'something' seems to have borrowed a number of features from Erlang.
But a compiler in the sense of translating a higher level language down into C? Or from something isomorphic to C, essentially plugging in values into a boilerplate template? The latter is far more likely
"We know that most of the code is written in C and that it comprises 2.5 Megalines of code, roughly[1]. One may wonder why it is possible to write such a complex system and have it work. This is the Erlang programmers view."
Contrast this with https://www.ohloh.net/p/erlang: "In a Nutshell, Erlang has had 7,332 commits made by 162 contributors representing 2,346,438 lines of code"
I'm not sure if those roughly 154kloc really make a difference...
From an erlang programmer's point of view, 2.5MLOC are a complex system.
On the other hand, every Hello World in Erlang drags in about 2.5MLOC of liability (even if much of that is never run). And I doubt it's all autogenerated.
So if anything, 2.5MLOC of generated NASA code is probably less complex than the erlang runtime.
Great article! I'd like to add that the D programming language also offers a lot of features to create robust code with multiple paradigms, although the syntax is heavily C oriented rather than functional.
'immutable' and 'shared' are added to the known C 'const' qualifier for data that will never change (contrary to not changing in the declaring scope only) data which is shared across threads, everything else is encouraged to use MPI using the std.concurrency module.
Pure functional code can be enforced by the compiler by using the 'pure' qualifier. There is even compile time function evaluation if called with constant arguments, which is awesome when used with it's type-safe generic and meta-programming.
There's unit tests, contracts, invariants and documentation support right in the language. Plus the compiler does profiling and code coverage.
I'd be curious to test D against Erlang for such a system. (Not saying Erlang shouldn't be used, it's the next language on my to-learn list, just that the switch to functional might be too radical for most developers used to imperative and OO and D provides the best of both worlds.)
I've been interested in D for a while, for these reasons and more - it's features look nice, but it never seems to have gotten much popularity/mindshare. Could you hazard a guess why?
In the early days there were some issues in the community which lead to the Phobos/Tango divide in the standard library for D1.
This is now past history as the community has joined around D2, just known as D, and strives to reach compliancy with the "The D Programming Language" book written by Andrei Alexandrescu.
D2 development is made in the open with source code available in GitHub.
Besides the reference compiler, dmd owned by Digital Mars, there are also LDC and GDC compilers available. Currently it seems that GDC might be integrated into GCC as of 4.8 release.
Right now it seems more people are picking up D, mainly for game development projects.
Indeed, the D1 splits, and and also questions being answered with "that will be in D2". There is now a problem that there is not yet a consensus about how D2 compares to C++11.
Does anyone have any knowledge of why Ada isn't used over C? Specifically, it seems like Ada gives you a lot better tools when it comes to numerical overflows/underflows.
Also, what compiler does NASA use? Something like CompCert? What kind of compiler flags? Do they run it through an optimizer at all?
See my post below - to reuse code cross platform. There's a diverse set of compiler toolchains, operating systems, architectures. Only ANSI C is supported by all of them. The compilers are specific to the target OS and hardware, and flags are unsurprisingly the strictest possible for C89.
Interesting, I'm curious if they have looked into CompCert, and if so what they think of it. Maybe it doesn't target the architecture they want. There is also vellvm which seems like something a space mission would care about. Although, I've never heard of a gcc compiler bug being the cause of a NASA mission failure so perhaps gcc is Good Enough?
Great article and comparison, and a nice way of highlighting one of Erlang's strengths.
However: I'm dubious that it's a strength many people here need. No, the article did not say anything about that, but I am. A few minutes of downtime, now and then, for a web site that's small and iterating rapidly to find a good market fit, is not the biggest problem. And while Erlang isn't bad at that, I don't think it's as fast as something like Rails to code in, and have all kinds of stuff ready to go out of the box.
That said, I'd still recommend learning the language, just because it's so cool how it works under the hood, and because sooner or later, something will take its best ideas and get popular, so having an idea how that kind of thing works will still be beneficial.
As you mentioned Rails, I thought I should mentioned Chicago Boss. It's a blindingly fast Rails-inspired framework that takes many Erlangisms out of coding in Erlang: http://chicagoboss.org/
I think two reasons: 1) VxWorks directly supports message passing (http://www.vxdev.com/docs/vx55man/vxworks/ref/msgQLib.html). 2) They seem to prefer simple, obvious, "less magic" interfaces. STM is nice for its "magic", but message creates very well defined, documented interfaces between code.
> It turns out that many of the traits of Erlang systems overlap with that of the Rovers. But I don't think this is a coincidence. The software has certain different properties — the rovers are hard realtime whereas the erlang systems are soft realtime.
> The things that do not overlap has to do with the need of having soft realtime vs hard realtime. In Erlang we can yield service. It is bad, but we can do it. On a rover it can be disastrous. Especially in the flight control software. Fire a rocket too late and you are in trouble.
The OTP virtual machine takes a lot of memory. It's an interpretter, which means much slower execution, as article and others above pointed - The Erlang VM is soft-realtime, it can't guarantee that something would finish in certain amount of micro or milliseconds, or if it does guarantee - it's too much for what they need (just guessing here).
But the concepts are very similar - message passing being the way to communicate between modules, rather than shared memory ways.
This brings another topic - the Linux vs Minix debate :) - I guess there are right things to be done for the right time, and right target. It's just getting all these things right is the hardest.
The figure is just mentioned to say that the code base is quite complex. And what are you talking about? It’s not a rant on anything. Certainly not the 2.5 MLOC.
So a Mars Rover is much closer to a browser/backbone/node.js app than I could ever imagine. The basic structure is surprisingly similar to javascript apps these days: isolated modules, message passing/event loop, fault tolerance.
Node.js is cooperatively multitasked. VxWorks (and Erlang) are preemptively multitasked. So the basic structure is quite different. If one of your node.js events infinite loops, it is game over. Be it web server or rover. Not so here.
Though, with the coding standards forbidding any recursion (direct or indirect) and requiring statically bounded loops, you're not going to have an infinite loop anyway (or, incidentally, write a UTM).
Please, explain what do you mean by that - "pre-emptive"?
My understanding (coming from C and OS terms) is that pre-emptive means taken over. e.g. if I have a real OS thread it is being temporarily "paused" and the resources (CPU/mem/io) are given to something else. At some point control is restored back.
But this is without the knowledge or instructions from the thread itself. So things like priority inversions are hard to battle with pre-emptive multitasking - for example thread A with low priority holding a mutex, while thread B with higher priority waits for it. (and no need for mutexes, if only message passing is to be used).
The node.js worker threads are native threads, which are preemptive on all current platforms. The JavaScript context is running an event loop which most likely must perform locking on its message queue and callbacks to async operations are queued for execution on a future tick of this event loop. All of this seems very preemptive to me.
What seems like cooperation in node.js is really just async operations queuing up on the event loop. Since requests are also async events, they get interlocked with callbacks from existing requests.
To me, cooperation is when you yield the thread to another coroutine. This saves the state of the call stack, the registers, everything; meaning you don't force your user to keep that state in closures. The user code in a cooperative environment feels sequential and blocking and results are passed by return values, not by calling continuations.
Its also friendlier to exceptions since it doesn't lose the entire call stack; with node.js you only get the stack since the beginning of the current event loop's tick.
> The JavaScript context is running an event loop which most likely must perform locking on its message queue and callbacks to async operations are queued for execution on a future tick of this event loop. All of this seems very preemptive to me.
There’s a single loop which blocks until the task yields while waiting on the result from one of the worker threads. That’s cooperation. All queued connections are starved until that happens. In Erlang, or just with pthreads, the connections are processed independently. Think separate event loops for each connection.
> To me, cooperation is when you yield the thread to another coroutine. This saves the state of the call stack, the registers, everything; meaning you don't force your user to keep that state in closures. The user code in a cooperative environment feels sequential and blocking and results are passed by return values, not by calling continuations.
That has noting to do with how execution is scheduled. Cooperative scheduling requires passing continuations[1], so the execution can be resumed while it waits. The simplest implementation is to use callbacks, the way node.js does it. Futures and deferreds[2] are a little bit more sophisticated (Python’s Twisted, probably something for node.js exists as well), as they allow for better composition. And of course you can hide the continuations entirely, which can be done in both Scala (compiler plugin) and Python (gevent or using generators), rewriting the direct control flow by breaking it on yield points automatically (this is how exception throws work in most languages btw), but the limitations inherent in having a single event loop per thread will still exist.
> a single loop which blocks until the task yields while waiting on the result from one of the worker threads
Yes, node.js is cooperative, yet since all I/O is asynchronous the time spent blocking is mostly dispatching and simple operations, it doesn't block while waiting - that's where it's performance and high concurrency comes from. Doing CPU-heavy work in the server/main process is a no-no.
Obviously that approach is fine enough for many things. Before node.js, people have written those kinds of servers in Twisted or Netty, with great results. Netty based framework powers, for example, much of Twitter. I was just explaining how the scheduling works :)
If a function call in a node.js blocks, doesn't the app hang while waiting for it? A single process with event callbacks is almost the opposite of what is described here.
My comment was specifically about module isolation and message-passing, the design patterns, not language implementation, preemptive vs cooperative or whatever.
I hate to harp on this, since you were downvoted to oblivion, but node.js doesn't have module isolation, as a crash in one function will stop the entire program.
And it doesn't have message-passing since it doesn't have multiple processes. There's nothing to pass messages to.
The only design pattern it shares is splitting code into subsets, and every language (that matters) has that.
On node.js: not if you use domains (0.8+) to isolate module exceptions, and you do (should) have multiple processes. On the client, a module throwing an error won't crash the browser either. Finally, events are a form of message-passing.
In node.js operation that would normally block are performed in a worker thread. For example to read a file you specify a filename and a callback function that will be called when the file is in memory. After the call the execution of the current thread will eventually return to the event loop and other events will be processed until the file is read, such as requests or other callbacks. Once the file is read, a call to the specified callback is queued on the event loop and then executed, resuming your code.
Thanks for the hate. I know it's worlds apart and there extreme implementation differences, but from a design pattern/code organization perspective it sounds similar, I just thought that was curious. A javascript programmer's view on the Rover's software.
(and I feel stupid having to apologize for a harmless comment. way to go HN)
But the article implies (and more than once) that the rover's architecture borrows from Erlang, while the opposite is true. Erlang adopted common best practices from fault-tolerant, mission-critical software, and packaged them in a language and runtime that make deviating from those principles difficult.
The rover's software shows Erlang's roots, not its legacy.