
Safety-critical realtime with Linux - corbet
https://lwn.net/SubscriberLink/734694/2e85660e26897085/
======
WalterBright
Every industry seems determined to learn from scratch on their own how to make
safety critical systems. None look at an industry that figured this out 50+
years ago - the airframe industry, which has an incredibly good track record
of making safe systems out of unreliable parts.

I wrote a couple articles on the general idea:

[https://digitalmars.com/articles/b39.html](https://digitalmars.com/articles/b39.html)

[https://digitalmars.com/articles/b40.html](https://digitalmars.com/articles/b40.html)

~~~
fit2rule
You're right, but its the Rail transportation people that figured it out
before the Air guys ..

~~~
WalterBright
You'll see bits and pieces of it in other places, for example, the dual
independent braking system in all cars. But the auto engineers failed to
generalize that into the car's computer controls.

The Fukushima and Deepwater Horizon disasters show how other industries fail
at applying the concept. Reading the sequence of failures in those just makes
me grind my teeth. For example, venting the hydrogen overpressure, good idea.
Venting it into an enclosed space with sparking electrical equipment,
spectacularly dumb. (And the list goes on and one with both disasters.)

Nobody has taken it to heart and applied it pervasively like the airframe
industry.

------
kev009
I find that subject line completely terrifying. Please use a small trusted
compute base, hopefully with rigorous auditing and attempts at formal
modeling, for safety critical systems. The Linux kernel development process is
not suitable for this domain.

~~~
mechatronix00
I’ve read SpaceX uses Linux to fly Falcon and Dragon. If true, I wonder how
they got happy with it for safety critical tasks. Or maybe those safety
critical tasks get offloaded to microcontrollers with easier to audit
codebases.

edit: reference to the article-
[https://lwn.net/Articles/540368/](https://lwn.net/Articles/540368/)

~~~
ficklepickle
I would have thought they'd use Dragonfly BSD. Get it!?! Dragon FLY... 'Cuz it
flies...

I'll show myself out.

------
srcmap
Designed realtime with Linux is non-trivial.

I worked on HA (Highly Available) system with Linux inside Xilinx's Vertex Pro
PPC. It is redundant system with multiple fault detections and switch over if
any subsystem detected failure.

There was one 250 ms hard real time requirements: If I am a slave and don't
detect the master 's UDP ping for 250 ms. I will assume the master has failed
somehow, I will start action and take over control as master.

The sub-system did trigger from time to time while the master is alive and
working perfectly OK.

Eventually I figured out that one of the system API was using > 250 ms time.
(Forget which one now, that was > 10 years ago.) I have to profile very
carefully and redesign the code to get around that API.

~~~
burntrelish1273
Sorta like a distributed watchdog or HA failover.

(Btw my favorite acronym is STONITH.)

~~~
occultist_throw
Yeah, STONITH bites us on our ass every so often. Primarily, because of split-
brain and no good way to ameliorate that without throwing out data.

Its better to have an odd quorum, to break tie-breakers.

------
aidenn0
Segmentation faults are a bad example of a fault you don't want in a safety-
critical system. A crash is okay, because you will usually have some sort of
fall-back (e.g. most power steering systems work unpowered). It's non-crashes
that cause silent improper behavior that are bad.

Of course, a segmentation fault is usually a symptom of pointer misuse, which
means your code is likely to also suffer from corruptions.

~~~
AnimalMuppet
Not all systems work with the software down. Yes, power steering works with
the power down, but that's for an engine failure, not for an electronics
failure or software crash. But I believe that airplanes that use software to
adjust the wing control surfaces move them to neutral positions on software
failure. (I can't prove that, and I have no first-hand knowledge - just recall
hearing it once, if I remember correctly). That means that, while the
software's down, the wings won't break off, and it won't crash due to the
control surfaces doing something bizarre, _but you can 't navigate the plane._
That's still pretty bad.

~~~
skykooler
Commercial aircraft have mechanical fallbacks for all controls, and emergency
protocols for if some controls don't work (for example, using spoilers to turn
if ailerons are inoperative). Many military jets have pure software control,
as they are aerodynamically unstable and are unable to be flown safely by hand
if the fly-by-wire system goes down; however, the procedure for a system
failure in a military jet is usually "eject!", which is not feasible for
passenger aircraft.

~~~
WalterBright
The Boeing 757 had 3 independent hydraulic systems, with 3 actuators per
surface. Any two could overpower the 3rd. Any software in the loop was dual,
written on different hardware using different algorithms, with a comparator
that would physically disconnect both if they disagreed. The pilot could also
physically disconnect them via circuit breakers.

There were also various mechanisms to prevent the surfaces from going too far
(cannot move full travel at 500 mph, it would rip the airplane apart).

This is all very serious stuff, and was worked over by a lot of people
imagining every perverse thing that could happen, and going through the list
of things that had happened, to ensure it is safe.

The track record of the 757 in service shows how effective this is:

[https://en.wikipedia.org/wiki/Category:Accidents_and_inciden...](https://en.wikipedia.org/wiki/Category:Accidents_and_incidents_involving_the_Boeing_757)

None of them resulted in recommendations for design changes.

------
zurn
I wonder how they arrive at the X microsecond worst-case number for the
software-based solutions. Does it take into account a perfect storm of APIC
interprocessor events, interrupts, SMP cache coherency protocol worst-case
behaviour and cross-CPU TLB shootdowns, misses on all levels of
instruction/data TLBs and caches and DRAM, CPU trace cache behaviour, ECC
machine check events, worst case OoO core behaviour wrt branch prediction and
speculative execution, worst case interference from other SMT threads, other
SoC functions accessing DRAM, etc?

It would seem to me that a worst case scenario could easily cause slowdowns of
many orders of magnitude. You could mitigate some of them by careful manual
memory layout and hardware specific tricks like hardwired TLB entries, but
still be left with a lot of uncovered stuff.

~~~
ridiculous_fish
Interrupt-based designs are bad for real time, for the reasons you give.
Instead you want to use techniques like polling and static scheduling, where
every process gets a fixed time slice. This reduces the variability and
improves fidelity to your model, since you know at every point what processes
are running.

~~~
occultist_throw
Easy to say that interrupts are bad.. But when you need to know of a pin
change, rising or falling IRQ, interrupts are the only game here. And not only
that, but if we were to do the timeslice strategy, then we need that slice to
be at least 2x more than the fastest input for that pin.

I would think with that strategy, you'd be servicing "slice-interrupts" more
than anything else.

I'll stick with writing ISR's that are only a few commands and do my work
outside the ISR's, as standard in industry now.

Now, on a related topic, if they're discussing getting the system to a 10us
timing, that would be useful in using Linux as a 3d printer controller
directly, rather than an Atmel or STM32 chip. But those requirements of what
needs turned off seems pretty onerous, unfortunately.

------
cjbillington
Looking forward to CPUs getting on-die FPGAs so we can actually chuck some
fully deterministic timing tasks onto them. Even if you're running on metal
without an OS, there are heaps of things that can stop your code from running
with predictable timing, and it seems like it's getting worse as CPUs are
getting more complex.

~~~
planteen
We've had FPGA with CPU for 5 years or so on Xilinx Zynq. :)

~~~
kyzyl
Amusingly, Xilinx makes dev tools that are so unbelievably bad that it's
almost impossible to do any of the best practices for development that you
would hope for in safety critical systems. No source control, code review,
reproducible builds, integration testing, centralized code review, etc.

Presumably you can sort of brute force it if you have defense level budgets,
but it's a seriously bad situation.

~~~
planteen
Yeah last time I used Vivado a couple years ago, version control on the
project was a nightmare. Reproducible builds seem to be a problem with stock
FPGA synthesis tools in general.

------
arca_vorago
I thought that the kernel had improved enough in recent years for sil3...
perhaps not though.

I wasn't as aware of the issue of safety-critical systems as I should have
been until I was inside a couple industrial companies where PLC's were
everywhere (for this very reason). The thing that interests me now about this
is how hard I see netconnected PLC's pushing into industrial applications,
mostly because everyone in industry is on the edge of their seat for IOT to
hit so they can use and abuse the data (instead of waiting for service call to
pull data like they used to, why not just use an LTE-modem PLC, for example?)
Do you see where I am going with this? Safety-critical industrial applications
<sil4 are increasingly more vulnerable, and it's not from lack of realtime
response to stimuli. In the end, using linux in realtime just seems to
exacerbate this particular angle on the issue that I see. It does make me
wonder about the implications of microkernel design vs monolithic in such
applications though.

[http://www.nfpa.org/codes-and-standards/all-codes-and-
standa...](http://www.nfpa.org/codes-and-standards/all-codes-and-
standards/list-of-codes-and-standards/detail?code=79)

[https://en.wikipedia.org/wiki/IEC_61508](https://en.wikipedia.org/wiki/IEC_61508)

[https://webstore.ansi.org/RecordDetail.aspx?sku=ANSI%2fRIA+R...](https://webstore.ansi.org/RecordDetail.aspx?sku=ANSI%2fRIA+R15.06-2012)

[https://www.iso.org/standard/69883.html](https://www.iso.org/standard/69883.html)

[https://webstore.iec.ch/publication/22797](https://webstore.iec.ch/publication/22797)

[https://en.wikipedia.org/wiki/Comparison_of_real-
time_operat...](https://en.wikipedia.org/wiki/Comparison_of_real-
time_operating_systems)

------
irundebian
Can somebody here recommend any books on developing safety-critical systems?
I've read some part of Kleidermacher's "Embedded Systems Security" book and
found it very helpful.

------
rocqua
The title says [LWN subscriber-only content], the link seems to suggest the
same.

It feels like LWM has bad access control, and someone abused that to post an
article that shouldn't be free.

~~~
cesarb
LWN has a cool feature where any subscriber can make these "subscriber links"
to share an article with a friend/coworker, a relevant mailing list, or
sometimes even HN. See
[https://lwn.net/op/FAQ.lwn#slinks](https://lwn.net/op/FAQ.lwn#slinks) for
details.

In this case, the link was posted by Jon Corbet, who is LWN's main editor and
developer.

