For external hardware inputs, we had software simulating the messages when we couldn't get the real equipment in the lab. We were writing software for a radar that didn't yet exist.
And it was tested and tested and tested and tested.....
The when the pieces came together in integration. They tested and tested. An when they built the physical radar, the software worked. A few configuration things here and there needed adjusting.
Actually testing was built into the software. When it came up it would talk to the physical parts to make sure everything was communicating ok before it could start running.
Was it bug free? Probably not. You can't test every possible scenario. But we took the work seriously. It took longer to get code done than any other environment I've been in, but the code ended up very solid quality.
I've been wondering about this for a while. We tend to run unit tests, integration tests, whatever tests, while the software is in development. However, once it is in "production" (for whatever definition of production), usually no tests are performed. At most, there's some sort of metrics and server monitoring, but nothing in the actual software.
It's a work of fiction, but Star Trek has these "diagnostics" that are run, with several levels of correctness checks, for pretty much everything. In your typical app, it could be useful to ask it to "run diagnostics", to see if everything was still performing as expected.
Each BIT does varying level of testing based on it's runtime budget but there are a lot of very basic tests that don't make much sense until you see your first field report of some register bit "sticking". Its much better to ground a plane that can't add 1+1 than to find that out in mid-flight.
As I recall, they modified the sensors to avoid latch up (extra pullup resister) and updated the FCS software to provide a warning if all 3 sensors return zero output. Even though this wasn't really an error in the FCS software, It could be argued that it failed to detect an erroneous input from 3 redundant sensors that latched up for the same reason.
This is a classic example of a pilot who misunderstood the pre-takeoff checklist procedure and cost the USAF a $325m aircraft.
> During the mishap sequence, the MP started engines, perfomled an IBIT, and had a fully
functioning Flight Control System. Subsequently, the MP shut down engines to allow
maintenance personnel to service the Stored Energy System. During engine shut down,
the MA's Auxiliary Power System (APU) was running. The MP believed the APU
provided continuous power to the Flight Control System, and therefore another IBIT after
engine restart was unnecessary. This belief was based on academic training, technical
data system description, and was shared by most F/A-22 personnel interviewed during
To be fair this seems to be more of a classic
example of a documentation or training problem.
So you could also classify this as a user interface / design problem.
Now, hopefully your system is set up such that race conditions cannot happen, but good luck with that.
You could command the external things to run diagnostics and report back. Part of this was it makes fixing things easier (like your car computer's diagnostics), part of it was required so we can figure out where things weren't working optimally.
For example if a motor was running its controller computer didn't hear from the main system in X seconds, it would just stop, send a message about what it was doing and then wait for an instruction. Presumably this was to prevent all heck from breaking loose if the main system went down or was not responsive.
This was all spelled out in long requirements documents.
I wonder sometime if our cars controlling computers are doing this...
Little about being a biker, programmer, and an aspiring tree-shade mechanic, reassures me about the safety of these systems. Internals of a bike are much more exposed to abuse, bikers are known to take a spanner to their machines, the motorcycle repair workshops are a much more informal industry (at least here in India). What happens to a bike that is not subject to regular maintenance? On the other hand, I have always marvelled at how mechanical systems like motorcycles are usually built with some sort of graceful failure in mind - In a lot of cases a motorcycle will warn you about a faulty component before it fails catastrophically. I assume the people who designed these systems, would have kept that in mind while designing these systems (although stuff like Toyota's unintended acceleration does not inspire confidence).
And what happens in the event of a catastrophic failure:- a car locking up at speed is still dangerous, but there is room for error. If the front tires of your motorcycle locks up at speed, the odds of you walking away from the incident are not high.
Don't get me wrong, these systems DO SAVE more LIVES than they could possibly take away in the long run, But I am still disconcerted by the whole thing.
Another thing that worries me are that these same companies are also working on self driving cars.
Worst of it is that everyone pretty much jumped into the race after Google. It also doesn't look like existing solutions work on a real time system. I'm a bit worried about being hit by a car because it was running garbage collection process and did not react quickly enough.
Seemed like a funny combination of lights to blink on.
Of course there's a certain level of destructive testing you can't do live, but that you really ought to do on your development system, load testing being a simple example. It behooves the wise developer to keep these quite separated in the code. :)
The errors should then be logged and the program should be restarted by a watcher process.
Here's an example on how you can both log errors and e-mail them if a process crash, using a startup script (Linux, Ubuntu):
exec sudo -u user /bin_location /program_path 2>&1 >>/log_path | tee -a /error_log_path | mail email@example.com -s email_subject
When the error gets thrown in your face, there's a higher chance that it gets fixed.
But this also have its setbacks. Loosing the whole state can be really bad.
This sounds really strange to me. So may I ask why? I find that recursion - most of the time - helps shorten and clarify the code. Also, doesn't recursion make induction proofs trivial?
The clearness and correctness of the code often ignores the possibility for stack overflow. Most naive implementations of DFS will hit the stack limit given trees that are all one long path from a single root to a single leaf.
1. Rule: Restrict all code to very simple control flow constructs – do not use goto statements, setjmp or longjmp constructs, and direct or indirect recursion.
Rationale: Simpler control flow translates into stronger capabilities for verification and often results in improved code clarity. The banishment of recursion is perhaps the biggest surprise here. Without recursion, though, we are guaranteed to have an acyclic function call graph, which can be exploited by code analyzers, and can directly help to prove that all executions that should be bounded are in fact bounded. (Note that this rule does not require that all functions have a single point of return – although this often also simplifies control flow. There are enough cases, though, where an early error return is the simpler solution.)
This is rule 1 of 10, so he apparently feels strongly about "banishing recursion." Gerard was formerly at Bell Labs and is also a fellow of the ACM and a member of the NAE.
In most systems lots of really important stuff is allocated at the bottom of the heap. It's very easy for a clobbered global flag (yes, hissss, globals, these are very constrained computers we're talking about here) to cause a system to have its shit get real at an alarming rate.
Also many of these systems where hard real time as in "if we don't respond in under 30ms something expensive goes bang" and again recursion can cause problems with that, lots of these systems are interrupt driven and have no garbage collection or threading so you can't just pre-empt them in that event since by the time you spot the problem you blew through your deadline and something went bang.
On the plus side you had a pretty good idea about how long the max processing would take (and avoid the timeouts and aforementioned "bang"), as the OS couldn't interrupt us. Certain system calls couldn't be made while in what we called "soft real time". Memory allocation was done upfront.
The process control of that system was interesting. You could assign processes to processors or groups of processors and then give those groups a scheduling method.
I haven't seen anything like it in the years since I left.
If you assume infinite precision arithmetic and a very friendly environment for inputs, recursion always looks simpler, but by the time you clean it up to handle real world issues, non recursion instead looks simpler.
Its too easy to write recursive end conditions along the lines of "if x == 42" when your helpful floating point routine somehow mysteriously rounded x to 42.00000001 so it'll never equal, or "no (supposedly) UTF-16 encoded string would ever have an odd number of bytes, even though I have no control of the source and the source is known to occasionally be insane" or at least thats how I remember it. I've run into both. Its not funny at the time but in retrospect its usually fairly hilarious.
Personally I think its harder for people to understand concurrency issues WRT recursion, but I'll probably just get flamed for that one. I feel more people have "leveled up" with concurrency and non-recursive code and functional style programming than have leveled up to include recursion in that mix. Imagine two (three?) concurrent recursive algos fighting each other over one data structure.
If you are careful about test vectors, you can pseudo-exhaustively prove that an iterative dual to a recursive algorithm is equivalent.
This is very convenient in most cases, but could hide the fact that a direct recursive call will overflow your stack.
This isn't apparent and would test out ok until someone makes a small; seemingly insignificant change which the compiler can't do tail call optimization on, and all of a sudden things fail.
I can only imagine a feeling of satisfaction you and your colleagues felt at this moment.
My favorite was one where we had an entire test harness written in Python that could completely control the operation of the device being tested in a way that resembled human input. Code was first written in Ada against a monolithic requirements document, while testers wrote their standard test cases against the same document. After the exhaustive amount of testing that took place by developers, testers themselves had the freedom to create contrived test cases that might have escaped the attention of devs (What if we just turn the machine on and off 10 times, because why not?).
This had the advantages of a formal software process as well as the ability to exploit human creativity. It also led to me losing a bet that a doppler radar can't be fooled with an empty potato chip bag.
* Lots of manual testing. While we did unit testing and some automated integration testing, most defects were found using exhaustive manual testing by trained engineers.
* Randomized UI testing. Used UI automation to exercise the UI with various physical configurations of the system. Would often run this overnight on many systems and analyze failures every day.
* Extensive hazard analysis. Basically, we wrote down everything that could possibly go wrong with the system (including things like gamma radiation), estimated the likelihood and harm, and then listed mitigations. The entire system could run safely even if there was full power failure. "Fail safe"
* Detailed software specifications, each of which was linked to manual test cases. Test cases were signed off when executed.
* Animal testing for validation. We went to a vet school and put a bunch of dogs under and brushed their teeth.
* Limited release for production. We would launch the system at one or two hospitals and monitor it for a few weeks before broader release.
What I mean is, does it continue applying anesthetic in the event of a power failure or does that stop entirely?
How about, we require that the source be open, and if it's too convoluted for the hospital's respected experts to check, then it fails inspection?
Hardware uses many standard parts and materials, and similarly, the software could use a few plain-simple-standard libraries, like libc (but not the floating point functions), zlib, libpng.
The Therac 25 case was just incompetence. The vendor was told about the problem, but was in denial, then later supplied a hardware fix which didn't fix the problem. The problem had to be thoroughly investigated and proven by a doctor and operator over many months. Why couldn't the vendor have investigated more thoroughly themselves, in a week or so? Why weren't they more careful about race conditions? (The problem was triggered by a human able to type into the interface too fast. An actual human.)
The combined utility of hardware and software interlocks is that they're complementary:
Sometimes it's easy to specify an interlock in the language of hardware: Never, under any circumstances, should it be possible to slew an avalanche-control howitzer to point at permanent structures; let's use a steel pipe to block the barrel from traversing beyond safe limits.
Sometimes it's easy to specify an interlock in software: Never, under any circumstances, should a rocket launch unless every desk at mission control has authenticated their assent with the main control system.
When interacting with the real world, real-world interlocks are handy, but they're hardly sufficient to guarantee safety. Nothing is.
The Therac-25 is a famous comp.risks cautionary tale. Among the many, many design misfeatures (if you haven't come across it, it's worth a read) was the one that killed people:
It was capable of providing two kinds of radiation therapy; electron beam radiation and X-ray radiation. It worked by having an electron beam generator which could be operated at either high power or lower power. Low power was used directly. High power was only used to irradiate a tungsten target which produced X-rays. (I'm simplifying here.)
You can probably guess what went wrong; people were directly exposed to the high power electron beam. Several of them died.
The obvious interlock here (which apparently previous versions had) was to have a mechanical switch which would only enable the high-power beam when the tungten target was rotated into place. No target, no high power. Simple and relatively foolproof (although it's possible for interlocks to go wrong too).
There problems in the Therac-25 went a lot further than just the bad design of the target-selection, which had an (badly designed) interlock. It checked that the rotating beam target was in the correct position to match the high/low power setting (and NOT the 3rd "light window" position without any beam attenuator).
While many design choices contributed to the machine's problems, you could probably say that two big design failures lead to the deaths associated with Therac-25. One was this interlock, which failed if you didn't put it in place (there was no locking mechanism, either, just a friction stopper). If the target was turned slightly, the 3 micro-switches would sense the wrong pattern (bit shift)... which was pattern for one of the OTHER positions.
There was also a race condition in the software that would turn on the beam at a power MUCH higher than it is ever used. This race was only triggered when you typed in the treatment settings very quickly, which is why the manufacturer denied there was a problem: when they tried to recreate the bug by carefully - that is, very slowly - following the reported conditions, it never failed.
Therac-25 is an incredibly powerful lesson in what we mean by "Fail Safe", and why it is absolutely necessary to have defense in depth. Fixing the target wouldn't have fixed the race condition power-level bug. Fixing any of the software wouldn't have fixed the bad target design that could be turned out of alignment. Oh, and they had a radiation sensor on the target (which could shut off the machine as another independent layer of defense... but they mounted it on the turnable target, so the micro-switch problem allowed the sensor to be moved away from the beam path.
The really telling thing, though, is how the previous model acted. It was not software controlled, and was an old-style electromechanical device. It turns out the micro-switch problem existed there as well (among other problems)... and it would blow fuses regularly. Which was yet another layer of safety. It turns out that when they upgraded it to a software-based control system, they got cheap and took out all those "unnecessary" hardware interlocks and "redundant" features. There is a lot of blame to go around, but this is where I put most of the responsibility. You never assume one (or even a few) safety feature will work - the good engineer assumes it will all break at any moment, and makes sure that it will still Fail Safe.
> (although it's possible for interlocks to go wrong too)
If there is one lesson to learn from the Therac-25, this was it. Things break, mistakes happen, and when you're building a device that shoots high-energy x-rays at people, you need to assume that everything did go wrong, and make sure the rest of the device can safely handle that situation.
Good. When a fuse blows, it shows something is wrong, and needs fixing. Replacing the fuse with a nail or something else that doesn't blow is a sure-fire way to set the thing on fire. Bad enough for a desk-lamp, a little worse for radiotherapy machine.
Sounds like people were irritated by fuses blowing, and decided to simply short-circuit the fuses instead.
Obviously, something was still very wrong. User error (or other bugs? I'm not sure) in the older hardware and the infamous race condition in the software-controlled Therac-25 was causing the beam to turn on some shockingly high amount of power. The better design of the older models saved people's lives by simply blowing fuses when the power went too high.
You could, perhaps, blame the poor communication between the hospitals and the manufacturer, because the fuse problem should have cause a bit of a panic among the engineer who designed the machine.
Kinda makes me worry about how many times I've microwaved the meat-n-two-veg just a bit.
Serious question: where do I get a job like this? It's my dream way of programming professionally.
In most user-facing software in the Internet age, the reverse is true: we do not really know what the software should do; but the penalty for a bug is not great; so, it is not worth expending enormous effort to be big free; instead it is better to expend great effort to be nimble and find out just what it is the software should be doing to begin with.
Very different environments yielding very different methodologies. I can't say one is better, just de gustibus non est disputandum.
The coding proces is slower than most people are used too and can become frustrating.
Is my feeling about everything industry I've worked in.
The stuff that runs telecoms (mostly billing side) particularly is the stuff of nightmares.
As other people have said here, nobody wants to touch it. Developers would often limit themselves to fix just a small portion of the code even though they thought the overall system could be improved in many ways, for fear of breaking something, causing a few million dollars of damage and getting fired. There was no assurance that any part of the systems should work like this or that.. only some vague expectations.
You're right, that would be a systems that should be built from scratch with that kind of concern but unfortunately it's not.
The rewrite doesn't have to be complete; it can (should) be done in pieces of course.
The big issues is duplicating the system while it is still morphing in production for all the edge cases, it often feels like trying to paint a moving bus.
I work in automotive, so it is govern by processes such as ASPICE and ISO26262.
We are currently porting code over from C++ to a C# system with parallel computation. The current system has been flying for a long time but has no testing and is tied to a bad UI. So we are re-writing.
That said, accuracy is number one. We have a pretty solid method for testing so far. We know have some robust input scenarios and we know that we want to get a specific output. So we are able to do fairly robust automated "regression" testing. If the numbers don't match then we have an issue, and we have to fix that before moving on.
After every validation that the new code gives us acceptable margins of error we wrap it up with unit tests so that we can then modify the code to try to optimize. Our testing is integrated from the highest level of the code down (I know backwards) but that's how we know we can validate the input.
We have a lot of testing and a long schedule. If this weren't critical software we'd have a much shorter turn around on what we are writing. We also work very closely with subject matter experts on every change we make. We have a guy who's been working with this software (and the subsequent theory) for 20 years. He's open to change, but he also validates everything so we don't accidentally change the output when we're optimizing.
...and what are you running it on? I wasn't aware of any embedded operating systems which supported C#!
Many ARM embedded systems can support C# with the open source .net micro framework (http://www.netmf.com), which doesnt require any OS, and was originally developed for Microsoft's SPOT watch. I haven't used it myself, and I agree with you that it doesn't sound, ideal at first for realtime apps, but none or soft realtime embedded applications are common too.
Performance isn't the only thing we are considering. We want to get improved performance, but our old analysis code was extreemly hard to maintain, so that went into the decision as well. Honestly, I'd probably pick another language, but I wasn't on the project when it started.
I personally would have liked to do this with F# because of how functional it is at it's core, but that's cuz we have a lot of Microsoft expertise in house.
Also one thing with the engineering apps is that anything where engineers (not software engieners) don't have to learn a new language is going to be an easier sell.
Here's what I wrote about that in a blog post on Antifragility and SW development:
At the end of the book, there is a chapter on ethics that Taleb calls “skin in the game”. To have skin in the game, you should share both in the upside and downside. Taleb quotes the 3,800 year old Hammurabi’s code: “If a builder builds a house and the house collapses and causes the death of the owner – the builder shall be put to death”. It is interesting to view this from a software development perspective. I have never worked on software where people’s lives were in danger if the software failed, but I would not be willing to submit to Hammurabi’s code if I did. But I think a little less extreme form of skin in the game is actually very good. Being on call for example. If the software you wrote fails, you may get called in the middle of the night to help fix it. I have been on call at most of the places I have worked in the past, and I think it has a lot of benefits. It gives you an incentive to be very thorough, both in development and testing. It also forces you to make the software debuggable – otherwise you yourself will suffer. Another way of introducing skin in the game is dog-fooding – using the software you are developing in your daily work. I have never worked on software that we have been able to dog-food, but I think that is another great practice.
And yet you can still measure one value in the metric system and another in English units and drill a smoking hole in Mars. It was sort of striking to read Charles Fishman's statement about the software being bug free, followed immediately by the supporting fact that the last three versions had one bug each. If they had one bug, how are you 100% sure they didn't have two?
I bet they required the guy who delivered the rocket fuel to sign something saying it contained no impurities, the guy who delivered the external tank to sign something saying it did not leak, etc... why should the software guy be special?
Yeah I know. We're special. But the world doesn't always see it that way.
In the durable goods world, you don't pretend things are perfect. Failure modes are designed and disclosed, replacement of parts is expected and made reasonable, tolerances are marked, failure rate metrics like MTBF are known, and as a customer you choose the price-quality tradeoff that makes sense for you.
I just wish consumer products were also sold this way. Instead we pretend every product is awesome and act surprised when things break.
I always thought the whole thing was dumb - of course there's a limit. Or are we supposed to believe we can push megabytes/s nonstop all month? I'd rather have them just tell me what the limit is and what happens when you go over it than pretend it's unlimited. And stop having the tech press act like the sky is falling when they discover that the unlimited plan actually has a limit.
At the end of the demonstration, the trash can is full of discarded water bottles.
I think you're being overcritical here. It's really really really difficult to reach 100% in any real form of measurement. So I think when they mean "bug free", they probably mean the chance of a bug is below some threshold of probability. The famous six sigma rule comes in mind.
I do grant you that this is not specified explicitly in the article, but they do say: "each 420,000 lines long-had just one error each. The last 11 versions of this software had a total of 17 errors. Commercial programs of equivalent complexity would have 5,000 errors."
No doubt, and that was in fact my point. I think most of us would be very reluctant to use the phrase "bug free," and in this case his statement was obviously meant to be in stark contrast to that reluctance.
For example, after buying a lottery ticket, I can pretty comfortably say that I'm not going to win. Sure, there is a chance, but is it really going to happen? No.
As another example, if I put down a waterbottle on my bike seat, there's a 1% chance of the water spilling onto the sidewalk. At this point, I'm also comfortable saying that it's not going to happen. However, if the waterbottle has a 1% chance of exploding when I put it down, I don't think I'll be comfortable saying that it won't happen. The chance of that must be much much lower before I can accept it.
I would like to give the person who said that the benefit of the doubt. I'm sure they understand the implication of bug free, but it's likely just easier to say that rather than explaining stats/tradeoffs to a journalist.
They're a specialty system for writing code (and mathematical proofs) where every possible system behavior for a given range of inputs can be examined for safety (outputs within allowed ranges with no unexpected behavior) and for liveliness (the expected progression from one output to another).
We know how to write software that comes arbitrary close to perfection. But as defects asymptotically approach zero, cost skyrockets.
The interesting question is what technologies can bend that cost/quality curve.
On the flip side, the cavalier attitude developers who are on the very, very low side of the curve, where huge quality improvements can be obtained for very cheap, towards those cheap practices also frustrates me. What do you mean you can't be bothered to run the automated test suite that I have handed you on a silver platter on your build server? Are you serious? How can you pass up something so cheap, yet so powerful? And I don't just mean, how can you not be bothered to set it up, etc... I mean, as a professional engineer, how can you justify that?
As devil's advocate, why not just run this for me (e.g. on every commit/every push)?
Much like the web usability ethos "Why make me think" - why make me work?
The lower the barrier to testing - ideally zero, it just happens without the dev having to do anything - the more testing will happen.
I don't often get the chance to set things up this way, but when I do, each dev works in their own git branch, and sends a pull-request with their changes. The test server(s) then run the complete test-suite on the branch, and either note the PR with "Tests passed" or emails the dev with "Tests failed" and the reasons. Devs don't need to think about running tests, reviewers/release managers don't need to even consider PRs until the "Tests passed" message shows up…saves time and effort for everyone, and improves code quality. The cost is simply the initial setup time.
In my real-life experience with a multiple-team environment, which is where the question came from, my running the tests doesn't do any good if you're going to consider it "my" server and simply ignore the results.
The key point here isn't a technical one. The key point here was, as an professional engineer, how can you justify not taking such a great bang/buck quality option that will far, far more than pay for itself? Explaining how I can be a professional engineer on your behalf more than misses the point.
I am 90% sure that actually taking any organisation and committing to good known process will raise the game by orders of magnitude - so technologies supporting and enforcing said process will be of benefit
And I think the process looks like this
1. Written requirements up front
2. Total isolation / integration points defined and contractually enforced
3. Test harnesses built first
4. Per romance and event metrics built in
#2 and #3 might be feasible, but they will require a massive shift of perception across most stakeholders. It's a politics game, and you will need the perennial support of a very influential sponsor to push through the feet dragging phase.
#4 Might be easier to sell, but still require time and effort to implement. In a sense, this is ultimately also a political matter.
It absolutely has not gone out of style in avionics software engineering. As a person who writes software for avionics, I can say that extensive design reviews at every step combined with rigorous testing is exactly how we build software. That's how its done at every avionics software company I've ever worked at (3 so far). Formal methods are generally still too cutting-edge and complicated for many people in this industry.
So maybe Google doesn't bother with design reviews, but those of us writing life-or-death software definitely do.
That machine is much easier to code than a space shuttle obviously, but I still wondered about it. The tech has to be rock solid. Even one malfunction could cause so much despair in a family, and could also cost your company millions.
I've been told by my cardiologist (and engineers working for the manufacturer) that it doesn't "fail safe" if the battery level drops too low to keep the device running (which is inevitable if it isn't replaced after 7-8 years, but it can and sometimes does happen prematurely and without warning).
In that situation, not only can it suddenly become unable to correct an arrhythmia (as expected), it could actually cause one all by itself, or pace above 200BPM for no reason.
No one I've talked to in the healthcare industry seems at all surprised about this for some reason. They just started monitoring it more often the closer it got to the "replace me now" indicator level.
Each developer got between 1/2 to 3/2 qa people, in addition to dedicated qa engineers. You had to submit detailed test plans for features. Surprisingly, the company didn't expend much effort on unit tests -- they were there, but not heavily emphasized.
One paper on the topic talks about the Ford Pinto fuel system design: http://users.wfu.edu/palmitar/Law&Valuation/Papers/1999/Legg...
The GM ignition-switch recall also sparked a similar debate: http://en.wikipedia.org/wiki/2014_General_Motors_recall
So it's not uncommon that economics outweighs risk-to-life in a lot of businesses.
And when that software disagrees…but the plane happens to be at 40,000 feet? The plane just stops running until the component is replaced?
I don't know too much about plane hardware, but I scuba-dive a rebreather which has critical life-support electronics.
It has two independent computers, and three O2 oxygen-pressure cells. The 3 cells report their reading of O2 pressure to both computers. Both computers simultaneously display the pressure on independent displays. One computer is primary (active, controlling the O2 pressure), whilst the secondary is display-only.
Both computers use a majority-rule…the two oxygen cells with the closest value win, whilst the third is ignored. This could potentially be fatal - two failing cells can report incorrect pressures and win the vote - so rebreather divers are also taught manual techniques to validate the computer readings (such as a diluent flush which is expected to produce a known predictable reading).
So there are a few techniques beyond simple a/b testing: best of 3 (or more, if available); independent circuits (ideally designed + built by independent manufacturers); manual techniques to give human verification of the data.
Whilst I certainly don't have extensive knowledge of real-time/safety-critical systems, it's clear that there are a lot of techniques, processes and procedures that we wouldn't necessarily be aware of in unrelated tech (e.g. web-dev) that do directly relate to that subject, and might well solve many of the scenarios we come up with.
From what he said they have 5 systems, two running software from vendor a, two from vendor b, and another running software from vendor c for failsafe. A quorum have to agree, if any don't it gets pulled out of service and analyzed why it wasn't the same output. If you get down to one there is still a mechanical failover but at that point you're already bringing things down. 3 at any time was considered a critical failure. Two at once would be full stop time.
I'd presume planes are similar. I know the space shuttle did the same at least so I assume the technique is common.
(Just like multiply redundant hardware won't help you if there's a design flaw and they all fail at once.)
And the compiler would have to be inspected too.
this part is kinda cool because there's also always been a movement for toward this for opensource/commercial security software, at least for the kernel. Of course, we have no such thing today (as in i doubt anyone here runs such a system in production), but the interest is there.
Corporate video by people who do this, explaining it: https://www.youtube.com/watch?t=116&v=YpxPAuHNpdM
SpaceX do HIL tests with the Falcon 9 & Dragon spacecraft.
Feels sad, that a lot of lessons learned are getting lost along the way...
I think we should beware of adding layers of "silver bullet" technologies that promise to fix the last crisis, but increase the friction of development.
To illustrate what I mean, PLCs are normally programmed in ladder logic. From that level, I don't think you can crash the machine or corrupt memory. So the risks are limited to the "operating system" if you will, which can be more mature and tested than the "application".
For example, if all trains runs with a minute accuracy at every points, then there won't be any collision by mistake.