Hacker News new | past | comments | ask | show | jobs | submit login

No, it indicates that the problem domain is sufficiently dangerous that the risk of fixing must be balanced against the risk of the fix causing a different unknown error. There were ~500,000 787 flights in 2017 with an average of 200 people per flight. The 737-MAX resulted in 385 fatalities, so if a fix had a 1 in 250,000 chance of causing a different error that could result in a fatal crash then it would be worse than the 737-MAX problems. Do you have confidence that systems you have worked on have processes in place to guarantee that there is less than a 1 in 250,000 chance that a fix would not cause another error? If not, are you aware of any organization whose development practices you have first hand knowledge of and that you are confident could give such a guarantee? That is the risk analysis that must be done when doing a fix.

To be fair, this is somewhat of an over-exaggeration of the requirements since not all systems are critical and not all errors cause critical problems. In addition, the risk must be balanced against the alternative, in this case the risk caused by making sure a reboot is done every 51 days, so you would need to do an analysis of the failure probability and possible consequences of the status quo and compare that against the possible error modes of a software fix.

As an addendum to the risk analysis, the above analysis was only for one year error and on a per-flight basis. If you expect the 787 to fly for ~30 years then the fix must not cause two crashes over 30 years so a 1 in 7,500,000 chance. The average flight is ~5,000 KM which is ~4-5 hours per flight for a total flight time of ~60,000,000 hours. A plane takes ~3 minutes to fall from cruising altitude, so we need fleet downtime of 6 minutes per 60,000,000 hours which is 1 in 600,000,000 downtime. That is 99.9999998% uptime, 8 9s, 6,000x the holy grail of 5 9's availability in the cloud industry, 60,000x the availability guaranteed by the AWS SLA (again, somewhat of an over-exaggeration since you need correlated failures to amount to 3 minutes of continuous-ish failure, but that depends on an analysis of mean-failure time and mean-time-to-resolution which I do not have access to).




> .. if a fix had a 1 in 250,000 chance of causing a different error that could result in a fatal crash then it would be worse than the 737-MAX problems.

That's an improper calculation. Slight nuance but orders of magnitude difference. A better approximation is:

If a fix had a 1 in 250,000 chance of causing a different error that would result in a fatal crash then it would be worse than the 737-MAX problems.

And converse:

If a fix had a 1 in 250,000 chance of causing a different error that could result in a fatal crash then it could be worse than the 737-MAX problems.

MAX's MCAS generates way more errors than 1 in 250k. A few of them resulted in a crash.


Yes. I was being slightly sloppy with my wording. I meant would and would.


Well, what I mean is that perhaps such critical software should be carefully rewritten with rigorous architectural and QA standards so they don't have to use an ugly hack in the first place.


Rewriting software is not only costly and subject to breakage, but for some of these systems requires an absolutely monumental FAA recertification process.

The cost of recertifying software shouldn't be the sole reason not to rewrite something, but I imagine you've been part of a rewrite where not quite everything worked as intended, even after having a lot of tests.

"The devil you know" very much applies to software that controls such life-critical functions and flying airplanes. If a pilot knows how to work around something, introducing something they may not know how to work around could be the difference between life and death.

Why did two 737-MAXes crash and the fleet grounded? New systems were introduced that pilots didn't know how to address - and seemingly could not workaround, even in spite of the engineers who designed them not wanting that outcome.

A rewrite, even with the most rigorous architectural and QA standards, is not a panacea.


I agree, critical software should be written to a high quality standard. In fact, I will take it one step further and say that critical software must be written to an OBJECTIVE quality level that is sufficient for the problem at hand. If that level can not be reached, where we are confident that the risk is mitigated to the desired level, then the software should not be accepted no matter how hard they tried or how much they adhered to "best practices". We do not let people build bridges just because they tried hard. They have to demonstrate that it is safe, and, if nobody knows how to build a safe bridge in a given situation, then the bridge is NOT BUILT.

To then circle back to airplane software, the standards of original development are even higher than the standard I stated above. There are ~10,000,000 flights in the US per year according to the FAA and for at least the 10 years before the 737-MAX problems (I believe closer to 20), software was not been implicated in a single passenger air fatality. That means that in over 100,000,000 flights there were only two fatal crashes due to software (not even in the US, so we would actually need to include global flight data, but I will not bother with that since I am unaware of the count of software-related fatalities in other countries) for a total fleet-wide reliability of 1 in 50,000,000, 7 9s on a per-flight basis. If we use the per-time basis I used above, there are 25,000,000 flight-hours per year, so over 10 years, 6 minutes in 250,000,000 hours is 1 in 2,500,000,000 or 99.99999996% uptime, 9 9s, 25,000x gold standard server availability, 250,000x the availability guaranteed by the AWS SLA. Also note that with servers, we can use independent replicated servers to gain redundancy allowing uptime to multiply (1 in 100 failure for each server means chance of failure of both at the same time, assuming independence, is 1 in 10,000), but the same does not apply to airplanes since every airplane must succeed.

The thing to understand is that the software problems we are seeing are not necessarily an indication that their standards are lower than the prevailing software industry and that they should adopt their practices. It could be, and likely is, that the OBJECTIVE quality level we require is extremely high, and they have not been able to achieve it as of late. This obviously does not excuse their problems since, as I stated above, they must reach the OBJECTIVE quality level we require; it is just an observation that maybe it is not because they are incompetent, maybe they are really, really good and the problem is just really, really, really hard.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: