So those systems didn't fail when a bitflip happened?
> The root cause is that somebody engineered something life-critical that mistakenly assumed hardware can not fail.
The systems I am aware of were designed with bitflips in mind. NO software can handle arbitrary amounts of bitflips. ALL software designed to mitigate bitflips only lower the odds via various forms of redundancy. (For context, I've written code for NASA, written a few proposals on making things more radiation hardened, and my PhD thesis was on a new class of error correcting codes - so I do know a little about making redundant software and hardware specifically designed to mitigate bitflips).
By claiming a bitflip didn't kick off the problems, and trying to push the cause elsewhere, you may as well blame all of engineering for making a device that can kill on failure.
So your argument is a red herring
>On the whole, you fail to make a case that preventing bitflips is the solution to a problem
Yes, had those bitflips been prevented, or not happened, those fatalities would not have happened.
>Ya, I'm not buying that biyflips are a problem.
If bitflips are not a problem then we don't need ECC ram (or ECC almost anything!) which is clearly used a lot. So bitflips are enough of a problem that a massively widespread technology is in place to handle precisely that problem.
I guess you've never written a program and watched bits flip on computers you control? You should try it - it's a good exercise to see how often it does happen.
I guess you define something being a problem differently than I or the ECC ram industry do.
> So those systems didn't fail when a bitflip happened?
I didn't say that. I'm saying that the root cause (as in "root cause analysis") is not the bitflip. Designating the bitflip as the root cause is like analyzing your drunk driving accident and concluding that the root cause must be ethanol, rather than your drinking habits.
> The systems I am aware of were designed with bitflips in mind. NO software can handle arbitrary amounts of bitflips. ALL software designed to mitigate bitflips only lower the odds via various forms of redundancy.
Of course, and I'm not actually arguing that adding in ECC is completely worthless to that effect, though it is close to worthless. Luckily, ECC is quite cheap, if not free, so throwing it in there makes sense.
However, suppose ECC would increase the cost by several magnitudes, would it still be worth it? Obviously not. Redundancy alone reduces the probability of spurious failure by several magnitudes, and simply increasing redundancy would be far cheaper than adding in ECC.
> If bitflips are not a problem then we don't need ECC ram (or ECC almost anything!) which is clearly used a lot. So bitflips are enough of a problem that a massively widespread technology is in place to handle precisely that problem.
My point is that bitflips either don't really matter, in case data integrity is not mission critical, or they don't actually solve the problem, in case data integrity is mission critical.
If you have solved the problem of data integrity through redundancy, then ECC doesn't make much of a difference anymore. If you haven't solved the problem, then ECC will only prevent a vanishingly small subset of disasters that are awaiting you.
> I guess you've never written a program and watched bits flip on computers you control? You should try it - it's a good exercise to see how often it does happen.
I don't care how often it happens. I care about the odds of a bitflip causing an actual problem. If a computer crashes, that's okay, it'll reboot. If any data were to be corrupted, it would most likely happen at the disk level and not the DRAM level.
> I guess you define something being a problem differently than I or the ECC ram industry do.
Of course, somebody who sells ECC RAM will want to convince you that ECC actually solves a real problem. The same can be said about the nutritional supplement industry, or many other industries that rely on make-belief.
> If you have solved the problem of data integrity...
As above, this is not a binary, black and white thing, but you keep presenting it as such. It's probabilistic, and higher protection is not free - the tradeoff is engineering.
> Redundancy alone reduces the probability of spurious failure by several magnitudes
ECC "alone reduces the probability of spurious failure by several magnitudes". That's why it is used.
Naive redundancy ignores almost a century of better method form forward error correcting codes. I have a feeling your idea of redundancy is having multiple exact copies of a system or data and having them vote, which is a terribly expensive way to do data protection when there are vastly better methods.
>Of course, somebody who sells ECC RAM will want to convince you that ECC actually solves a real problem. The same can be said about the nutritional supplement industry, or many other industries that rely on make-belief.
And we're done. If you don't think ECC helps a real problem then I see why you don't understand bitflip causing problems. Good luck.
> As above, this is not a binary, black and white thing, but you keep presenting it as such. It's probabilistic, and higher protection is not free - the tradeoff is engineering.
The actual problem is binary. You either solved it, or you didn't. ECC is "free", but it doesn't actually solve the problem. Actually solving the problem requires engineering.
Of course there's a probabilistic element to it, but the problem is to drive the probability of failure to "vanishingly small". The utility of adding or removing a vanishingly small constant to another vanishingly small constant is vanishingly small. This is what ECC does for you.
> ECC "alone reduces the probability of spurious failure by several magnitudes". That's why it is used.
ECC reduces the probability of spurious failure due to bitflips in DRAM by several magnitudes. However, spurious failure can occur for so many more reasons that the bitflip issue becomes a vanishingly small part.
> I have a feeling your idea of redundancy is having multiple exact copies of a system or data and having them vote, which is a terribly expensive way to do data protection when there are vastly better methods.
As you know, having worked for NASA, this is the right choice under certain circumstances. If there are lives on the line and you have a choice between "not solving a problem" and "a terribly expensive solution", you should go with the latter.
> If you don't think ECC helps a real problem then I see why you don't understand bitflip causing problems.
ECC does not solve the problem of data integrity. If you actually solve the problem of data integrity, you will find that ECC becomes effectively redundant. Do we not fundamentally agree on this? If so, why not?
That's not to say ECC is entirely useless from an administrative standpoint. It makes DRAM bitflips one less thing to worry about. One less thing out of thousands of things. Commensurately, the cost of ECC in a given deployment, like its utility, is vanishingly small.
So those systems didn't fail when a bitflip happened?
> The root cause is that somebody engineered something life-critical that mistakenly assumed hardware can not fail.
The systems I am aware of were designed with bitflips in mind. NO software can handle arbitrary amounts of bitflips. ALL software designed to mitigate bitflips only lower the odds via various forms of redundancy. (For context, I've written code for NASA, written a few proposals on making things more radiation hardened, and my PhD thesis was on a new class of error correcting codes - so I do know a little about making redundant software and hardware specifically designed to mitigate bitflips).
By claiming a bitflip didn't kick off the problems, and trying to push the cause elsewhere, you may as well blame all of engineering for making a device that can kill on failure.
So your argument is a red herring
>On the whole, you fail to make a case that preventing bitflips is the solution to a problem
Yes, had those bitflips been prevented, or not happened, those fatalities would not have happened.
>Ya, I'm not buying that biyflips are a problem.
If bitflips are not a problem then we don't need ECC ram (or ECC almost anything!) which is clearly used a lot. So bitflips are enough of a problem that a massively widespread technology is in place to handle precisely that problem.
I guess you've never written a program and watched bits flip on computers you control? You should try it - it's a good exercise to see how often it does happen.
I guess you define something being a problem differently than I or the ECC ram industry do.