Pairing like that would find this kind of error. It is also believable that mainframe shops would be willing to cut their throughput in half to get that level of reliability.
I was completely unaware of it even in a research setting. Yet I was aware of things like ECC RAM, RAID, Chipkill RAM and algorithms for continuous checking of results including some implemented in hardware to isolate faulty CPUs but they all added extra hardware or software and none reused existing commodity cores. Its so simple, of course it should be a thing.
This link from wikipedia has pretty much a good laundry list of descriptions different companies have taken for doing lock-stepping, etc https://en.wikipedia.org/wiki/Lockstep_%28computing%29 . It's also one of the reasons this level of machine is so expensive. Each mainframe core would be two cores in a "normal" system, plus mainframes have logic to decommission a cpu without needing a shutdown, by migrating the workload to a spare processor.
I only spoke with such confidence in my previous post because I thought I had worked on high end machines. I worked with a number of Sun and oracle machine with CPU failure detection and CPU hot swapping. But This feature was always flaky. I am eager to learn about what I missed.
I was completely unaware of it even in a research setting. Yet I was aware of things like ECC RAM, RAID, Chipkill RAM and algorithms for continuous checking of results including some implemented in hardware to isolate faulty CPUs but they all added extra hardware or software and none reused existing commodity cores. Its so simple, of course it should be a thing.
Could you post a link to some reading on this?