> Even in a high availability system, hardware is a minor contributor to system outages.
> By applying the concepts of fault-tolerant hardware to
software construction, software MTBF can be raised by several orders of magnitude.
> Dealing with system configuration, operations, and maintenance remains an unsolved problem.
I have no experience whatsoever with them, but I think I've seen many mentions over the years of how IBM mainframes have been designed to keep running while anything is modified or swapped out.
I also remember years ago reading about a research OS, which journaled everything such that you could supposedly pull the plug at any random moment and recover the running state with no trouble.
The stability of the mainframes may be misleading, as the older isolated systems are different than the networked systems.
I would postulate that for a general purpose computer, software stability is directly proportional to connectivity.
Something I'd also really like to see more often in various systems is a better config for enabling/disabling various checks sort of like how log levels can be adjusted. i.e., checks divided into something like expensive, noticeable, and minor overhead as well as by software module so you could enable/disable all checks related to some particular subsystem without enabling expensive checks globally or having to individually enable each check.
Some people say if the check is worth doing at all, it should always be enabled, but I think that overlooks the reality that some checks can be useful but expensive. For an example of such a check, whenever I implement a custom data structure, I find it useful to have assertions before and after each operation checking that all conditions hold such as an order of elements being preserved. That check is O(n) though so you definitely don't want it running in production or even in most debug builds.