Hacker News new | past | comments | ask | show | jobs | submit login
Why do computers stop and what can be done about it? (1985) [pdf] (jimgray.azurewebsites.net)
23 points by twoodfin 16 days ago | hide | past | favorite | 6 comments

This was written in 1985. That's 36 year ago. The more things change, the more they have stayed the same.

> Even in a high availability system, hardware is a minor contributor to system outages.


> By applying the concepts of fault-tolerant hardware to software construction, software MTBF can be raised by several orders of magnitude.


> Dealing with system configuration, operations, and maintenance remains an unsolved problem.

> Dealing with system configuration, operations, and maintenance remains an unsolved problem.

I have no experience whatsoever with them, but I think I've seen many mentions over the years of how IBM mainframes have been designed to keep running while anything is modified or swapped out.

I also remember years ago reading about a research OS, which journaled everything such that you could supposedly pull the plug at any random moment and recover the running state with no trouble.



At a high level, think of how much money Amazon pours into AWS, Microsoft spends on Azure or O365, or Google spends on their GCP & Internal systems.The dollar amounts spent are staggering, yet all major systems continue to have outages that usually boil down to "Human Error".

The stability of the mainframes may be misleading, as the older isolated systems are different than the networked systems.

I would postulate that for a general purpose computer, software stability is directly proportional to connectivity.

I remember a discussion about a set of changes introducing a null check which was technically redundant and Linus criticizing said change for it. However, in my opinion redundant checks are often not a bad idea since while a check might be redundant now, software is ever changing and these "redundant" checks often help catch mistakes so long as they can be afforded from a performance perspective. I pretty much never view a check as redundant unless the check is in the same function or if it is in the same _small_ module.

Something I'd also really like to see more often in various systems is a better config for enabling/disabling various checks sort of like how log levels can be adjusted. i.e., checks divided into something like expensive, noticeable, and minor overhead as well as by software module so you could enable/disable all checks related to some particular subsystem without enabling expensive checks globally or having to individually enable each check.

Some people say if the check is worth doing at all, it should always be enabled, but I think that overlooks the reality that some checks can be useful but expensive. For an example of such a check, whenever I implement a custom data structure, I find it useful to have assertions before and after each operation checking that all conditions hold such as an order of elements being preserved. That check is O(n) though so you definitely don't want it running in production or even in most debug builds.

Nobody can prove that they stop...

Um, a (1985) might be useful here.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact