

Why do computers stop and what can be done about it? (1985) [pdf] - wormold
http://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf

======
rdtsc
This is a very good article. It is interesting how it is in large still
relevant.

Fault tolerance and reliability is something that doesn't get enough
attention. It is only after a pattern of failures and spending time late
debugging segfaults, one starts to appreciate fault tolerance.

Sadly over the years we have been conditioned that software crashes and, well,
that is how it is. You just restart. Fix that dangling pointer. Hire another 5
engineers to hover of the code and squash bugs.

It is cultural and priority thing. In some domains and applications crashing
systems can be dangerous. Heart pacemakers, traffic lights, crypto modules,
life support systems. Some less critical ones like phones as well. If my
server crashes, I'll somehow accept it. But if my desk phone does, I will be
very disturbed by it. The expectation is that I can pick up and talk on my
phone, even though behind the scene there are quite a few complicated things
going on.

> In the future, hardware will be even more reliable due to better design,
> increased levels of integration, and reduced numbers of connectors.

This is interesting coming from 1985. Has it though? It should have. But I
think proliferation of home PC and blue screens of death, cheaply made and
flaky hardware really lowered everyone's expectation of reliability threshold.

This is why languages and platform like Erlang are very interesting. It was
something designed with reliability and fault tolerance at the top of the
"todo" list. Can't say there are too many things like that out there.

Here is a longer read by Erlang's creator Joe Armstrong. It is his theses on
building fault tolerant systems, mostly the famous AXD301 switch.

[http://www.erlang.org/download/armstrong_thesis_2003.pdf](http://www.erlang.org/download/armstrong_thesis_2003.pdf)

Anyway an interesting read.

And note that supervision and isolated processes can be emulated in other
langauges to a certain extent. Just spawn OS processes. Use IPC. Connect to
other machines, set up watchdogs. So if anything, it is at least useful to
study the patterns and the design approach. You can then apply it to any
system.

~~~
tedd4u
"And note that supervision and isolated processes can be emulated in other
langauges to a certain extent. Just spawn OS processes. Use IPC."

Applied in the past couple years to the web browser domain - Chrome / Safari
do this for each tab and some of the plugins.

------
kjhughes
Note the notable author: Jim Gray, 1998 Turing Award recipient. Gray
tragically was lost at sea in 2007.

([http://en.wikipedia.org/wiki/Jim_Gray_(computer_scientist)](http://en.wikipedia.org/wiki/Jim_Gray_\(computer_scientist\)))

------
bcoates
Interesting point I haven't seen discussed directly before:

    
    
      System administration, which includes operator actions, system
      configuration, and system maintenance was the main source of failures
      -- 42%. Software and hardware maintenance was the largest category.
      High availability systems allow users to add software and hardware and
      to do preventative maintenance while the system is operating. By and
      large, online maintenance works VERY well. It extends system
      availability by two orders of magnitude.
    

I've usually seen the online vs offline maintenance tradeoff depicted as the
availability benefits of online maintenance vs the massive development cost of
a system that remains available despite operator error during maintenance --
but if you meet the much easier standard of fail-fast (human error causes the
system to go down but doesn't corrupt the database) then online maintenance is
still an overall win even if it's the primary cause of failures.

------
dekhn
I don't think anybody in industry still uses the Tandem approach.

D'oh! Of course HP bought tandem and brought the product line forward:
[http://en.wikipedia.org/wiki/NonStop](http://en.wikipedia.org/wiki/NonStop)

------
EdwardCoffin
In the same vein as this paper, I've found a lot of food for thought in this
set of slides from Rick Harper's course on Robust Programming at Stratus
Computer:
[http://ftp.stratus.com/vos/doc/papers/papers.html](http://ftp.stratus.com/vos/doc/papers/papers.html)

------
joeheyming
life before the internet was a thing

------
faldore
wat

