is an interesting read and provides a more concrete example of how to run a highly concurrent and fault tolerant application.
It doesn't talk about the social or psychological bullet points in this article, it is more technical. But I found it very readable.
As an addition, I can think of these patterns (just thinking about it in 1 minute, mostly remembering Erlang talks I've listen to, some from practice):
* Build system out of isolated components. Isolation will prevent failures from propagating. In Erlang just launch a process and don't use custom compiled C modules loaded in the VM. In other cases launch an OS process (or container).
* If your service is running on one single machine, it is not fault tolerant.
* Don't handle errors locally. Build a supervision tree where some of part of the system does just the work it is are intended to (ex.: handling a client's request), and other (isolated part) does the monitoring and error handling. Have one process monitor others, one machine monitor another etc.
Once a segfault or malloc fault has occurred installing a handler and trying to recover might not be the best solution. Restarting an OS (or Erlang) process might be easier. Another way to put it, once the process has been fouled up, don't trust it to heal itself. Trust another one to clean up after it and spawn a new instance it.
* Try not to have a master or a single point of failure. Sometimes having a master is unavoidable to create a consistent system, so maybe it can be elected with a well defined algorithm or library (paxos, zab, raft etc).
* Try to build a crash-only system. So that isolated units (OS or Erlang processes) can be killed instantaneously for any reason, any time and system would still work. If you are controlling the system you are building use atomic file renames, append-only logs, and SIGKILL (or technologies that use those underneath). Don't rely on orderly shutdowns. Sometimes you are forced to use databases/hardware/system that already don't behave nicely. Then you might not have a choice.
* Always test failure modes as much as possible. Randomly kill or mess with your isolated units (kill your processes), degrade your network, simulate switch failures, power failures, storage failure. Then simulate multiple failure simultaneously -- your software crashes while you detected a hardware failure and so on.
* As a side-effect of first point and the crash-only property. Think very well about your storage. In order to be able to restart a process, it means, it might have had to save a sane checkpoint of its state. That means having reliable, stable and fault tolerant storage system. Sometimes recomputing the state works as well.
Moreover, Cook's piece is very broadly applicable, it doesn't apply just to software systems.
These two documents are complimentary, not mutually exclusive.