
No Single Points of Failure - honoredb
http://techblog.mdsol.com/2014/06/16/no_single_points_failure.html
======
lsh123
A good summary with one exception: the monitoring, instrumentation and logging
didn't get enough attention. The failure is the norm so firsts and foremost
you want to know when a failure occurs and then you need to be able to
investigate what went wrong. You should literally monitor/instrument
everything: every API call, every DB access, every page rendering should
include code to monitor latency, result codes, payload size, etc. Every error
(even benign) should be logged preferably with stack traces. Any unexpected
condition should be checked and logged. All the instrumentation data should be
graphed and stored for a long period of time so you can analyze the impact of
your code changes on system performance and correlate it with system failures.

------
colechristensen
I don't like 'no single point of failure' maxim because I think it leads
people to make strange or incorrect decisions and neglect things in order to
serve the maxim instead of doing what's best.

Being 'fail safe' is much more important than being redundant. That is, you
need to design your product's failure. How well it works and how rarely it
fails are important, but not nearly as important as how well it fails.

This means monitoring for knowing when it fails, auditing for knowing how it
did fail after the fact, backups for after the fact, and most importantly (and
harder to define) is predicting what can fail and how and designing your
product's behavior after that failure.

~~~
dtauzell
I agree that monitoring and quick recover are important. It is hard to
eliminate all points of failure.

My favorite failures are what I call "distributed single points of failure".
An example is a linux bug that is triggered by a cron-job that is set to run
at the same time across all servers.

------
bittermang
There's always a single point of failure. The user.

