
Difference Between Fault Tolerance, High Availability, Disaster Recovery (2014) - chynkm
http://www.pbenson.net/2014/02/the-difference-between-fault-tolerance-high-availability-disaster-recovery
======
notacoward
When I was working in this area during relatively early days, the difference
was sometimes expressed this way:

* Fault tolerance: near-infinite MTBF

* High availability: near-zero MTTR

With HA there _is_ a blip. It might not be visible to an application because
of retries, but it is visible outside of the HA system/component to some
degree.

Special bonus thought: as a guide to designing or implementing an HA/FT
system, I always found it helpful to think in terms of what happens to system
reliability as size increases. In a traditional system, system reliability
goes down because of dependencies between nodes/components. In some systems
this degradation is even worse than you'd think because it's tied to the
number of connections - O(n^2) rather than O(n). In an HA system, system
reliability should go _up_ because of nodes being able to cover for each
other.

The key question was always: if X fails, what other part of the system can
make up for (not just survive) it? If it's a whole node, what other node(s)
can take its workload? If it's a disk, where is another copy of the data? If
it's a network, how else can nodes communicate or at least synchronize? That
last was interesting but because it led to things like serial lines or pinging
through shared disks as a last-resort way to convey cluster state. Fun times.

~~~
jmts
MTBF: Mean time between failures

MTTR: Mean time to repair

~~~
Arnavion
I've only heard "mean time to _recovery_ " in the context of sofware, though
Wikipedia does imply "mean time to repair" is also valid for software.

------
ndespres
I like these analogies a lot. I regularly have to explain to my clients the
difference between the backup system, the disaster recovery system, and the
file server replication system. None of them is the same as any other, and
each component has different levels of redundancy going down the stack (RAID,
HA for the virtual machine, shared storage between hosts, etc) and it's no
easy task to explain that yes, while component X is redundant, it does not
meet the definition of "highly available" or "backed up." So I appreciate any
explanation like this article that attempts to simplify and illustrate any of
these definitions.

------
vinay_ys
In that plane analogy, assuming the plane's design load capacity requires all
4 engines to be fully functioning, a fully loaded plane suffering a failure of
1 out of 4 engines is dealing with a "degradation scenario".

It will have to jettison some amount of load (proportional to loss of one
engine) to save the rest of the load or it risks losing the entire plane.

This is a fault-tolerant system with degradation possibilities.

If this degradation possibility isn't acceptable, then the plane's design load
has to be reduced. Plane will carry only so much load that can be safely flown
with minimum surviving engines (say, 3 out of 4, or 2 out of 4 or even 1 out
of 4).

When the plane is operating in this configuration with redundant online
engines, there are interesting challenges w.r.t efficiency of the engines.
These 4 engines will now have to be at their best efficiency when loaded only
2/3 or 1/2 or 1/4 of their load capacity. Because that is expected to be their
normal operating condition. And they should be able to operate under higher
load condition (which may not be efficient, and stressful) for a sustained
duration that is desired for the plane to land safely.

Obviously, the plane that has to operate with spare engine capacity is more
expensive. It is worth deploying such a solution only if the cargo being
carried is more expensive even when adjusted by the probability of occurrence
of this scenario.

These exact same trade-off scenarios exist for design of distributed software
systems that have to tolerate machine component failures.

