
Why Must Systems Be Operated? - luu
http://brooker.co.za/blog/2016/01/03/correlation.html
======
marcolinux
From the article: "...a black box monitoring of the system is not sufficient.
Black box monitoring, including external monitors, canaries, and so on, only
tell the system which side of an externally visible failure boundary a system
is on. Many kinds of systems, including nearly every kind that includes some
redundancy, can move towards this boundary through multiple failures without
crossing it. Black-box monitoring misses these internal state transitions.
Catching them can significantly improve the actual, real-world, durability and
availability of a system."

The author uses RAID, but this observation is valid for systems in general.I
really wished that was some kind of guidance, manual or best practices
available on how to design and/or auto-regulate those internal state
transitions.

~~~
derefr
It seems like there's a possibility for a general rule-of-thumb (that could be
operationalized) in systems that try to use redundancy to increase fault-
tolerance.

The naive approach would be linear: treat two half-failed drives in a mirror
as having one whole failed drive. That wouldn't work too well, though mostly
because "half-failed" is actually nearly already failed.

This seems like the kind of thing that could use a "calibration curve"—where
you observe how the reported health actually correlates to the remaining MBTF,
and then divide future reported-health estimates by that correlation-curve to
get their actual health.

I'm guessing the calibration curve will just, itself, end up being a bathtub
curve—which means that e.g. a drive with one or two SMART errors would need to
be considered "already on its way out." But sometimes such drives live a long
time—as a matter of cost, it's probably too expensive to throw out every disk
that will _probabilistically_ fail soon. It might be possible, though, to move
them to some sort of "non-front-line" service instead, maybe moved to Dynamo-
like n=17 highly-redundant storage. (I wonder if AWS actually "recycles" EBS
volumes into Dynamo/S3 volumes this way.)

\---

As an aside, I've always wondered why we don't use calibration curves more.
They're great for estimating a lot of things:

• Remaining battery life: _sort of_ works this way (in that battery _output_
is fairly constant until the battery "runs out"; 10% remaining = battery at a
slightly lower voltage, 0% remaining = battery still plenty charged but no
longer charged enough to output the proper voltage for the device.) But could
be calibrated way better by actually correlating reported battery life to
time, as observed by the device during its service life. This is made harder
because we don't usually let batteries drain dry, though we _do_ let them get
into that precarious 10% "suboptimal voltage" case quite often. A properly-
calibrated battery report should discharge linearly and recharge on an
S-curve, rather than the other way 'round.

• Progress bars. If you're an OS manufacturer and you want to distribute an OS
update, deploy it to a bunch of test machines and track how long each phase of
the update takes, average them a bit pessimistically (maybe take the third-
sigma median.) Now you can make a progress bar that _appears_ to fill
linearly, and gives a _real, calibrated_ estimate on time-to-completion. The
bar can be fronting pretty oblivious software, as long as it's split into
phases itself: each phase can be kicked off and then the bar can just ease
between N% and (N+P)% over the (pessimistically) estimated phase time, quickly
cubic-sliding up to (N+P)% if the phase completes early. (I do know one piece
of software that actually did things this way: Mac OS, pre-System-7.)

------
pjc50
Perhaps better phrased as "why do complex systems require more active
maintenance", with the argument that this is because they have many more
"partial failure requiring recovery actions" states. Whereas simple systems
are either working or not.

------
golergka
Excellent description with states and markov chains. I have never thought of
durability and stability in this terms, and I certainly will from now on.

------
peterwwillis
Saying "simple system" is like saying "teacup"; what kind of tea, what
temperature? Are you pouring right after steeping or after cooling? Does it
have a handle and/or saucer? Is there a ceremony or cultural aesthetic
involved? Do you want it to develop a patina? Is it used for more than one
kind of tea?

Why must we operate systems? Entropy, for one, but also because simplicity is
relative.

