
Gray Failure: The Achilles' Heel of Cloud-Scale Systems [pdf] - wallflower
https://www.cs.jhu.edu/~huang/paper/grayfailure-hotos17.pdf
======
lsc
yeah, this is really interesting, and something that, as far as I can tell,
most companies aren't so interested in. I remember one place I worked with
many thousands of computers, during burn-in, they's re-write every sector of
disks, and would only fail the disk if it couldn't reallocate all sectors
three passes in a row.

Which seemed crazy to me, because the disks were used in non-redundant
configurations... if there were read errors on those disks, it would cause
actual data corruption, and eventually caused servers several steps up the
line to crash and set my pager off, which is how i found out about it.

That's the hard part of infrastructure as code; a lot of programmers don't
understand (or don't think about?) what it means to have a failure. In this
case, running the disks non-redundantly was reasonable; the system would have
dealt just fine with the whole server falling over... but because it
"recovered" the error, the error was propagated all the way up to my goddamn
pager. (Infernal pager? sisyphean pager? that job had the most active pager
I've worn in 20 or so odd years of wearing pagers.)

~~~
eropple
_> a lot of programmers don't understand (or don't think about?) what it means
to have a failure_

Most of my devops consulting these days is more on the human side of things
(devs and ops not getting along, Managements Just Don't Understand, etc.) but
whenever I end up in a design review this is _still_ the first thing I ask:
"how does this break, under what circumstances will it break, and how to we
respond to it breaking without waking somebody with SSH access up at two in
the morning?".

~~~
jungturk
It's expensive, and perhaps less enjoyable that other aspects of engineering,
but it certainly pays dividends in many environments.

[https://en.wikipedia.org/wiki/Failure_mode_and_effects_analy...](https://en.wikipedia.org/wiki/Failure_mode_and_effects_analysis)

------
woliveirajr
> Moreover, there are many types of gray failure that are not performance-
> related

The only possible cenarios I can think off are then even worst: data lost,
uptime lost...

