Hacker News new | more | comments | ask | show | jobs | submit login
Gray Failure: The Achilles' Heel of Cloud-Scale Systems [pdf] (jhu.edu)
47 points by wallflower on Jan 28, 2018 | hide | past | web | favorite | 6 comments

yeah, this is really interesting, and something that, as far as I can tell, most companies aren't so interested in. I remember one place I worked with many thousands of computers, during burn-in, they's re-write every sector of disks, and would only fail the disk if it couldn't reallocate all sectors three passes in a row.

Which seemed crazy to me, because the disks were used in non-redundant configurations... if there were read errors on those disks, it would cause actual data corruption, and eventually caused servers several steps up the line to crash and set my pager off, which is how i found out about it.

That's the hard part of infrastructure as code; a lot of programmers don't understand (or don't think about?) what it means to have a failure. In this case, running the disks non-redundantly was reasonable; the system would have dealt just fine with the whole server falling over... but because it "recovered" the error, the error was propagated all the way up to my goddamn pager. (Infernal pager? sisyphean pager? that job had the most active pager I've worn in 20 or so odd years of wearing pagers.)

> a lot of programmers don't understand (or don't think about?) what it means to have a failure

Most of my devops consulting these days is more on the human side of things (devs and ops not getting along, Managements Just Don't Understand, etc.) but whenever I end up in a design review this is still the first thing I ask: "how does this break, under what circumstances will it break, and how to we respond to it breaking without waking somebody with SSH access up at two in the morning?".

It's expensive, and perhaps less enjoyable that other aspects of engineering, but it certainly pays dividends in many environments.


If a client are hell bent on cost savings from non-redundant disks sell them netboot (call it 'cloud') with a possible use of smaller local SSD for cache purposes and proper ZFS+RAID on the servers.

I think there's a place for non-redundant disk... it's just that you have to understand that while redundant disk mostly either works or doesn't, (well, it returns all data or none, the speed can vary when it's in a non-optimal state, but it is rare to lose only some of the data on a raid, while losing only some of the data is the common failure mode for non-redundant spinning-disk.) a common failure mode for non-redundant disk is to make certain sectors inaccessible while the rest of the disk is okay.

As long as you plan for that last part, (which can mean just wiping and reinstalling every time you get an unrecoverable read error, it can mean RMAing the disk after the first reallocated sector (as a RAID would) and it can mean doing something with something like zfs to catch errors and mark whole files bad. )

Centralized disk has it's own set of issues, though it is a possible solution.

> Moreover, there are many types of gray failure that are not performance-related

The only possible cenarios I can think off are then even worst: data lost, uptime lost...

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact