Hacker News new | past | comments | ask | show | jobs | submit login
Crash-Only Software (usenix.org)
45 points by llambda on Aug 21, 2011 | hide | past | favorite | 15 comments

I really enjoyed one of this paper's citations, "Why do computers stop and what can be done about it" written by Jim Gray in 1985. A really nice and early look at how to build reliable systems from unreliable but independent parts. http://www.hpl.hp.com/techreports/tandem/TR-85.7.html

What would be necessary to adapt a Linux distro to be crash-only? shutdown now really would mean shutdown now. Unfortunately, I doubt many Linux services or applications would be happy if there shutdown scripts did not run.

If the distro can't be made crash-only, perhaps just the Linux kernel could. After executing shutdown scripts for userspace services, just crash the kernel. Good tests for filesystems and device drivers! :)

The kernel (along with a modern FS) already is crash-only. Journals are designed to recover a FS quickly from any state after a power failure. The remainder of the kernel mostly is a bunch of in-memory structures recreated on every boot.

As for userspace, well, e.g. Firefox is a lot of the way there. SQLite itself is designed to handle failures extremely gracefully, and by virtue of regularly flushing most of its state into SQLite, Firefox itself achieves a great deal of reliability (although this might not be true of its cache storage, etc.).

You can work towards making userspace crash only. Generally you want to make it stateless as much as possible, ie avoid writing to lots of files, only config databases that are crash proof.

A few people have been looking at stateless userspace eg see http://www.infoq.com/presentations/Runtime-Changes

Current Linux distros generally have too much poorly documented state in the filesystem, much of which may not be updated in a crash proof way...

Mac OS X has already provided a lot of the framework for applications, it is opt-in for app developers, but it allows for an app to be kill -9'ed on shutdown.

Definitely take a look at Sudden Termination as Apple calls it. It is one of the few features along with restoring apps as they were upon login means that a reboot is no longer a productivity killer.

This is one Damien Katz sites as a big influence on the CouchDB design.

See also Erlang, where one commonly encouraged way of handling errors is to crash a process, and let another process restart it.

Erlang is "let it crash", which is not exactly the same thing, but it's quite close, and just as interesting.

Is Firefox crash-only? When I have to restart my system, I just do it without closing the browser. After the system reboots, I launch Firefox and it brings all the windows and tabs back.

(I don't know if there's a cleaner way to do that, without manually saving all the tabs in every window)

There is a quit option in the menu, so no. I think it can be configured to restore state on a clean exit. Chrome can.

If the default behavior isn't working for you, give the Session Manager add-on a try.

Very good stuff. I was first exposed to it by the design of Cassandra which works exactly like that. To stop it you just ceash it and it always start with recovery.

Should be part of the standard CS curriculum.

Definitely interesting for anyone who's into Operating Systems and/or distributed programming.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact