
Crash-only software: More than meets the eye (2006) - djpuggypug
http://lwn.net/Articles/191059/
======
rdtsc
> he resulting system is often more robust and reliable because crash recovery
> is a first-class citizen in the development process, rather than an
> afterthought,

Because of that I usually make all my services and systems crash only. End up
using things like use atomic file moves, open files with append-only, use kill
-9 to stop services and so on. To make your system crash-onl,y you have to go
down the base system calls.

Some observed effects so far (many are covered in the article):

* Faster restarts (if your regular operation involves restarting lots of processes).

* Less code (don't have to handle both the clean shutdown and dirty shutdown).

* Recovery/cleanup code if it is needed, is often ends up moved to startup instead of shutdown (you might have to recover corrupt files when you start up again. For example re-truncate the files to a known offset based on some index).

* Something else might need to manage external resources (OS IPC recources, shared memory, IPC message queues etc). This could be a supervisor process.

* If you do a lot of socket operations on localhost, your sockets could get stuck in TIME_WAIT state and you'll eventually run out of ephemeral ports if you do a lot of restarts (say during testing). SIGTERM signals often are caught and processes (libraries) perform a cleaner shutdown.

* Think very well about the database you use and see if it can can support crash only operation. Some do some don't ( I won't name any names here ).

------
mjb
Crash-only software is a great concept, and this article is a very interesting
summary of what it is (and what it's not). If you read only one section of
Candea and Fox's paper, I would recommend section 3 "Properties of Crash-Only
Software". It lays out some basic properties of proper crash-only software,
which work as guidelines even for software that doesn't go all the way to the
crash-only ideal.

My favorite one of the principles is "All important non-volatile state is
managed by dedicated state stores". Being both crash-only (or even just
tolerating crashes) and keeping state is a very difficult combination, and you
don't want every one of your services needing to solve that problem over and
over. Dedicated state stores let you hand this problem off, which turns many
systems stateless (or at least without hard state). Tolerating crashes in
soft-state-only services is much easier, perhaps even trivial if you follow
the other rules.

I wrote a blog post about this paper a while back
([http://brooker.co.za/blog/2012/01/22/crash-
only.html](http://brooker.co.za/blog/2012/01/22/crash-only.html)), if anybody
is interested.

------
sbierwagen
I've got a billion tabs open in Firefox, (plus a bunch of extensions) which
seems to expose some O(n^2) algorithm in the internals, because it becomes
unusably slow after running for 24 hours. I can either quit it normally, which
takes 7 minutes-- or just kill the process and restart it.

~~~
mtdewcmu
Firefox is a house of cards, because all those billion tabs are sharing a
single process and they all need to cooperate perfectly.

~~~
Dewie
In what ways do the tabs need to cooperate/interact?

~~~
pessimizer
They have to share the same thread. If one blocks, they all block.

~~~
mtdewcmu
There are many threads, but threads are not isolated by the OS to the same
extent as processes; hence, their fates are all intertwined. The user can't
kill one misbehaving thread, and even if you could, you couldn't expect the
program to be stable afterward.

~~~
coryrc
Heh, I have attached to the process, went to the stuck thread, went up a few
frames, and told it to return. Never did recover, as expected :) But before it
restored all your tabs it was worth a shot at saving the state.

------
signa11
surprising that no one mentioned erlang in this context. the view that you
don't need to program defensively at all. programs can terminate for a variety
of reasons, and as long as a monitoring process / program can take corrective
action it's all good.

