

Crash-only Software - ams1
http://snarkmarket.com/2010/5750

======
mkramlich
When I first wrote programs for a Unix box I would think of the main program
as the application. But there would almost always be some sort of shell script
to wrap invocation. Sometimes also a cronjob. Maybe a watchdog restart daemon
process, etc. Then my understanding expanded to really consider all of this
working in concert as the application, even the OS and it's configuration, and
any other service dependencies. Whatever had to be a certain way, manually
configured or tuned or whatever, to achieve the desired application result.

I think that's also where the notion of error handling comes into play because
it's obvious at that level that it's not enough to just handle exceptional
cases gracefully inside the body of your main program(s), and in fact, you may
not want to, you may want to let certain things percolate up and cause the
process to exit, if only to achieve a uniformity in how you handle edge cases.

------
stcredzero
_Or imag ine a crash-only busi ness that goes bank rupt every four years as
part of its busi ness plan. Every part of the enter prise is designed to scat
ter and re-form, so the busi ness can with stand even an exis ten tial cri
sis. It’s a fero cious com peti tor because it fears nothing._

Is that a startup or a terrorist cell?

In a way, the traditional edit-compile-test-debug cycle is a "deploy-only"
development style. The "continuous integration" part of extreme programming is
taking that even further.

------
metellus
I'm having trouble finding a distinction between crash-only software and
software with good error handling. If the software (or bank or government or
whatever) crashes and then restarts gracefully and automatically, I don't see
how it could be said to have crashed.

~~~
tpz
The distinction is simply (but critically!) this: the standard operating
procedure (and, in fact, the only available way!) to shut down a piece of
crash-only software is to _crash it forcibly_. There is no quit button, no
shutdown command, no remote API, there is just kill. Sounds simple enough, but
where it really gets interesting is in how this mindset affects your design
and implementation decisions.

Imagaine for a moment what it would be like if during every second of work
during every day of the week each and every thing you did was accompanied by a
corresponding design exercise along the lines of "okay, but what if it crashes
right then." Now _that_ would sure produce some "software with good error
handling", now wouldn't it. _That_ is crash-only software.

~~~
mkramlich
Yep, you'd be forced to think more about transactions. What activities needed
to be carried out as a transaction, and for what activities it didn't matter.
Because at any point the process could exit.

------
robinsloan
I'm curious to know if anybody here on HN has anything to say about the
analogy. How might this extend beyond the domain of programming? Where else
could this pattern apply?

E.g. there's a terrific comment over on Snarkmarket that talks about gliders,
and how they are, in a sense, always crashing.

~~~
eru
Could you expand on the gliders? Or give the link to the comment on
Snarkmarket?

~~~
robinsloan
"Pow­ered air­craft pilots prac­tice engine-out sce­nar­ios as part of
emer­gency train­ing. A check­list is brought out that usu­ally sets a good
glide angle and starts the pilot think­ing about a place to land.

How­ever, glider pilots per­form that skill every flight. From the moment of
release and through­out the flight, they’re con­stantly plan­ning ahead,
think­ing of where the land­ing spot will be (and some alter­nates) and what
flight path will get them there."

[http://snarkmarket.com/2010/5750/comment-
page-1#comment-1178...](http://snarkmarket.com/2010/5750/comment-
page-1#comment-11786)

~~~
eru
Thanks!

------
ig1
Isn't this basically a standard principle behind fault tolerant systems design
(and languages like Erlang) where you just stuff the system fully of
redundancy so you can crash individual parts when they go wrong and recover
instantly ?

------
eru
Is this a dupe? I read this before, probably on HN.

Anyway, it deserves to be re-submitted in any case.

