Hacker News new | past | comments | ask | show | jobs | submit login
Crash-only Software (snarkmarket.com)
29 points by ams1 on June 27, 2010 | hide | past | favorite | 14 comments

Or imag ine a crash-only busi ness that goes bank rupt every four years as part of its busi ness plan. Every part of the enter prise is designed to scat ter and re-form, so the busi ness can with stand even an exis ten tial cri sis. It’s a fero cious com peti tor because it fears nothing.

Is that a startup or a terrorist cell?

In a way, the traditional edit-compile-test-debug cycle is a "deploy-only" development style. The "continuous integration" part of extreme programming is taking that even further.

I'm having trouble finding a distinction between crash-only software and software with good error handling. If the software (or bank or government or whatever) crashes and then restarts gracefully and automatically, I don't see how it could be said to have crashed.

The distinction is simply (but critically!) this: the standard operating procedure (and, in fact, the only available way!) to shut down a piece of crash-only software is to crash it forcibly. There is no quit button, no shutdown command, no remote API, there is just kill. Sounds simple enough, but where it really gets interesting is in how this mindset affects your design and implementation decisions.

Imagaine for a moment what it would be like if during every second of work during every day of the week each and every thing you did was accompanied by a corresponding design exercise along the lines of "okay, but what if it crashes right then." Now that would sure produce some "software with good error handling", now wouldn't it. That is crash-only software.

Yep, you'd be forced to think more about transactions. What activities needed to be carried out as a transaction, and for what activities it didn't matter. Because at any point the process could exit.

That's a fair point -- a crash handled gracefully & completely sorta ceases to be a crash at all. I think one distinction the authors of the original paper make is that an app's off switch ought to be external to the app. More like "kill -9" than "exit"; more like pulling the plug than pressing a "power down" button.

Yes. CouchDB is designed as crash only software, it detects and restarts immediately on crash, and shutting it down is done with kill, not exit.

It's the distinction between attempting an automatic recovery under application control, and the philosophy of the badness happened; get out NOW failure design.

In complex systems, there's always the assumed potential for untoward behavior from unanticipated recovery environments; the handling of recovery can be a bigger problem than a complete failure.

In these environments, clean failures are preferable.

And staging the recovery processing can be preferred. This goes as far as staging application start-up and sequencing the component server reboots manually. Yes, manually.

I'm curious to know if anybody here on HN has anything to say about the analogy. How might this extend beyond the domain of programming? Where else could this pattern apply?

E.g. there's a terrific comment over on Snarkmarket that talks about gliders, and how they are, in a sense, always crashing.

Could you expand on the gliders? Or give the link to the comment on Snarkmarket?

"Pow­ered air­craft pilots prac­tice engine-out sce­nar­ios as part of emer­gency train­ing. A check­list is brought out that usu­ally sets a good glide angle and starts the pilot think­ing about a place to land.

How­ever, glider pilots per­form that skill every flight. From the moment of release and through­out the flight, they’re con­stantly plan­ning ahead, think­ing of where the land­ing spot will be (and some alter­nates) and what flight path will get them there."



Isn't this basically a standard principle behind fault tolerant systems design (and languages like Erlang) where you just stuff the system fully of redundancy so you can crash individual parts when they go wrong and recover instantly ?

Is this a dupe? I read this before, probably on HN.

Anyway, it deserves to be re-submitted in any case.

When I first wrote programs for a Unix box I would think of the main program as the application. But there would almost always be some sort of shell script to wrap invocation. Sometimes also a cronjob. Maybe a watchdog restart daemon process, etc. Then my understanding expanded to really consider all of this working in concert as the application, even the OS and it's configuration, and any other service dependencies. Whatever had to be a certain way, manually configured or tuned or whatever, to achieve the desired application result.

I think that's also where the notion of error handling comes into play because it's obvious at that level that it's not enough to just handle exceptional cases gracefully inside the body of your main program(s), and in fact, you may not want to, you may want to let certain things percolate up and cause the process to exit, if only to achieve a uniformity in how you handle edge cases.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact