
Let It Crash the Right Way (2009) - jacquesm
https://mazenharake.wordpress.com/2009/09/14/let-it-crash-the-right-way/
======
erikb
I'd say in a test environment you should let it crash a lot. But in
production, facing users (which can be another programmer, who suddenly forgot
that he is a programmer and knows what you are dealing with) there are few
situations were you should really crash.

Think about a very dumb user application like Word. Let's say word crashed
with a Stacktrace, some log messages, the best error message ever, maybe even
saving the user data in a backup file somewhere. The user will think Word is
unusable.

Now let's consider a scenario in which Word recovered under the hood by losing
the users data but never losing the GUI. The user still thinks it's bad, but
he won't think it's unusable. To some degree he will blame himself for not
saving the data himself.

That's why I think I would be way less willing to crash in front of the user
and would apply Pokemon Error Handling (Gotta Catch them All) in all user
facing software as a feature, not a bug.

~~~
thaumaturgy
I have trouble imagining very many situations where crashing is _the right
thing to do_.

Program gets bad or unexpected input (parsers, tokenizers, renderers, data
processing): parse or process as much valid input as possible, then set a
helpful error condition and return.

Program encounters a resource constraint (malloc failed, can't get a lock, no
socket available, permissions problem): set a helpful error condition and
return.

Something really went bugnuts (kernel panic): crash, but as gracefully as
possible.

Raising exceptions is an OK way to transmit errors up the call chain, although
it's not my personal favorite, but allowing unhandled exceptions to find their
way up the call chain doesn't seem like a great idea to me.

edit: I went back and re-re-read the post. His points here seem reasonable:

> _Do I know how I’m supposed to handle an error here?_

> _If not, then should I handle it? (Thus going back to specification)_

etc., but I don't understand the code example at the end; I'd think it would
be good to have more precise information about what failed and where and why,
rather than omitting that exception handling and letting a parent process or
something else deal with it further up the chain where the cause is far more
opaque.

~~~
lostcolony
Not having that catchall will trigger a "badmatch", which will cause the
process to crash. The Erlang VM will log it to the error log. This isn't a
"throw/catch", where the caller will catch it and have to do something; the
process this happens in -will- die (and the parent may timeout in waiting for
a response).

But, it's an Erlang process; it should be supervised (or separately linked or
monitored, but that's neither here nor there). If it is, it will be restarted
in a known good state (or at least; should be. That's the guarantee the
developer needs to work towards in an Erlang process).

So given that, the outcome is that you get a log of what happened (and it
stands out), you wrote less code, and you return to a known good state.

The outcome of the catch all is you have to explicitly log what happened, it
is very easy to miss, and your process/system is left running in a possibly
ambiguous state (since either that message should have been handled, or it
never should have been sent).

Now, that said, dealing with bad user data and sending them a useful error
message -is- something that requires some work, some sussing out of what went
wrong, yes.

Talking to external resources (as alluded to elsewhere in mentioning jlouis'
Fuse library) -is- something you want to limit crashes on (since supervisors
are built as a hierarchy, and just because a piece of hardware is temporarily
unavailable, you shouldn't crash so often that the system comes down).

Malloc...you probably still want to crash on. Crashes will percolate up the
supervisor hierarchy, until eventually hitting the highest level one,
effectively dropping all memory and resetting to a good state. If that doesn't
get you to a good state, there -is- no way to get there without manual
intervention.

------
snarfy
"We all know the saying it’s better to ask for forgiveness than permission.
And everyone knows that, but I think there is a corollary: If everyone is
trying to prevent error, it screws things up. It’s better to fix problems than
to prevent them. And the natural tendency for managers is to try and prevent
error and overplan things."

— Ed Catmull, President of Pixar

[https://signalvnoise.com/posts/2440-we-all-know-the-
saying-i...](https://signalvnoise.com/posts/2440-we-all-know-the-saying-its-
better-to-ask-f)

------
jakejake
Please excuse the self promotion but if you like this topic I wrote a related
post "Insidious Bugs or: How I Stopped Worrying and Learned to Love
Exceptions" [http://verysimple.com/2009/04/03/insidious-
bugs/](http://verysimple.com/2009/04/03/insidious-bugs/)

It does seem like a lot of people are afraid to let a user see an error, but
if you're not extremely careful then attempts to shield the user can mask
bugs.

------
jimbobimbo
Thank you for posting this! I'll be sure do share the link with my team: a lot
of people are looking at exceptions as they're some kind of voodoo that needs
to be handled explicitly in every nook and cranny of the code.

------
EGreg
This is great advice for everything except debugging.

I would like to accumulate a "stack trace" of calls among actors to see what
led to a particular exception, and log it.

~~~
lostcolony
Slight problem there, at least in Erlang; if A sends a message to B that
causes B to crash, A doesn't know about it. A doesn't in any way see the error
(though it may be restarted if part of the same supervisor, and it may timeout
and crash if waiting for a response). So handling that message/not handling
that message doesn't affect debugging; the top of your stack trace will be in
B, regardless (and you may have a separate one at A if it times out waiting
for a response from B, or if it's been linked to B such that it dies when B
dies). Errors do not propagate except where you've said they should; i.e., by
your supervision tree, or by explicit linking.

~~~
EGreg
No, B should log it along with the accumulated call stack, before crashing.

~~~
lostcolony
In Erlang, B will log only what stack accumulated in B's process. Nothing from
A's. That is -

    
    
      run() ->
        B = spawn(fun() -> receive_func() end),
        A = spawn(fun() -> send_func(B), io:format("A completed normally", []) end).
    
      send_func(Pid) -> Pid ! whatever.
    
      receive_func() -> receive _ -> throw(blah) end.
    

You will receive a call stack listing only an error in receive_func. You will
not have the anonymous function in B (due to tail call optimization) in the
call stack, and you will not have the send_func in the call stack, at all,
because that process completed normally; there was no error involved with it
(a fact you can see by the io format call, which will be invoked just fine). A
sent its message, and then printed something, and that's it. B's error does
not relate to A; A might have a separate error (such as if I were to add a
receive clause to send_func, expecting a response, that times out after a
while), but that error is unrelated to B, and will create an entirely separate
error in the logs, with an entirely separate call stack.

~~~
jacquesm
Sorry for posting that just before running out of the office. You are 100%
correct. So the thing to do here is to isolate the message that caused the
crash, that way you don't need to know what happened in 'A', you _shouldn 't_
have to know what happened in 'A'!

If you do have to know what happened in A then you are using messages where
you should have used function calls, after all a message should be context
free once it leaves the sender. This may require some re-thinking of the
concept of messaging, it is wrong to see a message as some kind of RPC
mechanism, a message is a self contained package sent to a recipient that
contains everything the recipient needs in order to process it and/or elicit a
reply. Any other state should be elsewhere and should not be required at all
to identify the cause of the crash.

Getting this right can be quite tricky.

