

Error Codes or Exceptions? Why is Reliable Software So Hard? - troels
http://damienkatz.net/2006/04/error_code_vs_e.html

======
agentultra
Why not conditions and restarts as Lisp does it? Lisp bakes it into the
language, but it's not impossible to implement in Java (or Blub X of choice..)
according to Peter Siebel (see his google talk).

The one thing that makes working with exceptions is that when one is
encountered the stack is unwound up to the first matching handler (and
possibly just passed off to the next one, etc). You can lose a lot of state
this way and it makes restarting computations difficult (or even deciding what
to do in some cases).

Using exceptions just requires careful planning and I think conditions and
restarts are a much more elegant way of solving the problem.

------
alanh
To answer the titular question: Because (1) most software, and most
frameworks, are designed optimistically, thinking almost exclusively about
when everything goes well. And (2) because of _positive bias_ we test cases we
expect to go well.†

For example, in the Tornado web stack, it’s certainly possible to gracefully
handle an exception you threw in the middle of responding to a method, but you
have to basically augment the framework to do so. It’s practically unfinished
in that sense. Would that have been the case if it were designed with failure
modes occupying its creators’ minds even half as much as successful ones?

†The most fun example/intro to this I’ve seen:
[http://www.fanfiction.net/s/5782108/8/Harry_Potter_and_the_M...](http://www.fanfiction.net/s/5782108/8/Harry_Potter_and_the_Methods_of_Rationality)

~~~
zmmmmm
> Because (1) most software, and most frameworks, are designed optimistically,
> thinking almost exclusively about when everything goes well. And (2) because
> of positive bias we test cases we expect to go well

Yes, this realization finally hit me a few years ago and started to make me
reign in my use of exceptions. I realized that exceptions are all about giving
priority to the optimistic case in the code - making it as clear and simple
and with as little branching as possible. However after seeing projects
through a full lifecycle (from initial conception through to maintenance in
production) I now realize that the _really_ important code - the code that we
spend most of our time trying to figure out in maintenance over the lifetime
of the application and whose state is most hard to understand and deal with -
is _in fact_ the error handling code and that exceptions focus on hiding that
code and making it "disappear" and seem implicit. What seems like a win
initially can be a huge loss when you're trying to write a robust system where
every possible behavior is explicit and well understood.

~~~
moe
Don't equate exception with "Error".

Many common problems, especially in network servers, can be solved quite
elegantly with an "exception driven design". In those scenarios the exceptions
serve as a vehicle to bubble state through multiple layers. Pretty much like
an event-framework, except many people don't realize they have that baked
right into their language.

A simple example would be a network-server that detects a protocol phase
change at the lowest protocol level. By raising a "PhaseChange"-Exception you
can quite nicely propagate such an event to the higher layers that need to
know about it, without resorting to duplicated/shared state or awkward call-
chains that introduce nasty dependencies and then need their own exception
handling, and without running into potential synchronization issues.

------
bad_user
I don't really like the article, but I will grant the author the benefit of
the doubt as he's the creator of CouchDB. Maybe his ideas are right, but
worded badly.

I lost my interest at " _PHP to the Rescue_ ". I understand the point he tried
to make, but he's wrong.

PHP (or CGI/FastCGI for that matter) only works because the computation done
on each request is very light and it maps well with the HTTP protocol which is
at its core a stateless protocol. But it only works because the real work of
indexing, searching and retrieving of documents gets pushed to the database
server, a piece of software that does serious gymnastics to serve the results
you want. The database server does give you ACID and it does this precisely
because it can UNDO and once the database server crashes, all hell breaks
loose.

A better example in this context would be Chrome, which sandboxes each tab
inside its own process, such that the crashing of a tab doesn't affect the
whole browser. But tabs themselves have to be long-running and stable, and
those tabs are still crashing and valuable work may still be lost, not to
mention the whole browser still freezes because of plugins that haven't been
fully sandboxed.

Also, it doesn't warm my heart when I lose an hour's work, even if the other
tabs haven't crashed.

    
    
        Don't Undo Your Actions, Just Forget Them 
        ... Use a Functional Language
    

This doesn't address the bigger and most important issue - some resources are
inherently mutable.

Once a file is changed, it stays changed (sorry, you'll have to replace the OS
to avoid it and what can I say, good luck with that). Once an email is sent,
it stays sent. Once a phone call is made, you can't really pretend that it
isn't. Once a bank transaction is made, you have to undo it if anything went
wrong, otherwise you're facing serious penalties.

Not dealing with mutable data only works insofar you're dealing with dumb
logic and not everything is as simple as an HTTP request that returns rows of
comments in a simple blog.

Also, I find an article bitching about OOP and about UNDO weird at best if it
doesn't reference THE OOP recipe for undoing whatever you changed --
<http://en.wikipedia.org/wiki/Command_pattern>

~~~
damienkatz
Boy, you missed the point. Undo here isn't about user-undo. It's about when
the program can't continue, you need undo all incomplete state changes and
return to a previous state. Really this article is about the lack a
transactions in programing languages, and how error conditions cause problems
for software with long lived state.

The point about PHP is that it successfully does that precisely because the
way it's architected. The state is kept the DB server where the transaction
problem is solved.

When using long lived server processes, things tend to get in invariant states
on error conditions unless you are very careful about how you are programming
and dealing with state mutation and error condition. As ugly as it seems, PHP
style development frees you from that.

I like your point about Chrome, it's a great architecture and it's actual
quite similar original Apache PHP combo with the process isolation.

Anyway user level undo is still an important concept, but orthogonal to
creating reliable software systems.

~~~
nostrademons
You could just tl;dr the article as "Use transactions". Transactions don't
have to be tied to a database, and anyone with a CS background should know
what they are.

------
jrockway
I'm not sure his example for "exceptions are too complicated" is very good.
When deliver mail, the operation isn't "first try the primary server, then try
the backup server". It's "try the servers in ascending order of priority,
randomly selecting if the priorities are the same". From that, the code
becomes much simpler:

    
    
        def servers := ( sort possible servers according to rules );
        def sent_to;
    
        send_attempt: for server in servers {
            try {
                server->send_mail_to( message );
                sent_to := $server;
                last send_attempt;
            }
            catch Temporary Failure {
                logger->note_temporary_failure( ... )
            }
        }
    
        if(!sent_to)
            throw Permanent Failure ("EPIC FAIL");
    
    

The key is to use exceptions for what they're good for: unwinding the stack
under certain conditions. We want the stack to unwind if there is some reason
to abort mail sending completely ("network cable not plugged in"), but we only
want to unwind to the point of trying the next server if the server is simply
unavailable.

Using error codes alone would be complicated ("if result_code ==
TRY_AGAIN_LATER then { try the other server } otherwise { return
TOTALLY_FAILED_DUDE }"), and using exceptions without the "sent_to" state
would also be quite complicated (I'm not really sure how to even do that).

The key is to use the features for what they're good at, rather than to treat
them like an ideology.

From an ideology perspective, your language's designer already decided what he
wants you to do. If you can accidentally ignore an error, you're using the
wrong one.

That means in Java, you tend to favor exceptions, because you get some
compile-time type checking to make sure you're handling them. And if you don't
handle them, your program exits at runtime, which is a good thing. Return
codes, on the other hand, are easy to ignore, and your program produces
undefined results if you forget to check them (and no compiler is going to
tell you where that is: you'd better have good tests, tolerant users, and no
deadline).

In Haskell, on the other hand, the ideology is reversed. If you throw
exceptions in IO, they basically work like exceptions in untyped languages --
your program exits unless you were lucky enough to have a runtime handler. (If
you throw them outside of IO, like with "error", you're double fucked. You
can't even catch them outside of IO, which basically means Never Do That.)

But the good news is, there are type-safe return codes. You define a type
like:

    
    
        data Either a b = Right a | Left b
    

and then you make your functions return Either the Right answer or an error.
If you have a function that returns an error code, you can't do anything with
the result until you intentionally handle the error. If you forget, your
program simply does not compile -- there's no way to ignore errors except by
explicitly writing code to ignore the error.

And, of course, this is generally hidden from the programmer with monads! You
don't even have to think about handling the return codes: the language does it
for you.

(C is a weird case where both exceptions and error codes work poorly. For that
reason, I make sure to write C in very tiny pieces that can be composed with a
safer high-level language. In that case, there are only a few places where
errors can occur, and they tend to be simple like "tried to allocate memory,
it didn't work, so i deallocated everything else" or "failed to send to the
socket: EAGAIN". The complex logic like "failed to send to primary MX" should
be handled higher up in your stack.)

------
cturner
I believe this is a solved problem, although it's not used much.

EOF and Apache Cayenne are ORMs that have a notion of a hierarchy of 'editing
contexts'. If you save to the top-level one, it commits to the schema. But you
can create a 'child editing context'. You can then interact with this graph.
When you commit it, it gets pushed up to the parent editing context. Of
course, if you want you can them commit again to persist it.

This allows you to build and throw out a deck.

This creates a new problem, a human one. People are used to interacting with
web applications, and having everything saved somewhere after each screen
change. With this arrangement, stuff doesn't necessarily get saved. I built an
application that allowed you to create an order, and then create a person
associated with it, and all this in memory. Users were annoyed that when they
didn't save the order, the people they'd created associated with it also got
thrown out.

~~~
jamesaguilar
He should have been persisting the work in progress too. That's not really a
weakness inherent to editing layers.

------
smanek
I think the ideas behind STM
(<http://en.wikipedia.org/wiki/Software_transactional_memory> \- usually used
for concurrency), present some interesting opportunities to automatically
'rewind' state in the case of exceptions.

I can imagine systems where you provide multiple implementations, and the
system automatically finds one that works.

~~~
zmmmmm
STM is great, but let's face it, it's not getting that email back once it went
off to the user, or "uncharging" the user's credit card (sure you can refund,
but you've still made a mess on their statement).

------
oconnore
That read like it was written by a Haskell programmer pretending to be an
Enterprise-y Java coder™. Sort of condescending, but maybe I like it?

~~~
fferen
Certainly better than the reverse, isn't it?

------
lcargill99
Finite state machines make exception processing easier. They also inspire
horror in those used to if() then ... else() trees, but them's the breaks.
SFAIK, FSM are the difference between high reliability and ... not so reliable
systems.

~~~
Zaak
This sounds interesting, but I can't think of how FSMs could be used to
process exceptions. Would you explain?

~~~
lcargill99
Each failure event becomes a state transition instead of an exception. You can
then go a long way towards proving that nothing in the FSM object's state was
trashed. You can also build a test harness that provides good coverage ( up to
100% ) and documentation of that level of coverage.

That's one way to do hi-rel processing in 'C'.... it all makes choice of
language less of an issue. And if you have adequate logging of test site
installs ( meaning all state transition data is logged ), then you can
reproduce 100% of failures in a controlled environment.

It's not magic, but somehow, making the error cases explicit events has ( on
me at least ) the effect of being able to reason about them more effectively.
It's just another event.

------
joeyh
This is an article from April 2006, so was written around the same time he was
rewriting couchdb in erlang.

~~~
damienkatz
Indeed. This article is me trying to explain what's been missing from most
server side languages, it became so much more obvious once I learned Erlang.
But at the time, almost no one had heard of Erlang, so it mostly just gets a
passing mention.

