Hacker News new | comments | show | ask | jobs | submit login
Exceptions in Elixir (whatdidilearn.info)
55 points by ck3g 8 months ago | hide | past | web | favorite | 13 comments

How does "fail fast" look in practice? Is there a "master" thread catching and handling exceptions thrown by other threads?

Not quite. Erlang's VM (BEAM) uses supervision trees in order to keep track of the various processes and their crashes. Each supervisor and worker in the tree is it's own process. So, if you have 5 supervisors, and each of those start 2 worker processes, you will have 15 processes in your supervision tree (the VM will spawn a bunch for itself to use, so do not think you only have 15 processes running in your system).

Basically, at the top level of your application, you will have a supervisor that will look after all of the processes that are important to your application. Each of these processes could have any kind of functionality (e.g., database connection, HTTP server, etc), or be another supervisor. When you start these applications, they too may start a supervision tree of processes that are important to them (e.g., the database connection may actually start a pool of processes).

In "fail fast" or "let it crash", only the process that actually threw the exception will die. The supervisor that is looking after that process will be notified of it being killed and, depending on how you have the supervisor configured, it may or may not start a new process to replace the one that died.

Another thing to note, depending on how the supervisor is configured, it may actually crash if a particular process it is monitoring crashes too many times. This will make the supervisor crash and it should bubble up to it's supervisor. Unfortunately, it is possible to take down your entire application this way.

TLDR: There is no master process that does all of this. Though, each supervisor is sort of a master process for each of its supervisors and workers and the processes a supervisor watches may or may not be restarted upon failure.

So, is there a particular strategy to organizing code in order to 'hot-swap' it (failing code) out, while keeping a production system up and running?

Pretty much! I haven't played with that feature myself, but Erlang's telecom origins help explain this feature. If you're upgrading a telephone switch with N live calls, it'd be optimal to not have to kill those calls just to upgrade some software. There's more nuance to it than that, but "little-to-no downtime", or hot swappable code, is a language feature. Pretty neat idea in an era of "throw away the whole VM/container" to push a config update.

So, there's maybe two parts to your question? How do you structure your code to make it possible to hot load code -- and how does that help you recover from crashes.

The beam VM allows for an old version and a current version of all modules. When you call into a function with a fully qualified name (Module:Function), it always calls into the current version; if you call a function within a module only by its function name, it calls into the same version that is executing, which could be the old version. So, you need to periodically (or on demand via some message) make a fully qualified call, to ensure your process will migrate. You also need to make sure the old version doesn't stay on the stack, so you have to be tail recursive, at least sometimes. You also need to make sure you make your new code able to cope with state developed by old code, which can be challenging at times.

If your service is generally stable, but occasionally crashes with some types of requests, then you're in a good place. If something is crashing a lot, it can cascade into a supervisor crash, and it is likely that you will have a bad day. In theory, when your service starts (started by you, or if the supervisor restarts it), it has a consistent state, and will be able to service requests; but often it started crashing because some service it requires stopped working right, and restarting the client doesn't really help.

I've found let it crash is a good philosophy, but shouldn't always be implemented literally. In an http server, I'd rather catch crashes, log them and return an error to the client -- not just close the socket. In erlang server processes that don't maintain much state running in pg2, it's better to catch and log, because requests are going to be lost if you actually crash.

Other responses are great. One thing I'd also point out is that the exceptions aren't really caught so much as the process is crashed, and the supervisor boots up a new process depending on how you've set it up.

In practice, this means your processes need to be able to recover state when it boots up. I have some processes that act as caches of database values, which is the simplest type of recovery because you can just load from the db

Another sub-point on your point is that restarting a process after failure can reduce the number of code paths that need to be handled after errors.

Instead of having to handle multiple possible error paths and create multiple recovery methods to fix up state, it’s often simpler to crash and re-use the main initialization route to do the recovery. This way you end up with one well known and tested "initialization route".

To build on your example, I’d imagine it’d be simple to have a check that compares the current revision of the cached data against a global version or timeout. Then if this check fails you just crash and allow the supervisor to restart the process. The regular initialization path will load the data from the database.

I wrote a blog post[0] that walks through some example Elixir code to explain it. It boils down to having Supervisor processes that monitors worker processes (or other supervisor processes), and the supervisor receives a message when the worker process exits. The supervisor can then decide what to do based on what the error message is.

[0] https://medium.com/@tylerpachal/let-it-crash-creating-an-exa...

Nitpicking, in Erlang-lingo supervisors do not monitor but instead link the worker processes. This is important when the supervisor itself fails as monitors are unidirectional and would keep the (then) unsupervised processes alive while a link is bidirectional, so all workers will be killed if the supervisor goes down (and they are not explicitly trapping exits).

Awesome thanks for the info! I guess I have never really thought about the case where my supervisor would go down before the workers would. But you're right the processes are "linked" in Elixir as well.

Yes, a typical Elixir/Erlang application is a process tree with parent processes acting as supervisors to children processes. Each supervisor knows how to handle its children crashing, and properly replace crashed processes. They can do things like “if one child crashed restart all children” or “only restart the crashed child”.

The underlying mechanism are called links and monitors. When a process fails a message is delivered to other processes which makes the exception asynchronous toward processes which you want to receive them.

As other threads say you use this to build supervision trees.

Processes can monitor each other. Processes dedicated to monitoring and restarting other processes are called supervisors. You code for the "happy path" and let a tree of supervisors handle unexpected failures. An uncaught exception will torpedo its process.

Processes might also monitor each other for other reasons. For example, a resource pool would monitor processes which check out a resource, in case they die before returning the resource.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact