Basically, at the top level of your application, you will have a supervisor that will look after all of the processes that are important to your application. Each of these processes could have any kind of functionality (e.g., database connection, HTTP server, etc), or be another supervisor. When you start these applications, they too may start a supervision tree of processes that are important to them (e.g., the database connection may actually start a pool of processes).
In "fail fast" or "let it crash", only the process that actually threw the exception will die. The supervisor that is looking after that process will be notified of it being killed and, depending on how you have the supervisor configured, it may or may not start a new process to replace the one that died.
Another thing to note, depending on how the supervisor is configured, it may actually crash if a particular process it is monitoring crashes too many times. This will make the supervisor crash and it should bubble up to it's supervisor. Unfortunately, it is possible to take down your entire application this way.
TLDR: There is no master process that does all of this. Though, each supervisor is sort of a master process for each of its supervisors and workers and the processes a supervisor watches may or may not be restarted upon failure.
The beam VM allows for an old version and a current version of all modules. When you call into a function with a fully qualified name (Module:Function), it always calls into the current version; if you call a function within a module only by its function name, it calls into the same version that is executing, which could be the old version. So, you need to periodically (or on demand via some message) make a fully qualified call, to ensure your process will migrate. You also need to make sure the old version doesn't stay on the stack, so you have to be tail recursive, at least sometimes. You also need to make sure you make your new code able to cope with state developed by old code, which can be challenging at times.
If your service is generally stable, but occasionally crashes with some types of requests, then you're in a good place. If something is crashing a lot, it can cascade into a supervisor crash, and it is likely that you will have a bad day. In theory, when your service starts (started by you, or if the supervisor restarts it), it has a consistent state, and will be able to service requests; but often it started crashing because some service it requires stopped working right, and restarting the client doesn't really help.
I've found let it crash is a good philosophy, but shouldn't always be implemented literally. In an http server, I'd rather catch crashes, log them and return an error to the client -- not just close the socket. In erlang server processes that don't maintain much state running in pg2, it's better to catch and log, because requests are going to be lost if you actually crash.
In practice, this means your processes need to be able to recover state when it boots up. I have some processes that act as caches of database values, which is the simplest type of recovery because you can just load from the db
Instead of having to handle multiple possible error paths and create multiple recovery methods to fix up state, it’s often simpler to crash and re-use the main initialization route to do the recovery. This way you end up with one well known and tested "initialization route".
To build on your example, I’d imagine it’d be simple to have a check that compares the current revision of the cached data against a global version or timeout. Then if this check fails you just crash and allow the supervisor to restart the process. The regular initialization path will load the data from the database.
As other threads say you use this to build supervision trees.
Processes might also monitor each other for other reasons. For example, a resource pool would monitor processes which check out a resource, in case they die before returning the resource.