It has been extremely stable, scaling has been a non-issue. Error reporting has become easier and easier, now that companies like Sentry and AppSignal have integrations for Elixir.
Elixir is VERY fault-tolerant. DB connection crashing? Ah well, reconnects immediately, while still serving the static parts of the application. PDF generation wonky? Same thing. Incredibly fast on static assets, still very fast for anything else.
I've had nothing but fun with the language and the platform. And the Phoenix Framework is just icing on the cake. I've been fortunate to have been to many community events, and meeting (among so many others) José and Chris at conferences has made me very confident that this piece of software has a bright future. The Elixir slack is also VERY helpful, with maintainers of most important libraries being super responsive.
I would not start another (side or production) project with anything else than Elixir.
I still don't understand this.
I don't think I've ever built a web server in any language where this wasn't true unless I specifically wanted hard failure.
The amount of fault tolerance would be a per-app design goal rather than something that seems to be a language feature. I've worked in apps in all languages that range from any failure being a hard failure to being impossible to crash, and this is due to business requirement.
For example, regarding your examples, just about every web server I can think of will automatically turn uncaught exceptions into 500 responses unless you opt otherwise.
In most languages, you achieve this behaviour by rescuing/catching exceptions. In Erlang/Elixir, we don't like to that, because exceptions are mechanism to signal that something went wrong and telling the system to continue despite of failures is not a good practice.
Instead, in Erlang/Elixir, you organize your software using separate entities (called processes), which are completely isolated. Therefore, by definition, if something fails, it won't affect other parts of your system. This also leads to other features like supervision trees, which allows you to restart part of your application, exactly because you know all of those entities are isolated.
When you have shared mutable state, it is much harder to have something like built-in supervisors, because you have no guarantee that a crashed entity did not also corrupt the shared state.
In a nutshell, I would say Erlang/Elixir makes you think more about failures and how things go wrong.
I know this sounds a bit handwavy but it is not that trivial to explain those details on text. I have also given talks on this called Idioms for Building Fault-Tolerant and Distributed Applications in case you are interested: https://www.youtube.com/watch?v=B4rOG9Bc65Q
The magic is in the supervisor pattern, explained here for erlang: http://erlang.org/documentation/doc-4.9.1/doc/design_princip...
It is hard to describe why this "feels different" in Elixir than it does in Express.js or a Tomcat running a Java application. It's all experiential for me, but maybe I can put the sentiment in words: I always KNOW that whatever part of my application may break, however much and for whatever duration, the scheduler and the supervisors will make sure that the rest of the system runs exactly as intended, and the broken part of the system will be back up eventually. I did not have this feeling (as strongly) prior to working with Elixir.
But I will admit this is a very subjective position. And I am not sure you'd experience it the same way were you in a similar situation.
A dead BEAM process can be restarted in a few microseconds, load up some complex state, and keep going. I don't believe the same can be said of a dead K8s service.
In most runtimes, initialization like that is linear (think bash’s execfail switch); if something fails to initialize, the whole HTTP app daemon will crash out, get restarted by its init(8) process, and then try again.
In Erlang, you’ve got something more like “services” in the OS sense: components of the program that each try to initialize on their own, independently, in parallel, with client interfaces that can return “sorry, not up yet” kinds of errors as well as the regular kind—or can just block their clients’ requests until they do come up (which is fine, because the clients are per-request green threads anyway.) In Erlang, the convention is that these services will just keep retrying their init steps when they hit transient internal errors, with the clients of the component being completely unaware that anything is failing, merely thinking it isn’t available yet.
Certainly, Erlang still has a linear+synchronous init phase for its services—just like OSes have a linear+synchronous early-init phase at boot. But the only things that should be trying to happen in that phase involve acquiring local resources like memory or file handles which, if unavailable, reflect a persistent runtime configuration error (I.e. the dev or the ops person screwed up), rather than a transient resource error.
Indeed, any language runtime could adopt a component initialization framework like this; but no language other than Erlang, AFAIK, has this as its universal “all ecosystem libraries are built this way” standard. If you want this kind of fault-tolerance from random libraries in other languages, you tend to have to wrap them yourself to achieve it.
(You could say that things like independent COM apartments or CLR application domains which load into a single process are similar to this, but those approaches bring with them the overhead of serialization, cross-domain IPC security policy enforcement, etc., making them closer to the approach of just building your program as a network of small daemon processes with numerous OS IPC connections. Erlang is the “in the small, for cheap” equivalent to these, for when everything is part of the same application and nothing needs to be security-sandboxed from anything else, merely fault-isolated.)
The fault tolerance that I love about Erlang/Elixir is the actor model. Everything is (or can be) an actor, which is like a living and breathing instance of a class. So they can live and do their own stuff, and then if they fail at that and need to be recreated, they get recreated by something that supervises them.
Contrast this to for instance a Django or Rails app... if a vital service in the system dies the entire Ruby or Python runtime will (potentially) die and then respawn. It's cheap and we don't care, right? It will get restarted. The net result is similar, you don't get woken up in the middle of the night and customers are happy. But in systems where you want or NEED an entire system to remain on 24x7x365 it changes the game.
I've written large applications in Clojure/Clojurescript and I've seen/reviewed reasonably large code bases in Elixir, and while I would agree that Elixir is a very good solution for many problems, it is not a tool for everything.