
The Zen of Erlang - mononcqc
http://ferd.ca/the-zen-of-erlang.html
======
jcizzle
I think the important thing about Erlang (the system as a whole) is that you
really have to understand what it does first. It's not one of those platforms
you just jump into and start toying around and then maybe things work 'good
enough.' Once you do understand what it does, though, it is exceptionally good
at that.

This post does a great job of explaining what Erlang, as a whole, does and why
it does it.

------
agentgt
The lighter let-it-crash is a circuit breaker. This is done quite frequently
in the JVM world because .. well the JVM has a really shitty startup time..
even restarting threadpools can be expensive.

I get the whole let-it-crash but I really would like more tools on feedback
control and backpressure handling (ie whats the right amount threads to
allocate and how many failures/timeouts should you allow etc...). Even
monitoring is a pain (ie too many alarms). I don't know if erlang provides
libraries for this but its a hard problem (see
[https://github.com/Netflix/Hystrix/issues/131](https://github.com/Netflix/Hystrix/issues/131)).

~~~
davidw
Erlang has circuit breakers too, like this:
[https://github.com/jlouis/fuse](https://github.com/jlouis/fuse)

Sadly, they are not mentioned much in books or other documentation, despite
being a potentially extremely useful piece of infrastructure for some kinds of
projects.

~~~
mononcqc
Chapter 3 of Erlang in Anger ([http://www.erlang-in-
anger.com/](http://www.erlang-in-anger.com/)) does mention them among other
strategies in handling overload (3.2.2). I tried to put as much concise
production experience as I could into that manual. Hopefully it proves
helpful!

~~~
thedudemabry
Since you're here, I just want to thank you for the most thorough, accessible,
and pragmatic Erlang writing I've run across. Cheers!

------
davidw
Here's a devil-in-the-details question that you might consider adding to your
excellent article:

You have a web server in there, and also a storage system. What happens when
the errors propagate up and the storage system dies? Does it force the entire
node to reboot? Shouldn't the web server stay up to keep users informed that
there is a serious problem, rather than simply going away? What's the best way
to accomplish that?

~~~
mononcqc
Author here. This is a challenging one, because it is intimately related to
what is acceptable or not to your users.

By default you could say that if the storage mechanism must be up and
available and it isn't, then the front-end shouldn't be responsive and it
should crash.

You could also say that you want the front-end app to be available if the
storage layer is offline. This has two possible consequences:

a) you disconnect the front-end and the back-end so that they do not depend on
each other. This can be done either through application strategies (you can
define the storage app as 'transient' so it can fail without shutting down the
system) or by putting the front-end on a different Erlang node.

The latter means that your dependency on the storage back-end is not as direct
as it seems.

b) this is my preferred solution, and it requires you to rework what you think
of as 'depends on'. If you expect the storage layer to fail and that you must
be able to service the front-end anyway, then the architecture demoed in the
presentation needs an asterisk.

The reason for this is that the dependency as described crashes if the
database is not available, because the storage subtree acts as a proxy for
'the database'. The OTP structure encodes 'my database is available'.

I can rework that requirement to mean 'the storage layer is up and ready to
talk to a database'. This is a huge change because it no longer promise the DB
is available, it promises that something whose job it is to talk to the DB is
available.

You can then change your interface accordingly. I go into some more detail
about this in "It's about the guarantees" [http://ferd.ca/it-s-about-the-
guarantees.html](http://ferd.ca/it-s-about-the-guarantees.html)

In a nutshell, the difference in both initialization and supervision
approaches is that in the one described in b), the client's callers make the
decision about how much failure they can tolerate, not the client itself. The
client making the decisions is what is described in the presentation.

Sadly I could not fit all of that and the compromise of supervision structures
within the hour I had allocated for my presentation, so this comment and the
side-blog post ought to do (I've also put that material in Erlang in Anger, if
you happen to grab that free ebook).

~~~
davidw
I wish more people would talk about this kind of thing in the Erlang world.
Supervision trees are nice, but there are real-world examples like the above
where it's not quite so cut-and-dried, and some additional design is required.
Each of your proposed solutions involves compromises, costs, and benefits of
their own that may not be obvious to someone new to Erlang.

The insight of people such as yourself who have already run into these
problems is very valuable to those of us with less experience.

Thanks!

~~~
mononcqc
I think a lot of these things are experience-related, or usually cemented
within a specific implementation. A lot of people may apply these principles
correctly because that's what they find works best, without necessarily
bringing it to a conscious level, or to a level of explicitness that makes it
easy to teach or use.

Garrett Smith is starting to hit on that with
[http://www.erlangpatterns.org/](http://www.erlangpatterns.org/) and trying to
broadcast that kind of information to the rest of the community, but I'm
guessing participation hasn't been strong enough to help (I know I haven't
participated enough to that website personally)

------
siscia
A little shameless plug, but someone can find it interesting.

I wrote a very tiny booklet about writing highly scalable, fault tolerant,
distributed system. The source and the compiled PDF can be found here:
[https://github.com/siscia/intro-to-distributed-
system](https://github.com/siscia/intro-to-distributed-system)

~~~
harigov
Good job. I think it is way too introductory, tbh. A few examples of working
distributed systems while talking about why they are the way they are, might
be useful. Also, unless you have plans for updating it in the future, you
might get more readers interested, if you publish it as a blog.

~~~
siscia
I will keep the booklet next to the blog post I write, if people are
interested I will keep expanding it.

Thanks for your feedback thought :)

------
technion
When I wrote ctadvisor[0], I continually ran into issues with certificates in
the chain that weren't encoded the way I expected. Sometimes it was legitimate
- because it took a week to realise I occasionally hit an email certificate
that looks quite different, and sometimes it was just because some CAs
generate unusual certs.

Everytime such a thing happened, it would crash, and just plow on. I never
actively planned that. It's incredibly powerful.

[0][https://github.com/technion/ct_advisor](https://github.com/technion/ct_advisor)

------
such_a_casual
A really excellent piece. Having no experience whatsoever with Erlang, I feel
like I have a very strong idea of its purpose and approach after reading this.
Not only that, but it's convinced me that a system for prioritizing and
restarting pieces of code is essential to all projects. It seems dumb to not
have a system like this. Thank you for taking the time to write this up.

------
finishingmove
Best article I've read on Erlang. Learned a couple of new things.

------
morenoh149
will a video be posted?

~~~
mononcqc
No. The conference organizers didn't have the time to set a recording setup
up.

Maybe if I end up giving the talk again somewhere there could be a recording.

------
hubbins
I thought the inclusion of a photo of the Challenger disaster as an example of
"blow it up" is in very poor taste. People died.

~~~
mononcqc
"blow it up" was an example of a thing that "could not make sense" for rocket
science as a quote. I also did not know (and currently do not know) of
significant rocket explosions or failures that didn't result in the loss of
human life, sadly.

Looking at the list here [https://en.wikipedia.org/wiki/List_of_spaceflight-
related_ac...](https://en.wikipedia.org/wiki/List_of_spaceflight-
related_accidents_and_incidents) I'm guessing Soyuz 33, STS-1 and a few others
would have worked, but any of those would have brought back similar images,
whether the space shuttle image was of a complete one or from the challenger
explosion, a failure in rocket science reminds you of any of those you have
seen; car crashes and airplane crashes are likely the same.

Then again, it's possible the whole slide is in bad taste. I wanted to convey
what the 'let it crash' stuff felt to me the first time I heard it, and
Challenger's disaster felt both higher profile and more distant in our
collective memory than any random disasters I could have used.

I could probably have avoided discussing the topic entirely, but I hoped that
the context around it where I think it would obviously be a bad idea to have
'blow it up' as a rocket science motto would save it. It possibly failed.

~~~
simoncion
> I wanted to convey what the 'let it crash' stuff felt to me the first time I
> heard it, and Challenger's disaster felt both higher profile and more
> distant in our collective memory than any random disasters I could have
> used.

This was a good choice.

> ...I hoped that the context around it where I think it would obviously be a
> bad idea to have 'blow it up' as a rocket science motto would save it.

Given enough people, someone will _inevitably_ take offense to _anything_ you
write. If someone is insufficiently capable of considering the _context_ in
which a reminder of a thirty-year-old high-profile disaster [0] is presented,
they're gonna be unreasonably kerfluffled.

[0] A disaster that was _caused_ by a _serious_ failure to remember and stay
within the safety margins of a _very_ complex and hazardous system... which
makes the choice of this _particular_ disaster even _more_ apt.

