

Do messages get lost when Erlang modules are upgraded? - satvikc
http://pankajmore.in/code-reloading-in-erlang.html

======
rubyrescue
Hot code reloading is really amazing in Erlang. We use it to patch a method
without restarting on a regular basis. Getting reltool properly configured to
deploy the new version of your application though is a MAJOR pain in the ass.

for the non-erlangers:

the 'simple' way (ie i have a tiny change i want to make, so i'll just ssh in
and do it) is to just edit your code, then attach to an existing node, and do
make:all([load]) to pick up the code changes. we added a makefile "make
attach" that does this for us.

the 'reltool' way is you package your application into releases. We use
jenkins build numbers as the third digit in our releases, so each one is
packaged into a zip, encrypted, and then is ready to deploy. at deploy time,
the zip is copied to all machines, then we use some escript to attach to the
running nodes, and use the 'release_handler' bring up the N+1 version of our
app. it's a pain but once it's configured it's amazing...

~~~
blankverse
Could you talk a little bit about what happens to processes which are running
different versions of the code? Do you make your code backward compatible so
that old processes continue to work with the new process? Or do you just let
the old processes crash?

~~~
strmpnk
You can upgrade state in many cases if the changes are trivial, though
depending on the ephemerality of a process you might let it finish or crash
and restart instead of doing this.

~~~
blankverse
So, in the worst case, for achieving basic code reloading in other erlang like
systems(ex.- cloud haskell), we could let every process in the cluster crash
and restart with the new version?

~~~
strmpnk
It's rarely "only crash" as that eventually bubbles up to being equivalent to
a reboot. It's simply that you can have subsystems that do that in effective
isolation if the effects of a reboot are minimal.

On the code state upgrade side there are usually a few different issues that
can arise and I'd be very curious to hear if there are ways Haskell would
handle some of these.

The first one is type changes. I might have a record that has a new field
added. Now it's not necessarily pretty to upgrade on call with a pattern match
or using a code upgrade protocol but it's easily expressed dynamically.

Another is in the interface, like adding new arguments or changing from a
synchronous call to an asynchronous one. These are a bit easier to handle via
indirection though they show that you'll need to plan your entry/exit points
for upgrades carefully (again, OTP has things like gen_server which make this
much easier).

If Haskell can manage to get past they type boundary issue then it's really a
matter of supporting at least 2 simultaneous versions of code so each process
can be scheduled and upgraded in natural course. Handling more than 2 could be
of use depending on how aggressively you want to purge, for example, a local
rather than fully qualified call can be caught in a closure and passed around
as in some value to be called later. These long lived references will need to
be handled carefully or you might get some delayed surprises.

------
pointernil
In my very early years when discovering coding I remember fantasizing about a
system where I could start my creation as very simple endless loop and add to
it without having to stop it what ever would make up an application (... more
a coding experiment in logo or basic it was at that time ;)

Fast forward almost 30 years: in Erlang light-weight processes are (most of
the time) tail recursive functions handling messages... endless loops. Those
light-weight processes are running those endless loops.

The described module upgrade functionality allows for uninterrupted system
upgrades of servers/services in the back-end which btw. is often serious
business in production and the cause of the reltool complexities.

BUT: the same hot code reloading system allows for very sleek development
experience. You start your first version of the service and from there on
update the source and the service changes its behavior most of the time with
no restart, no lost state etc. (with the help of some monitoring tools source
changes can be picked up automagically...)

This kind of dev-environment is simply flow-inducing.

------
rdtsc
This is pretty far from what other frameworks and systems can even dream
about. Not everyone needs this but when they do need it, I only know about
Erlang that can handle it.

~~~
thesz
I have to say "when one REALLY-REALLY-REALLY-REALLY BADLY need it". Because
the same effect can be achieved by different means most of the time.

~~~
rdtsc
How? You have a C++ or Java object instance running in a process how do you
upgrade that code without restarting the process?

~~~
alanning
I imagine you would do what erlang does for you, just manually - isolate
individual server instances, do the update, then make them public again.

Say for example you are updating code in your service's web-tier. Have your
LBs not send any more new traffic to the server instance, wait some reasonable
length of time until existing connections are completed, deploy, restart, give
LBs the A-OK to start traffic up again. Repeat until all web-tier instances
are updated.

~~~
rdtsc
Ok but what if that server you isolated was holding a long run process. Yes
for a web service that serves quick http request and responses you could do
that. Not everything is short lived request and response messages. Some
sessions and processes are long lived. So you have a socket open and data
streaming into it, it is not easy to isolate it. You could say send it a
message that says -- start isolating.

How do you hotswap with data held in the stack or the heap? Even more if two
part of your code need updated instance data, how does it ensure that update
is synchronous or happens in the right order? In Java they have a
$transformer() method. Ok how do you regulate the order in which
$transformer() gets called. Otherwise a new version of one instance will call
an old version of another.

I am not saying it is impossible to do it, there might be a way, but it is
just usually working against the framework and against the default setup of
the system.

~~~
alanning
Yes, it can get very complicated trying to handle all the edge cases as you
described. I would guess doing it the same way erlang does it is actually the
simplest: let it crash.

This gets into coding mindset, my impression is that most erlang programmers
expect their code to crash whereas most other languages seem to lend
themselves to expecting the program _not_ to crash.

I am not very well versed in erlang but my readings so far imply that the
graceful handling of crashes is really where erlang/OTP shines. Regardless of
the framework, I would say it comes down to proper queuing, being able to
safely retry work, and to some extent having some smarts on the clients.

------
strmpnk
I was a little confused by the title as it implied a relationship between a
process's message box and some module which is not the case.

The article does at least demonstrate that processes can transition between
two versions of the same code w/o resetting state, which is, at its core, the
very thing that makes code upgrades remotely practical.

Other comments mention some more sophisticated machinery like release
upgrades. Erlang also has many code upgrade options baked into the OTP as
well. I make use of many of these features both during development but also in
production with some careful review. I'm always disappointed when going back
to a system that has to "reboot" itself after getting used to hot upgrades and
distributed erlang (version discrepancies in a cluster can present a similar
problem if you don't want to pause your system).

~~~
blankverse
The title is confusing! It should have been "Do messages get lost during code
reloading in erlang?"

In a prod env, how do you make sure that different versions of your code
coexist peacefully?

~~~
banachtarski
You can have multiple sets of instructions for the same function but indicate
that only one should be active. This makes it easy to rollback a deploy for
example. Also, when you do a live upgrade, there must be at least two versions
exposed to the VM at one point before the switch occurs.

Easy hot code reloading is one of the great benefits from CSP.

