
Graceful Shutdown - rumcajz
http://250bpm.com/blog:146
======
zimbatm
Any sufficiently-complex program is bound to re-implement an inferior version
of Erlang OTP :-p

Quickly you find out that recording the list of process / task / ... ,
including the relationship between each is a useful thing to have. With that
first-class registry, it's possible to implement all sorts of shutdown and
restart logics.

~~~
rumcajz
Disclaimer: I have almost no experience with Erlang/OTP.

But the question here is not whether you _can_ implement different kinds of
shutdown, it's rather which kind covers the common graceful shutdown use
cases.

If Erlang/OTP has such algorithm, I'd love to hear about it.

~~~
_asummers
Erlang has an explicit concept of processes and supervisors,and communication
by message passing. When a worker process dies, the supervisor is made aware
of it, and upon receipt of the message, can decide what to do. In some cases
that means restarting the worker to some known good state (turn it off and
on), in others just restarting a generic worker (a web server, for example),
and others, the supervisor cannot recover on its own so it crashes itself and
lets its own supervisor do the same exercise. It also has a notion of hot code
reloading, which can go and change process state to satisfy the new software
requirements of the new app version (think upgrading a telephone pole
remotely).

~~~
rumcajz
I meant it the other way round. The question was not about what happens when
the child dies. It was rather how does the parent let the child know that it
should terminate. Or that it should eventually terminate, but that it should
attempt a clean shut down.

~~~
_asummers
The supervisor can also terminate its children. The communication goes both
ways. But the children may also be supervisors with their own children. which
all have their own termination behavior. The VM is designed around reliability
and addresses this sort of behavior with language constructs.

~~~
rumcajz
Ok, I guess I'll have to give it a closer look. But are you positive they have
a good story about interactions between hard and soft shutdowns? E.g. when
grandparent requests a hard shutdown but parent a soft shutdown. Or vice
versa. Or if timeouts of cancellations on different level of the hierarchy
don't match?

~~~
_asummers
It was designed for dealing with telephone switches, which could go offline
for any reason, and need to be worked around. I recommend the entirety of
LYSE, but [0] this chapter should give you some footing. To your specific
question, upon receipt of the DOWN message, the processes can respond
according to what makes sense to the domain.

[0]
[https://learnyousomeerlang.com/supervisors](https://learnyousomeerlang.com/supervisors)

~~~
rumcajz
Thanks!

------
keypusher
Is this not a solved problem? You have the load balancer stop sending new
connections to the host. You send sigterm to the process. You wait until
timeout, then you sigkill the process. This is how it's done in Kubernetes,
ECS, and other systems I've worked with. Trying to engineer the entire
lifecycle from within the webserver has a wide array of problems that the
author is bending over backwards to end up only partially solving.

~~~
rumcajz
Here's a scenario: You webserver has an open long-lived WS connection. You
stop load balancer. The connections still lives. Then you send SIGTERM to the
webserver. The connections still lives. The main coroutine of the webserver
catches the SIGTERM and wants to do gracefull shutdown. But the connection in
question in running in a separate coroutine. So it has to send a graceful
shutdown signal to that coroutine. Etc. You end up with having to deal with
the problems described in the article.

~~~
ninkendo
If you have a websocket architecture, the client needs to tolerate a severed
connection and transparently reconnect, without user-visible impact. Anything
else is going to lead to disaster... the server end can't stay up forever.

The right thing to do in this case is for the server to also disconnect any
websockets when it gets the SIGTERM, in addition to stopping the accept loop
and the other things it would normally do. Clients simply have to reconnect
(and have enough tolerance in the socket's in-band protocol to recover state,
roll back transactions, etc.) It's a scenario that's bound to happen to a
certain percentage of your clients anyway, due to all sorts of external
factors, and must be handled regardless.

~~~
rumcajz
CLOSE frame exists in the WS protocol for a reason. It's there so that both
sides know that all the messages were delivered before the connection was
closed.

------
maxxxxx
As a long time desktop developer I always find it interesting how little
thought is put into shutdown server side. The only way seems to take it off
the load balancer and the hope that everything has finished after some time.

I have spent numerous hours on figuring out how to shut down multi threaded
desktop apps within reasonable time. It’s not easy.

------
skybrian
This problem seems similar to processing a request that has a deadline, after
which the request is discarded and any resources should be freed.

In Go, you can keep track of deadlines using a Context [1]. Subrequests also
need to know the deadline (or perhaps a shorter deadline), so they get passed
a derived context. The same mechanism also supports cancellation.

[1] [https://golang.org/pkg/context/](https://golang.org/pkg/context/)

~~~
latchkey
I deployed a binary golang agent [1] across about 2000 raspberry pi class
machines (for litecoin mining) using overseer [2].

It allowed me to implement zero downtime upgrades, which was pretty cool.
Internally within the agent, I used the context pkg to implement with a
'Manager' which provided cron like functionality where each thing could be
shutdown gracefully [3]. Used the same mechanism to shut the internal
webserver down gracefully as well [4].

[1]
[https://github.com/jpillora/overseer](https://github.com/jpillora/overseer)

[2]
[https://github.com/blockassets/bam_agent](https://github.com/blockassets/bam_agent)

[3]
[https://github.com/blockassets/bam_agent/blob/f7edce599b770b...](https://github.com/blockassets/bam_agent/blob/f7edce599b770b26189fd2c2fbb7cff81ac827a6/monitor/manager.go#L53)

[4]
[https://github.com/blockassets/bam_agent/blob/f7edce599b770b...](https://github.com/blockassets/bam_agent/blob/f7edce599b770b26189fd2c2fbb7cff81ac827a6/web_server.go#L50)

------
jkcxn
Kotlin is doing something like this with coroutine cancellation. Not quite the
same because the cancelation requires cooperation but I reckon you could add a
timeout and get the composability that the article talks about

[https://kotlinlang.org/docs/reference/coroutines/cancellatio...](https://kotlinlang.org/docs/reference/coroutines/cancellation-
and-timeouts.html)

------
protomyth
Its interesting that very few new languages actually take into account the
maintenance of programs coded in the language. I still think that something
like tuple spaces (Linda) that can be "frozen" would actually help in breaking
apart programs so shutdowns and, more likely, partial shutdowns are possible.

~~~
dnautics
Erlang-OTP

~~~
protomyth
Yeah, that's one language, and I'm really not sure they went far enough.

~~~
dnautics
OTP is still evolving. It's more like a standard library than a first class
language or vm construct. What do you want? Perhaps some additional libraries
could help

