
Grace: Graceful restart for Go servers - tilt
https://github.com/facebookgo/grace
======
AYBABTME
I think graceful restarts are fine for servers that offer web pages or other
things where a client will not retry a request.

For all other cases[1], I think graceful restarts are a bad practice. When you
use that pattern, you come to rely on your server closing down cleanly. It
makes you consider abrupt terminations (pulling a powerplug, kill -9, network
partitions) an edge case.

Because you separate 'shutdown cleanly' from 'shutdown abruptly' and that in
most deployments clean shutdowns are much more frequent than abrupt ones, you
end up not exercising the abrupt termination case very often. This means that
your system's reaction to abrupt terminations is likely buggy or neglected. If
it works fine because you took great care of handling that case, the effect is
that your codebase is now more complex. You have two code paths to handle
termination: a clean path and a dirty path.

If instead, you shutdown only one way (abruptly), those two cases become the
same. You reduced the problem space and now your code shuts down only 1 way.
You can be confident that your system is resilient to abrupt failures, since
you exercise that shutdown path every single time you stop your server. Some
call this crash-only software.

A caveat with this is that clients must be able to handle abrupt terminations.
This usually means that clients must have the ability to retry (with exp-
backoff) their operations, and that each operation they perform has a timeout.

    
    
        How To Know If You're Infected By The Graceful Shutdown Fever: 
        
        Assume that your servers are `kill -9` every 2s at 
        random, are you confident that your system will keep 
        working properly?
    

[1]: another valid case I found was in distributed systems that require a
certain part of their members to be up at any time. If the nodes coordinate
their restarts, they can ensure that a minimum number of nodes are kept
healthy. [2]: I've spent many hours arguing this with very smart colleagues
who disagree, so this is a bit of an opinionated view (still I think I'm right
=])

~~~
nemothekid
I think in the case of web servers (that are most times stateless),
guaranteeing it will survive a "kill -9" just means having some sort of
supervisor service.

However there are times you want a graceful restart because an abrupt restart
affects your client's UX - such as during a file upload, or where a client may
be running a long running synchronous operation (like a credit card charge).
In both cases the system may be resilient against a -9 and retries, but
annoying if "the world stops" because of a change to prod.

~~~
AYBABTME
A supervisor service boils down to the same issue if there's a network
partition or if the machine just blows up, a disk fails, the kernel panics,
the power supply dies, etc. The thing is you can't guarantee that any process
will survive all a request, so you might as well work as if you were living in
Hell and every request has a 0.5 probability of failing (by any mean).

~~~
nemothekid
Right, which are cases where a graceful restart will never help you. If a user
is uploading a 4K video file and the server dies because god said so, well
they can just try again.

If the server dies because the dev team decides to push to prod 4x in an hour
and the user has to _retry every time_ , a graceful restart lets you do that
without pissing off anyone who was using that live connection.

~~~
AYBABTME
Agreed, that's what I meant in:

    
    
        I think graceful restarts are fine for servers that offer 
        web pages or other things where a client will not retry a
        request.
    

\+ when retrying a request is not practical.

------
zimbatm
A couple of my own also written in Go. The main difference is that control is
external which makes them language-agnostic:

* [https://github.com/zimbatm/socketmaster](https://github.com/zimbatm/socketmaster) \- very simple

* [https://github.com/pusher/crank](https://github.com/pusher/crank) \- adds a control socket, dynamic config and more fine-grained controlled restarts

Both are designed to sit on top of your process manager of choice. Unlike
self-forking process solutions they don't change PID so it works great with
systemd and upstart.

------
tobz
Just commenting to say I used this library for a project at work to great
effect. I tried rcrowley/goagain, which had issues and felt like I was
shoehorning everything in. I tried another that I can't remember.

There's a few that are drop-in replacements for http.ListenAndServe, which is
great, until you want to do something custom, then it gets a lot more kludgy.
Grace is not kludgy.

And, most importantly, it definitely works. Was able to hammer the bejeezus
out of my daemon, restarting it willy nilly, without issue or fanfare.

------
vezzy-fnord
_Additionally it is implemented using the same API as systemd providing socket
activation compatibility to also provide lazy activation of the server._

Couldn't immediately tell from reading the source - but is this to say that it
uses the sd-* APIs directly, that it implements behavior compatible with
sd_listen_fds, or some other form of "socket activation" like superserver or
fd-holding?

~~~
daakus
Largely it just means that it uses the `LISTEN_FDS` environment variable and
passes the sockets down starting at fd=3 (more here:
[https://github.com/facebookgo/grace/blob/master/gracenet/net...](https://github.com/facebookgo/grace/blob/master/gracenet/net.go)).
Admittedly our systemd use case is no more and I haven't tested with it
recently so I want to clarify I should probably re-test and make sure it still
works.

------
shockzzz
This line is indicative of what I hate about Golang code:

 _a := newApp(servers)_

[https://github.com/facebookgo/grace/blob/master/gracehttp/ht...](https://github.com/facebookgo/grace/blob/master/gracehttp/http.go#L124)

EDIT: read my comment below where I elaborate

[https://news.ycombinator.com/item?id=9695880](https://news.ycombinator.com/item?id=9695880)

~~~
ejcx
Why? Is this really a golang thing?

If you are going to be creating these complex structures why not using a
factory function for it? If there were many creations of this type it could
severely increase the bulk of the code....

------
jzelinskie
I've been using
[https://github.com/tylerb/graceful](https://github.com/tylerb/graceful) for a
few years now. At the time Go (1.3 IIRC) added to ability to actually access
the internal connection state for connections created via net/http, it was the
only library to actually do the right thing.

~~~
tobz
FWIW, 'grace' provides graceful restarting (and graceful shutdown) while
'graceful' looks to simply provide graceful shutdown.

------
epoxyhockey
Also worth a look:
[https://github.com/fvbock/endless](https://github.com/fvbock/endless)

It allows for graceful restarts after replacing your binary. I'm not sure if
that is a feature of grace?

------
stevewepay
What is the technique to shutdown and restart a server without losing
connections? Is it just making sure the connection isn't closed or reset
before shutting down the server?

~~~
GauntletWizard
You don't just need to preserve the socket; You also need to keep both the new
and old server in memory for sufficient time that the existing connections
die.

Perhaps there's a kernel-level API that could be added to allow sockets to be
snatched or handed over to a new process. That is, honestly, probably the more
apropos solution. The fact that sockets act as a kind of lock is an
implementation detail.

~~~
AYBABTME
Then the new born server would have to know what half-done work has been
done/exchanged with the client and recover from there. This seems like an
impossible challenge to me, or at least much more complex than sharing the
socket while the old server finishes serving its ongoing connections while
accepting no new ones.

~~~
GauntletWizard
I think you've misunderstood me; You don't need/want to share any connections
that the old process has, only the listen socket. You 'steal' the socket, and
then any 'accept' calls on the previous owners FD behave as if there were no
calls, while new connections show up on the thief's socket FD.

~~~
AYBABTME
I think that's how it's currently done, but the procedure is issued by the
dying server. But yes, maybe that'd be cleaner if it was a proper API call
initiated by the new born process.

------
mholt
I see some syscall signal stuff in there - how well does this run on Windows?

~~~
twotwotwo
I'd be happy to be corrected, but my guess is it wouldn't compile: running
`GOOS=windows godoc syscall` I don't see SIGUSR2 or Kill defined. Maybe an
enterprising person could port it, if the semantics are similar enough between
Windows and Unices for those APIs that exist for both (os.StartProcess,
CloseOnExec, etc.).

