
Readiness protocol problems with Unix daemons - vezzy-fnord
http://homepage.ntlworld.com./jonathan.deboynepollard/FGA/unix-daemon-readiness-protocol-problems.html
======
lambda
Yep, I've run into the problems with "expect fork"/"expect daemon" in Upstart
first hand. Forking is a terrible readiness protocol, and Upstart is terrible
at keeping track of a process that doesn't do exactly what it expects. In
fact, if you choose the wrong one, you wind up with Upstart not even noticing
the a process has gone away, and not being able to do anything to reset
Upstart's knowledge of the world without spawning programs in a loop until the
PID namespace wraps around and you get back to the original PID that it's
expecting to track, so that Upstart can finally notice it die and properly
clean up after it. This is seriously the only recommended workaround for a bug
that went unfixed for years:
[https://bugs.launchpad.net/upstart/+bug/406397](https://bugs.launchpad.net/upstart/+bug/406397)

I'm glad that Debian and Ubuntu have finally decided to go with an actually
maintained init system, that properly track processes even after arbitrarily
many forks and support a reasonable readiness protocol.

~~~
bonobo3000
I have also had a little mismatch with upstart for a particular application -
the app itself spawns new processes in response to user input, but upstart can
only track the main app not the children.

I am curious - precisely knowing when an app is running and when its not seems
inherently dependent on the app and your definition of "running" for that app.
Tracking apps from the system level in terms of processes etc. can only go so
far. How do the init systems you mentioned tackle this?

~~~
Shish2k
> precisely knowing when an app is running and when its not seems inherently
> dependent on the app [...] How do the init systems you mentioned tackle
> this?

In systemd's case, it provides an API so that the app can explicitly say "I am
now ready to receive requests", among other things.

[http://www.freedesktop.org/software/systemd/man/sd_notify.ht...](http://www.freedesktop.org/software/systemd/man/sd_notify.html)

(I've also found the related watchdog functionality very useful -- once you
tell the init system that your daemon is ready to serve requests, then you can
also tell it every time a request is completed; then if you go for say 10
seconds without serving a request, init will assume that your service has hung
and kill/restart it. I've seen a small number of daemons do this for
themselves, but that number is nowhere near 100%, and even those who do
implement it don't do it as well as systemd does.)

------
anonttttttt
The problem is actually solved, see [http://welz.org.za/notes/on-starting-
daemons.html](http://welz.org.za/notes/on-starting-daemons.html) \- it is just
that people thesedays write flaky daemons which crash and want to have the
init system restart services blindly - rather leave that to the monitoring
tool which checks the service offered, so that we also notice if it is stuck
in an infinite loop. Or even better, make crashes something to be eliminated,
not papered over.

------
simoncion
I'm wondering what's wrong with the way OpenRC's start-stop-daemon works. It
either:

* forks a child to run a foreground task

or

* starts a self-daemonizing service

then waits for a user-configured period of time to make sure that the child or
daemon is still around before declaring success.

OpenRC also provides programs to mark a service as started-but-inactive, and
-I think [0]- was-inactive-but-is-now-fully-started [1]. This lets you avoid
blocking system start with daemon that might take a while to initialize, or
handle a daemon that might flap in and out of availability, or handle "Is it
actually started" logic for daemons for which the answer to that question is
pretty complex.

Daemons that _require_ a given service to be started will wait to start until
that service is no longer in the "inactive" state.

Edit: OpenRC _can_ be configured to start a cgroup for a given service, and
-presumably- kill all tasks in that cgroup on exit; mooting the problem of
incorrectly marking a service as "failed".

[0] The documentation for this state is at odds with the state name.
Regardless, the state is just a label for a particular service, so it seems
reasonable to believe the state name rather than the manpage.

[1] Among several other states. See the BUILTINS section of openrc-run(8) for
more info.

~~~
simoncion
Almost completely unrelated to TFA:

In the Apache Tomcat Systemd House of Horror story, we find this:

> And let us mention another horror in catalina.sh: the logging. The dæmon
> process has its standard output and standard error redirected to a log file.
> This is a known-bad way to go about logging. One cannot size-cap the log
> file; one cannot rotate the log file; one cannot shrink the log file. It
> grows, as one file, forever until the dæmon is shut down.

I was under the distinct impression that logrotate _can_ handle cat-std*-to-
file style logfiles: to rotate, copy the logfile, then truncate the original.

Indeed, from logrotate(8):

    
    
      copytruncate
             Truncate the original log file to zero size in place after  cre‐
             ating  a copy, instead of moving the old log file and optionally
             creating a new one.  It can be used when some program cannot  be
             told  to  close  its  logfile  and  thus  might continue writing
             (appending) to the previous log file forever.  Note  that  there
             is a very small time slice between copying the file and truncat‐
             ing it, so some logging data might be lost.  When this option is
             used, the create option will have no effect, as the old log file
             stays in place.
    

Why has the author never read the logrotate man page? :(

~~~
sveiss
Doing that is inherently racey (as acknowledged in your man page quote), and
you will lose some log data while the rotation happens if it's being written
to quickly enough. If you need all your log data, you can't rotate or shrink
it with this style of logging, hence 'known-bad'.

~~~
simoncion
It _is_ inherently racy. Noone is disputing that. But, remember that the
author said:

> One cannot size-cap the log file; one cannot rotate the log file; one cannot
> shrink the log file. It grows, as one file, forever until the dæmon is shut
> down. [0]

Anyone with even the most basic understanding of how star-nix [1] systems
handle files understands this to be an untrue statement.

Take particular care to note that the author's statement was _unqualified_. If
he had mentioned _anywhere_ that you cannot do those things without some risk
of data loss, there would be _no_ controversy. Instead, he said -flat out-
that rotating and size-capping a cat-std-whatever-to-log-file style log file
is _impossible_.

[0]
[http://homepage.ntlworld.com/jonathan.deboynepollard/FGA/sys...](http://homepage.ntlworld.com/jonathan.deboynepollard/FGA/systemd-
house-of-horror/tomcat.html)

[1] star-nix rather than $ASTERISKnix because I can't figure out how to escape
a literal asterisk. :/

