
Kill Me Softly – Kill processes in a reliable manner - alanfranz
https://github.com/alanfranz/killmesoftly
======
rdtsc
If you are the one designing your applications, don't rely on SIGTERM and
clean shutdowns. Design them so SIGKILL is the default way to stop them such
that no data is corrupted and persistent state stays consistent. Then you
don't have to rely on some rarely executed "recovery" code, or even worse take
chances of what might happen if SIGTERM doesn't work.

Handle SIGTERM only if you get a blackbox-like application (a database that
will break if you pull the power plug on it) and so on.

~~~
khrf
If you think of it from a developer's view, you'll come to conclusion that you
can't create a program that doesn't need to cleanup things before termination.
Unless it's very useless program that doesn't process any incoming data.
That's why there is SIGTERM in operating systems.

~~~
lomnakkus
That's nonsense. All ACID databases handle the requirement that a SIGKILL
doesn't corrupt data. There may be a non-trivial recovery cost if you
experience an unexpected shutdown, but that's about it for any application
that's serious about data integrity.

EDIT: In fact they handle a much stronger requirement that a power failure
doesn't corrupt data, but the point stands.

~~~
vinceguidry
ACID databases can do this because they have the powerful hand of math giving
them superpowers. Most developers don't.

~~~
pjscott
Sqlite is a library that you can just link to. Use that for your state-
handling, and you need not fear a thousand SIGKILLs. This is a very popular
approach.

Or, if you're on a server, you can use another database. Or use atomic file
writes, if you have fancy needs. If you have _really_ exotic needs then sure,
do your own thing -- but most of the time, killability is easy and beneficial.

~~~
vinceguidry
If you do that, then you have to model program state not as objects, but as
database rows. Have fun using ORM on all your projects, for everything.

------
tspiteri
I know it's normally a good idea to avoid code duplication, but in this case I
would have copied the function inside kms_functions into both kmsn and kmsp,
since this removes the dependency on readlink and also makes it possible to
move the executables around without care.

~~~
wdewind
One of the best things I learned from a very senior engineer was "There are
far worse things in life than duplicating code." Dependencies are frequently
one of those things.

~~~
khrf
Oh thank you, these are gold words. I was amused recently when my collegues
wanted to avoid code duplication of one small library and ended up with
creating a new repository, learning awful Python packaging and releasing a new
package for every small change. Much more pain gained IMHO.

------
peterwwillis
Alternative modes of operation that would be useful:

* Send SIGUSR1 if TERM doesn't work

* Send SIGCONT if TERM doesn't work

* Attempt to reap the process

* Attempt to ptrace the process

* Give a hint as to why it can't be killed (like blocked I/O, a zombie, etc)

* Kill the process group (dangerous!)

~~~
ad_hominem
SIGUSR1 is often caught by applications for their own IPC. SIGQUIT would
probably be a safer one to send, but that can also be caught and repurposed.
Sending other signal types would definitely have to be opt-in to prevent
unintended side effects.

------
alanfranzoni
Often I need to kill processes, and wait until they're dead. I want to send a
SIGTERM first, then a SIGKILL if it fails. And I want this process to be
automated. That's it.

------
pepve
I really want to like this. But SIGKILL is such a special instrument in the
administrator's toolbox that I don't want it automatically sent after a short
timeout. Moreover I don't think anyone should want that. This tool will hide
or allow you to ignore important issues with your system.

I do unequivocally like the blocking behaviour though, I might implement that
for myself.

Edit: so just something like this:

    
    
      function killwait() {
        kill $1
        while ps -p $1 &>/dev/null; do
          sleep 1
        done
      }

~~~
dimman
I'd recommend adding some kind of max retries mechanism because otherwise, if
process ignores SIGTERM for instance, you can wait forever.

~~~
pepve
I personally prefer utilities like this just do one simple task and perform it
well. And adding a timeout would just complicate things unnecessarily,
compared to the alternative of pressing Ctrl-C when you feel you've waited
long enough.

------
vacri
So instead of two kill messages, we now have two scripts to run, dependent on
name or pid?

Just put everything in one script - these scripts are shorter than an init
boilerplate section - and use an option to switch. Do process names by
default, then perhaps use a -p arg to switch the script into 'pid' mode.
Bash's 'getopts' is pretty easy to implement.

As it stands, you currently have to waste mental power to figure out whether
you want the 'n' or 'p' version anyway, so may as well just make it an arg.

~~~
ukandy
Is the functionality even needed? Doesn't take much to type..

ksm `pidof cupsd`

------
benatkin
A lot of these bash scripts could be replaced with Go programs. It doesn't
take much code for a bash script to get ugly.

~~~
ukandy
Sure they could, but then you would need different compiled versions for
different architectures, and this is kind of OTT for the task it accomplishes
as it is!

~~~
benatkin
This would be an incentive to package the code, which would mean being able to
check it against SHAs. There are a lot of go programs on homebrew now. I'd
like to see a trend of smaller ones getting added there.

------
faragon
In my opinion, current process handling is braindead, both in POSIX and
Windows. That could be "fixed" with an additional unique identifier during the
OS uptime, in order to allow other processes track instances, and not just the
ones doing the fork.

~~~
dap
As you point out, using the pid alone is not reliable. FWIW, Solaris and
illumos systems have process contracts that allow you to reliably track
processes[0]. If the child forks, it's the new child is still in the same
contract (unless configured otherwise). And both the child or parent (the
watcher) can restart and then return to the normal state (parent watching
child processes).

Solving this problem requires kernel support because the kernel is generally
the only fault domain in the system that cannot crash while the system is
still running.

Edit: forgot the link the first time. [0]
[http://illumos.org/man/4/process](http://illumos.org/man/4/process)

~~~
feld
Linux has this capability in recent kernel versions (3.x?). I don't know what
the implementation is called (prctl?), but DragonflyBSD add this in procctl(2)
in release 4.0. FreeBSD already had a procctl(2) and added the same
functionality to be compatible.

[http://lists.dragonflybsd.org/pipermail/commits/2014-Novembe...](http://lists.dragonflybsd.org/pipermail/commits/2014-November/343844.html)

So now FreeBSD and DragonflyBSD have plans to utilize this exactly like
Illumos and build lightweight process supervision without the need for
something as disruptive as systemd.

------
alanfranzoni
Thank to everybody for the feedback. Standalone executable scripts that are
both OSX and Linux compatible are now available.

------
_cipher_
> Those should be available on almost any Linux system, even on minimal
> installs:

> [..]

> Bash

Maybe I'm nitpicking here, but why Bash _should_ be available?

------
dankilman
Those should be available on almost any Linux system, even on minimal
installs:

Linux

~~~
cbd1984
Tautologies aren't false.

