Hacker News new | past | comments | ask | show | jobs | submit login
Killing a process and all of its descendants (morningcoffee.io)
280 points by shiroyasha23 72 days ago | hide | past | web | favorite | 75 comments

This is why cgroups were invented. They solve this problem. Start a process in its own cgroup, and you can later confidently kill the process and all of its descendants. Container "technologies" use cgroups extensively, as does systemd service management.

[CGroups original developer]

Yes, for tracking processes and reliable resource control. Prior to cgroups, in Google's Borg cluster management daemon the best strategy I was able to come up with for reliably and efficiently tracking all the processes in a job was:

- assign each job a supplementary group id from a range reserved for Borg, and tag any processes that were forked into that job with that group id

- use a kernel netlink connector socket to follow PROC_EVENT_FORK events to find new processes/threads, and assign them to a job based on the parent process; if the parent process wasn't found for some reason then query the process' groups in /proc to find the Borg-added group id to determine which job it's a part of.

- if the state gets out of sync (due to a netlink queue overflow, or a daemon restart) do a full scan of /proc (generally avoided since the overhead for continually scanning /proc got really high on a busy machine).

That way we always have the full list of pids for a given group. To kill a job, nuke all the known processes and mark the group id as invalid, so any racy forks will cause the new processes to show up with a stale Borg group id, which will cause them to be killed immediately.

This approach might would have had trouble keeping up with a really energetic fork bomb, but fortunately Borg didn't generally have to deal with actively malicious jobs, just greedy/misconfigured ones.

Once we'd developed cgroups this got a lot simpler.

cgroups was extremely useful for a system I built that ran on Borg, Exacycle, which needed to reliably "kill all child processes, recursively, below this process". I remember seeing the old /proc scanner and the new cgroups approach and being able to get the list of pids below a process and realizing- belatedly, that UNIX had never really made this easy.

Was giving each job its own UID not an option? users are the original privilege separation after all and kill -1 respects that.

No, because multiple jobs being run by the same end-user could share data files on the machine, in which case they needed to share the same uid. (Or alternatively we could have used the extra-gid trick to give shared group access to files, but that would have involved more on-disk state and hence be harder to change, versus the job tracking which was more ephemeral.) It's been a while now, but I have a hazy memory that in the case where a job was the only one with that uid running on a particular machine, we could make use of that and avoid needing to check the extra groups.

That's definitely the correct way to do this today. But even then `kill -9 $(< /sys/fs/cgroup/systemd/tasks)` is not enough if your goal is to reliably kill all processes because that's not atomic. Instead you'll have to freeze all processes, send SIGKILL and then unfreeze.

Can freezing be done atomically?

Not sure to be honest. From the documentation: "Writing "FROZEN" to the state file will freeze all tasks in the cgroup". Even if not, it should still be sufficient once all tasks are frozen: If you then send SIGKILL to all processes in the group, no fork bomb or similar process kerfuffle will be able to avoid being killed once they get unfrozen.

Unfortunately in cgroupv1, the freezer cgroup could put the processes into an unkillable state while frozen. This is fixed in cgroupv2 (which very recently got freezer support) but distros have yet to switch wholesale to cgroupv2 due to lack of adoption outside systemd.

Is that really an issue though? Unkillable is fine so long as it immediately handles the kill -9 as soon as it's unfrozen without running any additional syscalls.

There are cases where signals might be dropped (though I'm not sure if SIGKILL has this problem off the top of my head -- some signals have special treatment and SIGKILL is probably one of them). And to be fair this is a more generic "signals have fundamental problems" issue than it is specifically tied to cgroups.

It depends what you need. If you don't care that the kill operation might not complete without unfreezing the cgroup, then you're right that it's not an issue. But if the signal was lost (assuming this can happen with SIGKILL), unfreezing means that the number of processes might not decrease over time and you'll have to retry several times. Yeah, it'd be hard to hit this race more than ~5 times in a row but it still makes userspace programs more complicated than they need to be.

Yes, the freezer cgroup can be used to "atomically" put an entire cgroup tree into a frozen mode. However, unless you're using cgroupv2, the process might be stopped in an unkillable state (defeating the purpose). So this is not an ideal solution.

Really the best way to do it is to put it inside a PID namespaces and then kill the pid1. Unfortunately, most processes don't act correctly as a pid1 (the default signal mask is different for pid1, causing default "safe exit" signal behaviour to break for most programs). You could run a separate pid1 that just forwards signals (this is what Docker does with "docker run --init" and similar runtimes do the same thing). But now the solution has gotten significantly more complicated than "use PID namespaces".

Arguably the most trivial and workable solution is process groups and using a negative pid argument to kill(2), but that requires the processes to be compliant and not also require their own process groups. (I also haven't yet read TFA, it might say that this approach is also broken for reasons I'm not familiar with.)

Wait, what does cgroupv2 do with unkillable processes?

Maybe I'm misreading - is it that cgroupv1's freezer puts processes in an unkillable state? Or does cgroupv2's freezer have a way of rescuing processes already in uninterruptible sleep?

The if you freeze a cgroupv1 feeezer, the processes may be frozen at a point within their in-kernel execution such that they are in an uninterruptible sleep. The reason is that the cgroupv1 freezer basically tried to freeze the process immediately without regard to it's in-kernel state.

Fixing this, and making the freezer cgroup more like SIGSTOP on steroids (where the processes were put into a killable state upon being frozen, if possible) was the main reason why cgroupv2 support for freezer was delayed for so many years.

So the answer is "both, kinda". I'm not sure how it'd deal with legit uninterruptible sleep (dead-or-live locked) processes but I'll look into it.

Semantically it shouldn't be necessary I think?

I think the freezer cgroup does this, but I do t think systemd uses it.

Exactly. Solaris implemented this as "contracts" to make its service management framework (SMF, which is similar to systemd, but came out first and is superior in many ways).

systemd uses cgroups, correct? just wondering what the options are for learning more about this, would it be enough, assuming you'd only be working with systemd operating systems, to learn the systemd concepts of slices etc.?

slices generally map 1:1 with cgroups. Try running systemd-cgtop and you can see the resource usage of each of the cgroups

systemd uses cgroups, yes.

The GNU coreutils timeout command encapsulates a lot of these issues. It's surprisingly difficult to handle all the edge cases and races.


Thank you, timeout is new to me and helpful. And MaiZure! Ye gods. That website is an absolute gold mine for improving techniques around reading and understanding source code.

Cool. Anyone know how they make diagrams like those?

I was curious as well. According to the author, "All the diagrams were hand-crafted in PowerPoint."


You can do those flow diagrams easily with Mermaid.

Sometimes I wonder if I'm on a list somewhere for frequent Google searches such as this one - "How to kill a parent with all the children"!

Personally I use `killall` command and works as expected, at least on Debian.

For instance, when I want to stop conky, all I do is run `killall conky` and kills all of its processes at once.

Another longer way to do such thing is to run `kill -9 $(pidof conky)` which kills all returned processes.

Just be careful around Solaris!

Bazel-watcher tries to accomplish this on Linux by using process group IDs. It works, if imperfectly sometimes. I ported this to Windows[1] using Job Objects. As usual, the Windows API was hell, and I made use of undocumented syscalls in order to make it work (though that part is partly Go’s fault: when you start a process, it immediately drops the thread handle on the floor. If you start a process suspended, that means it’s impossible to resume using documented APIs.)

Thankfully Raymond Chen wrote an article[2] about Job Objects which helped me figure out the last bits. I genuinely am not sure I could have gotten it right without that article. There’s so many subtle ways to fail!

[1] https://github.com/bazelbuild/bazel-watcher/pull/144/files

[2] https://devblogs.microsoft.com/oldnewthing/20130405-00/?p=47...

I'm confused by this post, but I'm an ex-dev on the NT kernel.

ResumeThread is a well documented API. As is TerminateJobObject. The one thing I find a bit baroque is the recently added way to make sure a process is in a job on creation, which I believe is part is the ProcThreadAttributes mechanism.

Yes, ResumeThread is. NtResumeProcess is what I had to use, because Go immediately closes the thread handle.

I’m sure it’s safe to rely on NtResumeProcess. I have used it since XP without issue. But I definitely wish there was a better way to go back from a process to a thread. The best I could find is using Toolhelp32 to iterate all the threads on the system, which I believe is just wrapping NtQuerySystemInformation. Would’ve worked but definitely wasn’t fast.

On Linux you can also use prctl with PR_SET_PDEATHSIG to set a signal that will be sent to all child processes when the parent dies. This is a syscall you'd need to make from in the program. See 'man 2 prctl'.

One low hanging fruit I've thrown into my ssh-ing aliases/functions is is just throwing 'timeout' on the front. I usually exit cleanly and/or don't spin up zombie-prone processes, but sometimes I do dumb things, so to exit after a day...

timeout 86400 ssh me@example.com

I was reading the man page for timeout, and it looks like you can throw some kill options, but I never needed them, so I never looked for them.

Really nice writing style on this article.

It covers everything that needs to be covered, but it also gets right to the point. Yet without being overly terse or dry.

And it explains everything clearly. So often the writer is good at understanding an idea but not conveying it. This lays it out where it's easy to pick up.

Can somebody re-explain the last part about nohup propagation to descendant processes?

I don’t quite get what the implications are. The author also doesn’t seem to talk about their solution for it in the context.

>but on BSD and its variants like MacOS, the session ID isn’t present or always zero

Don't know about macOS, but session ID/pointer does present on FreeBSD and OpenBSD.

Related post with interesting comments:

UNIX one-liner to kill a hanging Firefox process:


Isn't this exactly why you can freeze cgroups?

Even waiting for a process to exit is surprisingly hard (impossible?) in Linux, unless it's your child.

That is getting considerably easier with the addition of pidfds, though: https://lwn.net/SubscriberLink/794707/93ffb35438fd3710/

> Beyond the ability to unambiguously specify which process should be waited for, this change will eventually enable another interesting feature: it will make it possible to wait for a process that is not a child — something that waitid() cannot do now. Since a pidfd is a file descriptor, it can be passed to another process via an SCM_RIGHTS datagram in the usual manner. The recipient of a pidfd will, once this functionality is completed, be able to use it in most of the ways that the parent can to operate on (or wait for) the associated process.

So, to wait for a process that is not your child, do you have to get the relevant pidfd from its parent? In which case, this doesn't help all that much. Or is there some other way to get pidfds for arbitrary processes?

Perhaps pidfd_open(pid, ...)? https://lwn.net/Articles/789023/

I find it bizarre they called it "pidfd_" rather than just "process_" or "proc_"... almost seems like they deliberately avoided the obvious?

It's because the object you get is a file descriptor.

In fact it's exactly equivalent to getting a file descriptor for /proc/$pid -- Christian (the person who developed the patchsets) quite cleverly solidified a trick that some folks knew about for several years (that you could use /proc/$pid as a race-free way of checking if a process has died if you grabbed a handle while it was still alive). Before pidfd_send_signal(2) there wasn't a way to use that "interface" nicely. But now it's a first-class citizen (and Christian had to fight a lot of battles to get this in over several releases).

It's really cool work and I have high hopes it will be used far and wide because it solves so many individual problems in one fell swoop.

> It's because the object you get is a file descriptor.

It is, but you're opening the object, not the descriptor (which doesn't even exist yet). When you open a kernel object, you always get back a (new) file descriptor representing that object. It's against previous naming conventions too. It's not like mkfifo() was called mkfifofd() or socket() was called socketfd() or perf_event_open() was called perf_event_fd_open()...

I'm really amused that you're so excited about it and find it so cool. I mean, I'm not suggesting it isn't awesome that they added it, but to me, it's such a glaring obvious deficiency that I'm completely flabbergasting that a lot of battles had to be fought to include it. It should've been added and embraced with open arms over two decades ago...

> It is, but you're opening the object, not the descriptor (which doesn't even exist yet).

pidfd_open(2) is still a proposed interface, and isn't in mainline yet (and if I'm remembering the ML discussions correctly, it might not even go in any time soon). The currently-available interfaces are pidfd_send_signal(2), CLONE_PIDFD, and the new pidfd_poll(2) support for pidfds. In that context, calling it "pidfd_" makes more sense (since you're operating on existing handles) and thus pidfd_open(2) also makes sense because otherwise the naming would be needlessly inconsistent.

There are also several pre-existing APIs that are called process_ (such as process_vm_{read,write}v(2)) which use pids and not pidfds -- so calling the new APIs process_ (or even proc_) could lead to confusion. From memory the first couple of iterations of the patchset changed the name several times until we landed on pidfd_ and nobody really complained much afterwards.

Also (and now I'm just nitpicking), mkfifo(3) doesn't give you an fd -- it's just a wrapper around mknod(2). But I do get your point.

> I'm really amused that you're so excited about it and find it so cool.

I might be a little bit more biased towards thinking it's cool, since the developer is a good friend of mine and we went back and forth on the design quite a lot (the fact he managed to get /proc/$pid fds to have these features is pretty remarkable and it's unbelievably cool that it didn't require having multiple classes of fds -- if you'd have asked me a year ago I would've said it'd be very hard to get right and would never be merged because it'd be so invasive). But thinking that it's neat isn't mutually exclusive with thinking that (something like) it should've been implemented a long time ago.

What obvious? The command creates and returns a pidfd for the pid passed, not a process or a proc (which would be conflated with other concepts in Linux).

But you're opening the object (which already exists), not a file descriptor (which doesn't exist yet). And when you create or open a kernel object, you always get back a (new) file descriptor representing that object. See perf_event_open, mkfifo, etc. I'm not really sure how it'd have conflated with anything to call it proc_open or something like that, but I'll take your word on it.

You can get a pidfd by opening /proc/$pid. That's all a pidfd is. The proposed pidfd_open(2) is for separate and more specialised use-cases where /proc isn't available and other restrictions apply.

Posted 1 week ago! Do you know what took them so long to realize this is the correct way to do it? Windows has had it for some two decades...

Unfortunately, as Linux has matured, many of its developers refuse to adapt technologies for the sole reason that it "stinks" of Microsoft.

Cutler and his team were pretty forward thinking given that NT is over 30 years old. See also: the async i/o headaches and the inability to WaitForSingleObject()/WaitForMultipleObjects() (or an analogue) in Linux.

It's a shame really.

> many of its developers refuse to adapt technologies for the sole reason that it "stinks" of Microsoft.

I don't where did you get that idea but it is not true at all. Linux developers have a history of ignoring (and reinventing) technologies and interfaces developed in other Unix systems, too. You can listen any Bryan Cantrill rant for more details. The claim that they refuse to adapt technologies that "stink" Microsoft is a mischaracterization.

re: WaitForMultipleObjects, finally something like that is likely coming to Linux:


> It's a shame really.

Well, yes, and of course, yet at the same time it's FOSS, so ... if someone really needed it, they should have proposed a patch. ¯\_(ツ)_/¯

> Well, yes, and of course, yet at the same time it's FOSS, so ... if someone really needed it, they should have proposed a patch. ¯\_(ツ)_/¯

It's a bit like doing I/O that doesn't throw your data away... just use O_DIRECT, bring your own I/O manager and page cache and scheduling and simply implement all the low-level compatibility stuff yourself if you need such a weird thing ¯\_(ツ)_/¯

I mean MySQL did that with libaio (which needs O_DIRECT anyway), Oracle on Linux did that and some with ASM (their own filesystem basically), also Ceph with BlueStore (LevelDB as a FS).

Because general purpose filesystems and I/O susbsystems are just that, general purpose.

PostgreSQL regularly woes about the problems of the Linux I/O and fs APIs, but ... to my knowledge they did not do much else. (Other than working around the problems in userspace, and taking the performance hit.)

If you want I/O that doesn't throw you data away, use the perfectly fine buffered I/O on let's say XFS. Fast, safe, sane. If you need more juice, add NVMe, and/or MD-RAID / LVM. And eventually you need to scale out to multiple nodes anyway, and then the performance of the cluster consistency subsystem will be the weak point.

Things don't get accepted into mainline just because you did the work. This is a weird misconception people have about open source. You also see it when people ask "why did you build this instead of just contributing it to $largeProject?"

Huh. I never said it'll be magically accepted.

But you also can't say that it's somehow impossible to get anything other than what Linus et al. think is the very epitome of perfect.

Upstreaming stuff is an effort. Contributing to such a complex and fragile pile of code/crap/C must be done through the kernel devs.

But if you want to add a new FS, I/O scheduler, bytecode VM, or whatever, it's possible. Look at Ceph, btrfs, bcachefs, all the SSD/NAND oriented FS-es, look at eBPF, and the XDP [eXpress Data Path] IO Visor network thingie, or WireGuard [which is still not merged as far as I know, because it tried to bring its own crypto library instead of using the already existing stuff]. All the stuff getting upstreamed from Android piece by piece.

A lot of kernel devs are happy about this or that, but they are pragmatists.

And Linus doesn't veto big ideas just because, but does not want to pull sub-par implementations. (See KDBUS for example.)

RHEL had an entire async i/o subsystem. It was rejected upstream.

Could you provide some details about this?

You could do it for a very long time. Folks are mentioning the new pidfd stuff, but that interface is built on much older tricks.

In particular, what you could do is grab a handle to /proc/$pid. This is now called a pidfd, but this works on old kernels too. Then, to check if the process has died you just do a fstatat(2) and see if you get ESRCH -- if you do, the process has died and this will work even if the pid is reused. I think you could use inotify to avoid polling, but I'm not sure.

The main benefit for the new poll support for pidfds is that you can get the exit status. And obviously CLONE_PIDFD has other benefits as well as the incredibly useful feature of pidfd_send_signal(2) which was the first patch sent for Linux 5.1.

Not anymore: b53b0b9d9a613c418057f6cb921c2f40a6f78c24 (pidfd: add polling support)

Wow, finally! Committed just 3 months ago. Might be able to finally rely on it in Linux distros a few years from now!


It's in kernel v5.3, which is expected in September, so it might be in Ubuntu Eoan Ermine 19.10. (After all 5.0 got released in March and Ubuntu 19.04 ships with 5.0.)

Yeah, that's when you'll just start seeing the new kernel in the wild. Not when you'll stop seeing older kernels in the wild.

There will be RHEL6 installs for the next 10 years, but why do they matter exactly?

Is it any easier on any other platform? Genuinely asking.

Windows NT has had “kill process tree” since as far back as I recall (mid 1990s). See:

  taskkill /t

kqueue allows waiting for any pid to exit.

Isn’t netlink’s PROC_EVENT_EXIT pretty straightforward?

The proc connector has a few problems:

1. It's effectively unmaintained (which I found out when I started mailing around asking if there was interest in me sending patches that fix the rest of the issues listed).

2. It requires privileges to use, making it useless when compared to other alternatives that can help solve some of the other issues (pidfds or just the good old /proc/$pid fstatat(2) trick).

3. It doesn't work in containers at all.

4. It has several pretty serious bugs which could even be argued to be security bugs. But since it has effectively zero users now, I'd be surprised if anyone would be interested in such bugs.

I wanted to fix these issues quite desperately, because it would allow for init systems that don't suffer from the cgroup or ptrace downsides. Unfortunately, it uses netlink and so any changes are mind-bogglingly complicated (especially if you want to tie it to PID namespaces because then you're really SoL since netlink is fundamentally tied to network namespaces).

If you're a privileged process. Usually I, a non-root user, want to wait for another process I started in another process tree.

"Wait until the child process..."

So, no. That's not it.

Off topic, but this headline makes me think "...unto the tenth generation upon the Earth..."

advanced programming in the unix environment is an excellent book that dives more into this

I just use rkill

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact