At the end there is some sort of deadlock between namespace teardown, signal delivery, and FUSE, but... it isn't articulated in a way that is super comprehensible to me. The kernel flushes open files on exit and also kills things in the namespace on exit. But that means the race condition was always hitable if you killed the FUSE daemon at the wrong time relative to the FUSE client shutdown? It's not totally obvious to me why this would impact other non-FUSE filesystems.
Signal delivery and multithreaded process teardown in the kernel is certainly tricky, and it's really easy to get these weird edge cases wrong.
Anything filesystem that involves writing to another process, possibly a distant one (think NFS) will be susceptible to this.
Generally this wouldn't be a problem because init would normally be init, and it wouldn't have any open FDs for remote files (unless it was a diskless boot), and it wouldn't exit (because init never exits, and historically init exiting/dying would cause Unixes to panic), so this doesn't usually come up.
But containers can break all these assumptions since you can have applications be init. And if applications-as-init start helper daemons for FUSE or whatever, they need to clean up in order, and if they don't then bad things happen. In this case the application being init caused the kernel to kill all the other processes in the namespace when the application exited.
Apps-as-init can always fail to exit cleanly by crashing, and that shouldn't cause unkillable zombies. The fix described is correct: allow flushing during exit to fail, since that could always happen (e.g., ENOSPC). Better than waiting forever for a flush that can't complete.
I mean, the kernel cannot rely on any userspace process to do anything -- even init. Even ignoring containers. I don't think the root cause here is even init related -- that's just what caused flush to hang forever in this situation, but as you point out that could happen for any number of reasons.
> Apps-as-init can always fail to exit cleanly by crashing, and that shouldn't cause unkillable zombies.
> The fix described is correct: allow flushing during exit to fail, since that could always happen (e.g., ENOSPC). Better than waiting forever for a flush that can't complete.
Sure. You could also do an interruptible flush before blocking signals instead of after.
I am curious, if you were just to walk over every PID in the pid namespace after sending zap_pid_ns_processes, and perform wake ups, would it break out of the `wait_event` loop?
Btw, this class of weirdness with FUSE isn't that unusual.
The crux of it is that once you've called exit_signals() from do_exit(), signals will not get delivered. So if you subsequently use the kernel's completions or other wait code, you will not get the signal from zap_pid_ns_processes(), so you don't know to wake up and exit.
There's a test case here if people want to play around: https://github.com/tych0/kernel-utils/tree/master/fuse2
I'm glad you inherited this :).
Oh, I wasn't suggesting that it was about killable vs. unkillable.
Couple of things:
1. Should prepare_to_wait_event check if the task is in PF_EXITING, and if so, refuse to wait unless a specific flag is provided? I'd be curious if you just add a kprobe to prepare_to_wait_event that checks for PF_EXITING, how many cases are valid?
2. Following this:
__send_signal_locked -> (jump to out_set)
sigaddset // It has the pending signal here
One note, in the post:
# grep Pnd /proc/1544574/status
Shouldn't it be "ShdPnd"?
I would argue they're all invalid if PF_EXITING is present. Maybe I should send a patch to WARN() and see how much I get yelled at.
> Shouldn't it wake up, even if in its in PF_EXITING, that would trigger as reassessment of the condition, and then the `__fatal_signal_pending` check would make it return -ERESTARTSYS.
No, because the signal doesn't get delivered by complete_signal(). wants_signal() returns false if PF_EXITING is set. (Another maybe-interesting thing would be to just delete that check.) Or am I misunderstanding you?
> Shouldn't it be "ShdPnd"
derp, fixed, thanks.
Oh, I see, you're suggesting exactly,
> (Another maybe-interesting thing would be to just delete that check.)
As somebody who has written a non-trivial amount of upstream Linux filesystem code and who is leading the containers team at my current employer, I've found your writing more interesting than perhaps most people on the face the planet might. I'm also a bit surprised at how often companies write their own custom FUSE filesystems. A lot of them I only hear about as former employees from those companies join mine and then clue me in about their existence. It seems like every large-ish company these days has at least one now.
It looks like you were able to figure things out through some combination of /proc poking, code inspection, and LKML querying. Out of curiosity, would it be feasible for you to have tried enabling some of the kernel hacking options such as WQ_WATCHDOG or DETECT_HUNG_TASK? Do you think that would have sped up your investigation?
Also, my whole career I've been doing ps aux, but TIL about ps awwfux. Which I guess goes to show there's always some gap in one's basic knowledge of Linux foo!
Hi Mike. So far so good for me.
> It looks like you were able to figure things out through some combination of /proc poking, code inspection, and LKML querying. Out of curiosity, would it be feasible for you to have tried enabling some of the kernel hacking options such as WQ_WATCHDOG or DETECT_HUNG_TASK? Do you think that would have sped up your investigation?
We do have these both enabled, and have alerts to log them in the fleet. I have found it very useful for saying "there's a bug", but not generally applicable in debugging it. However, we wouldn't catch these things without user reports if we didn't have those tools.
Something that might (?) be useful is something like lockdep when there's hung tasks. It wouldn't have helped in this case, since it was a bug in signals wakeup, but I e.g. in the xfs case I cited at the bottom maybe it would.
To generalize a bit, I think the problem is doing any sort of interruptible wait -- because we can no longer be interrupted. Uninterruptible waits aren't any different without signal delivery. I might be oversimplifying, though.
Edit: Reading the article it's more clear this happens in kernel's:
exit_signals(tsk); /* sets PF_EXITING */
Or zap_pid_ns too late, yeah.
First, a signal wouldn't get the process out of blocking on flushing because filesystem waits are non-interruptible.
Second, if a process is exiting then a) no handler for a signal (other than SIGKILL or SIGSTOP) could run, b) any default-exit actions couldn't wouldn't change the state of the process.
Therefore the process can't want the signal if the process is exiting.
In the few months since my comment you've linked, I've put some work into a possible FUSE-less Bedrock implementation. It will likely have some downsides compared to a FUSE-based solution, but the trade-off may be worthwhile for some users. While it's too early to commit to this, I'm hoping to eventually support switching between a FUSE-mode and a non-FUSE-mode with a reboot to allow users to pick the desired trade-off in the desired contexts.
 https://bugzilla.kernel.org/show_bug.cgi?id=34932 https://bugzilla.kernel.org/show_bug.cgi?id=198879 https://lists.debian.org/debian-kernel/2011/10/msg00412.html
 https://github.com/keybase/client/issues/12458 https://github.com/libfuse/libfuse/issues/248 https://bugs.launchpad.net/ubuntu/+source/sshfs-fuse/+bug/17...
(Another one of my favorites is cat -vet)
(I considered filing a bug report, but there is one already: https://github.com/idank/explainshell/issues/214)
Bigcorps are large and diverse. In this case - this seems to be user/desktop facing as it's a fuse module for studio assets.