This article is mostly about rediscovering the distinction between interruptible and non-interruptible waits.
At the end there is some sort of deadlock between namespace teardown, signal delivery, and FUSE, but... it isn't articulated in a way that is super comprehensible to me. The kernel flushes open files on exit and also kills things in the namespace on exit. But that means the race condition was always hitable if you killed the FUSE daemon at the wrong time relative to the FUSE client shutdown? It's not totally obvious to me why this would impact other non-FUSE filesystems.
Signal delivery and multithreaded process teardown in the kernel is certainly tricky, and it's really easy to get these weird edge cases wrong.
> It's not totally obvious to me why this would impact other non-FUSE filesystems.
Anything filesystem that involves writing to another process, possibly a distant one (think NFS) will be susceptible to this.
Generally this wouldn't be a problem because init would normally be init, and it wouldn't have any open FDs for remote files (unless it was a diskless boot), and it wouldn't exit (because init never exits, and historically init exiting/dying would cause Unixes to panic), so this doesn't usually come up.
But containers can break all these assumptions since you can have applications be init. And if applications-as-init start helper daemons for FUSE or whatever, they need to clean up in order, and if they don't then bad things happen. In this case the application being init caused the kernel to kill all the other processes in the namespace when the application exited.
Apps-as-init can always fail to exit cleanly by crashing, and that shouldn't cause unkillable zombies. The fix described is correct: allow flushing during exit to fail, since that could always happen (e.g., ENOSPC). Better than waiting forever for a flush that can't complete.
> Generally this wouldn't be a problem because init would normally be init, and it wouldn't have any open FDs for remote files (unless it was a diskless boot), and it wouldn't exit (because init never exits, and historically init exiting/dying would cause Unixes to panic), so this doesn't usually come up.
I mean, the kernel cannot rely on any userspace process to do anything -- even init. Even ignoring containers. I don't think the root cause here is even init related -- that's just what caused flush to hang forever in this situation, but as you point out that could happen for any number of reasons.
> Apps-as-init can always fail to exit cleanly by crashing, and that shouldn't cause unkillable zombies.
Right.
> The fix described is correct: allow flushing during exit to fail, since that could always happen (e.g., ENOSPC). Better than waiting forever for a flush that can't complete.
Sure. You could also do an interruptible flush before blocking signals instead of after.
I don't understand why wants_signal returns false on PF_EXITING even if the signal is SIGKILL (and from the kernel). Shouldn't wake up the process still, so it can get out of the flush?
I am curious, if you were just to walk over every PID in the pid namespace after sending zap_pid_ns_processes, and perform wake ups, would it break out of the `wait_event` loop?
Btw, this class of weirdness with FUSE isn't that unusual.
Author here (hi Sargun), it's not really about rediscovering killable vs. unkillable waits, and any confusion is probably a result of my poor writing.
The crux of it is that once you've called exit_signals() from do_exit(), signals will not get delivered. So if you subsequently use the kernel's completions or other wait code, you will not get the signal from zap_pid_ns_processes(), so you don't know to wake up and exit.
Oh, I wasn't suggesting that it was about killable vs. unkillable.
Couple of things:
1. Should prepare_to_wait_event check if the task is in PF_EXITING, and if so, refuse to wait unless a specific flag is provided? I'd be curious if you just add a kprobe to prepare_to_wait_event that checks for PF_EXITING, how many cases are valid?
2. Following this:
zap_pid_ns_processes ->
__fatal_signal_pending(task)
group_send_sig_info
do_send_sig_info
send_signal_locked
__send_signal_locked -> (jump to out_set)
sigaddset // It has the pending signal here
....
complete_signal
Shouldn't it wake up, even if in its in PF_EXITING, that would trigger as reassessment of the condition, and then the `__fatal_signal_pending` check would make it return -ERESTARTSYS.
> Couple of things: 1. Should prepare_to_wait_event check if the task is in PF_EXITING, and if so, refuse to wait unless a specific flag is provided? I'd be curious if you just add a kprobe to prepare_to_wait_event that checks for PF_EXITING, how many cases are valid?
I would argue they're all invalid if PF_EXITING is present. Maybe I should send a patch to WARN() and see how much I get yelled at.
> Shouldn't it wake up, even if in its in PF_EXITING, that would trigger as reassessment of the condition, and then the `__fatal_signal_pending` check would make it return -ERESTARTSYS.
No, because the signal doesn't get delivered by complete_signal(). wants_signal() returns false if PF_EXITING is set. (Another maybe-interesting thing would be to just delete that check.) Or am I misunderstanding you?
Hi Tycho. I was The Guy at LSS who tested positive for COVID about 12 hours after we sat next to each other at that Japanese restaurant in Vancouver the week before last. I really hope you didn't catch it. So far, to my knowledge, my "blast radius" is just me.
As somebody who has written a non-trivial amount of upstream Linux filesystem code and who is leading the containers team at my current employer, I've found your writing more interesting than perhaps most people on the face the planet might. I'm also a bit surprised at how often companies write their own custom FUSE filesystems. A lot of them I only hear about as former employees from those companies join mine and then clue me in about their existence. It seems like every large-ish company these days has at least one now.
It looks like you were able to figure things out through some combination of /proc poking, code inspection, and LKML querying. Out of curiosity, would it be feasible for you to have tried enabling some of the kernel hacking options such as WQ_WATCHDOG or DETECT_HUNG_TASK? Do you think that would have sped up your investigation?
Also, my whole career I've been doing ps aux, but TIL about ps awwfux. Which I guess goes to show there's always some gap in one's basic knowledge of Linux foo!
> Hi Tycho. I was The Guy at LSS who tested positive for COVID about 12 hours after we sat next to each other at that Japanese restaurant in Vancouver the week before last. I really hope you didn't catch it. So far, to my knowledge, my "blast radius" is just me.
Hi Mike. So far so good for me.
> It looks like you were able to figure things out through some combination of /proc poking, code inspection, and LKML querying. Out of curiosity, would it be feasible for you to have tried enabling some of the kernel hacking options such as WQ_WATCHDOG or DETECT_HUNG_TASK? Do you think that would have sped up your investigation?
We do have these both enabled, and have alerts to log them in the fleet. I have found it very useful for saying "there's a bug", but not generally applicable in debugging it. However, we wouldn't catch these things without user reports if we didn't have those tools.
Something that might (?) be useful is something like lockdep when there's hung tasks. It wouldn't have helped in this case, since it was a bug in signals wakeup, but I e.g. in the xfs case I cited at the bottom maybe it would.
I think they definitely should not. I've considered sending a patch that adds a WARN() or some syzkaller test for it or something, especially now that I've seen it in other filesystems.
Well, only if the wait is for userspace or a remote resource, right? Regular disks are sometimes considered infallible (or at least, the IO will timeout eventually in the generic SCSI logic) and might be ok to wait on.
To generalize a bit, I think the problem is doing any sort of interruptible wait -- because we can no longer be interrupted. Uninterruptible waits aren't any different without signal delivery. I might be oversimplifying, though.
It sounds like exit_signals() is being called too early, and based on the test case linked this might be a library issue rather than a code or kernel issue?
Edit: Reading the article it's more clear this happens in kernel's:
It doesn't matter. Filesystem waits are historically non-interruptible. The correct fix is indeed to allow the flushes to fail fast rather than wait forever.
> I don't understand why wants_signal returns false on PF_EXITING even if the signal is SIGKILL (and from the kernel). Shouldn't wake up the process still, so it can get out of the flush?
First, a signal wouldn't get the process out of blocking on flushing because filesystem waits are non-interruptible.
Second, if a process is exiting then a) no handler for a signal (other than SIGKILL or SIGSTOP) could run, b) any default-exit actions couldn't wouldn't change the state of the process.
Therefore the process can't want the signal if the process is exiting.
I'm reminded of Bedrock Linux hanging during sleep because Linux sleeps the FUSE daemon while a process is waiting on FUSE and unable to be suspended for sleep: https://news.ycombinator.com/item?id=34583495
To be clear, Bedrock Linux does not hang during sleep. Rather, it cannot enter sleep consistently; if it fails to, it simply continues on as it was operating before the request to suspend. The underlying issue is a long standing The Linux kernel bug [0] in which it cannot reliably suspend if FUSE is in use. When the kernel detects this scenario, it simply doesn't suspend, but rather logs its difficulty to dmesg. This is not specific to Bedrock, but hits all projects on Linux which leverage Linux's FUSE functionality [1]. It probably hits Netflix's ndrive as well.
In the few months since my comment you've linked, I've put some work into a possible FUSE-less Bedrock implementation. It will likely have some downsides compared to a FUSE-based solution, but the trade-off may be worthwhile for some users. While it's too early to commit to this, I'm hoping to eventually support switching between a FUSE-mode and a non-FUSE-mode with a reboot to allow users to pick the desired trade-off in the desired contexts.
I love `ps axufww` so much and ah fuck to having learned it that way. It's the first thing I run on a server when logging in. It tells you so much about the system. That, `w` and a `dmesg -T` will go 90% of the way to diagnosing most system issues.
The explainshell output is actually largely wrong here. For historical reasons, GNU ps has two completely different sets of flags depending on whether you use a dash (ps -abcd) or not (ps abcd). The man page documents both sets, but explainshell is getting confused and showing a mix of dash versions and non-dash versions. For example, it shows the documentation for '-f' ("full-format listing"), but it should be showing the documentation for 'f' ("ASCII art process hierarchy (forest)").
At the end there is some sort of deadlock between namespace teardown, signal delivery, and FUSE, but... it isn't articulated in a way that is super comprehensible to me. The kernel flushes open files on exit and also kills things in the namespace on exit. But that means the race condition was always hitable if you killed the FUSE daemon at the wrong time relative to the FUSE client shutdown? It's not totally obvious to me why this would impact other non-FUSE filesystems.
Signal delivery and multithreaded process teardown in the kernel is certainly tricky, and it's really easy to get these weird edge cases wrong.